✅ Finding a needle in Haystack: Facebook's photo storage [2010]

https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf

Screenshot 2024-03-08 at 3.05.11 PM.png

Top 3 takeaways

Traditional NAS performance can be significantly improved by caching file metadata and limiting the disk reads only to file data and eliminating IO for fetching Inode id and metadata.
Significant reduction in metadata footprint can be achieved by packing multiple blobs in large files. This is primary achieving by sharing Inode metadata overhead.
Given access pattern where reads are orders of magnitude more frequent on recently uploaded files, caching those in memory can significant reduce Storage overhead, which can remain optimized for writing, thus allowing increased throughput via sequential/append only writes.

Summary

Object store.
Access pattern: Written once, read often, never modified, and rarelt deleted
Goals:
- Increase throughput, Reduce cost, Reduce latency
Tenets
- Designed to reduce the amount of file metadata per photo, storing it entirely in main-memory and enforcing 1 disk read per image.
Bottlenecks
- 1 image read requires 3 sys calls with the NAS solution. [Translate filename to inode, read Inode from disk, read file itself]. Solution → Caching metadata more effectively and enforcing only 1 disk read per image read.
- NAS solution stores unused metadata on disk such as permissions.
- Placing thousands of files in a directory on NAS is extremely inefficient as the directory blockmap was to large to be cached effectively. Solution → Reduce max number of files in a directory.
- Storing single photo per file results in hundreds of bytes of metadata in the inode which can be reduced by storing multiple photos in a single file, thus maintaining very large files.
Health & correctness
- Background task (pitch-fork) continiously evaluates the health of each Store machine. If check fails the volume is marked as read-only
- Index files that allow the Store machine to build it’s in-memory mapping quickly, shortening restart times. ( Otherwise, the machine would need to read the full contents of the volume). Index files are updated async and represent checkpoints, including the seq_id/offset, which can be used to read the remaining information from the storage files.
Discussion
- CDNs do a good job for hot content such as profile pictures and recently uploaded images, however they are not cost effective and don’t yield good hit rates on long-tail images that are common access pattern in Facebook.