✅ On designing and deploying internet-scale services [ MS, 2008]

Src: https://s3.amazonaws.com/systemsandpapers/papers/hamilton.pdf

Summary:

Most operations issues come from design and development or are best solved there
Low-cost administration correlates highly with how close the development, test and operations work together
Simplicity is key to efficient operations

Basics recommendations

Design for Failure → Entire service must be capable of surviving a failure without human intervention. The best way to test the failure path is to never shut down the service normally. Just hard-fail it. Paths that aren’t frequently used, don’t work when needed
Redundancy and fault recovery
- Is the ops team willing and able to bring down any server in the service at any time without draining the load first? If they can then there is synchronous redundancy (No data loss) failure detection and automatic take-over.
Support single-version software.
Assume dependencies will fail. Can rely on cached data temporarily, etc.
Design with insulation in mind. One pod or cluster should not affect others.
Allow for (rare) emergency human interventions. Intervention actions must be written as scripts and tested in production often.
It’s better to not let more work into an overloaded system than to continue accepting work and begin to trash it. Gracefully degrade rather than hard fail. Rate limiting!
Partition the service → Infinitely-adjustable and fine-grained. The recommendation is to use a lookup-table at the mid-tier that maps file-grained entities(e.g. users) to the system where data is managed. File-grained partitions can be moved freely between servers. E.g. Cassandra virtual partitioning scheme.
Analyze throughput and latency. Do so with other ops running, e.g. DB maintenance or migrations. This will help catch issues during periodic management tasks.
- RPS, #concurrent users, metrics mapping load to resource requirements
Handle failures and correct errors at the service level where the full execution context is available rather than at lower levels of the stack. e..g if a dependent API fails, don’t expect it to recover