Src: https://s3.amazonaws.com/systemsandpapers/papers/hamilton.pdf
Summary:
Basics recommendations
Design for Failure → Entire service must be capable of surviving a failure without human intervention. The best way to test the failure path is to never shut down the service normally. Just hard-fail it. Paths that aren’t frequently used, don’t work when needed
Redundancy and fault recovery
Support single-version software.
Assume dependencies will fail. Can rely on cached data temporarily, etc.
Design with insulation in mind. One pod or cluster should not affect others.
Allow for (rare) emergency human interventions. Intervention actions must be written as scripts and tested in production often.
It’s better to not let more work into an overloaded system than to continue accepting work and begin to trash it. Gracefully degrade rather than hard fail. Rate limiting!
Partition the service → Infinitely-adjustable and fine-grained. The recommendation is to use a lookup-table at the mid-tier that maps file-grained entities(e.g. users) to the system where data is managed. File-grained partitions can be moved freely between servers. E.g. Cassandra virtual partitioning scheme.
Analyze throughput and latency. Do so with other ops running, e.g. DB maintenance or migrations. This will help catch issues during periodic management tasks.
Handle failures and correct errors at the service level where the full execution context is available rather than at lower levels of the stack. e..g if a dependent API fails, don’t expect it to recover