Over the last couple of years, I’ve compiled a short list of software design principles that resonated well with me. Most of them are from industry white papers and engineering blogs and I’ve linked them with the real-world systems that applied them.

Principles and philosophies

Last Updated: Feb 27, 2024, Live Document

  1. Any change must impact a user-facing or operational issue. [✅ Scaling Memcache at Facebook [Meta, 2013] ]
  2. Most operations issues come from design and development or are best solved there. ✅ On designing and deploying internet-scale services [ MS, 2008]
  3. Simplicity is key to efficient operations. ✅ On designing and deploying internet-scale services [ MS, 2008]
  4. Philosophy of delaying complexity until necessary [✅Data Management for internet-scale single sign on [Google,04]]I-
  5. Better to design for correctness in as many errors as possible than to place bets on how rare certain error scenarios are. [✅Data Management for internet-scale single sign on [Google,04]]I-
  6. Make one significant change at a time → with multiple changes the cause and effect cannot be correlated easily. ✅ On designing and deploying internet-scale services [ MS, 2008]
  7. Threat probability of something as a tunable parameter.
  8. Favor stateless components that enable rapid iteration and simple deployment process
  9. Make failure a common case and State space reduction → Instead of fixing all possible failure scenarios, the transaction system proactively shuts down when it detects a failure. This reduces all failure-handling scenarios to a single recovery operation. ✅ FoundationDB: A Distributed Unbundled Transactional Key [Apple, Snowflake, 2021], ✅ Raft: In search of an understandable consensus Algorithm]
  10. Fail fast and recover fast → Early detection of failures and quick reconfiguration over quorums to mask failures. [✅ FoundationDB: A Distributed Unbundled Transactional Key [Apple, Snowflake, 2021]
  11. Simpler, modular systems are easier to understand and maintain. A well-known approach to problem decomposition can be utilized. Whenever possible, divide problems into separate pieces that could be solved, explained, and understood relatively independently. E.G Raft separates leader election, log replication, safety, and membership changes[✅ Raft: In search of an understandable consensus Algorithm]
  12. Design systems for predictability over absolute efficiency improves system stability. While systems such as caches can improve performance, do not allow them to hide the work that would be performed in their absence, ensuring that the system is always provisioned to handle the unexpected. ✅ Amazon Dynamo and DynamoDB → (Amazon 2007, 2022)