System Design

Concepts and Study Areas

  • ACID
    • Atomicity guarantees that multiple statements withing a group of transactions are treated as a single “atomic” unit and either all complete successfully, or are rolled back so as not to leave the underlying data in an inconsistent state.
    • Consistency ensures that a transaction can only mutate data from one valid state to another.
    • Isolation controlling concurrency such the multiple concurrent actions on data leave the data in a valid state, the same as which if each transaction occurred serially.
    • Durability guarantees that once committed a transaction will remain and be persisted even in the event of a crash.
  • API: Application Program Interface is a set of definitions and protocols for enabling communications between multiple systems regardless of language implementation (and location)
  • CAP Theorem: Consistency, Availability, Partition tolerance; any distributed system can only provide two of the aforementioned guarantees.
    • Consistency: Every read returns the most recent write or an error.
    • Availability: Every request returns a valid response without the guarantee that it is the most recent write.
    • Partition tolerance: They system will continue to operate regardless of the loss of one or more messages between nodes in the cluster or a loss of a node in the cluster.
    • When a partition event occurs there are two basic choices:
      • Cancel the current operation and reduce availability of the most recent data but ensure consistency of the data. You must return an error or timeout if you cannot guarantee the most up to date data.
      • Proceed with the operation to provide availability of the data and risk returning “inconsistent” data. You must return the most recently available version of the data even though you cannot guarantee it is up to date.
  • Eventual Consistency: an informal guarantee that in order to achieve high availability in a distributed system if there are no new writes to a given record “eventually” all reads will return the last updated value. They are classified as providing BASE semantics (basically-available, soft-state, eventual consistency) which is the opposite to an ACID system (think PH in chemistry).
  • Consistency vs. Coherence: Coherence refers to reading and writing to the same memory location such that the writes to that location will be seen in order and that the data in that location remains valid, not written to by multiple processors/processes at the same time. All reads from that location will return the most recently written value. Consistency applies to r/w operations to other locations in an order that makes sense to a given application.
  • PGP, TLS and Encryption
  • Vertical scaling
  • Horizontal scaling
  • Caching Systems, Key/Value Stores
  • Content Distribution Networks
  • Columnar Data Stores and Data Structures
  • Indexing
  • Load balancing
  • Message Queues
  • Microservices
  • NoSQL
  • OLAP and OLTP
  • ORB: Object Request Broker is a framework that enables clients to locate servers and call operations or functions on them as if they were in the same process. The JavaEE platform enables various implementations of RMI (Remote Method Invocation), and IIOP (Internet Inter-ORB Protocol) calls to remote methods or endpoints. It manages the interactions between clients and servers.
  • ORM: Object Relational Mapping, a bridge between a programatic/typically struct or object view of data and its underlying data store. The data store is typically an RDBMs.
    • Java: Hibernate, JPA
    • Python: SQLAlchemy, The Django ORM
    • Go: GORM
  • Point of Multiplication: The point in a system architecture where a single stream of data is forked (or multiplied) into additional data streams.
  • Service Mesh
  • Sharding, Database replication, and Database Replication
  • Site Reliability Engineering
  • State Machines
    • https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf
  • Write-ahead Logging
  • Reverse Proxies
  • WebSockets
  • Zero-copy

Solving System Design Problems

High Level Concepts and Principals

Following are the high-level system design concepts and principals that you should be familiar with:

  • Statelessness: If possible processes should be stateless. This reduces complexity and makes scaling easier. Typically, it is easier to make ingest, ETL pipelines stateless. There are certainly operations that require state and in those cases how that state is maintained is critical
  • Idempotency:
  • Source of Truth:
  • Contracts:

Steps to Solve System Design Problems

Keep in mind the distinction between Functional vs Non-Functional Requirements. The basic difference is that functional requirements define how the system must work and non-functional requirements define how it must perform.

  1. Understand the Goals
    1. Who are the users? Are they primarily humans, machines or a combination of the two?
    2. What value are they trying to get out of the system? What do they need it for and how are they going to use it?
    3. What type of data is going in and out of the system?
  2. Define the Scope
    1. Are there existing pieces that we are leveraging? Authentication and Authorization? Logging? Metrics, Monitoring and Alerting?
    2. What are systems are we integrating with?
    3. What systems does the system we are designing rely on?
    4. What systems rely on the system we are designing?
    5. What types of data need to be stored?
    6. Will you require data for offline analytics?
    7. What are the storage retention requirements?
    8. Any laws/rules/regulations that need to be taken into account (privacy, GDPR, etc.)?
  3. Define the Scale
    1. How many requests or inputs per second? How big is the request data?
    2. How many users in total?
    3. How many concurrent users?
    4. Request types? Number of reads/writes per second? Is it read or write heavy or evenly balanced? Even if there are more reads than writes does it mean more data has to be accessed and aggregated before being returned to the user?
    5. What does the usage pattern look like? Is it generally consistent throughout the day or are there spikes? What about usage spikes based on time of year?
    6. What are the SLOs, SLAs and error budgets?
    7. What is the expected amount of time or size of data we are going to retain?
    8. Will the system store PII? What are the privacy, CCPA, GDPR requirements and how will those requests be handled? This will certainly speak to how the data is stored and the system is designed to enable the ability to delete user data in a manageable way.
  4. Start at the High-Level, Block Diagram, and then Flesh Out Details
    1. Sketch out what an overall system would look like and enumerate all of the major pieces required for the system
    2. Define entry points and how users and/or other systems will interact with the system.
    3. Online and Offline processes?
    4. Start simple and iterate as you flesh out the details
    5. Basic components
      1. Application service layer that directly serves requests
      2. Various back-end services to separate the concerns of different parts of the system as a whole
      3. Storage layer: Is there static data that must be stored (images, videos)? What about in-memory caches, transactional databases?
      4. Typical components
        1. Webserver/Loadbalancer
        2. Services specific to the domain
        3. Database, certainly distributed with at the very least master/slave replication
        4. Caching
        5. Backups
        6. DR
  5. Data Structures and Algorithms
    1. For each of the components of the system outline what basic type of data storage and/or sorting, searching algorithms are most appropriate for the use case.
    2. Think about and explain how things will scale
  6. Tradeoffs and Compromises: Real world systems will always have trade-offs. There is never an infinite budget and infinite compute. Think CAP theorem.
    1. Is there data that becomes less likely to be read and can tolerate additional latency? Maybe that’s put on a slower storage medium to save costs
    2. What type of database might you use? Can you achieve the same goals with a distributed PostgreSQL cluster that you can with a multi-million dollar RDBMS appliance?
    3. Why are you choosing certain technologies? Supportable, ease of finding qualified Engineers, licencing, vendor lock-in, etc.
  7. Identify Bottlenecks
    1. Given the characteristics of the system there are always choke-points in a system that require careful consideration and additional decomposition to alleviate the bottleneck.

Protocols and Technologies

Formulas and Constants

Following are handy formulas and constants that you should know when going through system design exercises.

  • Permutations: The formula to calculate the number of permutations from a set of distinct items is: P(n, r) = n! / (n – r)!. Which means, the permutations of n number of distinct items in a set of length r is the factorial of n divided by the factorial of n – r
    • For example: If n = 10, and r = 2. 3628800/(10 – 2)!, or 3628800/40320 = 90
  • Number of Seconds in a Day: 86400
  • Maximum Length of a URL: 2048 characters

Links