A six-step framework to tackle any system design question

Introduction

In the book of System Design Interview by Lewis C. Lin, a six-step framework called PEDALS is introduced to help answer any system design question.

PEDALS stand for:

  • Process requirements
  • Estimate
  • Design the service
  • Articulate the data model
  • List the architectural components
  • Scale

The systematic method will power you through the whole system design interview.

Process requirements

It means to clarify the question before you answer it.

  • What is it? What does the system do? What are the goals or outcomes?
  • What features does it provide? What features should be left out?

Why is clarification important? If the requirements are not clarified,

  • the desired features would not be included or correctly designed
  • the interviewers’ expectation would not be met
  • the precious time would be wasted in unexpected area

By clarifying the question, you prove that you can take proper actions for any ambiguous question in real world.

Estimate

It is always a good idea to estimate the scale of the system as it helps reflect if the designed system could fulfill the functional requirements. The requirements might include:

  • Number of users
  • Number of active users(NAU)
  • Requrests per second(RPS)
  • Logins per seconds
  • Transactions per second(TPS) for E-commerce
  • Likes/dislikes per second, shares per second, comments per second for social media sites
  • Searches per second for sites with a search feature
  • Storage needed
  • Servers needed
  • Network bandwidth needed

Based on the functional requirements above, it’s possible to estimate the system requirements including servers, storage and netowrk bandwidth needed.

To estimate hardware resource needed, we need to understand that there are four major resources in a computer system.

  • CPU
  • Memory
  • Storage
  • Network

Estimate servers needed

The modern computer system is a multi-processor system. It varies from single CPU core to multiple CPU cores. The following is a 32 CPU threads system.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

In order to estimate how many servers are needed, we can approach in the following manner.

  1. How much work can single CPU do?
  2. How much work can single server do?
  3. How many servers are needed?

Let’s have an example to go through this approach.

Let’s say it takes 100ms for a sinlge-core CPU system to handle single client request. It means the system can handle 10 requests per second. So, we can extrapolate a 32-core system can handle 320 requests per second. Let’s say we have to handle 320,000 requests per second(RPS). It means 1000 servers are needed.

Notice that this is a rough calculation only considering CPU needs. In real case, there might be other performance bottleneck to handle 320 requests per second in a system. For example, the system might be already I/O bound before running out of CPU bandwidth. But this method still gives us an estimation when to consider the CPU computation resource only.

Estimate storage needed

To estimate storage needed, we can approach as below.

  1. Identify the different data types
  2. Estimate the space needed for each data type
  3. Get the total space needed

Let’s take YouTube as an example to understand this approach.

  • Data types: videos, thumbnail images and comments.
  • Let’s assume there are roughly 2B users and 5% users(100M users) upload videos consistently. On average, each user has a weekly upload(~50 videos per year). Roughly, 13M videos(100M*50/365) are uploaded daily. Let’s assume the video is 10 minutes long on average and it takes 50MB storage space after compression. Let’s say each video has a thumbnail image of 20KB. Each video has about 5 comments and the size of each comment is 200 bytes. In total, the space need for each video is 50MB + 20KB + 1KB, roughly 50MB. By multiplying 13M videos, it needs 619TB storage in a day.

Estimate network bandwidth needed

Determine the incoming and outgoing data for network bandwidth estimation

  • We already know there would be ~619TB data uploaded to YouTube in a day. Dividing this by the number of seconds in a day(619TB/86400 seconds), the incoming data to YouTube would be 7.3GB/s.
  • Let’s say 10% of YouTube users are daily active users. With approximately 200M daily users, let’s assume a user watches 10 videos a day. Then YouTube would have 2B views in a day. This would result in ~93PB outgoing data in a day. Dividing this by the number of seconds in a day(93PB/86400 seconds), the outgoing speed would be 1128GB/s.

Design the service

When we process the requirements, we should already collect the clear enough requiremnts before the system design. And now, we need to define what to build and figure out how to build the system service.

CRUD framework

  • Create
  • Read
  • Update
  • Delete

For example, to build a YouTube-like video service. We can use CRUD to brainstorm the possible the system actions.

  • Create

  • New users

  • New channels

  • Upload videos

  • Add comments

  • Read

  • View videos

  • Read comments

  • Search videos

  • Read video recommendations

  • Update

  • Edit video metadata

  • Edit comments

  • Delete

  • Delete users

  • Delete channels

  • Delete videos

  • Delete comments

Now we can further define services(API endpoints) as below.

  • /users
  • /channels
  • /videos
  • /comments
  • /recommendations

Be RESTFUL: API Best Practices

  • Use Nouns. REST is known for using the HTTP commands(GET/POST/DELETE/PUT) to read or write data. When the API is invoded via HTTP requests, the HTTP request comes with the corresponding verbs(GET/POST/DELETE/PUT).
  • Use nesting to show the hierarchy. For example, /users/1 returns a specific user with id=1.
  • Return json. It is the industry standard today.
  • Support filtering and pagination. It helps minimize the latency and waste.
  • Use plural

Design strategies

  • Information processing strategies

  • Batch strategy(Scheduled processing)

  • Chain-of-Command strategy

  • Checklist strategy

  • Rate limiting strategies

  • Fixed window strategy

  • Sliding window strategy

  • Token bucket strategy

  • Leaky bucket strategy

  • Limiting concurrent requests strategy

  • Critical requests strategy

  • Communication strategies

  • Middleman strategy

  • Town crier strategy

  • Asynchronous strategy

  • Latency strategies

  • Main-replica strategy

  • Push vs. Pull strategy

  • Precompute strategy

  • Lazy loading strategy

  • Peer-to-peer strategy

  • Efficiency strategies

  • Divide and conquer strategy

  • Listener strategy

  • Space reduction strategies

  • Mario and Luigi strategy

  • Synchronization strategies

  • Locks strategy

  • Error handling strategies

  • Code readability, maintainability, and elegance strategies

  • Security strategies

Articulate the data model

  • Schema(Tables, Fields)

  • SQL vs. NoSQL database

  • SQL database: MySQL, PostgreSQL, etc

  • NoSQL database: MongoDB, Redis, etc

  • NoSQL databases have a flexible or unstructured database schema.

  • Non-database storage

  • Distributed file systems(e.g. Store files in HDFS and file path in database)

  • Object storage(e.g. S3)

List the architectural components

  • Logical architecture, physical architecture, cloud architecture
  • Service-oriented architecture
  • Draw architecture diagrams

Scale

How to tackle common scale issues

What were the estimated scale requirements in step 2?

  • Number of total users
  • Number of active users
  • Requests per second(RPS)
  • Logins per second
  • Storage needed

Resolve the following system bottleneck:

  • CPU
  • Memory
  • Storage
  • Latency

We can use the following strtegies to scale a small functional system.

  • Load balancing
  • Read replica databases
  • Distributed file systems
  • Content delivery networks
  • Daemon process pre-computation

Problem: Handling more users and requests

  • Horizontal scaling and load balancer
  • Vertical scaling

Problem: Handling more reads

  • Read replicas

  • drawback: lack of data consistency

  • Sharding

  • Shard by customer or range

  • Shard by geography

  • Shard by hash function

  • Disadvantages: slower join performance, additional application code, recurring maintenance

Problem: Avoiding crashes

Chaos Engineering is a discipline where engineering teams purposefully create controlled failure within a large scale system. It can emulate:

  • Failure of a data center
  • Failure of a database shard
  • Randomly thrown exceptions
  • Skew in distributed system clocks
  • Failure of a key internal microservice
  • “Storm” tests that emulate power outages

Problem: Providing data consistency

No system can simultaneously provide two of the following guarantees(CAP):

  • Consistency - wirtes are immediately in effect
  • Availability - requests receive a valid response
  • Partition Tolerance - functionality despite network errors

When scaling a system, there is a direct relationship between consistency and availability. When horizontally scaling, you’ll see availability dramatically increase; however, consistency will suffer because the system becomes distributed.

To create stronger read consistency, you’ll have to decrease the effect of data replication your system creates, which will decrease availability.

Problem: Need to improve latency

As a system grows and creates an increasingly complex workflow, latency becomes a critical factor.

The bottleneck can be caused by system resource starvation, which could be alleviated through horizontal and vertical scaling. It also can be caused by application limitation. Another strategy to consider is caching which helps reduce unnecessary disk I/O operations.

Identify and alleviate scalability bottlenecks