Interview questions & answers

System Design interview questions & answers

System design is the process of defining the architecture, components, and data flow of a large-scale software system to meet requirements like scalability, reliability, and performance. It involves choosing how to distribute load, store data, cache results, and handle failures while balancing trade-offs. Interviews assess whether you can reason about these decisions for systems serving millions of users.

Updated 2026-06-18 · 15 real, commonly-asked questions with answers.

Key takeaways

  • System design is the process of defining the architecture, components, and data flow of a large-scale software system to meet requirements like scalability, reliability, and performance.
  • Core areas to revise for System Design: Scalability, Load balancing, Caching, Database design & sharding, CAP theorem & consistency.
  • This guide answers 15 of the most-asked System Design interview questions — rehearse them in OnJob's free AI mock interview.
ScalabilityLoad balancingCachingDatabase design & shardingCAP theorem & consistencyMessage queuesMicroservicesHigh availability

Top 15 System Design interview questions

Q1.What is the difference between horizontal and vertical scaling?

Vertical scaling adds more power (CPU, RAM) to a single machine, which is simple but has a hard ceiling and a single point of failure. Horizontal scaling adds more machines and distributes load across them, which scales nearly without limit but adds complexity in coordination and consistency. Large systems favor horizontal scaling for resilience and elasticity.

Q2.What is a load balancer and how does it work?

A load balancer distributes incoming requests across multiple servers to prevent any one from being overwhelmed and to improve availability. It uses algorithms like round-robin, least-connections, or hashing, and performs health checks to route traffic only to healthy servers. It also enables horizontal scaling and can terminate SSL.

Q3.What is caching and why is it used?

Caching stores frequently accessed data in fast storage, like memory, so future requests are served without recomputing or re-fetching from a slower source. It dramatically reduces latency and load on databases and backends. The trade-off is potential staleness, which is managed with expiration policies and invalidation strategies.

Q4.Explain the CAP theorem.

The CAP theorem states that a distributed system can guarantee only two of three properties at once: Consistency, Availability, and Partition tolerance. Since network partitions are unavoidable, the real choice during a partition is between consistency and availability. CP systems reject requests to stay consistent, while AP systems stay available but may serve stale data.

Q5.What is database sharding?

Sharding horizontally partitions a database across multiple servers, with each shard holding a subset of the data based on a shard key. It lets the database scale beyond one machine's capacity for both storage and throughput. The challenges are choosing a shard key that avoids hotspots and handling queries that span multiple shards.

Q6.What is the difference between SQL and NoSQL in system design?

SQL databases offer strong consistency, structured schemas, and powerful joins, suiting transactional systems with complex relationships. NoSQL databases offer flexible schemas and easier horizontal scaling, suiting high-volume, simple-access patterns like key-value or document stores. The choice depends on consistency needs, query patterns, and scale.

Q7.What is a message queue and when would you use one?

A message queue is a buffer that lets services communicate asynchronously by passing messages, decoupling producers from consumers. It smooths traffic spikes, enables retries, and lets components scale and fail independently. It is used for tasks like sending emails, processing uploads, or coordinating microservices.

Q8.What is the difference between latency and throughput?

Latency is the time to complete a single request, while throughput is the number of requests a system handles per unit of time. They are related but distinct: a system can have high throughput through parallelism while individual requests still have high latency. Good design optimizes both according to the application's needs.

Q9.How would you design a URL shortener?

You generate a short unique key for each long URL, store the mapping in a database, and redirect on lookup. The key can come from a base-62 encoding of an auto-increment ID or a hash, balancing length against collision risk. To scale, you add caching for popular links, read replicas, and sharding on the key.

Q10.What is a CDN and why is it used?

A content delivery network is a geographically distributed set of edge servers that cache static content close to users. By serving images, CSS, and videos from a nearby edge rather than the origin, it cuts latency and offloads the origin server. CDNs also absorb traffic spikes and add protection against some attacks.

Q11.What is the difference between strong and eventual consistency?

Strong consistency guarantees every read returns the most recent write, so all clients see the same data immediately, at the cost of latency and availability. Eventual consistency allows replicas to be temporarily out of sync, converging over time, which improves availability and performance. Banking favors strong consistency, while systems like social feeds tolerate eventual consistency.

Q12.What is a microservices architecture?

Microservices structure an application as a collection of small, independently deployable services, each owning a specific business capability and communicating over the network. This allows independent scaling, technology choices, and deployment, improving team autonomy. The trade-offs are operational complexity, network overhead, and harder distributed debugging compared to a monolith.

Q13.How do you handle a single point of failure?

You eliminate single points of failure through redundancy: running multiple instances behind a load balancer, replicating databases, and distributing across availability zones. Health checks and automatic failover redirect traffic away from failed components. The goal is that no single component's failure takes down the whole system.

Q14.What is rate limiting and why is it important?

Rate limiting caps how many requests a client can make in a time window to protect a system from abuse, overload, and runaway costs. Algorithms include token bucket, leaky bucket, and fixed or sliding windows. It ensures fair usage, defends against denial-of-service, and keeps the service stable under heavy load.

Q15.How would you approach a system design interview question?

Start by clarifying requirements and constraints, including functional needs, scale, and read-write patterns, then estimate capacity. Define the high-level architecture and data model, then drill into components like databases, caching, and load balancing while discussing trade-offs. Finish by addressing bottlenecks, failure handling, and how the design scales.

Free AI mock interview

Practise System Design out loud

Reading answers is step one. Rehearse them in OnJob's free AI mock interview, get instant feedback, then apply to AI-matched jobs in one click.

Create my free profile — free