Scale from zero to millions of users Single Server Setup To start with something simple, everything is running on a single server. 1. Users access websites through domain names, such as api.mysite.com. Usually, the Domain Name System (DNS) is a paid service provided by 3rd parties and not hosted by our servers. 2. Internet Protocol (IP) address is returned to the browser or mobile app. In the example, IP address 15.125.23.214 is returned. 3. Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) [1] requests are sent directly to your web server. 4. The web server returns HTML pages or JSON response for rendering. Database Scale from zero to millions of users 1 With the growth of the user base, one server is not enough. We need multiple servers: One for web/mobile traffic. The other for the database. Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently. Which databases to use? You can choose between a traditional relational database and a non-relational database. Relational databases are also called SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases represent and store data in tables and rows. You can perform join operations using SQL across different database tables. Non-Relational databases are also called NoSQL databases. Popular ones are CouchDB, Neo4j, Cassandra, MongoDB, Amazon DynamoDB, etc. These databases are grouped into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases. Non-relational databases might be the right choice if: Your application requires super-low latency. Scale from zero to millions of users 2 Your data are unstructured, or you do not have any relational data. You only need to serialize and deserialize data (JSON, XML, YAML, etc.). You need to store a massive amount of data. Vertical scalling vs Horizontal scalling Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scale by adding more servers into your pool of resources. When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations: Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server. Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely. Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling. Load Balancer In the previous design, users are connected to the web server directly. Users will be unable to access the website if the web server is offline. In another scenario, if many users access the web server simultaneously and it reaches the web server’s load limit, users generally experience slower response or fail to connect to the server. A load balancer is the best technique to address these problems. 💡 A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set. Scale from zero to millions of users 3 With this setup, web servers are unreachable directly by clients anymore. After a load balancer and a second web server are added, we successfully solved no failover issue and improved the availability of the web tier: If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load. If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them. Database Replication Scale from zero to millions of users 4 The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address those problems. 💡 Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves) A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database. Most applications require a much higher ratio of reads to writes; thus, the number of slave databases in a system is usually larger than the number of master databases. Scale from zero to millions of users 5 Advantages Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel. Reliability: If one of your database servers is destroyed, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations. High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server. Scale from zero to millions of users 6 What if one of the databases goes offline? If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. In case multiple slave databases are available, read operations are redirected to other healthy slave databases. If the master database goes offline, a slave database will be promoted to be the new master. Cache It is time to improve the load/response time. This can be done by adding a cache layer. 💡 A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently. After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response incache, and sends it back to the client. This caching strategy is called a read-through cache (other caching strategies are available). When to use a Cache system? Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for Scale from zero to millions of users 7 persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores. Expiration Policy in a Cache System It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in the memory permanently. It is advisable not to make the expiration date too short as this will cause the system to reload data from the database too frequently. Meanwhile, it is advisable not to make the expiration date too long as the data can become stale. Consistency in a Cache System This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. Single Point of Failure? A single cache server represents a potential single point of failure (if it fails, will stop the entire system from working). As a result, multiple cache servers across different data centers are recommended to avoid SPOF. What happens when the Cache System is full? Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other eviction policies, such as the Least Frequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different use cases. CDN - Content Delivery Network Scale from zero to millions of users 8 💡 A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc. Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe. CDN workflow 1. User A tries to get image.png by using an image URL. 2. If the CDN server does not have image.png in the cache, the CDN server requests the file from the origin, which can be a web server or online storage like Amazon S3. 3. The origin returns image.png to the CDN server, which includes optional HTTP header Time-to-Live (TTL) which describes how long the image is cached. 4. The CDN caches the image and returns it to User A. The image remains cached in the CDN until the TTL expires. 5. User B sends a request to get the same image. 6. The image is returned from the cache as long as the TTL has not expired. Considerations of using a CDN Cost: CDNs are run by third-party providers, and you are charged for data transfers in and out of the CDN. Scale from zero to millions of users 9 Setting an appropriate cache expiry: For time-sensitive content, setting a cache expiry time is important. The cache expiry time should neither be too long nor too short. If it is too long, the content might no longer be fresh. If it is too short, it can cause repeat reloading of content from origin servers to the CDN CDN fallback: You should consider how your website/application copes with CDN failure. If there is a temporary CDN outage, clients should be able to detect the problem and request resources from the origin. Design with CDN & Cache included Scale from zero to millions of users 10 Stateful vs Stateless Systems It is time to consider scaling the web tier horizontally. For this, we need to move state (for instance user session data) out of the web tier. A good practice is to store session data in the persistent storage such as relational database or NoSQL. Each web server in the cluster can access state data from databases. This is called stateless web tier. Stateful Architecture A stateful server remembers client data (state) from one request to the next. User A’s session data and profile image are stored in Server 1. To authenticate User A, HTTP requests must be routed to Server 1. If a request is sent to other servers like Server 2, authentication would fail because Server 2 does not contain User A’s session data. Similarly, all HTTP requests from User B must be routed to Server 2; all requests from User C must be sent to Server 3. The issue is that every request from the same client must be routed to the same server. This adds the overhead. Adding or removing servers is much more difficult with this approach. It is also challenging to handle server failures. Stateless Architecture Scale from zero to millions of users 11 A stateless server keeps no state information. In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable. Scale from zero to millions of users 12 Design with the horizontal scaling update (stateless architecture) After the state data is removed out of web servers, auto-scaling of the web tier is easily achieved by adding or removing servers based on traffic load. Data Centers To improve availability and provide a better user experience across wider geographical areas, supporting multiple data centers is crucial. Scale from zero to millions of users 13 Example with 2 Data Centers In normal operation, users are geo-routed to the closest data center, with a split traffic of x% in US-East and (100 – x)% in US-West. In the event of any significant data center outage, we direct all traffic to a healthy data center. Message Queue To further scale our system, we need to decouple different components of the system so they can be scaled independently. Messaging queue is a key strategy employed by many real-world distributed systems to solve this problem. Scale from zero to millions of users 14 💡 A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. Message queue architecture Input services, called producers/publishers, create messages, and publish them to a message queue. Other services or servers, called consumers/subscribers, connect to the queue, and perform actions defined by the messages. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable. The producer and the consumer can be scaled independently. Use case Your application supports photo customization, including cropping, sharpening, blurring, etc. Those customization tasks take time to complete. Web servers publish photo processing jobs to the message queue. Photo processing workers (consumers) pick up jobs from the message queue and asynchronously perform photo customization tasks. When the size of the queue becomes large, more workers are added to reduce the processing time. However, if the queue is empty most of the time, the number of workers can be reduced. Logging, metrics, automation Logging: Monitoring error logs is important because it helps to identify errors and problems in the system. You can monitor error logs at per server level or use tools to aggregate them to a centralized service for easy search and viewing. Scale from zero to millions of users 15 Metrics: Collecting different types of metrics help us to gain business insights and understand the health status of the system. Some of the following metrics are useful: Host level metrics: CPU, Memory, disk I/O, etc. Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc. Key business metrics: daily active users, retention, revenue, etc. Automation: When a system gets big and complex, we need to build or leverage automation tools to improve productivity. Continuous integration is a good practice, in which each code check-in is verified through automation, allowing teams to detect problems early. Besides, automating your build, test, deploy process, etc. could improve developer productivity significantly. Scale from zero to millions of users 16 Design with Message Queue, Logging, metrics & automation Database Scaling Scale from zero to millions of users 17 As the data grows every day, your database gets more overloaded. It is time to scale the data tier. Vertical vs Horizontal scaling Vertical scaling, also known as scaling up, is the scaling by adding more power (CPU, RAM, DISK, etc.) to an existing machine. However, vertical scaling comes with some serious drawbacks: There are hardware limits. If you have a large user base, a single server is not enough. Greater risk of single point of failures. The overall cost of vertical scaling is high. Powerful servers are much more expensive. Horizontal scaling, also known as sharding, is the practice of adding more servers. Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to the shard. Sharding Use Case User data is allocated to a database server based on user IDs. Anytime you access data, a hash function is used to find the corresponding shard. In our example, is used as the hash function. If the result equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used. The same logic applies to other shards. user_id % 4 Scale from zero to millions of users 18 The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Sharding key (known as a partition key) consists of one or more columns that determine how data is distributed. user_id is the sharding key. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database. When choosing a sharding key, one of the most important criteria is to choose a key that can evenly distributed data. Problems with Sharding Celebrity problem: Excessive access to a specific shard could cause server overload. Certain shards might experience shard exhaustion faster than others due to uneven data distribution. When shard exhaustion happens, it requires updating the sharding function and moving data around. Scale from zero to millions of users 19 Design with Horizontal Database Scaling (Sharding) Scale from zero to millions of users 20