Uploaded by Boitumelo Rethabile

Data-Intensive Applications: Reliability & Scalability

Databases Notes: LU1
Many applications today are data-intensive rather than compute-intensive.
The limiting factors for these applications are usually the amount of data, the complexity of data,
and the speed at which it is changing, not CPU power.
A data-intensive application is generally built from standard building blocks providing commonly
needed functionality, including the ability to:
Store data for later retrieval (databases)
Remember the result of an expensive operation to speed up reads (caches)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send messages to other processes so it can be handles asynchronously – data that continuously
updates (stream processing)
Deal with accumulated data at a later point rather than immediately
Provide abstraction, use them without thinking too much or being aware of the underlying
DATA SYSTEMS (Message queues vs Databases)
 Similar, both store data, but have different access patterns.
 Different access patterns and implementations.
 Both under the umbrella term “Data systems” because some datastores can be used as message
queues (Redis), and some message queues provide database-like durability guarantees (Apache
kafka). Boundaries between both have been blurred.
 A single tool cannot fulfil all data processing and storage requirements of apps, hence work can
be broken down into tasks that can be done by a tool, then the different tools are amalgamated
using application code.
 Responsibility of code to keep the various tools in sync with each other. E.g., if there’s an
application-managed caching layer, or full-text server separate from the main database
By combining tools to provide a service, the API hides the implementation details from the client
In figure 1, can be sure that the cache will be updated or invalidated correctly on writes so that
clients see consistent results.
 Skills and experience of people
 Legacy system dependencies
 Time scale for delivery
 Organisation’s tolerance for risk
The system should continue to work correctly (function at the desired level of performance) even in
the face of adversity (faults and human error).
EXPECTATIONS (Working Correctly)
o Application performs the functions that are expected by the user.
o Can tolerate mistakes by user who uses it in unexpected ways.
o Performance is satisfactory for the required use case under expected load and data volume.
o System prevents unauthorized access.
FAULTS (What could go wrong)
Fault: Component of system deviating from spec.
Failure: System as a whole stops providing required service.
Resilient system = Anticipate faults and cope with them.
Fault tolerant suggests system can be tolerant of every possible fault, in reality not feasible. Can’t
have zero probability of fault, but must design fault tolerance mechanisms to prevent failure.
Trigger faults deliberately to find bugs and other faults. Therefore, ensure that fault-tolerant
machinery is always tested, this increases the confidence that faults will be handled
appropriately when encountered.
Hard disks crash, RAM becomes faulty, Unplugging wrong network cable, power outage, etc.
Add redundancy to individual hardware components to reduce failure rate of system. When
one component dies, redundant component takes its place while broken component is
being repaired. (For apps where availability is essential)
- Set up disks in RAID configuration.
- Dual power supply for servers.
- Hot swapable CPUs.
- Backup power and generators for data centers.
Apps now use larger number of machines (increased data volumes and computing demands)
= increase rate of hardware faults. Use software fault-tolerant techniques in addition to
hardware redundancy, allows for system to tolerate loss of entire machine.
Systematic error in the system. Harder to anticipate since they are correlated across nodes,
cause more system failures than hardware failures that are not correlated. E.g:
- Software bug that causes instance of application server to crash when given bad input.
- Runaway process that uses some shared resource (memory, disk space or network
- Service that the system depends on that becomes slow and/or unresponsive.
Cascading failures, small fault in one component triggers fault in another component,
which also triggers other faults.
Solutions: - thorough testing
- Process isolation.
- Allow processes to crash and restart.
- Measure, monitor and analyse system behaviour.
Humans are not reliable, therefore make system reliable in spite of this:
Design systems in a way that minimizes the opportunity for errors to occur. Well designed
abstractions (APIs) make it easy for the system to do the right thing. But if the interfaces are
too restrictive, people might find way to work around them, nullifying their benefit.
Decouple places where people make the most mistakes from the places where they can
cause failure. Provide non-production environment (sandbox) where people can experiment
safely using real data without affecting the running system.
Test thoroughly at all stages. Can use Automated testing, covers corner cases that rarely
arise in normal operation.
Allow quick and easy recovery from human error to minimize impact. Ensure that rolling
back configuration changes is faster. Roll out new code gradually to ensure unexpected bugs
affect subset of system.
Set up detailed and clear monitoring, e.g performance metrics and error rates. Monitoring
can show early warning signals and can attend to warnings ASAP.
Implement good management practices and training.
Bugs in apps cause loss of productivity which negatively affect organisations, hence have
responsibility to ensure that user data is not corrupted.
There should be reasonable ways to deal with the growth of the system (in data volume, traffic
volume or its complexity).
Ability of system to deal with increased load.
If system grows in particular way, what are the options of coping with the growth, and how can
computing resources be added to handle additional load?
 Described by load parameters – best choice of these parameters depends on system
architecture. Can be:
- Requests per second to web server
- Ratio of reads and writes in db
- Number of simultaneously active users in a chatroom
 Investigate what happens when load increases.
- What happens to performance when increasing load parameters and keeping system
resources unchanged?
- How much do resources need to increase when load parameters are increased and want
to keep performance the same?
In Batch processing system performance = throughput – no. records processed/second or = time
taken to run a job on dataset of certain size.
In online systems, response time of service – time between client sending request and receiving
response (note that don’t get same response time with every request)
- Response time = distribution of values that can be measured
- Varies because of random addition latency – context switch to background process, loss
of network packets, etc.
- Can use average response time to know typical response time, but does not inform on
how many users experienced delay.
- Percentiles. Sort response times from fastest to slowest, if median (50th percentile) r.t =
200m/s (for single request), half of requests are returned in less than 200m/s and other
half take longer.
- Look at higher percentiles (95, 99, 99.9) – tail latencies – to see how bad outliers are. If
95th percentile is 1.5sec, 95 out of 100 requests take less than 1.5sec and 5 out of 100
take longer.
- Tail latencies = important b.c affect users experience. Amazon uses 99th percentile to
describe response time for internal services.
- Customers with slower requests = < data on accounts = many purchases = most valuable
- Reducing response times at high percentiles = difficult because they are easily affected
by random events out of your control.
- Percentiles are used in Service Level Objectives and Service Level Agreements, where the
expected performance and availability of service are defined.
Queuing delays are sometimes responsible for large part of response time at high percentiles.
Server can process small number of things in parallel, takes small number of slow requests to
hold up process of subsequent requests – head of line blocking.
Although subsequent tasks are faster to process, client will see slow r.t due to time waiting for
prior request to complete. Must measure r.t on client side.
 Scaling Up (Vertical scaling) move to more powerful machine.
 Scaling Out (Horizontal scaling) distribute load across multiple smaller machines – AKA sharednothing architecture.
 Distributing stateful data systems from single node to distributed set up can become complex.
Therefore, scale up until scaling cost require for system to be distributed.
 Elastic systems that can automatically add computing resources when an increase in load is
detected. Useful if load is unpredictable.
 Scaling manually involves human analysing system capacity and decide to add more machines.
 Architecture that scales well is built around the assumption of which operations will be common
and which ones will be rare.
 Scaling is specific to a particular application.
Other people should be able to productively work on the system when maintaining or adapting it to
new use cases.
Design software in a way that will minimize problems during maintenance, thus avoiding the
creation of legacy software.
Make it easy for teams to keep system running smoothly.
Good operations team typically responsible for:
o Monitoring health of system and quickly restore service when it is in bad state.
o Track down cause of problems, system failure and/or degraded performance.
o Keep software platforms up to date.
o Keep tabs on how different systems affect each other.
Good operability means making routine tasks easy, allowing operations team to focus their
efforts on high value activities.
Things that data systems can do to make routine tasks easy:
o Provide good visibility into runtime behaviour and internals of the system, with good
o Avoid dependency on individual machines.
o Provide good support for automation and integration with standard.
o Exhibit predictable behaviour, minimizing surprises.
o Self-healing where appropriate, but also giving administrators manual control over the
system state when needed.
Make it easy for new engineers to understand the system by removing complexity.
Complexity in larger projects slows down people who have to work on system and increase cost
of maintenance.
Symptoms of complexity:
- Explosion of state space
- Tight coupling of modules
- Tangled dependencies
- Inconsistent name and terminology
Complexity = greater risk of introducing bugs when making a change (new developers do not
understand hidden assumptions)
Simplicity != reducing functionality, but removing accidental complexity (when complexity is not
inherent in the problem that software solves, but arises from implementation)
Tools for removing accidental complexity:
- Abstraction (hide implementation detail behind simple-to-understand façade. Therefore,
allowing for the abstraction to be re used instead of implementing similar thing multiple
times. = Higher quality applications
- High level programming languages = abstractions that hide machine code, CPU registers
and syscalls.
Making change easy.
AKA Agility.
Agile working patterns provide framework that allows for adapting to change.
Agile community has technical tools and patterns that aid in developing software in changing
environment: TDD (Test-driven development) and refactoring.
Ease to which a data system can be modified and adapt it to change is linked to simplicity and
abstractions. Simple and easy to understand systems are easier to modify.