Distributed Computing Challenges
Table of Contents
Scalability>
Scalability #
Independent parallel processing of requests>
Independent parallel processing of requests #
- Ideally linear scalability (n more servers = support for n more users)
- Though this is hard to achieve because of
-
Overheads & synchronization
-
Load-imbalances create hot-spots
-
Amdahl’s law
Performance improvement from parallel processing for a sequence of operations is limited. Even if certain operations could be sped-up by being performed in parallel, other operations that could not, such as reading or writing data, would limit how fast the system could be improved.
-
- It is therefore necessary to partition both data and compute in order to meet the load demand
Fault Tolerence>
Fault Tolerence #
Mask & recover from failures>
Mask & recover from failures #
- Because full redundancy is too expensive, use quick detection & failure recovery instead
- Types of failure recovery
- Replication (replicates data & service, consistency issues)
- Re-computation (easy for stateless services, remember data lineage for computation)
High Availability>
High Availability #
Service operates 24/7>
Service operates 24/7 #
- Downtime = bad customer experience & loss in revenue
- Commitment to a certain % of availability captured in service level agreements (SLAs)
- How to achieve high availability?
- Eliminate point of failure through redundancy
- Reliable crossover
- Efficiently monitor & detect failures
Consistency>
Consistency #
Consistent results>
Consistent results #
-
Different trade-offs when replicating state of applications
-
CAP Theorem △
it is impossible for a distributed data store to simultanously provide more than 2 out of these 3 guarantees:
- Consistency
- Availablity
- Partition tolerance (or else network problem)
-
Main choices: strongly consistent but additional latency vs. inconsistent but better performance & availability
-
Preferred in use-cases: embracing eventual consistency for high availability
Performance>
Performance #
Predictable low-latency & high throughput>
Predictable low-latency & high throughput #
- Latency affects traffic and therefore revenue. A delay of 100 ms can already mean a 6% drop in sales.
- Optimize for tail latency which is the last 0.X% of request distribution graph (slowest 1% response times). This is amplified by scale due to fan-outs for microservices & data partitions.