Distributed Computing Challenges

Table of Contents

Scalability>

Scalability #

Independent parallel processing of requests>

Ideally linear scalability (n more servers = support for n more users)
Though this is hard to achieve because of
1. Overheads & synchronization
2. Load-imbalances create hot-spots
3. Amdahl’s law
  
  Performance improvement from parallel processing for a sequence of operations is limited. Even if certain operations could be sped-up by being performed in parallel, other operations that could not, such as reading or writing data, would limit how fast the system could be improved.
It is therefore necessary to partition both data and compute in order to meet the load demand

Fault Tolerence>

Mask & recover from failures>

Because full redundancy is too expensive, use quick detection & failure recovery instead
Types of failure recovery
1. Replication (replicates data & service, consistency issues)
2. Re-computation (easy for stateless services, remember data lineage for computation)

High Availability>

Service operates 24/7>

Downtime = bad customer experience & loss in revenue
Commitment to a certain % of availability captured in service level agreements (SLAs)
How to achieve high availability?
1. Eliminate point of failure through redundancy
2. Reliable crossover
3. Efficiently monitor & detect failures

Consistency>

Consistent results>

Different trade-offs when replicating state of applications
CAP Theorem △
it is impossible for a distributed data store to simultanously provide more than 2 out of these 3 guarantees:
1. Consistency
2. Availablity
3. Partition tolerance (or else network problem)
Main choices: strongly consistent but additional latency vs. inconsistent but better performance & availability
Preferred in use-cases: embracing eventual consistency for high availability

Performance>

Predictable low-latency & high throughput>

Latency affects traffic and therefore revenue. A delay of 100 ms can already mean a 6% drop in sales.
Optimize for tail latency which is the last 0.X% of request distribution graph (slowest 1% response times). This is amplified by scale due to fan-outs for microservices & data partitions.