Skip to main content
  1. Posts/

Distributed Computing Challenges

·336 words·2 mins·
Scalability>

Scalability #

Independent parallel processing of requests>

Independent parallel processing of requests #

  • Ideally linear scalability (n more servers = support for n more users)
  • Though this is hard to achieve because of
    1. Overheads & synchronization

    2. Load-imbalances create hot-spots

    3. Amdahl’s law

      Performance improvement from parallel processing for a sequence of operations is limited. Even if certain operations could be sped-up by being performed in parallel, other operations that could not, such as reading or writing data, would limit how fast the system could be improved.

  • It is therefore necessary to partition both data and compute in order to meet the load demand
Fault Tolerence>

Fault Tolerence #

Mask & recover from failures>

Mask & recover from failures #

  • Because full redundancy is too expensive, use quick detection & failure recovery instead
  • Types of failure recovery
    1. Replication (replicates data & service, consistency issues)
    2. Re-computation (easy for stateless services, remember data lineage for computation)
High Availability>

High Availability #

Service operates 24/7>

Service operates 24/7 #

  • Downtime = bad customer experience & loss in revenue
  • Commitment to a certain % of availability captured in service level agreements (SLAs)
  • How to achieve high availability?
    1. Eliminate point of failure through redundancy
    2. Reliable crossover
    3. Efficiently monitor & detect failures
Consistency>

Consistency #

Consistent results>

Consistent results #

  • Different trade-offs when replicating state of applications

  • CAP Theorem △

    it is impossible for a distributed data store to simultanously provide more than 2 out of these 3 guarantees:

    1. Consistency
    2. Availablity
    3. Partition tolerance (or else network problem)
  • Main choices: strongly consistent but additional latency vs. inconsistent but better performance & availability

  • Preferred in use-cases: embracing eventual consistency for high availability

Performance>

Performance #

Predictable low-latency & high throughput>

Predictable low-latency & high throughput #

  • Latency affects traffic and therefore revenue. A delay of 100 ms can already mean a 6% drop in sales.
  • Optimize for tail latency which is the last 0.X% of request distribution graph (slowest 1% response times). This is amplified by scale due to fan-outs for microservices & data partitions.


Mara Avramescu
Author
Mara Avramescu
Software Engineer & CS @ TUM