Scalable System Design

Consistent Hashing

Consistent hashing is used to minimize the number of keys to rebalance when your application cluster is resized. It also allows the nodes to rebalance themselves with traffic evenly distributed.

  • [Ringpop] (https://ringpop.readthedocs.org/en/latest/architecture_design.html#concepts) from Uber uses FarmHash as its hashing function because it’s fast and provides good distribution. Consistent hashing applies a hash function to not only the identity of your data, but also the nodes within your cluster that are operating on that data. Ringpop uses a red-black tree to implement its underlying data structure for its ring which provides log n, lookups, inserts, and removals.

IMAGE ALT TEXT HERE


2 dimensional sharding

With simple hash partitioning, expanding clusters in place involves a non-trivial amount of operational work b/c data needs to be shuffled around as the number of hash partitions increases. Instead, you can create a two-dimensional sharding scheme that data is first divided into multiple time tiers then within each time tier, data is divided into partitions based on a hash function. And within each hash partition, you can further divided into chunks.

In Twitter, they use this approach to handle their full index for all tweets. They use Mesos to generate inverted indices in parallel. A set of chunk "Segment" will be assigned to each machine (ie. Earlybird).

  • To grow data capacity over time, they will add time tiers. Existing time tiers will remain unchanged. This allows them to expand the cluster in place.
  • To grow serving capacity (QPS) over time, we can add more replicas.

2 dimensional sharding

A larger number of Earlybird machines per cluster translates to more operational overhead. They reduced cluster size by:

  • Reduce Hash Partition Count: pack more segments onto each Earlybird.
  • Reduce Replica: increase QPS each Earlybird could serve

Tuning hardware

In order to pack more segments onto each Earlybird, they went for SSD than RAM as RAM was too expensive and limited by # of DIMM slots per machine. SSDs were significantly less expensive ($/terabyte) than RAM. SSDs also provided much higher read/write performance compared to regular spindle disks.

However, SSDs were still orders of magnitude slower than RAM. Switching from RAM to SSD, Earlybird QPS capacity took a major hit. To increase serving capacity, they made multiple optimizations:

  • Tuning kernel parameters to optimize SSD performance
  • Packing multiple DocValues fields together to reduce SSD random access,
  • Loading frequently accessed fields directly in-process and more.

Search Client - 2 levels scatter and gather

  • First level is to query partition and merge their results at each time tier. (optimize via not going to all time tier based on the query)
  • Second level is to merge result per each time tier into final merge result.

2 level scatter and gather

Reference

Group Communication

  • Uber developed TChannel that is a networking framing protocol used for general RPC. Ringpop uses TChannel as its proxying channel and transport of choice. It supports out-of-order responses at extremely high performance with benchmarks ranging from 20,000 to 40,000 operations per second.