scalability

Why Uber Engineering Switched from Postgres to MySQL

“Why Uber Engineering Switched from Postgres to MySQL” is an interesting study with plenty of technical detail of how MySQL was a better choice than PostgreSQL for the very demanding growth of Uber. These kinds of issues are probably way out of scope for any “regular Joe” application, but the insight into the differences of MySQL and PostgreSQL architectures is still useful.

Main PostgreSQL limitations covered by the study are:

Inefficient architecture for writes
Inefficient data replication
Issues with table corruption
Poor replica MVCC support
Difficulty upgrading to newer releases

Distributed architecture concepts I learned while building a large payments system

Gergely Orosz, an engineer who worked at Uber on the large scale payments system used by the company, shares some of the distributed architecture concepts he had to learn in the blog post titled “Distributed architecture concepts I learned while building a large payments system“.

The article is very well written and easy to follow. But it’s also a goldmine of links to other resources on the subject. Here’s a list links and concepts for a quick research and/or click-through later:

Service Level Agreements (SLAs).
- Availability / service uptime (in percentage of time a year)
- Accuracy (in percentage)
- Capacity (in requests per second)
- Latency (95% and 99%)
Horizontal vs. vertical scaling
- Horizontal scaling is adding more machines, much preferred for distributed systems.
- Vertical scaling is upgrading machines to the more powerful ones.
Consistency
Data Durability (here‘s some more on the subject)
Message Persistence and Durability
- RabbitMQ
- Kafka Streams
Idempotency (here‘s some more on the different strategies)
Sharding and Quorum
- Resharding
- Foursquare post-mortem on the 2010 17 hour downtime
- Quorum in Cassandra
The Actor Model
- The Actor Model in 10 Minutes
- Communicating Sequential Processes (CSP), as an alternative
Reactive Architecture

Amazon AWS : Scaling Up to Your First 10 Million Users

ENT309 Scaling Up to Your First 10 Million Users from Amazon Web Services

This must be one of the greatest presentations on the Amazon AWS that I’ve ever seen. It uses a gradual approach – from small and simple to huge and complex. It covers a whole lot of different Amazon AWS services, how they compliment each other, at which stage and scale they become useful, and more.

Even quickly jumping through the slides gave me a lot to think (and Google) about.

Mcrouter: a memcached protocol router

Mcrouter is an Open Source tool developed by Facebook for scaling up the memcached deployments:

Mcrouter is a memcached protocol router for scaling memcached (http://memcached.org/) deployments. It’s a core component of cache infrastructure at Facebook and Instagram where mcrouter handles almost 5 billion requests per second at peak.

Here is a good overview of some of the scenarios where Mcrouter is useful. There’s more than one. Here are some of the features to get you started:

Memcached ASCII protocol
Connection pooling
Multiple hashing schemes
Prefix routing
Replicated pools
Production traffic shadowing
Online reconfiguration
Flexible routing
Destination health monitoring/automatic failover
Cold cache warm up
Broadcast operations
Reliable delete stream
Multi-cluster support
Rich stats and debug commands
Quality of service
Large values
Multi-level caches
IPv6 support
SSL support

Software Engineering Radio : CAP Theorem

On the way to work today I enjoyed an excellent episode of Software Engineering Radio which featured an interview with Eric Brewer, a VP of Infrastructure at Google, probably more famous for his CAP Theorem.

In theoretical computer science, the CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it succeeded or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

The discussion around “2 out of 3” was very thought-provoking and, at first, a little bit counter-intuitive. If you don’t want to listen to the show, read though this page, which covers the important bits.

The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C. Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A. Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A. In some sense, the NoSQL movement is about creating choices that focus on availability first and consistency second; databases that adhere to ACID properties (atomicity, consistency, isolation, and durability) do the opposite.

This puts some of the current trends into perspective.