In the beginning, we built these integrations by effectively bouncing through Facebook web services using ad-hoc endpoints. However, we found that this could be cumbersome and it limited our ability to use internal Facebook services. This would ease the integration with other internal Facebook systems and allow us to take advantage of tooling built to manage large scale server deployments.
The primary goals of the migration were to keep the site fully available during the transition, avoid impacting feature development, and minimize infrastructure-level changes to avoid operational complexity.
Not so much. This task looked incredibly daunting on the face of it; we were running many thousands of instances in EC2, with new ones spinning up every day. In order to minimize downtime and operational complexity, it was essential that instances running in both EC2 and VPC seemed as if they were part of the same network. The only way to communicate between the two private networks is to use the public address space.
So we developed Neti — a dynamic iptables manipulation daemon, written in Python, and backed by ZooKeeper. Neti provides both the missing security group functionality as well as a single address for each instance, regardless of whether it is running in EC2 or VPC. Everyone loves stateless services—they're easy to deploy and scale, and you can spin them up whenever and wherever as you need them.
The truth is we also need stateful services like Cassandra to store user data. Running Cassandra with too many copies not only increases the complexity of maintaining the database; it's also a waste of capacity, not to mention having quorum requests travel across the ocean is just…slow.
Instagram also uses TAO , a distributed data store for the social graph, as data storage. We run TAO as a single master per shard, and no slave updates the shard for any write request. It forwards all writes to the shard's master region. Because all writes happen in the master region which lives in the United States , the write latency is unbearable in Europe.
You may notice that our problem is basically the speed of light. Can we reduce the time it takes a request to travel across the ocean or even make the round trip disappear? Here are two ways we can solve this problem. For example, imagine there are five data centers in the United States and three data centers in the European Union. If we deploy Cassandra in Europe by duplicating the current clusters, the replication factor will be eight and quorum requests must talk to five out of eight replicas.
In the meantime, a quorum request for each partition will be able to stay in the same continent, solving the round-trip latency issue. It should look almost identical to the end user. When we send a write to TAO, TAO will update locally and won't block sending the write to the master database synchronously; rather it will queue the write in the local region. In the write's local region, the data will be available immediately from TAO, while in other regions, the data will be available after it propagates from the local region.
This is similar to regular writes today, which propagate from the master region. Although different services may have different bottlenecks, by focusing on the idea of reducing or removing cross-ocean traffic, we can tackle problems one by one. As in every infrastructure project, we've learned some important lessons along the way. Here are some of the main ones. The key to expanding to multiple data centers is to distinguish global data and local data. Global data needs to be replicated across data centers, while local data can be different for each region for example, the async jobs created by web server would only be viewed in that region.
The next consideration is hardware resources. These can be roughly divided into three types: storage, computing and caching. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store. Global data neatly maps to data stored in these servers.
The goal is to have eventual consistency of these data across data centers, but with potential delay. Because there are vastly more read than write operations, having read replica each region avoids cross data center reads from web servers.
Writing to PostgreSQL, however, still goes across data centers because they always write to the primary. Web servers, async servers are both easily distributed computing resources that are stateless, and only need to access data locally. Web servers can create async jobs that are queued by async message brokers, and then consumed by async servers, all in the same region. This means that updates to cache in one data center are not reflected in another data center, therefore creating a challenge for moving to multiple data centers.
Imagine a user commented on your newly posted photo. In the one data center case, the web server that served the request can just update the cache with the new comment. A follower will see the new comment from the same cache. Our solution is to use PgQ and enhance it to insert cache invalidation events to the databases that are being modified. On the primary side:. On the replica side:. This solves the cache consistency issue.
On the other hand, compared to the one-region case where django servers directly update cache without re-reading from DB, this would create increased read load on databases. In order to mitigate this problem, we took two approaches: 1 reduce computational resources needed for each read by denormalizing counters; 2 reduce number of reads by using cache leases.
The most commonly cached keys are counters. For example, we would use a counter to determine the number of people who liked a specific post from Justin Bieber.
0コメント