Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Current Events
Join our developers event to win one of the valuable prizes!
posted 3 years ago
viewed 16710 times
Share this article

Decomposing Twitter (Database Perspective)

twitter-logo-small.pngTwitter - one of the latest and hottest Web 2.0 trends used by millions of users around the world every day. How does it manage the enormous data produced by its users? This article reviews the technical aspects of the Twitter service, how it handles such a tremendous amount of tweets and other data, and looks at what we can learn from the findings.

Twitter

As we all know, Twitter, which launched its service in 2006, is expanding at an amazing pace. According to official About Twitter page, as of September 2010 some 175 million users use the service, generating about 95 million tweets a day. This means about 1100 new tweets appear every second. At peak times, over 3,000 tweets are generated per second. More than 600 million searches are done in a day.

Simply put, the scale of service is amazing, but what is even more amazing is its growth rate. Behind Twitter's rapid growth is the seamless processing of huge amounts of data.

Understanding Twitter

For those of us who are satisfied with sitting in the passenger seat, here is a brief outline of the services provided by Twitter.

Twitter is a micro blogging service. Users are limited to sending 140 characters per message, which is called a tweet.

Here are two core services of Twitter:

  1. One is to allow you to read (follow) the tweets of others in whom you are interested.
  2. The other is to allow other users (followers) who are interested in you to read your tweets.
The other services provide the RT function (ReTweet, or posting a link to another user's tweet), and the Reply function, which lets you reply to tweets of others.

Replied tweets can only be seen by users who follow both you and the owner of the original tweet. These two additional features are what make it difficult to implement Twitter services and complicate the matters of designing systems as well as creating headaches to the administrators.

Just like G-dragon is the most popular user on me2Day, Korea's most popular alternative for Twitter, Lady GaGa (http://twitter.com/Ladygaga), a pop singer, has the largest number of followers on Twitter. As of today, 7,617,063 people follow Lady Gaga. In return for this popularity, Lady Gaga also follows 152,130 users. Katy Perry (http://twitter.com/katyperry), who is on the 8th position among the users with largest number of followers on Twitter, has 5,158,745 followers.

Imagine the number of ReTweets and Replies generated by these two examples alone, and one can easily guess the devastating combination it would result in. When this passes the point of no return, one will see Twitter's Fail Whale.

twitter-fail-whale.png

Real-Time Data in Twitter

The data tweeted by users must be processed in real time. As mentioned earlier, it is not easy to attain low latency and high throughput at the same time when the size of the generated data is enormous.

Here are four types of Real-Time Data that give Twitter engineers a headache:

  1. Tweets
  2. Timelines
  3. Social graphs
  4. Search indices
Let's see how these four types of data are handled, and how the associated problems are solved.

Tweets

The typical value of the Tweet meta information consists of an id of a tweet and the id (user_id) of a user who posted it (in addition to the Reply information).

The [single] RDBMS table which was used by earlier versions of Twitter is shown below:

id user_id text created_at
20 12 just setting up my twitter 2006-03-21 20:50:14
29 12 inviting coworkers 2006-03-21 21:02:56
34 16 Oh shit, I just twittered a little. 2006-03-21 21:08:09

In this original implementation Twitter used Master-slave replication and Memcached for read throughput. But when Twitter became popular and the disk usage surpassed 800 GB, reaching a limit in terms of performance, the Table Partition became necessary.

In this situation, there are two types of easily applicable partitioning methods.

  1. The user id-based partitioning
  2. The tweet id-based partitioning
The first type has a drawback that N number of partitions must be searched for to find a specific tweet.

The second case has a similar drawback which requires to search for N number of partitions to find the tweets created by a specific user.

To circumvent these drawbacks, Twitter uses the time axis-based partitioning. It other words, this is a policy in which tweets are stored in chronological order and passed to the next partition when a table exceeds a certain size. This is reasonable, as most of the reading operations are for recent tweets. When searching for tweets created by a specific user, it is thus only necessary to search for a couple of the most recent partitions to retrieve a sufficient number of tweets.

However, there is a problem with this method. In this solution, recent partitions are bound to become overloaded. To solve this problem, Twitter adds temporary partitions to recent ones in order to increase concurrency, and exclusively uses Memcached to increase the effectiveness of caching.

This is what Twitter currently does to store tweets. The introduction of Cassandra, a NoSQL, is being considered for more effective handling of a larger number of tweets in the future.

To reduce the burden of partitioning, Twitter is considering separating stores that take id as the key from stores, in which the sequence of tweet ids for each user (user_id) is the key.

Twitter wants its tweet storages to be non-relational. (Unfortunately, this is not yet certain. The reason for this, I think, is that Cassandra's performance will sharply decrease when one of the nodes in a distributed data system will fail, causing reconfiguration which will increase the workload for other nodes).

Timelines

A Timeline is the tweet sequence of users who I follow. Considering the definition, a query to retrieve the 20 most recent tweets that a specific user will read might look something like this:

SELECT * FROM tweets WHERE user_id IN  (SELECT source_id

FROM followers WHERE destination_id = ?)

ORDER BY created_at DESC LIMIT 20

The query above will show severely degraded service performance as I follow more users. Judging that such data schema and query cannot solve the problem, Twitter instead creates Timeline in advance through a process called Fanout (a kind of unfolding).

In this way, the sequence of a new tweet from a user is created in advance by using the Fanout process. The key factor in this case is the processing speed of Fanout. The sequence is stored in Memcached in order to quickly apply new tweets to Timeline (Off-line fanout is a batch process that creates Timeline in advance by reflecting a new tweet regardless of user requests. On-line fanout, on the other hand, creates Timeline upon the request of a user.) Twitter has achieved an impressive boost in performance after the implementation of the fanout process.

date average tps peak tps deliveries
2008-10-07 30 120 21,000
2010-04-15 700 2,000 1,200,000

The throughput has been improved from 21,000 Timeline requests per second to 1.2 million Timeline requests per second.

The point in this example is that High Throughput and Low Latency can be achieved by changing the hierarchy of storage and the time at which operation takes place. That is, the memory and disk need to be utilized in a manner appropriate to the purpose, and both the off-line and online processes need to be used.

Social Graphs

Social Graphs is an expression of the relationship between a Follow and a Follower. The Following block may be added here, if we were to take it one step further. A variety of operations are required for this relationship, such as the union, intersection, a difference, and inclusion. As mentioned in the previous section, it is not that simple when it comes to Reply, as it is necessary to display the original poster of the tweet and the author of the Reply to other users as well.

Source_id Destination_id
20 12
29 12
34 16

The initial format looked like the above. The relationship of Follow was stored in a single table and replicated in master-slave fashion. The problem with this, however, is that throughput for writing is unsatisfactory, and indices sometimes are not loaded into the memory as the size of the table grows.

A solution for this is to maintain two tablesForward and Backward. In this solution, each table is maintained according to the directions of the source and destination, and is partitioned by user_id.

Forward Table

source_id destination_id updated_at X
20 12 20:50:14 X
20 13 20:51:32
20 16

Backward Table

destination_id source_id updated_at X
12 20 20:50:14 X
12 32 20:51:32
12 16

Using this method enables primary key-based partitioning, while speeding up the process and making it easier to manage indices.

One more important thing: for effective partitioning and replication of a table, the data should be commutative (the order of commutative and operation is interchangeable) and idempotent (the result is the same when the operation is performed several times). This precondition is made by Gizzard, Twitter's powerful sharding (or partitioning) framework. In this case an operation that has failed to be executed due to a temporary malfunction can be executed again at any time later, regardless of whether it has been executed or not.

To create an internally commutative attribute, all rows have a time stamp to ensure their validity, and the updateable attribute information in a relationship becomes invalid when the time recorded in its time stamp is earlier than the time recorded in the time stamp of the row.

The performance of the Social Graphs of Twitter is as follows:

cardinality iteration write materialize inclusion
1ms 100 edges/ms 16 ms 1 ms

Search Indices

Twitter supports a search function for specific keywords, as shown above. For this, it assigns an id to every possible keyword and stores the doc_id information that includes the keyword to provide the search function.

term_id doc_id
20 12
20 86
34 16

In earlier versions of Twitter, the relationship between term_id and doc_id used to be stored in a single table, as shown above. But doing this increased the size of indices, to the point that they were too big to be loaded into the RAM.

Partitioning was required to solve this problem, and thus a time-based partitioning was applied, which is similar to the partitioning of tweets. Doing this has the advantage of allowing the information to be immutable, which means that the data will not be changed once it has been created. In addition, it is a quite effective partitioning as it satisfies the commutative and idempotent attributes, both of which greatly help the replication and distribution processes.

Twitter plans to further improve this feature by separating partitions for document and time and replacing MySQL, the lowest-level storage, with Lucene, which is optimized for searching.

Twitter Performance

The following is a table showing the performance data of the current Twitter service.

reads/second writes/second cardinality
Tweets 100 k 850 12 b
Timelines 80 k 1.2 m a lot
Graphs 100 k 20 k 13 b
Search 13 k 21 k 315 m

The Lessons Learned

Here is what we can take out of this Twitter DB decomposition by analyzing its four major types of data.

  • There are no permanent solutions (respond flexibly to changing circumstances).
  • There are no perfect solutions - there are only solutions that are useful for a certain amount of time.
  • Scalability does not achieve itself - it is achieved by appropriately combining partitioning, indexing, and replication.
  • All data related to a Read-time query must be serviced in the memory (A disk is only for permanent storage).
  • Only a small number of problems can be solved through pre-computation.
  • For best performance always pay attention to locality!

In Conclusion

We have studied Twitter's approach of throwing a stone to catch two rabbits, namely Low Latency and High Throughput.

Quick readers may have already noticed that they are chasing two hares at once by creating an additional attribute that NoSQL pursues while still using a RDBMS, as the lowest-level storage. In short, Twitter assigns the commutative and idempotent attributes to data for effective partitioning and replication. From perspective, Twitter is already NoSQL-ready (It remains to be seen whether that NoSQL is Cassandra).

I have learned much from Twitter while looking for the Data Store for NHN. In addition, when I was asked to write a feature article about NoSQL by The Platform Magazine, I thought it would be beneficial for people with the same concerns to share this piece of information, as doing so might reduce the time they would spend on this issue.

N-Store (a tentative name), the first high-speed storage system of NHN in the future, is under development. The N-Store project members are working hard to come up with a storage system that will live up to expectations.

Kim Hyeo, Development Center, NHN.

If you like this article, add me on http://twitter.com/cubrid.

See also

  1. http://www.slideshare.net/netik/billions-of-hits-scaling-twitter
  2. http://twitter.com/about
  3. http://engineering.twitter.com/2010/05/introducing-flockdb.html
  4. http://en.wikipedia.org/wiki/Gizzard_(scala_framework)
  5. http://gigaom.com/2010/04/07/gizzard-anyone-twitter-offers-up-code-for-distributed-data/
  6. http://github.com/twitter/gizzard
  7. https://github.com/twitter/flockdb
  8. https://github.com/twitter/gizzmo
  9. https://github.com/robey/kestrel
  10. http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster
  11. http://www.readwriteweb.com/cloud/2011/01/how-twitter-uses-nosql.php
  12. Big Data in Real-Time at Twitter http://www.slideshare.net/nkallen/q-con-3770885 (PDF link here)



comments powered by Disqus