posted last year in Dev Platform category by Jeon Won Hee
"Ins and outs of NHN" is a series of articles that compares platforms and services from third-party vendors with NHN's own solutions. The topic of this first article in the series is Database Sharding Platform. I will introduce about the efforts being made from inside and outside of NHN to implement Database Sharding. I will first explain the concept of Sharding data vs. Partitioning data. Then review the common methodology to implement sharding. And finally I will compare all sharding platforms. This article is recommented for developers interested in big data management.
To store and search a volume of data which is so big that it cannot be handled by one database, you must find a way to use multiple databases. Although there are some databases made for distributed environments, such as Cassandra or Dynamo, these have many functional constraints, such as weakness in terms of search range or inability to use the JOIN operations. In order to expand data while using a relatively feature-rich functionality, it is recommended to use RDBMS by sharding the databases.
In the past, the Sharding logic was implemented directly on the application layer, but now there is an increasing number of examples which introduce Sharding platforms which allow to move the Sharding logic from the application layer to database or middleware layers. At a core level, Sharding platforms must respond effectively to ever-increasing data without failure, and handle different data characteristics and models depending on services.
In this article I will compare Spock Proxy, the Sharding platform based on MySQL Proxy, Gizzard, created by Twitter, and CUBRID SHARD, native database sharding feature in CUBRID which is set to launch in the first half of 2012. A table of other solutions was previous posted at Database Sharding with CUBRID.
Horizontal Partitioning and Sharding
Horizontal partitioning is a design that divides and stores data, such as a schema, into two or more tables within one database. For example, to handle large data of user messages, such schema may be created so that messages by users from city A are stored in one table, while messages by users from city B are stored in another table. This allows to reduce the size of the index and increase the concurrency of operations. The key point in this approach is that the data is partitioned between tables within a single database.
Sharding, on the other hand, is the distributed approach where data is horizontally partitioned between tables created in physically different databases. Thus, Sharding is a method to store the messages by users from city A in database A and messages by users from city B in database B. Here each database is called a Shard.
As you have to work on multiple databases, there could be functional limitations based on circumstances and also drawbacks in terms of consistency and replication, including JOIN operations. In many cases, Sharding used to be implemented at the application server level. There have been many attempts to provide this at the platform level. These can be classified into a pattern that operates in the application server, such as Hibernate Shards, a middle tier pattern such as CUBRID SHARD, Spock Proxy, or Gizzard, and a pattern that provides the Sharding functionality from a database itself, e.g. nStore or MongoDB.
Middle Tier Sharding Platform
By default, a Sharding platform should consider the following items:
- Database location abstraction
- Monitoring/Ease of operations
Database location abstraction and scalability are different from each other but connected. Database location abstraction ensures that on the application layer you do not need to know which data (which Row) is located in which database. The application is connected only with the Sharding platform. Connecting to a database is what the Sharding platform should do. In addition, Sharding platform should carry out the task to add a replicated storage to a specific Shard (one of the partitioned databases) in order to migrate database for replacement without restarting or changing the application code. When the Sharding platform comes to a stop, it is obvious that an application server will not be able to access the database. For this reason, the Sharding platform should provide redundancy.
For monitoring, the Sharding platform should be able to provide a number of requests and error information according to Shard keys (a standard to determine which data is stored in which database).
Comparison between Spock Proxy and CUBRID SHARD
The key function of the Sharding platform is the CRUD operation (CREATE, READ, UPDATE, DELETE) which chooses one database among many according to 'a standard set by a developer', in other words the Sharding strategy. We will compare Spock Proxy, a typical Sharding platform of MySQL, with the CUBRID SHARD platform developed by NHN.
Spock Proxy is a Sharding platform designed based on MySQL Proxy. In MySQL Proxy, you can execute the Lau scrip code, which is written by a developer before and after performing SQL. The primary purpose of using MySQL Proxy is to analyze and modify SQL. To use Spock Proxy, you need to create a MySQL database to manage the information about shards and how the data should be distributed. The row is a rule for Sharding.
Figure 1: Specifying Sharding rules in Spock Proxy.
CUBRID SHARD is the Sharding platform for CUBRID. The uniqueness of CUBRID SHARD is that it can also be used with MySQL as well as Oracle. It is set to be launched in the first half of 2012, and is planned to be used for processing the meta information database system of NDrive service, the cloud storage system developed by NHN.
The following table shows a summary of a comparison between Spock Proxy and CUBRID SHARD.
Table 1 Comparison between Spock Proxy and CUBRID SHARD
|Spock Proxy||CUBRID SHARD|
|Sharding rule storage||DBMS Table||Configuration file|
|How to create Shard keys||Modulo||
|How to find a Shard key||SQL parsing||Using HINT|
|Strength||No need to change SQL||
||Requires the change to SQL queries to insert sharding HINT|
Spock Proxy stores Sharding rules in the table in MySQL universal_db database. The SQL received from an application server is parsed and checked whether the query has shard keys. If shard keys are provided, the MySQL instance will be identified according to the standard recorded in universal_db, then SQL will be relayed to that MySQL instance.
When using this method, you do not need to describe information related to shard keys in SQL, unlike in CUBRID SHARD. Therefore, if you did not use Sharding before for your coding, but recently had to use due to the data increase, you may use Spock Proxy which will work without requiring you to change the SQL in your application. Note that this is limited to cases where there is no need to change the schema for Sharding, or when Sharding can be applied without changing the SQL that you use. However, the method used to find the Shard after parsing SQL, as used by Spock Proxy, has weaknesses in terms of performance. It could lead to unnecessary work, as SQL should be parsed twice: once by Spock Proxy to determine the MySQL instance, then by MySQL itself.
In CUBRID SHARD, SQL HINT is used which allows to avoid parsing SQL twice. Suppose there is a table as follows:
To use the student_no column data as a shard key, an application server sends the following prepared SQL to CUBRID SHARD.
SELECT student_no, name, age FROM student WHERE student_no = /*+ SHARD_KEY */ ?
CUBRID SHARD checks if a HINT displays /*+ SHARD_KEY */ in SQL, in which case the column data with the corresponding HINT will be used as a Shard key. It then reads the student_no value, which follows the hint, and identifies RDBMS based on the configuration file and transmits the corresponding query. As such, the benefit of using HINT is that you can improve processing efficiency by avoiding parsing SQL twice, and react to various RDBMS without violating the database location abstract.
- CUBRID SHARDING provides various HINTs, in addition to /*+ SHARD_KEY */.
- Typically, there is /*+ SHARD_ID(__id__) */, a HINT that allows you to find a special shard.
- The /*+ SHARD_VAL(__shard_key_val__) */ HINT can also play a role in finding a special shard for tables which have no shard keys. While this HINT is the same in that it searches for tables that have no specific shard keys, it configures a value for shard key column directly without choosing a shard, and selects the shard according to the internal rules of the middleware.
Unlike Spock Proxy, where Sharding rules are inserted into a database table, in CUBRID SHARD Sharding rules are specified in the configuration file. If there are three access addresses for actual RDBMS storage, you should specify DB addresses to CUBRID SHARD as follows.
0 shardDB shardNODE1:3306 1 shardDB shardNODE2:3306 2 shardDB shardNODE3:3306 3 shardDB shardNODE4:3306
If you want to use the Modulo method, write the corresponding value in the configuration file as below.
SHARD_KEY_MODULAR = 256
In addition, specify [MIN..MAX] according to the value generated from SHARD_KEY_MODULAR, and describe which shard to send the corresponding query to.
#min max shard_id 0 63 0 64 127 1 128 191 2 192 255 3
If you need a more subtle method than Modulo, you can create program code of your own that calculates the Shard ID for Shard keys.
SHARD_KEY_LIBRARY_NAME = libshardkeyid.so SHARD_FUNCTION_NAME = user_get_shard_key
As shown in the above example, you can register your own library which will calculate the Shard ID logic.
The common weakness of both Spock Proxy and CUBRID SHARD is that they both require additional network IO time for each additional hop because they are implemented as a middle tier.
The following figure displays a general process of internal execution in CUBRID SHARD, which is performed when a developer executes a query. When a developer executes a query, the query is analyzed by the DB shard middleware, which determines to which shard it will be sent. Then the query is transmitted to the selected shard, and finally the middleware delivers the response to the client.
Figure 2: CUBRID SHARD process.
Gizzard is a Sharding platform developed by Twitter. It is also a middle tier just like Spock Proxy and CUBRID SHARD. However, its usage and architecture are quite different from Spock Proxy or CUBRID SHARD.
The following figure shows how the data can be shared between several databases deploying two units of Gizzard.
Figure 3: Gizzard deployment diagram.
In this case, the interface being used by an application server is Thrift, an RPC protocol, not JDBC. Therefore, it is similar to DBGW (CUBRID SHARD) platform developed by NHN. When the schema of a storage changes, Gizzard may also need modifications. While it can be a constraint, it could also open up access not only to RDBMS, but also to various other databases (e.g. Lucene).
Gizzard is written based on Scala, and you can expand its functions by adding Scala codes as needed. Instead of viewing it as a complete product, we recommend that you download the source and modify it to suit your needs.
The biggest advantage of Gizzard is that you can perform hotspot response and database migration in the middle tier platform level. For example, Gizzard provides the direct replication feature, as shown in the following figure.
Figure 4: The replication feature of Gizzard.
Spock Proxy or CUBRID SHARD do not offer this feature, as they do not need it. CUBRID, MySQL, and Oracle have their own replication feature, thus they do not need to offer the feature from the middle tier.
The replication feature provided by Gizzard is very simple - send each request to several replicas - but cannot handle variety of requests that may occur while running the replication feature. However, it can be useful when using a database without the replication feature.
Figure 5: The migration feature of Gizzard.
As shown in the figure above, it offers the replication feature which does not require a service halt. When migrating a certain database, the feature replicates the details of a database storage before replication configuration, after allowing the new data to be written in two locations by configuring the replication. This approach, of course, cannot be used on RDBMS due to consistency issue.
Limitations of Sharding
Sharding, on the other hand, also has its own limitations. Typical constraints of Sharding are as follows:
- It cannot perform JOIN operations for two or more Shardings.
- auto_increment (serial) values can vary depending on Sharding.
- last_insert_id() value is no more valid.
- shard key column value must not be updated; requires delete, then insert.
- does not allow to access two or more Shards from one transaction.
When using Sharding, therefore, it is important that you perform right data modeling and schema design to prevent the above constraint issues.
To learn more about Sharding and other features of CUBRID I suggest you to see the Slideshare presentation at:
By Jeon Won Hee, Senior Software Engineer at CUBRID DBMS Development Lab, NHN Corporation.