Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Current Events
Join our developers event to win one of the valuable prizes!
posted 3 years ago
viewed 11785 times
Share this article

Database Sharding with CUBRID

Hi everybody!

Our development team has just released the User Specs for the Sharding feature which we are going to implement this year in CUBRID. In this blog I will explain the overall plan and how the database sharding will work in CUBRID.

What is Database Sharding?

At this moment you might be wondering what database sharding means. For this, we also need to understand what database partitioning is and their difference.

In simple words, the database partitioning is a method of dividing (partitioning) the data between tables in a single database instance.

db-partitioning.png

While the database sharding is a method of horizontal partitioning the data between tables in multiple machines. In this case, each individual partition is referred to as a shard or database shard.

db-sharding.png

Why Database Sharding?

There are three reasons why we want to provide the native support in CUBRID for database sharding:

  1. To enable horizontal scaling across multiple nodes.
  2. To provide application scale-out.
  3. To minimize application code changes.

What is it like without database sharding?

Take a look at the following figure to understand the communication process which takes places between an application and a database without sharding.

without-sharding.png

On the back-end you have several servers with separate databases, each of which has a table. The tables share the common schema. Briefly, they are identical, but store different data based on how the developers hard coded in their application. Assume the following three simple conditions are stated in the application which determine where should the data be stored. Normally, a quite complex static hashing scheme is implemented to aid the application in determining the target database. For this case, let's look at the easy one for better comprehension.

  1. WHERE id <= 500,000, store in TBL 01 which is in DB 1.
  2. WHERE id > 500,000 AND id <= 1,000,000, store in TBL 02 which is in DB 2.
  3. WHERE id > 1,000,000, store in TBL 03 in DB 3.

In case the developers add a new database, here are the steps they need to perform.

add-db-without-sharding.png

  1. Add a database.
  2. Add a broker and specify the port.
  3. Update the application code to use the new port.

Every time we add a new database with a table and want our application start using it, we need to manually change the application code which explicitly instructs which database table to use. To access different databases, the application also needs to establish a new connection. As a result:

  • There are too many connections between the application and brokers to access each DB.
  • Troublesome to maintain the application code as it should be changed whenever a new DB is added.

What is it like WITH database sharding?

add-db-with-sharding.png

In case of database sharding, all we need to do is to:

  1. Add a database.
  2. Update the broker metadata directory.

Everything else will be performed by the database server and a special broker which will parse the SQL query, select the necessary shard and relay the request.

How will CUBRID differ from the existing solutions?

There are many other solution already out there which provide database sharding support for different database systems. You can see some of the distinguished ones in the table below.

sharding-solutions.png

Most of the solutions are developed on top of a particular database. They usually run as a middleware which needs to parse the queries to determine the target database. Once the database is known, the middleware passes along the request, and the database parses the SQL query again. Thus such solutions often slow down the overall performance of the database.

The solution which CUBRID is going to provide will leverage its native broker middleware. Thus, no extra layer will be added which would have hindered the performance. Normally when you install CUBRID, it creates a broker which parses the queries sent by the client applications and relays the request to the database.

In the new architecture CUBRID will create two brokers by default, one of which will serve for sharding purpose. This broker will parse the queries as it usually does, but to determine which shard is necessary to relay the request to, it will receive a shard key as a hint in the SQL query like:

SELECT * FROM table WHERE /* shardkey */ id = ? AND …

Then based on the information set in the metadata folder, the broker will communicate with the necessary database to fetch the data. If you need to explicitly instruct the application to request a certain database, all you have to do is to add a hint into your normal SQL query. Thus,

  • Your application is not required to know where the table resides.
  • You will have a single view to access all the databases.
  • And, there will not be any performance overhead.
As a result we will have the following architecture for database sharding.

cubrid-shard-overview.png

Features

The database sharing with CUBRID will provide the following features.

  • Up to 1024 Shards.
  • Many Shard ID Generation Algorithm

    • Extendible Hash/Range
    • Or Composite (special purpose)
  • Dynamic Binding Query
  • Connection Pool for Shards
  • Shard Targeted Query
  • Generic Table (Not Sharded Table)
  • Configurable Shard Metadata by SQL
  • Retrieve Shard Configurations
  • One or More Brokers

    • Read Only Broker, Load Balancing
  • High availability for each Shard

    • Master/Slave DB
    • Quick & Easy split a shard
  • Support CUBRID/MySQL/Oracle as Shards

Do you think this is going to be a really nice sharding solution? What solution do you currently deploy in your environment? Please leave your comments below.



comments powered by Disqus