Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Current Events
Join our developers event to win one of the valuable prizes!
posted last year
viewed 8299 times
Share this article

The CUBRID Broker Story

Unlike other RDBMS, CUBRID has a middle layer called a BROKER.

For better understanding of CUBRID let’s take a look at why the BROKER is needed, what are its pros and cons.

Developers who use CUBRID know that its features a 3-tier structure, "API – BROKER – DB Server". The most frequently asked question by those who are new to CUBRID is:

"Why is CUBRID built in a 3-tier structure?"

Another commonly asked question is:

"The BROKER port is required in the connection string. What does the BROKER do?

This article aims at providing developers with answers as to why the BROKER is needed, what features it has, and what the pros and cons of the BROKER is.

CUBRID Client Module

  1. There is a database, which is a collection of data.
  2. Then there is a server, which controls the data.
  3. And a client that sends requests to the server.
This is the general pattern that CUBRID shares in its architecture. Now take a look at CUBRID's Process Structure illustrated in the figure below.

Figure 1: CUBRID Process Architecture

You can see that the process that acts as the database server is called cub_server (the lower part). Also the only client module that can send a request to this cub_server process is libcubridcs.so (a part of the cub_cas module). This library is written in C-language. The native application that uses libcubridcs.so will look like the following.

cubrid-process-architecture.png

And below you can see what these two client and server processes do when, for instance, a SELECT statement is executed in the native application. The figure shows an abbreviated form of the query processing stages. The processing stage may change depending on whether a plan cache was used, on the isolation level that was set, or depending on the form of query.

cubrid-request-process.png

This schema above shows that CUBRID’s client module does not just send to the server the request received from the application program, but also it is responsible for other tasks. The client module:

  • parses the query;
  • optimizes it;
  • creates a query plan;
  • and sends the plan to the server.

The server then searches and manipulates the data according to the plan sent by the client.

To create a query plan a lot of information is needed for the client module. Let’s say the following query is executed.

SELECT * from foo where id = 1
  1. The client module first checks whether the query meets the CUBRID SQL syntax before processing it. It has a query parser which cuts the SQL statement into the smallest tokens, thus analyzes the statement.
  2. After confirming that there are no syntax errors, the client module checks the statement for semantic errors. For example, it checks:
    • whether a foo table actually exists;
    • whether the current user has a SELECT authority;
    • if there is a column called id in the foo table;
    • if the column id is comparable with the numeric values, etc.
  3. The more complicated the query is by join or complex WHERE condition, the more syntax errors are checked. Since it is extremely inefficient to send a request to a server every time schema information is needed, schema information is cached in the client.
  4. A lock cache is also needed. If a query such as that in the example is executed, then tasks, such as changing the schema for the table, must not be conducted until SELECT on the foo table is completed. Using the lock can be used so that the table is not modified when the table is being searched. In CUBRID, the IS lock is obtained when the table is being SELECTed. Inserting and selecting data can be made on the table when the IS lock is obtained, but altering the table schema, or modifying through a full scan of the table is not allowed until the IS lock has been released. A IS lock on the table to select must be secured in the client module. However the lock must be cached in the client module since it is inefficient to request a lock to server every time, even worse on performance.
  5. In the final stage, it optimizes the query and then creates a plan. It checks whether a usable index exists, decides which join method should be used, and sees if the sorting process can be omitted if there is an ORDER BY clause.
  6. Then the client module transfers the optimized query plan to the server so to be executed.

We’ve looked at simplified stages of processing a query. In the stages by a client module, it performs pre-execution processes and then sends a request to the server. The client request is handled in the server. Therefore, the client has many modules in order to process the query. The overall structure of CUBRID is explained in the following diagram.

cubrid-client-server-modules.png

In the above figure you can also notice that the client module provides a native C-interface for JAVA or PHP APIs which applications can use to communicate with the database. To summarize, the 3-tier architecture allows CUBRID to upload the APIs and move the query processing logic to the client module at the same time unloading the database server to do these tasks. Thus, CUBRID APIs are light, and the CUBRID Server is not busy. I believe it became clear why CUBRID adopts a 3-tier architecture.

For a better comprehension, below I am going to compare how the same process would work in a 2-tier architecture which is used in most other relational database systems.

  1. One way to process the query is to enable the client module to be called directly by JAVA or PHP applications. One of the biggest problems in this approach is that the application becomes too heavy because of the JDBC. The application consumes too much resources (CPU, Memory, etc.) because the client module performs many stages for query processing.
  2. The second method is to separate the architecture that the client module only receives a request from an application and the server performs all of query processes. This is an extremely complicated task since the existing architecture should be re-designed. No matter how difficult the task is, we have to ask a question as “How beneficial will a 2-tier architecture be?” The CUBRID Development Lab came to the conclusion that there is nothing to gain from the modifications. The DBMS server processes most actions that the DBMS is required to do, such as database management, buffer management to change data saved on disks to memory structures, concurrency management, failure restoration management, and this requires a lot of system resources. In most cases, the server's load causes a performance bottleneck therefore to share the server’s loads with the client module assures system resources availability for server processing and maximizes server’s capacity. For this reason, it is beneficial to have the 3-tier architecture.

As we’ve looked through two methods so far, 2-tier architecture is not suitable, therefore a 3-tier structure can be considered as an alternative. Now leave the DBMS server and client module structure as they are, and add a middle layer that can process JAVA and PHP requests. Then application driver conducts relatively simple tasks, and the requests can be handled in the middle layer and in the DBMS server. In CUBRID, this middle layer between a driver and the DBMS server is called, CUBRID BROKER.

CUBRID BROKER

BROKER is not a single process but two processes, which are cub_broker and cub_cas.

The cub_cas is a process which handles the query request by parsing and query planning. In other words, it is a client of the DBMS server. An active connection to process a request, among many other connections requested from the driver, corresponds to a single cub_cas.

The other process, cub_broker decides which cub_cas to allocate when the driver requests a new connection, and manages the cub_cas processs accordingly. A single cub_broker process manages multiple cub_cas processes. One cub_broker and an “N” number of cub_cas processes consist of a single BROKER group. Multiple groups can be operated depending on the user setting.

The diagram illustrated in the Figure 1 above shows how processes are interacted when the application requests a connection.

  1. When the app sends a connection request, the request is sent to cub_broker.
  2. Then the cub_broker selects available cub_cas and sends the application’s connection to cub_cas, thus connecting the driver and the cub_cas.
  3. The cub_cas accesses the cub_master to connect the DB server and requests a connection to the cub_server. A connection between cub_cas and cub_server is created once the cub_master passes the connection from the cub_cas to the cub_server (like the application connection is passed to cub_cas by cub_broker). This connection may last even when the driver is disconnected.

Now we have seen what the BROKER is and how it works in the query processing. Now let’s take a look at the pros and cons of the 3-tier structure.

The Cons of the BROKER

Complexity

One of the first shortcomings that comes to mind when you think of the BROKER is the architecture looks complicated. It gives to users many difficulties to understand additional processes in the server and broker to handle the query.

The database server and the BROKER operate as a separate component. To the DBMS server, the cub_cas is a single client. To the BROKER, the DBMS server is a target component to connect based on the driver’s request. This relationship may seem hard to understand for those who are unfamiliar with CUBRID. For example, let’s say that the number of connections has been configured in the application but more connections are necessary at this moment. In CUBRID, the user should change the settings in both the BROKER and the DB server respectively. BROKER setting has to be changed to increase the number of cub_cas which can receive the maximum number of requests from the application. The cub_server setting must also be modified to increase the number of cub_server which can receive the maximum number of requests from the cub_cas. It may be confusing as to why two settings have to be modified when the number of connections increases. This is because, as we mentioned above, the BROKER and database server both have different roles. However, we may make changes to this in the future. On the other hand, this separate component enables to have the flexibility and scalability of the configuration.

For this reason, CUBRID has two configuration files; cubrid.conf for the DB server and client library and cubrid_broker.conf for the BROKER.

CUBRID Configuration Files

Performance

Another shortcoming the BROKER has is that its performance decreases due to the increase of hops. The decline in performance may seem natural since the number of processing stages increases. Performance quality can be assessed with response time and throughput. If the process stages increases, the response time naturally increases. The response time when processing a single query can be within a difference of mss, but the gap can widen greatly if the number of repetitions increase.

The increase in the number of communications is the most significant reason behind the increase in response time.

However, unlike a single client environment, in a multi-client environment the increase of response time due to the increase in communications cannot make a big difference. In environments where the server load becomes extremely high, the database server and client module can be separated, which will increase the amount of resource the server can use. The increase in server’s resources means the increase of server’s throughput. Consequently, having an additional layer, BROKER, is not something that you should be concerned about, because it doesn’t have a grave influence on throughput, and can result in better performance. If the performance seriously matters, then you can use connection pooling and statement pooling because it decreases the number of request sent from the driver to the server.

Failure points

The third shortfall is that the failure points increase because of the existing middle layer (BROKER). Fortunately CUBRID has prepared measures for such situations. CUBRID supports HA (High Availability) feature on the DB server, as well as on the BROKER. By providing the connection URL specifying the active and standby BROKER’s IP and port, the failure on the BROKER can be managed by failover to a standby BROKER so as to keep the service alive.

The Pros of the BROKER

Now let’s take a look at the merits of the BROKER.

Sharing resources

One merit provided by the BROKER is that it can share the cub_cas which handles connections. As was mentioned above, the connection, which handles the actual query, is connected one to one to the cub_cas. In other words, there must be as many cub_cas processes active as concurrent connections.

What if the application creates two connections but there is only one cub_cas?

For convenience sake, let’s call the two connections C1 and C2. C1 is connected to the cub_cas and is processing queries. If the C2 sends a connection request, the cub_broker looks for a cub_cas that can be used. The cub_broker finds one cub_cas process, which is currently processing C1. In such a case, the cub_cas which is currently in service breaks off the connection with C1 and goes to process C2 only when C1 disconnects, or the time between the end of the current transaction and before a new transaction begins. If two cub_cass are active for C1 and C2, then the connection will not be terminated and the queries for C1 and C2 will be processed smoothly.

However, if there is only one cub_cas, C1 and C2 must share a cub_cas and the connection with cub_cas will be terminated every time C1 or C2’s transaction is over. The driver is aware that their connection can be terminated after the transaction, due to the situation before, and will try to reconnect. When the server restarts in the connection pooling environment, connections are restored just as the situation mentioned above, and the application does not have to examine the validity of the connections. The following diagram shows how the processing is done when two connections share a single cub_cas.

cub_cas-connction-management.png

However, frequently used connections have to compete with a single cub_cas, which can cause problems in the performance. So the BROKER parameter value should be configured so that the adequate number of cub_cas are running. However, if a majority of the connections are in idle state in the pool, then sharing the connection is more rational. This will be more discussed in the CAS for Oracle/MySQL later on.

Logging and monitoring

The BROKER's second strength is a convenient monitoring and logging features. Since all queries requested by applications, such as JDBC, pass through the BROKER, a part of the processing can be identified through the BROKER’s status. The status can be identified through utilities that monitor the BROKER status. The status information, shown per cub_cas, has the following meanings:

  • IDLEcub_cas is not connected to the driver.
  • CLIENT WAIT – in the middle of transaction, the 1st request is processed and is waiting for the 2nd request from the application. For example, a transaction consists of two queries, INSERT and UPDATE and commit. The CLIENT WAIT state is activated after the INSERT execution is completed and before the UPDATE query is requested. If this status continues, then the application is probably not sending the 2nd request.
  • CLOSE WAIT – after a transaction in ended, the application has not requested the following query. A new connection request can be handled in cub_cas CLOSE WAIT state only if an IDLE cub_cas is not available.
  • BUSY – a query execution request has been sent to the database server. The query statement that is being executed can be checked with the monitoring utility.
Information such as the QPS, the number of errors, the number of slow queries that is being processed in the BROKER is also monitored. Besides monitoring, the information being processed in the BROKER is also recorded in log files, which provides information on connection as well as query executions.

The following shows how a query log is recorded when a simple JDBC application is executed. The log provides information on when the query was executed, what values were bound, execution time, number of tuples, etc.

String connection_url = "jdbc:cubrid:localhost:53300:testdb:public::?";
Connection con =   DriverManager.getConnection(connection_url, null,   null);
String query = "select   * from foo where id = ?";
 
PreparedStatement pstmt = con.prepareStatement(query);
 
pstmt.setInt(1, 1);
 
ResultSet rs = pstmt.executeQuery();
 
while (rs.next())
{
    // fetch result
}
 
rs.close();
pstmt.close();
con.close();
03/21   18:51:36.467 (0) CLIENT IP 127.0.0.1
03/21   18:51:36.473 (0) connect db cgkdb user public url   jdbc:cubrid:localhost:53300:cgkdb:public::?
 
03/21   18:51:36.476 (1) prepare 0 select * from foo where id = ?
03/21   18:51:36.477 (1) prepare srv_h_id 1
03/21   18:51:36.491 (1) execute srv_h_id 1 select * from foo where id = ?
03/21   18:51:36.491 (1) bind 1 : INT 1
03/21   18:51:36.529 (1) execute 0 tuple 1 time 0.055
03/21   18:51:36.529 (0) auto_commit
03/21   18:51:36.529 (0) auto_commit 0
03/21   18:51:36.529 (0) *** elapsed time 0.052
 
03/21   18:51:36.537 (0) con_close
03/21   18:51:36.537 (0) disconnect

Flexibility

The third advantage of BROKER lies in its various modes. As briefly mentioned before, CUBRID supports HA. Database server has active-standby structure, and data in an active server gets duplicated to a standby server, thereby enabling a continuous service when a failure occurs on the active server through automatical failover to the standby server. While write operations (INSERT/DELETE/UPDATE) are required to be handled by the active server, read operations (SELECT) can be processed by both active and standby servers. According to operations the application executes, a mode to select to which server the application should access is needed.

There are three modes for BROKER.

  • Read-write mode is for accessing the active server with read and write operations;
  • Read-only mode is for accessing the standby server only for read. In case of connection failure, the broker can redirect the requests to the active server.
  • Slave-only mode is also for accessing the standby server only for read, but in case of connection failure, the broker will not redirect the requests to the active server but return the error.

When the BROKER is set in read-only mode and the standby server is accessible, then the broker connects to a standby server. In case of connection failure, the BROKER connects to the active server. Even if the database server has changed its role (i.e. changed from being active to being standby) due to a failure, the application shall not consider which server is active because the BROKER automatically connects to a active/standby server according to its read-write or read-only mode.

In slave-only mode the BROKER can access only a standby server. In a case when a failure occurs on the standby server and there is only an active server, all the requested connections to slave-only BROKER return an error. This mode is useful when a standby server is only allowed to execute queries in order to analyze the inserted data in a real production service however those queries should be blocked from reaching the active server because they may decrease the performance in service.

However if the failure on the standby server lasts too long and the application is needed to connect to the active server, then a DBA can change this slave-only BROKER to read-only. This is very convenient, as there is no need to make code modification on the application level; no application developer is required to be involved. Only DBA is involved and is capable of switching the modes according to their needs. Below is the image of server connection according to BROKER mode.

cubrid-broker-modes.png

For more detailed examples of the BROKER in different modes, you can refer to the presentation we gave at OSCON 2011: CUBRID HA - Guaranteed Way to Never-Die Web Services.

So, now we have briefly looked into advantages and disadvantages of the BROKER model. Let me show you some examples of possible BROKER development.

CAS for Oracle/MySQL

CAS for Oracle/MySQL is a middleware connection pool manager for ORACLE and MySQL server based on CUBRID BROKER and JDBC driver. The main purpose of CAS for Oracle/MySQL is to be able to use the advantages of CUBRID Broker's connection management also in Oracle/MySQL. Let’s say there are two open connections in a single Web server which are in idle state in the connection pool. If there are 10 Web servers in idle state, this means there are 20 connections in total which are mostly idle. If the number of Web servers increases as a service grows in traffic, the number of idle connections will also increase accordingly. In this case, if application connections can be shared in the BROKER layer, we can avoid the increase of idle connections. The connection loads on the DBMS server can be reduced if cub_cas is set to operate accordingly not by total connections set in Web servers, but also by actual load of DBMS. The following image is simplified structure of CAS for Oracle/MySQL, which you can also see in the above mentioned OSCON presentation.

cubrid-cas-for-oracle.png

DB sharding

As database grows in volume, there are some cases where you need to manage the data larger than manageable capacity that each device can handle. In such cases, the data needs to be allocated to several databases with the same schema. Generally, when processed in application, you are to select a sharding key and a rule for deciding a shard based on a key value and then to pick up a database and query to it. For example, as for forum database, forum ID shall be a shard key and a shard will be determined through hash function in the application. If such a sharding method is supported in the BROKER, then the application has to transfer only a shard key with a query and the BROKER will determine a shard by the shard key and properly execute the query to the shard. In this case, there will be no need for implementing logic for deciding a shard in application. Only changing the BROKER’s configuration will be necessary when more shards add up.

We wrote a separate blog about CUBRID Database Sharding. Refer to it for more explanation and examples.

Load balancing

In a case of increased read load on the DB, one of the possible ways to handle the load is to extend the replication server, then distribute read requests. CUBRID also supports 1:N active standby configurations in HA and allows to have additional replicas for read balancing.

In BROKER, however, loads shall be balanced dynamically deciding how much load is applied to each replicated server. For example, let’s say in a case where there are ten connections, five of them are connected to the replicated server 1 while the other give - to the replicated server 2. It is impossible to forecast how much connections will be used at that moment when each of 5 connections is made. If the connection pool operates in First-In-First-Out, throughput per a connection could vary greatly one from another and so could throughput per a replicated server. However, if connection can be changed after evaluating its throughput dynamically, then it can balance each of throughputs and distribute server loads automatically by only changing BROKER’s configuration even when additional replicated server is added.

What DB sharding and load balancing have in common is that by the time the request is handled, a target DB should be selected. And the good news is that the 3-tier structure is capable of responding to such requests in a more flexible manner. CAS for Oracle/MySQL is being supported in CUBRID version 8.4.0, while the Database Sharding and load balancing featues will be provided in the upcoming versions.

Conclusion

So far, we have looked into CUBRID and its middleware BROKER. Although there are some disadvantages of the 3-tier structure, I believe that there are many more advantages in it which allow more flexibility to connection management as the server environment gets more complex. Taking advantage of its unique structural advantage, starting from CAS for Oracle/MySQL, the CUBRID BROKER shall evolve to become an independent middleware.

By Chol Gyu Kang, CUBRID Dev. Lab Leader, NHN Corporation.



comments powered by Disqus