Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

How to Efficiently Import Large Files in CUBRID

Sometimes we need to import large data files to CUBRID tables which can go over several GB. In this article we will cover different ways to perform this task. They are:

  • The easiest way to import data to CUBRID is using the CUBRID Manager.
  • You can also insert using the data using PHP, Java, or other CUBRID drivers.
  • Otherwise, you can use CSQL, CUBRID's command line SQL interpreter.

First, we will show you some basic test results so that you can have an overall idea why certain solution, among those listed above, is faster or slower. Then we will list the recommendation you are encouraged to follow when importing large data to CUBRID.

Test Overview

For each of these solution we will perform a test on a small amount of data (100,000 records) to provide some idea on each of the solution's performance. The setting will be executed on Windows 7 x86 with CUBRID 8.4.0 installed. We will use .

  • CSQL
    • -S (Stand-alone execution mode)
    • -C (Client-server execution mode)
  • CUBRID Manager
  • PHP API

Additionally the following configurations will be applied to each of the solutions.

  • Commit cycle will be set to 5,000.
  • We will measure the execution time of only the INSERT statements, thus we will have the database and tables created in advance.

Test Scenario

Running CSQL in the command line (-S and -C modes)

CSQL is a command line tool which allows to execute SQL queries to CUBRID database server. When compared to CUBRID Manager, it is much lighter and faster. There are two modes in which CSQL can be executed. One is Stand-alone mode (-S), the other is Client-server mode (-C).

  • In the Stand-alone mode, CSQL Interpreter directly accesses database files and executes commands including server process functions. That is, SQL statements can be sent and executed to a database without a separate database server process running for the task. Since the standalone mode allows only one user access at a given time, it is suitable for management tasks by Database Administrators (DBAs).
  • In the Client-server mode, CSQL Interpreter usually operates as a client process and accesses the server process.

For more, read CSQL Execution Mode.

Now we will create a database for our test in the command line.

$> cubrid createdb dbtest

Then we need to connect to the dbtest database using CSQL and create the necessary tables.

$> csql demodb

        CUBRID SQL Interpreter


Type `;help' for help messages.

csql> CREATE TABLE test1(a int, b TIMESTAMP, c int AUTO_INCREMENT)
csql> ;ru

Current transaction has been committed.

1 command(s) successfully processed.
csql> ;ex
  • ;ru is a session command which directs CSQL to run the entered SQL statement;
  • ;ex stands for exit. For more commands see Session Commands.

Since we have prepared everything, let's create a dbtest.sql file with the INSERT statements which will be used in our test.

INSERT INTO test1 VALUES (0 , SYS_TIMESTAMP, NULL);
INSERT INTO test2 VALUES (1 , SYS_TIMESTAMP, NULL);
……………………
INSERT INTO test1 VALUES (99998 , SYS_TIMESTAMP, NULL);
INSERT INTO test1 VALUES (99999 , SYS_TIMESTAMP, NULL);

To run the test we direct CSQL to load the SQL statements from the file and execute them. To start CSQL in Stand-alone mode, we run the following command:

$> csql -u dba -p 1111 –S -i dbtest1.sql dbtest

To start CSQL in Client-server mode, we run the following command:

$> csql -u dba -p 1111 –C -i dbtest1.sql dbtest

Importing in PHP

Like in the previous example, we will use the same database and table information. Then we will run the following PHP code to insert 100,000 records.

$host_ip = "localhost";
$host_port = 33000;
$db_name = "dbtest";
$userId = "dba";
$password = "1111";

$cubrid_con = @cubrid_connect($host_ip, $host_port, $db_name, $userId,   $password);

if($cubrid_con)
{
    $sql = "insert into " . $db_name . " (a, b) values (?, SYS_TIMESTAMP)";
    $reg = cubrid_prepare($cubrid_con, $sql, CUBRID_INCLUDE_OID);
    
    // Insert 100,000 records in the loop.
    for($i = 0; $i < 100000; $i++)
    {   
        $res = cubrid_bind($reg, 1, $i);
        $res = cubrid_execute($reg);

        // Commit once in 5,000 times (commit cycle).
        if (($i+1) % 5000 == 0)
        {   
            cubrid_commit($cubrid_con);
            echo $i, "
";
        }
    }
}

Importing in CM

We will import the data from the file we have previously created for CSQL. We will use CUBRID Manager's default Import Data feature (see the image below).

Import Data in the CUBRID Manager

Test Results

The following are the results we obtained from this experimental test ran on a small amount of data. The following table and graph represent the data in seconds.

50,000 records 100,000 records 300,000 records
csql-S 5 10 29
csql-C 111 224 599
PHP 68 136 413
CM 17 33 96

CUBRID Tools Import Results

Conclusion

Use CSQL in Stand-alone mode

As we can see from this test, CSQL in Stand-alone mode is the fastest solution to import the data to CUBRID. This is because it directly deals with the database without interacting with server processes. It behaves as if server itself is performing the operations, not the client which is connected to the server. For this reason it is the fastest solution.

However, there are cases when we cannot use the Stand-alone mode, since in Stand-alone mode no other database user has to interact with the database. This means that CSQL should be the only user which works with the database at that particular moment. Thus Stand-alone mode implies that the database should not be running. If it is running, this means another user (host) is using it. In this case, you will receive the following warning.

$> csql -S demodb

ERROR: Unable to mount disk volume "C:CUBRIDdatabasesdemodbdemodb_lgat".
The database "C:CUBRIDDATABA~1demodbdemodb", to which the disk volume belongs,
is in use by user USER-PC$ on process 3096 of host user-PC since Thu Sep 22 11:04:01 2011.

In such cases, either you have to make sure no other user is interacting with the database, by stopping the database, or use other solutions. To stop the database in the command line, type the following command where demodb is the name of your database.

cubrid server stop dbtest1

Create CONSTRAINT after Importing

This is one of the foremost important recommendations for users who import large data into tables. Do not create any type of INDEXing before you import the data. This related to INDEX, UNIQUE, REVERSE INDEX, REVERSE UNIQUE, and even PRIMARY KEY (PK automatically creates INDEX). Otherwise, every INSERT during the importing process will force the server to reindex the new value which adds up to the overall time. So:

  1. Create your table.
  2. Define your columns with data types, but do not add any constraints event the PRIMARY KEY.
  3. Import your data.
  4. Only then create the constraints.

Turn OFF Logging

There are two types of logging:

  • Client-side logging
  • Server-side logging

Client-side logging

The client-size logging refers to SQL_LOG parameter of the Broker, the middleware of CUBRID. By default it is turned ON.

When SQL_LOG is configured to ON all SQL statements processed by the CUBRID Application Server (CAS) will be stored in the external log file. So this adds up some time, too. So unless you are sure that you need logging for your import, turn it OFF. It will significantly improve the import performance.

There are several ways to turn SQL_LOG OFF. We will show you how to change its value in the CUBRID Manager and in the command line.

CM Example

To turn OFF SQL_LOG in the CUBRID Manager, right click on the broker and select Properties.

Broker properties

In the modal window set the value of SQL_LOG to OFF. Press the OK button.

Broker properties window

To apply the changes, we need to restart the Broker. Right click on the broker and select Broker Off. Then start it again by clicking on the Broker On menu. Then do your importing.

Command Line Example

Open in the text editor the cubrid_broker.conf default broker configuration file located in the conf directory where you have installed CUBRID. In the file set SQL_LOG = OFF and save it as shown below.

[broker]
...
SQL_LOG                 =OFF
...

Then restart the broker.

cubrid broker restart

Server-side logging

The server-side logging is determined by the media_failure_support parameter of the CUBRID Server which specifies whether or not to store archive logs in case of storage media failure. If the parameter is configured to yes, which is the default value, all active logs are copied to archive logs when the active logs are full and the transaction is active. If it is configured to no, archive logs created after the active logs are full are deleted automatically. Note that archive logs are deleted automatically if the value of the parameter is configured to no.

By setting its value to no it is possible to gain some more time when performing mass import. To change its value in the CUBRID Manager, right click on the host and select Properties.

CUBRID Host Properties

In the modal window set the value of media_failure_support parameter to no. Press the OK button.

CUBRID Server Common Parameters

To apply the changes, make sure to restart the server.

Utilize Threads and Commit Count

If you use CUBRID Manager to import the data, make sure you utilize threads and commit counts.

Thread will direct CM to use several concurrent connections to perform INSERTs. To specify the number of threads, set the Thread count value. However you should remember that too many thread may vice-versa deteriorate the performance. It all depends on your hardware spec. Normally we recommend to use somewhere between 5 and 10 threads.

Import Data

Also, the execution time will depend on the commit cycle. It determines how often the inserted data should be committed. Committing frequently will deteriorate the performance. However, too high number might require too much memory. The value around 5,000 is recommended.

Utilize data_buffer_size

The data_buffer_size is the server parameter that configures the number of data pages to be cached in the memory by the Database Server. The greater the value of the data_buffer_size parameter, the more data pages can be cached in the buffer, which will provide the advantage of decreased disk I/O cost. However, if this parameter is too large, the buffer pool can be swapped out by the operating system because the system memory is excessively occupied. It is recommended to configure the data_buffer_size parameter in a way the required memory size is less than two-thirds of the system memory size. The default value is 512M.

Utilize insert_execution_mode

This insert_execution_mode is a very useful setting which allows to perform the INSERT operations on the server side instead of the client side. It has 7 modes (for more see the Database Server Parameters). By default it's value is 1, which means all INSERT INTO ... SELECT ... queries will be executed on the server side. When importing the data, you do not need this mode. Instead, you will need to set insert_execution_mode to 2, which will direct to perform all common INSERT statements like INSERT INTO ... VALUES ... to be processed by the server. See the following image.

This parameter is useful when the memory capacity of the client is limited or the dirty reading of INSERTed data is required.

CUBRID Insert Execution Mode Parameter

In conclusion, let's wrap up all these.

  • Use CSQL in Stand-alone mode.
  • Create CONSTRAINT after importing.
  • Turn OFF Logging.
  • Utilize Threads and Commit Count.
  • Utilize data_buffer_size.
  • Utilize insert_execution_mode.

References

See also

CUBRID Query Tuning Techniques

This article has been written by one of the CUBRID core developers to help users improve their application performance by understanding h...

What is Covering Index in CUBRID 8.4.0?

Starting from version 8.4.0 CUBRID supports Covering Index. Wikipedia explains it as:

Run CUBRID Manager on Mac OS X

It is really the great news that CUBRID Manager is now available for Mac users. In this guide you will see how to establish a remote database connect...




You are either using a very old browser or a browser that is not supported.
In order to browse cubrid.org you need to have one of the following browsers:



Internet Explorer: Mozilla Firefox: Google Chrome: