<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/">
    <channel>
        <title>How to Efficiently Import Large Files in CUBRID</title>
        <link>http://www.cubrid.org/?mid=import_large_file</link>
        <description>How to Efficiently Import Large Files in CUBRID</description>
        <language>en</language>
        <pubDate>Thu, 17 Mar 2011 11:24:40 -0800</pubDate>
        <lastBuildDate>Mon, 14 Nov 2011 06:29:37 -0800</lastBuildDate>
        <generator>XpressEngine 1.4.4.1</generator>
                        										        <item>
            <title>How to Efficiently Import Large Files in CUBRID</title>
            <dc:creator>CUBRID</dc:creator>
            <link>http://www.cubrid.org/import_large_file</link>
            <guid isPermaLink="true">http://www.cubrid.org/import_large_file</guid>
                                    <description><![CDATA[<h1>How to Efficiently Import Large Files in CUBRID</h1>

<div class="contents-table"></div>

<p>Sometimes we need to import large data files to CUBRID tables which can go over several GB. In this article we will cover different ways to perform this task. They are:</p><p></p><ul><li>The easiest way to import data to CUBRID is using the <b>CUBRID Manager</b>.</li><li>You can also insert using the data using PHP, Java, or other <b>CUBRID drivers</b>.</li><li>Otherwise, you can use <b>CSQL</b>, CUBRID's command line SQL interpreter.</li></ul>

<p>First, we will show you some basic test results so that you can have an overall idea why certain solution, among those listed above, is faster or slower. Then we will list the recommendation you are encouraged to follow when importing large data to CUBRID.</p>

<h2>Test Overview</h2><p>For each of these solution we will perform a test on a small amount of data (100,000 records) to provide some idea on each of the solution's performance. The setting will be executed on Windows 7 x86 with CUBRID 8.4.0 installed. We will use .</p><p></p><ul><li>CSQL</li><ul><li><b>-S</b> (Stand-alone execution mode)</li><li><b>-C</b> (Client-server execution mode)</li></ul><li>CUBRID Manager</li><li>PHP API</li></ul><p></p><p>Additionally the following configurations will be applied to each of the solutions.</p><p></p><ul><li>Commit cycle will be set to <b>5,000</b>.</li><li>We will measure the execution time of only the INSERT statements, thus we will have the database and tables created in advance.</li></ul>

<h2>Test Scenario</h2>

<h3>Running CSQL in the command line (-S and -C modes)</h3>

<p><a href="/manual/840/en/Introduction%20to%20the%20CSQL%20Interpreter" target="_self">CSQL</a> is a command line tool which allows to execute SQL queries to CUBRID database server. When compared to CUBRID Manager, it is much lighter and faster. There are two modes in which CSQL can be executed. One is <b>Stand-alone</b>&nbsp;mode (-S), the other is <b>Client-server</b>&nbsp;mode (-C).</p>

<ul><li>In the <b>Stand-alone</b> mode, CSQL Interpreter directly accesses database files and executes commands including server process functions. That is, SQL statements can be sent and executed to a database without a separate database server process running for the task. Since the standalone mode allows only one user access at a given time, it is suitable for management tasks by Database Administrators (DBAs).</li><li>In the <b>Client-server</b>&nbsp;mode,&nbsp;CSQL Interpreter usually operates as a client process and accesses the server process.</li></ul>

<p>For more, read <a href="/manual/840/en/CSQL%20Execution%20Mode" target="_self">CSQL Execution Mode</a>.</p>

<p>Now we will create a database for our test in the command line.</p>

<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="true" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">$&gt; cubrid createdb dbtest</div>

<p>Then we need to <a href="/manual/840/en/Connecting%20to%20Local%20Host" target="_self">connect</a> to the <b>dbtest</b>&nbsp;database using&nbsp;CSQL and create the necessary tables.</p>

<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
$&gt; csql demodb

        CUBRID SQL Interpreter


Type `;help' for help messages.

csql&gt; CREATE TABLE test1(a int, b TIMESTAMP, c int AUTO_INCREMENT)
csql&gt; ;ru

Current transaction has been committed.

1 command(s) successfully processed.
csql&gt; ;ex
</div>

<ul><li><b>;ru</b>&nbsp;is a session command which directs CSQL to <i>run</i> the entered SQL statement;</li><li><b>;ex</b>&nbsp;stands for <i>exit</i>. For more commands see <a href="/manual/840/en/Session%20Commands" target="_self">Session Commands</a>.</li></ul><p></p><p>Since we have prepared everything, let's create a <b>dbtest.sql</b> file with the INSERT statements which will be used in our test.</p>

<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="true" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
INSERT INTO test1 VALUES (0 , SYS_TIMESTAMP, NULL);
INSERT INTO test2 VALUES (1 , SYS_TIMESTAMP, NULL);
……………………
INSERT INTO test1 VALUES (99998 , SYS_TIMESTAMP, NULL);
INSERT INTO test1 VALUES (99999 , SYS_TIMESTAMP, NULL);
</div>

<p>To run the test we direct CSQL to load the SQL statements from the file and execute them. To start CSQL in Stand-alone mode, we run the following command:</p>

<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="true" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
$&gt; csql -u dba -p 1111 –S -i dbtest1.sql dbtest
</div>

<p>To start CSQL in Client-server mode, we run the following command:</p>

<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="true" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
$&gt; csql -u dba -p 1111 –C -i dbtest1.sql dbtest
</div>

<h3>Importing in PHP</h3>

<p>Like in the previous example, we will use the same database and table information. Then we will run the following PHP code to insert 100,000 records.<br /></p>

<div editor_component="code_highlighter" code_type="php" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
$host_ip = "localhost";
$host_port = 33000;
$db_name = "dbtest";
$userId = "dba";
$password = "1111";

$cubrid_con = @cubrid_connect($host_ip, $host_port, $db_name, $userId,   $password);

if($cubrid_con)
{
    $sql = "insert into " . $db_name . " (a, b) values (?, SYS_TIMESTAMP)";
    $reg = cubrid_prepare($cubrid_con, $sql, CUBRID_INCLUDE_OID);
    
    // Insert 100,000 records in the loop.
    for($i = 0; $i &lt; 100000; $i++)
    {   
        $res = cubrid_bind($reg, 1, $i);
        $res = cubrid_execute($reg);

        // Commit once in 5,000 times (commit cycle).
        if (($i+1) % 5000 == 0)
        {   
            cubrid_commit($cubrid_con);
            echo $i, "
";
        }
    }
}
</div>

<h3>Importing in CM</h3><p>We will import the data from the file we have previously created for CSQL.&nbsp;We will use CUBRID Manager's default <code class="menu">Import Data</code> feature (see the image below).</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/cm-import-data.png" alt="Import Data in the CUBRID Manager" width="393" height="552" editor_component="image_link"/></p>

<h2>Test Results</h2><p>The following are the results we obtained from this experimental test ran on a small amount of data. The following table and graph represent the data in <b>seconds</b>.</p>

<table class="blackcap rowbg">
<thead>
<tr>
<th></th>
<th>50,000 records</th>
<th>100,000 records</th>
<th>300,000 records</th>
</tr>
</thead>
<tbody>
<tr>
<th>csql-S</th>
<td>5</td>
<td>10</td>
<td>29</td>
</tr>
<tr>
<th>csql-C</th>
<td>111</td>
<td>224</td>
<td>599</td>
</tr>
<tr>
<th>PHP</th>
<td>68</td>
<td>136</td>
<td>413</td>
</tr>
<tr>
<th>CM</th>
<td>17</td>
<td>33</td>
<td>96</td>
</tr>
</tbody>
</table>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/cubrid-tools-import-results.jpg" alt="CUBRID Tools Import Results" width="445" height="249" editor_component="image_link"/></p>

<h2>Conclusion</h2><h3>Use CSQL in Stand-alone mode</h3><p>As we can see from this test,&nbsp;<b>CSQL in Stand-alone</b>&nbsp;mode is the fastest solution to import the data to CUBRID. This is because it directly deals with the database without interacting with server processes. It behaves as if server itself is performing the operations, not the client which is connected to the server. For this reason it is the fastest solution.</p><p>However, there are cases when we cannot use the <i>Stand-alone</i> mode, since in <i>Stand-alone</i>&nbsp;mode no other database user has to interact with the database. This means that CSQL should be the only user which works with the database at that particular moment. Thus <i>Stand-alone</i>&nbsp;mode implies that the database should not be running. If it is running, this means another user (host) is using it.&nbsp;In this case, you will receive the following warning.</p>

<div editor_component="code_highlighter" code_type="Plain" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
$&gt; csql -S demodb

ERROR: Unable to mount disk volume "C:CUBRIDdatabasesdemodbdemodb_lgat".
The database "C:CUBRIDDATABA~1demodbdemodb", to which the disk volume belongs,
is in use by user USER-PC$ on process 3096 of host user-PC since Thu Sep 22 11:04:01 2011.
</div>

<p>In such cases, either you have to make sure no other user is interacting with the database, by stopping the database, or use other solutions. To stop the database in the command line, type the following command where demodb is the name of your database.</p><p></p><div editor_component="code_highlighter" code_type="Plain" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">cubrid server stop dbtest1</div>

<h3>Create CONSTRAINT after Importing</h3>

<p>This is one of the foremost important recommendations for users who import large data into tables. <b>Do not</b> create any type of <code>INDEX</code>ing before you import the data. This related to <code>INDEX</code>, <code>UNIQUE</code>, <code>REVERSE INDEX</code>, <code>REVERSE UNIQUE</code>, and even <code>PRIMARY KEY</code> (PK automatically creates INDEX). Otherwise, every <code>INSERT</code> during the importing process will force the server to reindex the new value which adds up to the overall time. So:</p>

<ol>
  <li>Create your table.</li>
  <li>Define your columns with data types, but <b>do not</b> add any constraints event the <code>PRIMARY KEY</code>.</li>
  <li>Import your data.</li>
  <li>Only then create the constraints.</li>
</ol>

<h3>Turn OFF Logging</h3>

<p>There are two types of logging:</p><p></p><ul><li>Client-side logging</li><li>Server-side logging</li></ul><p></p><h4>Client-side logging</h4><p>The <b>client-size logging</b> refers to&nbsp;<span class="Apple-style-span" style="font-family: monospace; "><a href="/manual/840/en/SQL%20Logs">SQL_LOG</a></span><span class="Apple-style-span">&nbsp;parameter of the Broker</span>, the middleware of CUBRID.&nbsp;By default it&nbsp;is <a href="/manual/840/en/Broker%20Logs">turned <code>ON</code></a>.</p>

<p>When <code>SQL_LOG</code> is configured to ON all SQL statements processed by the CUBRID Application Server (CAS) will be stored in the external log file. So this adds up some time, too. So unless you are sure that you need logging for your import, turn it OFF. It will significantly improve the import performance.</p>

<p>There are several ways to turn <code>SQL_LOG</code> OFF. We will show you how to change its value in the <a href="/cubrid_manager">CUBRID Manager</a> and in the command line.</p>

<h5>CM Example</h5>

<p>To turn OFF <code>SQL_LOG</code> in the CUBRID Manager, right click on the broker and select <code class="menu">Properties</code>.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/broker-properties.png" alt="Broker properties" width="249" height="297" editor_component="image_link"/></p>

<p>In the modal window set the value of <code>SQL_LOG</code> to OFF. Press the OK button.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/broker-properties-window.png" alt="Broker properties window" width="725" height="495" editor_component="image_link"/></p>

<p>To apply the changes, we need to restart the Broker. Right click on the broker and select <code class="menu">Broker Off</code>. Then start it again by clicking on the <code class="menu">Broker On</code> menu.&nbsp;Then do your importing.</p>

<h5>Command Line Example</h5>

<p>Open in the text editor the <b>cubrid_broker.conf</b> default broker configuration file located in the <b>conf</b> directory where you have installed CUBRID. In the file set <code>SQL_LOG = OFF</code> and save it as shown below.</p>

<div class="code">
<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false">
[broker]
...
SQL_LOG                 =OFF
...
</div>
</div>

<p>Then restart the broker.</p>

<div class="code">
<div editor_component="code_highlighter" code_type="plain" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false">
cubrid broker restart
</div>
</div>

<h4>Server-side logging</h4>

<p>The server-side logging is determined by the <b>media_failure_support</b>&nbsp;parameter of the CUBRID Server which&nbsp;specifies whether or not to store archive logs in case of storage media failure. If the parameter is configured to <b>yes</b>, which is the default value, all active logs are copied to archive logs when the active logs are full and the transaction is active. If it is configured to <b>no</b>, archive logs created after the active logs are full are deleted automatically. Note that archive logs are deleted automatically if the value of the parameter is configured to no.</p><p>By setting its value to <b>no</b>&nbsp;it is possible to gain some more time when performing mass import. To change its value&nbsp;in the CUBRID Manager, right click on the host and select&nbsp;<code class="menu">Properties</code>.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/host-properties.png" alt="CUBRID Host Properties" width="249" height="310" editor_component="image_link"/></p>

<p>In the modal window set the value of&nbsp;<b>media_failure_support</b>&nbsp;parameter&nbsp;to <b>no</b>. Press the OK button.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/server-common-parameters.png" alt="CUBRID Server Common Parameters" width="735" height="781" editor_component="image_link"/><br /></p><p>To apply the changes, make sure to restart the server.</p><h3>Utilize Threads and Commit Count</h3><p>If you use CUBRID Manager to import the data, make sure you utilize threads and commit counts.</p><p>Thread will direct CM to use several concurrent connections to perform INSERTs. To specify the number of threads, set the <b>Thread count</b>&nbsp;value. However you should remember that too many thread may vice-versa deteriorate the performance. It all depends on your hardware spec. Normally we recommend to use somewhere between 5 and 10 threads.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/import-data-window.png" alt="Import Data" width="525" height="781" editor_component="image_link"/></p>

<p>Also, the execution time will depend on the&nbsp;<b>commit cycle</b>. It determines how often the inserted data should be committed. Committing&nbsp;frequently&nbsp;will deteriorate the performance. However, too high number might require too much memory. The value around&nbsp;<b>5,000</b>&nbsp;is recommended.</p><p></p>

<h3>Utilize data_buffer_size</h3>

<p>The <b>data_buffer_size</b>&nbsp;is the server parameter&nbsp;that configures the number of data pages to be cached in the memory by the Database Server.&nbsp;The greater the value of the <b>data_buffer_size</b> parameter, the more data pages can be cached in the buffer, which will provide the advantage of decreased disk I/O cost. However, if this parameter is too large, the buffer pool can be swapped out by the operating system because the system memory is excessively occupied. It is recommended to configure the data_buffer_size parameter in a way the required memory size is less than two-thirds of the system memory size. The default value is 512M.</p><h3>Utilize insert_execution_mode</h3><p>This <b>insert_execution_mode</b>&nbsp;is a very useful setting which allows to perform the INSERT operations on the server side instead of the client side. It has 7 modes (for more see the <a href="/manual/840/en/Other%20Parameters" target="_self">Database Server Parameters</a>). By default it's value is <b>1</b>, which means all <b>INSERT INTO ... SELECT ...</b>&nbsp;queries will be executed on the server side. When importing the data, you do not need this mode. Instead, you will need to set <b>insert_execution_mode</b>&nbsp;to 2, which will direct to perform all common INSERT statements like <b>INSERT INTO ... VALUES ...</b>&nbsp;to be processed by the server. See the following image.</p><p>This parameter is useful when the memory capacity of the client is limited or the dirty reading of INSERTed data is required.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/858/054/server-common-parameters-insert-execution-mode.png" alt="CUBRID Insert Execution Mode Parameter" width="735" height="799" editor_component="image_link"/></p>

<p>In conclusion, let's wrap up all these.</p><p></p><ul><li>Use CSQL in Stand-alone mode.</li><li>Create CONSTRAINT after importing.</li><li>Turn OFF Logging.</li><li>Utilize Threads and Commit Count.</li><li>Utilize data_buffer_size.</li><li>Utilize insert_execution_mode.</li></ul><p></p>

<h2>References</h2>

<ul>
  <li><a href="/manual/840/en/cubrid_broker.conf%20Configuration%20File%20and%20Default%20Parameters">cubrid_broker.conf Configuration File and Default Parameters</a></li>
  <li><a href="/manual/840/en/Broker%20Logs">Broker Logs</a></li>
  <li><a href="/manual/840/en/Dynamically%20Changing%20Broker%20Parameters">Dynamically Changing Broker Parameters</a></li>
  <li><a href="/manual/840/en/Checking%20Broker%20Status">Checking Broker Status</a></li>
  <li><a href="/manual/840/en/Parameter%20by%20Broker">Parameter By Broker</a></li>
  <li><a href="/?_filter=search&amp;mid=forum&amp;cpage=1&amp;category=195524&amp;act=dispForumContent&amp;document_srl=195665" target="_self">Issues importing a large MySQL table</a> - Forum post</li>
</ul>]]></description>
                        <pubDate>Thu, 17 Mar 2011 10:25:16 -0800</pubDate>
                        <category>performance</category>
                        <category>import</category>
                        <category>csql</category>
                        <category>cm</category>
                                </item>
            </channel>
</rss>
