<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/">
    <channel>
        <title>CUBRID Query Tuning Techniques</title>
        <link>http://www.cubrid.org/?mid=cubrid_query_tuning</link>
        <description>CUBRID Query Tuning Techniques</description>
        <language>en</language>
        <pubDate>Sun, 04 Sep 2011 20:19:02 -0800</pubDate>
        <lastBuildDate>Wed, 31 Oct 2012 00:26:35 -0800</lastBuildDate>
        <generator>XpressEngine 1.4.4.1</generator>
                        										        <item>
            <title>CUBRID Query Tun...</title>
            <dc:creator>CUBRID</dc:creator>
            <link>http://www.cubrid.org/cubrid_query_tuning</link>
            <guid isPermaLink="true">http://www.cubrid.org/cubrid_query_tuning</guid>
                                    <description><![CDATA[<h1>CUBRID Query Tuning Techniques</h1>

<div class="contents-table"></div>

<p>This article has been written by one of the CUBRID core developers to help users improve their application&nbsp;performance&nbsp;by understanding how CUBRID Indexing&nbsp;behaves and how to tune it.</p><p>Database tuning, necessary to&nbsp;build a high performance Web service,&nbsp;depends largely on how well index tuning is utilized. However, over the past three years while CUBRID has been applied to NHN's internal and external services surprisingly many developers either do not know much about <i>Database Indexing</i> or how to use it.</p><p>In this article I will explain about index structure and&nbsp;various indexing techniques introduced in&nbsp;CUBRID 2008 R4.0 on July 1st, 2011 to significantly improve CUBRID Database performance. Some of these techniques are present in other database management systems.</p><h2>Understanding Index Structure and Scanning</h2><p>In CUBRID Indexing is implemented in <a href="/cubrid_covering_index#why-do-we-need-covering-index" target="_self">B+tree</a> structure where index values are stored in leaf nodes. Unlike <a href="http://en.wikipedia.org/wiki/B-tree" target="_self">B-tree</a>, which stores the actual data in leaf nodes, B+tree stores only pointers to data in leaf nodes along with keys.</p><p>On the other hand, non-leaf nodes, which are above the leaf nodes, are typical B-tree nodes which act as indexes to quickly find the leaf nodes. Additionally, leaf nodes are connected in a linked list which&nbsp;allows faster&nbsp;sequential&nbsp;access, for instance, when performing <b>range search</b>&nbsp;- one of the key performance advantages of CUBRID over MySQL. Below you can see a typical structure of B+tree.</p><p><img src="http://www.cubrid.org/files/attach/images/49/282/220/b+tree-structure.png" alt="B+tree Structure" width="735" height="299" editor_component="image_link"/><br /></p><p>Here is how&nbsp;<b>Index Range Scan</b>&nbsp;works. There is a database table, and in order to read the complete result set, it has to read all records from the very first till the last to check if records are in that particular range. However, since the index keys are sorted, the search starts at a specific location and stops right at the moment when the key values do not match the search criteria any more. Thus, for&nbsp;<i>range scan</i>&nbsp;two sub-keys are required. First represents the <b>min value</b>, the second - <b>max value</b>.</p><p>Index Range Scan is a two-step process. The first step is to traverse the tree from the root node to the leaf nodes and find these sub-keys. The second step is to read all the records starting from <i>min</i> key until&nbsp;<i>max</i> key. If <i>max</i> key is not found in the current record, it will ready the next record until the max key is found. When such sequential search of max key is completed, the full range search is done.</p><p>See also&nbsp;<a href="/cubrid_840_key_features#limit-optimizations" target="_self">Multi Range Scanning</a>.</p><h2>Understanding Query Process Utilizing Index Scan</h2><p>For a practical example, let's consider the following table structure.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
CREATE TABLE tbl (a INT NOT NULL, b STRING, c BIGINT);
</div>

<p>And we create a multi-column index on columns a and b.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
CREATE INDEX idx ON tbl (a, b);
</div>

<p>Let's insert some records.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
INSERT INTO tbl VALUES (1, ‘AAA, 123), (2, ‘AAA’, 12), …;
</div>

<p>The following picture shows index nodes pointing to data stored in the heap file on the disk.</p><p></p><ol><li>The index values (both <b>a</b>&nbsp;and <b>b</b>) are sorted in ascending order (by default).</li><li>Each index node has a pointer to a corresponding data (table row) in the heap file illustrated by an arrow.</li><li>The data in heap file is stored in random order.</li></ol>

<p style="text-align: center;"><img src="http://www.cubrid.org/files/attach/images/49/753/202/cubrid-index-structure.png" alt="CUBRID Index Structure" width="382" height="500" editor_component="image_link"/><br /></p>

<p>Let's see how the index scanning is usually performed. On the table defined above with the data we have inserted, we will execute the following SELECT query.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT * FROM tbl
WHERE a &gt; 1 AND a &lt; 5
AND b &lt; ‘K’
AND c &gt; 10000
ORDER BY b;</div>

<p>When executed, the search is completed in three steps.</p>

<ol><li><b>Key Range </b>search<b>:</b> CUBRID will first find all index nodes where&nbsp;<b><span style="color: rgb(255, 0, 0); ">a</span><span style="color: rgb(255, 0, 0); ">&nbsp;&gt;1</span><span style="color: rgb(255, 0, 0); "> and a &lt; 5</span></b>.</li><li><b>Key Filter<span class="Apple-style-span" style="font-weight: normal; "><b>&nbsp;</b>search</span>:</b>&nbsp;Here <i>Key Range</i>&nbsp;cannot be used, but we can still filter the records using index keys. Thus, among these nodes, it will find all nodes where&nbsp;<b><span style="color: rgb(0, 117, 200); ">b &lt; 'K'</span></b>.</li><li><b>Data Filter<span class="Apple-style-span" style="font-weight: normal; "><b>&nbsp;</b>search</span>:</b> Since column <b>c</b>&nbsp;is not indexed, we cannot use <i>Key Filter</i>&nbsp;any more. To obtain its value it is necessary to look up the heap file.</li></ol>

<p>So, in general the following process is accomplished.</p>

<ol><li>First,&nbsp;<i>Key Range</i>&nbsp;and&nbsp;<i>Key Filter</i>&nbsp;are applied. These steps form a list of&nbsp;OIDs (Object Identifier) which tells CUBRID where exactly on the disk the required rows are located.</li><li>Based on these OID, the server&nbsp;will look up the heap file to retrieve the corresponding records.&nbsp;Then CUBRID will apply either <i>Data Filter</i>&nbsp;to these records&nbsp;or will read the values of columns listed in the SELECT clause. The results will be stored in a temporary page.</li><li>If <i>ORDER BY</i> or <i>GROUP BY</i> clauses are present, the records stored in the temporary page get sorted, and the final result is generated.</li></ol>

<p>The following figure illustrates these processes. You can notice, first, <i>Key Range</i> has been applied, then <i>Key Filter</i>, then <i>Data Filter</i>, finally <i>sorting</i>.</p><p><img src="http://www.cubrid.org/files/attach/images/49/282/220/search-through-index-and-data-table.png" alt="Search through index and data table" width="700" height="509" editor_component="image_link"/><br /></p><h3>Notes to remember when using indexes</h3><p></p><ol><li>In order for Optimizer to use index, WHERE clause should contain a Range condition (&lt;, &gt;, &lt;=, &gt;=, =). If Range condition is not defined, the Optimizer will attempt to perform table sequential scan.</li><li>Also, in order for Index Scan to be triggered, WHERE clause should contain&nbsp;the first column of the index key.</li><li>Sometimes developers include all columns in the index and use only one or two of them in the WHERE clause. It's not a good idea to overload the indexes.</li><li>Since it is not possible to sort&nbsp;the second column if&nbsp;the first column of the index is not listed in the condition, the Range Scan cannot be applied. The second columns is sorted only in accordance with the first column, as shown in the figure above. Thus, it is important to have the first column in the condition, while it does not matter much if the second column is present or not.</li><li>Since B+tree structure is created through index values comparison (less or greater), the usage of irregular conditions such as &lt;&gt;, != or NULL will limit the Optimizer from using the index even when the first column of index is listed in the condition. For example, in the following queries index cannot be used by the Optimizer even if columns&nbsp;<b>grade</b>&nbsp;and <b>email_addr</b> are indexed.<br />
<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT * FROM student WHERE grade &lt;&gt; 'A';
SELECT name, email_addr FROM student WHERE email_addr IS NOT NULL;
SELECT student_id FROM record WHERE substring(yymm, 1, 4) = ‘1997’;</div></li></ol><p></p><h2>Index Optimization: Disk I/O minimization is essential for tuning</h2><p>It is the nature of B+tree that the cost of accessing any leaf page is almost same. In B+tree the most expensive part of searching process lies within the <b>Key Range</b>&nbsp;from the beginning until the end of scanning the leaf nodes and corresponding table data from the disk.</p><p>I/O in CUBRID is performed in pages. This means that in order to read only one column for a single record, it requires to read the entire page from the disk which this records belongs to. The Optimizer determines whether there is a need to read index or table.&nbsp;Therefore, the most important query performance indicator is the number of pages involved in I/O operations, which determines&nbsp;largely&nbsp;the way the Optimizer works. That is the most important is not reading the record which complies the condition but reading the number of pages.</p><p>Below I will explain about the techniques that you can use to reduce the number of pages involved in the process of index scanning.</p><h3>Optimization 1: Utilizing Key Filter</h3><p>As explained earlier, even if <i>Key Filter</i> is not included in <i>Key Range</i>&nbsp;it can process the condition using the index key. In this case, if Key Filter is included in the WHERE condition, it is possible to reduce the number of data pages accessed during the index scan. Since the data pages are not sorted on the disk, access to them are random. Thus, it is more expensive than index page access. Therefore, by indicating the <i>Key Filter</i>&nbsp;in WHERE clause, it is possible to boost the performance.</p><p>Additionally, <i>Data Filter</i>&nbsp;can be applied by adding a column to the index used in&nbsp;<i>Key Filter</i>.&nbsp;For example, there is a table with two column index defined&nbsp;<b>idx_1 ON (group_id, name)</b>. And we want to execute the following query which utilizes this index.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT * FROM user WHERE group_id = 10 AND age &gt; 40;
</div>

<p>If we assume there are 100 records with <b>group_id = 10</b>&nbsp;and 10 records with <b>age &gt; 40</b>, after index scan returns OIDs (Object Identifiers) of 100 rows where group_id &nbsp;= 100, in the worst case the data pages will be accessed 100 times (<i>since <b>age</b>&nbsp;column is not a part of index</i>). However, if <b>age</b>&nbsp;column is added to index <b>idx_1</b>&nbsp;like <b>(group_id, name, age)</b>&nbsp;then the <b>age &gt; 40</b>&nbsp;condition will be treated as <i>Key Filter</i>, thus the number of data page access using OID will be only 10.</p><h3>Optimization 2: Covering Index</h3><p>If index is used to obtain all the results for a SELECT query, it is possible to build the results of this query by reading index keys only&nbsp;without accessing the data pages on the disk. Such index which <i>covers</i> all the requested columns is called <b>Covering Index</b>.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT a, b FROM tbl WHERE a &gt; 1 AND a &lt; 5 AND b &lt; ‘K’ ORDER BY b;
</div>

<p>For the above query, <i>Covering Index</i>&nbsp;can be applied. This is because both columns <b>a</b>&nbsp;and <b>b</b>&nbsp;used in the entire query, are included in the same index. The following figure confirms that <i>Covering Index</i>&nbsp;is used in the query and no data page access is performed, i.e. no disk I/O. Instead, the index key values obtained from index scan results, stored in the <b>Key Buffer</b>, are used to build the final results.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/753/202/cubrid-covering-index.png" alt="CUBRID Covering Index" title="CUBRID Covering Index" width="544" height="500" editor_component="image_link"/></p>

<p>Thus, <i>Covering Index</i>&nbsp;does not ever entail data page access, which means that if a covered query is used very often, the index will be saved in the DB Buffer Cache which plays significant role in disk I/O reduction. Thus:</p><p></p><ol><li>If the index key size is smaller than the record size;</li><li>And it is confirmed that the query covered by index will be executed very often;</li></ol><p></p><p>... then this assured to dramatically increase the performance.</p><h3>Optimization 3: Replacing sort operations</h3><p>Since the result set produced by index scanning is automatically sorted by index column, explicit sorting indicated in ORDER BY or GROUP BY clauses can be omitted when writing a query. In order to accomplish this, it is necessary to specify columns in ORDER BY or GROUP BY clauses in the same order as columns are defined in the index.</p><p>Previously we have noted that it is not possible to use index if only the second column of the index is provided in the condition. For example, the following query would not use index. In order to use index, the first index column should be present.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
CREATE INDEX idx ON tbl (a, b);
SELECT COUNT(*) FROM tbl ORDER BY b;
</div>

<p>However, if "<b>=</b>" operator is used to perform comparison of the first column of the index, that column can be omitted from the ORDER BY or GROUP BY clause.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT COUNT(*) FROM tbl WHERE a = 100 ORDER BY b;
</div>

<p>In the above case, sorting in ORDER BY will be skipped, since the final result set is already sorted. The following query and figure also show the case when sorting in GROUP BY is skipped.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT COUNT(*) FROM tbl WHERE a &gt; 1 AND a &lt; 5 AND b &lt; ‘K’ AND c &gt; 10000 GROUP BY a;
</div>

<p><img src="http://www.cubrid.org/files/attach/images/49/753/202/cubrid-group-by.png" alt="Skipping sorting in GROUP BY" title="Skipping sorting in GROUP BY" class="iePngFix" width="668" height="500" editor_component="image_link"/></p><h3>Optimization 4: LIMIT Optimization</h3><p>As you already know LIMIT clause allows to limit the number of final results to be return to the client. If there is a LIMIT clause in the query which has no <i>Data Filter</i>, there is no need to scan all the key values corresponding to <i>Key Range</i>.&nbsp;Instead it is necessary to return only as many records as indicated in the LIMIT clause and stop&nbsp;further&nbsp;scanning. This eventually will prevent the system from accessing the data pages, thus&nbsp;reducing the number of unnecessary I/O ops.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT * FROM tbl WHERE a = 2 AND b &lt; ‘K’ ORDER BY b LIMIT 3;
</div>

<p>When the above query is executed, looking at the following figure we can notice that the system does not scan all the index key values, but interrupts once the LIMIT count is reached. In other words, if there records with <b>a = 2</b>&nbsp;are stored in 10 data pages on the disk, LIMIT clause will scan only 3 records by reading only one data page.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/753/202/cubrid-key-limit.png" alt="CUBRID Key Limit" title="CUBRID Key Limit" class="iePngFix" width="552" height="500" style="" editor_component="image_link"/><br /></p>

<p>Meanwhile, it is also possible to apply LIMIT optimization to a query which contains IN clause. If index column is used in IN clause, CUBRID will use <i>Key Range</i>&nbsp;technique to scan only those records where keys are among those defined in IN clause. However, if LIMIT count is defined like in the following query, CUBRID will perform index scan only 3 times for each of those 3 sets of records, then interrupt index scanning. In other words, LIMIT optimization can be applied for each index scan. For visual illustration see the figure below.</p>

<div editor_component="code_highlighter" code_type="Sql" file_path="" description="" first_line="1" collapse="false" nogutter="false" nocontrols="false" style="border: #666666 1px dotted; border-left: #22aaee 5px solid; padding: 5px; background: #FAFAFA url(modules/editor/components/code_highlighter/code.png) no-repeat top right;">
SELECT * FROM tbl WHERE a IN (2, 4, 5) AND b &lt; ‘K’ ORDER BY b LIMIT 3;
</div>

<p>Since ORDER BY clause sorts the entire result set, if there are more than one <i>Key&nbsp;Range</i>&nbsp;as in the above query it is necessary to sort each index scan results once they are collected. However, when ORDER BY is used together with LIMIT, we can replace intermediary results on the fly while scanning <i>Key Range</i>. In CUBRID this process is called <b>In-place Sorting</b>.</p><p>The Figure below provides a detailed explanation of the search process for the above query.</p><p></p><ol><li>First, CUBRID will scan indexes where <b>a = 2 and b &lt; 'k'</b>. Since there is a LIMIT of 3 records, the index scan for&nbsp;<b>a = 2</b>&nbsp;<i>range</i>&nbsp;will stop when 3 OIDs have been scanned where <b>a</b> and <b>b</b> satisfy the criteria.</li><li>Second, <b>a = 4</b>&nbsp;<i>range </i>will be scanned similarly. Since it has already obtained 3 OIDs from the previuos <b>a = 2</b>&nbsp;<i>range</i>, it will scan the first index key in <b>a = 4</b>&nbsp;<i>range</i>&nbsp;and check if its <b>b</b>&nbsp;value is less than the ones in temp result set. Since <b>b</b> in (4, 'DAA') is not less than <b>b</b>&nbsp;in (2, 'CCC'), CUBRID will stop further scan for this <b>a = 4</b>&nbsp;<i>range</i>, because <b>b</b>&nbsp;key values in this index are already sorted, so if the first value is larger than the value in our temp result set.</li><li>Third, as in the previous step, it will scan <b>a = 5</b>&nbsp;<i>range</i>&nbsp;and compare <b>b</b>&nbsp;values. Since <b>b</b>&nbsp;in (5, 'AAA') is smaller than <b>b</b>&nbsp;in (2, 'CCC') and <b>b</b>&nbsp;in (2, 'ABC'), OID10 will be inserted into the second position of the temp result set. Thus the temp result set will have OID5, OID10, and OID9. Looking at the next value within this same range, i.e. (5, 'BBB'), CUBRID will stop further index scan and return the results.</li></ol><p></p><p>Thus, <b>In-place Sorting</b>&nbsp;technique allows to further narrow the index scan range. Thus, since separate sorting of the final results do not take place, In-place Sorting allows to significantly increase query execution performance.</p>

<p><img src="http://www.cubrid.org/files/attach/images/49/753/202/cubrid-multi-range-limit.png" alt="cubrid-multi-range-limit.png" title="cubrid-multi-range-limit.png" class="iePngFix" width="610" height="500" style="" editor_component="image_link"/><br /></p>

<h2>Conclusion</h2><p>Even though indexes are good to have, it is not necessary to create many indexes. When many indexes are declared for a table, SELECT queries get faster, however, the administrative cost of these indexes increases, thus the performance of INSERT/UPDATE/DELETE operations will decrease. Thus, the core DB tunning technique lies in creating the adequate number of indexes, and optimize queries in order to take advantage of these indexes. For this, it is necessary to consider the following indicators as a whole:</p><p></p><ul><li>Index structure implemented in the DBMS.</li><li>Variety of indexing techniques like those explained above.</li><li>Query patters and the frequency of their usage.</li><li>I/O cost</li><li>and the cost of the storage space</li></ul><p></p><p>Considering these, it is possible to create very efficiently indexed queries.</p><p>For more related tutorials, refer to the following articles.</p>]]></description>
                        <pubDate>Sun, 04 Sep 2011 19:19:35 -0800</pubDate>
                        <category>index</category>
                        <category>tuning</category>
                        <category>sql</category>
                        <category>performance</category>
                        <category>covering index</category>
                        <category>query</category>
                                </item>
            </channel>
</rss>
