CUBRID Query Tuning Techniques
This article has been written by one of the CUBRID core developers to help users improve their application performance by understanding how CUBRID Indexing behaves and how to tune it.
Database tuning, necessary to build a high performance Web service, depends largely on how well index tuning is utilized. However, over the past three years while CUBRID has been applied to NHN's internal and external services surprisingly many developers either do not know much about Database Indexing or how to use it.
In this article I will explain about index structure and various indexing techniques introduced in CUBRID 2008 R4.0 on July 1st, 2011 to significantly improve CUBRID Database performance. Some of these techniques are present in other database management systems.
Understanding Index Structure and Scanning
In CUBRID Indexing is implemented in B+tree structure where index values are stored in leaf nodes. Unlike B-tree, which stores the actual data in leaf nodes, B+tree stores only pointers to data in leaf nodes along with keys.
On the other hand, non-leaf nodes, which are above the leaf nodes, are typical B-tree nodes which act as indexes to quickly find the leaf nodes. Additionally, leaf nodes are connected in a linked list which allows faster sequential access, for instance, when performing range search - one of the key performance advantages of CUBRID over MySQL. Below you can see a typical structure of B+tree.
Here is how Index Range Scan works. There is a database table, and in order to read the complete result set, it has to read all records from the very first till the last to check if records are in that particular range. However, since the index keys are sorted, the search starts at a specific location and stops right at the moment when the key values do not match the search criteria any more. Thus, for range scan two sub-keys are required. First represents the min value, the second - max value.
Index Range Scan is a two-step process. The first step is to traverse the tree from the root node to the leaf nodes and find these sub-keys. The second step is to read all the records starting from min key until max key. If max key is not found in the current record, it will ready the next record until the max key is found. When such sequential search of max key is completed, the full range search is done.
See also Multi Range Scanning.
Understanding Query Process Utilizing Index Scan
For a practical example, let's consider the following table structure.
CREATE TABLE tbl (a INT NOT NULL, b STRING, c BIGINT);
And we create a multi-column index on columns a and b.
CREATE INDEX idx ON tbl (a, b);
Let's insert some records.
INSERT INTO tbl VALUES (1, ‘AAA, 123), (2, ‘AAA’, 12), …;
The following picture shows index nodes pointing to data stored in the heap file on the disk.
- The index values (both a and b) are sorted in ascending order (by default).
- Each index node has a pointer to a corresponding data (table row) in the heap file illustrated by an arrow.
- The data in heap file is stored in random order.
Let's see how the index scanning is usually performed. On the table defined above with the data we have inserted, we will execute the following SELECT query.
SELECT * FROM tbl WHERE a > 1 AND a < 5 AND b < ‘K’ AND c > 10000 ORDER BY b;
When executed, the search is completed in three steps.
- Key Range search: CUBRID will first find all index nodes where a >1 and a < 5.
- Key Filter search: Here Key Range cannot be used, but we can still filter the records using index keys. Thus, among these nodes, it will find all nodes where b < 'K'.
- Data Filter search: Since column c is not indexed, we cannot use Key Filter any more. To obtain its value it is necessary to look up the heap file.
So, in general the following process is accomplished.
- First, Key Range and Key Filter are applied. These steps form a list of OIDs (Object Identifier) which tells CUBRID where exactly on the disk the required rows are located.
- Based on these OID, the server will look up the heap file to retrieve the corresponding records. Then CUBRID will apply either Data Filter to these records or will read the values of columns listed in the SELECT clause. The results will be stored in a temporary page.
- If ORDER BY or GROUP BY clauses are present, the records stored in the temporary page get sorted, and the final result is generated.
The following figure illustrates these processes. You can notice, first, Key Range has been applied, then Key Filter, then Data Filter, finally sorting.
Notes to remember when using indexes
- In order for Optimizer to use index, WHERE clause should contain a Range condition (<, >, <=, >=, =). If Range condition is not defined, the Optimizer will attempt to perform table sequential scan.
- Also, in order for Index Scan to be triggered, WHERE clause should contain the first column of the index key.
- Sometimes developers include all columns in the index and use only one or two of them in the WHERE clause. It's not a good idea to overload the indexes.
- Since it is not possible to sort the second column if the first column of the index is not listed in the condition, the Range Scan cannot be applied. The second columns is sorted only in accordance with the first column, as shown in the figure above. Thus, it is important to have the first column in the condition, while it does not matter much if the second column is present or not.
- Since B+tree structure is created through index values comparison (less or greater), the usage of irregular conditions such as <>, != or NULL will limit the Optimizer from using the index even when the first column of index is listed in the condition. For example, in the following queries index cannot be used by the Optimizer even if columns grade and email_addr are indexed.
SELECT * FROM student WHERE grade <> 'A'; SELECT name, email_addr FROM student WHERE email_addr IS NOT NULL; SELECT student_id FROM record WHERE substring(yymm, 1, 4) = ‘1997’;
Index Optimization: Disk I/O minimization is essential for tuning
It is the nature of B+tree that the cost of accessing any leaf page is almost same. In B+tree the most expensive part of searching process lies within the Key Range from the beginning until the end of scanning the leaf nodes and corresponding table data from the disk.
I/O in CUBRID is performed in pages. This means that in order to read only one column for a single record, it requires to read the entire page from the disk which this records belongs to. The Optimizer determines whether there is a need to read index or table. Therefore, the most important query performance indicator is the number of pages involved in I/O operations, which determines largely the way the Optimizer works. That is the most important is not reading the record which complies the condition but reading the number of pages.
Below I will explain about the techniques that you can use to reduce the number of pages involved in the process of index scanning.
Optimization 1: Utilizing Key Filter
As explained earlier, even if Key Filter is not included in Key Range it can process the condition using the index key. In this case, if Key Filter is included in the WHERE condition, it is possible to reduce the number of data pages accessed during the index scan. Since the data pages are not sorted on the disk, access to them are random. Thus, it is more expensive than index page access. Therefore, by indicating the Key Filter in WHERE clause, it is possible to boost the performance.
Additionally, Data Filter can be applied by adding a column to the index used in Key Filter. For example, there is a table with two column index defined idx_1 ON (group_id, name). And we want to execute the following query which utilizes this index.
SELECT * FROM user WHERE group_id = 10 AND age > 40;
If we assume there are 100 records with group_id = 10 and 10 records with age > 40, after index scan returns OIDs (Object Identifiers) of 100 rows where group_id = 100, in the worst case the data pages will be accessed 100 times (since age column is not a part of index). However, if age column is added to index idx_1 like (group_id, name, age) then the age > 40 condition will be treated as Key Filter, thus the number of data page access using OID will be only 10.
Optimization 2: Covering Index
If index is used to obtain all the results for a SELECT query, it is possible to build the results of this query by reading index keys only without accessing the data pages on the disk. Such index which covers all the requested columns is called Covering Index.
SELECT a, b FROM tbl WHERE a > 1 AND a < 5 AND b < ‘K’ ORDER BY b;
For the above query, Covering Index can be applied. This is because both columns a and b used in the entire query, are included in the same index. The following figure confirms that Covering Index is used in the query and no data page access is performed, i.e. no disk I/O. Instead, the index key values obtained from index scan results, stored in the Key Buffer, are used to build the final results.
Thus, Covering Index does not ever entail data page access, which means that if a covered query is used very often, the index will be saved in the DB Buffer Cache which plays significant role in disk I/O reduction. Thus:
- If the index key size is smaller than the record size;
- And it is confirmed that the query covered by index will be executed very often;
... then this assured to dramatically increase the performance.
Optimization 3: Replacing sort operations
Since the result set produced by index scanning is automatically sorted by index column, explicit sorting indicated in ORDER BY or GROUP BY clauses can be omitted when writing a query. In order to accomplish this, it is necessary to specify columns in ORDER BY or GROUP BY clauses in the same order as columns are defined in the index.
Previously we have noted that it is not possible to use index if only the second column of the index is provided in the condition. For example, the following query would not use index. In order to use index, the first index column should be present.
CREATE INDEX idx ON tbl (a, b); SELECT COUNT(*) FROM tbl ORDER BY b;
However, if "=" operator is used to perform comparison of the first column of the index, that column can be omitted from the ORDER BY or GROUP BY clause.
SELECT COUNT(*) FROM tbl WHERE a = 100 ORDER BY b;
In the above case, sorting in ORDER BY will be skipped, since the final result set is already sorted. The following query and figure also show the case when sorting in GROUP BY is skipped.
SELECT COUNT(*) FROM tbl WHERE a > 1 AND a < 5 AND b < ‘K’ AND c > 10000 GROUP BY a;
Optimization 4: LIMIT Optimization
As you already know LIMIT clause allows to limit the number of final results to be return to the client. If there is a LIMIT clause in the query which has no Data Filter, there is no need to scan all the key values corresponding to Key Range. Instead it is necessary to return only as many records as indicated in the LIMIT clause and stop further scanning. This eventually will prevent the system from accessing the data pages, thus reducing the number of unnecessary I/O ops.
SELECT * FROM tbl WHERE a = 2 AND b < ‘K’ ORDER BY b LIMIT 3;
When the above query is executed, looking at the following figure we can notice that the system does not scan all the index key values, but interrupts once the LIMIT count is reached. In other words, if there records with a = 2 are stored in 10 data pages on the disk, LIMIT clause will scan only 3 records by reading only one data page.
Meanwhile, it is also possible to apply LIMIT optimization to a query which contains IN clause. If index column is used in IN clause, CUBRID will use Key Range technique to scan only those records where keys are among those defined in IN clause. However, if LIMIT count is defined like in the following query, CUBRID will perform index scan only 3 times for each of those 3 sets of records, then interrupt index scanning. In other words, LIMIT optimization can be applied for each index scan. For visual illustration see the figure below.
SELECT * FROM tbl WHERE a IN (2, 4, 5) AND b < ‘K’ ORDER BY b LIMIT 3;
Since ORDER BY clause sorts the entire result set, if there are more than one Key Range as in the above query it is necessary to sort each index scan results once they are collected. However, when ORDER BY is used together with LIMIT, we can replace intermediary results on the fly while scanning Key Range. In CUBRID this process is called In-place Sorting.
The Figure below provides a detailed explanation of the search process for the above query.
- First, CUBRID will scan indexes where a = 2 and b < 'k'. Since there is a LIMIT of 3 records, the index scan for a = 2 range will stop when 3 OIDs have been scanned where a and b satisfy the criteria.
- Second, a = 4 range will be scanned similarly. Since it has already obtained 3 OIDs from the previuos a = 2 range, it will scan the first index key in a = 4 range and check if its b value is less than the ones in temp result set. Since b in (4, 'DAA') is not less than b in (2, 'CCC'), CUBRID will stop further scan for this a = 4 range, because b key values in this index are already sorted, so if the first value is larger than the value in our temp result set.
- Third, as in the previous step, it will scan a = 5 range and compare b values. Since b in (5, 'AAA') is smaller than b in (2, 'CCC') and b in (2, 'ABC'), OID10 will be inserted into the second position of the temp result set. Thus the temp result set will have OID5, OID10, and OID9. Looking at the next value within this same range, i.e. (5, 'BBB'), CUBRID will stop further index scan and return the results.
Thus, In-place Sorting technique allows to further narrow the index scan range. Thus, since separate sorting of the final results do not take place, In-place Sorting allows to significantly increase query execution performance.
Even though indexes are good to have, it is not necessary to create many indexes. When many indexes are declared for a table, SELECT queries get faster, however, the administrative cost of these indexes increases, thus the performance of INSERT/UPDATE/DELETE operations will decrease. Thus, the core DB tunning technique lies in creating the adequate number of indexes, and optimize queries in order to take advantage of these indexes. For this, it is necessary to consider the following indicators as a whole:
- Index structure implemented in the DBMS.
- Variety of indexing techniques like those explained above.
- Query patters and the frequency of their usage.
- I/O cost
- and the cost of the storage space
Considering these, it is possible to create very efficiently indexed queries.
For more related tutorials, refer to the following articles.
Starting from version 8.4.0 CUBRID supports Covering Index. Wikipedia explains it as:
This guide explains how to retrieve database information from CUBRID. This includes: table, index, and column names, if certain columns are indexes ...
The scope of this tutorial is to show you how to use one of the most special features implemented in CUBRID – Click Counter.
Sometimes you want to concatenate all column values from differenct rows returned by your SELECT statement in one value. For example, consider the c...
Sometimes we need to import large data files to CUBRID tables which can go over several GB. In this article we will cover different ways to perfo...