What is Covering Index in CUBRID 8.4.0?
Starting from version 8.4.0 CUBRID supports Covering Index. Wikipedia explains it as:
A covering index is a special case where the index itself contains the required data field(s) and can return the data.
In other words, covering index is applied for a query which uses only indexed columns. For instance, see the following example.
CREATE TABLE tbl (i INT, j INT); CREATE INDEX idx ON tbl(i, j); INSERT INTO tbl VALUES(1,11), (2,12); SELECT j FROM tbl WHERE i > 0;
As you can notice both column i, used in the WHERE clause, and column j, used in the SELECT list, have been indexed. When CUBRID executes this type of SELECT statements, the covering index can be scanned, as it contains both the condition value and the result value requested by the SELECT statement.
Why do we need covering index?
At this moment you might be curious why do we need this concept of covering index? To explain this, first we need to quickly look at CUBRID’s Index Structure.
All indexing in CUBRID, including primary and secondary as distinguished in other databases, is implemented using B+ tree.
As shown in the figure above, B+ tree consists of root, leaf, and non-leaf nodes, where:
- Each of these nodes has an index key value used to search for the record.
- The leaf nodes also contain a pointer to a data row in the form of an OID (Object Identifier: volume id, a page id and a slot id like @650|2|0).
- Thus, the data itself is stored outside of the index in a heap file located on the disk in the order which has no relationship with the indexing order.
This type of indexing is referred as non-clustered indexing as opposed to clustered index [link to the section below].
In general case, when the data is requested, CUBRID scans the index tree (if the used column is indexed), finds the location in the heap file where the column’s data row is stored, then fetches that data. Fetching the data from the heap file which is located on a physical disk, may results in heavy disk I/O operations depending on the traffic of the service.
Here comes the question: if the column we need (for instance j in the first example) is covered by an index, why do we need to fetch it from the heap file to get its value?
This results in unnecessary and expensive disk I/O operations which increases the time required to execute the query.
For this reason in CUBRID 8.4.0 we have implemented the concept of covering index, which allows CUBRID to skip the disk I/O and find the value of the column j (from the first example) immediately from the index tree. Thus, when both the index key value (column i) and the requested data value (column j) are covered by the index, it is called covering index.
Using Covering Index
The idea behind the covering index is to reduce the amount of disk I/O operations, which allows to increase the overall look-up performance in two ways:
- By skipping heap file lookup, it decreases the overall time spent to execute the query.
- Generally, the size of the index key is much smaller than the size of the entire row of the data, so reading the index key is much faster.
Moreover, if a certain index is frequently used in query executions, it can be cached further in the CUBRID memory buffer which will eliminate the disk I/O completely. Memory related parameters, though, should be tuned based on some observations and metrics. You should monitor your system resource usage and decide what numbers are best for you.
However, there are some tradeoffs as well you need to make in order to use covering index. For instance, in order to use it, either you need to extend the existing index column or add a new index. But this also increases the index volume size as well as the administrative costs such that when you INSERT or UPDATE the record, the index tree has to reflect the changes accordingly, thus it will have an impact on overall performance for write operations.
Moreover, as indexes are sorted by their value, when you try to retrieve records with unindexed columns, CUBRID will quickly find the records using the index, but will eventually retrieve the data from the heap file randomly for each record. Sometimes, sequentially accessing the heap file and reading the data may be cheaper in terms of the amount of I/O operations than the random heap access.
Therefore, before deciding whether to use covering index or not, you need to consider about query performance, the storage space cost, and I/O cost.
Typically, if you are sure that the index key size is smaller than the data of the entire row, and the query which will be using this index is executed frequently, then creating covering index can be reasonable.
To see if a query uses covering index, you can analyze it with CUBRID’s built-in Query Planner which allows you to compare and find the cheapest queries. For example, if you execute the SELECT query from the first example, you will get the following query plan:
Query plan: iscan class: tbl node index: idx term (covers) cost: fixed 0(0.0/0.0) var 1(0.0/1.0) card 0
If the query plan is labeled with the term “covers”, then the query will use covering index. Also you can notice the CPU and IO costs of executing this query. For other variations of the same query, you may get different values for cost. The lower the cost, the better. But again, lowest cost does not always mean the fastest query. You should bargain on this.
Looking at the information about the server execution statistics, it is possible to learn how many times the covering index has been used. This number will represent the number of tuples created during the covering index usage. To learn more about server execution statistics information, refer to Outputting Statistics Information of Server.
Clustered index is another type of index structure where physical data (rows) on the disk is sorted in the same order as the index column values are. Simply put, you can think of clustered index as an index structure which contains all the rows of the table as opposed to non-clustered index which contains only pointers to the data located on the disk.
This means that one table can contain only one clustered index. In this case, every time a new record is inserted, the physical sorting will be triggered. But at the same time clustered indexing allows to retrieve the data much faster as the data is physically sorted.
In contrast, non-clustered index store only pointers to the data on the disk, so physical sorting is not required. But look-up will take more time as it accesses the heap file randomly.
Both MySQL and MSSQL use clustered structure for primary indexes, while non-clustered structure is used for secondary indexes. In other words, columns which are PRIMARY KEY or UNIQUE use primary indexes, and the clustered index is used which stores the entire row data in its leaf nodes. The secondary index uses non-clustered structure which stores the clustering key, i.e. the primary key columns for the row, as well as the columns specified for the secondary index.
Thus, if a large clustered index key is defined, any non-clustered index defined on the same table will be significantly larger because the non-clustered index entries contain the clustering key.
Moreover, if results are obtained by scanning the secondary index, it is required to perform the primary index scan for the second time to find the row data, making in total two index scans.
In CUBRID there is an index_scan_key_buffer_size parameter that can be customized for covering index. When this parameter is used by covering index, it determines the memory buffer size allocated for index column value.
During index scanning, if this memory buffer is completely used up, the scanning will be paused until all the data in the buffer is processed. After that index scanning will be resumed. The less scanning is stopped, the better for performance. However, setting too big number for this parameter is not necessarily good.
If you plan to use covering index, you have to consider both index column size and the number of simultaneous database users when determining the value of this parameter. The default value of this parameter is 320K.
This article has been written by one of the CUBRID core developers to help users improve their application performance by understanding h...
This guide explains how to retrieve database information from CUBRID. This includes: table, index, and column names, if certain columns are indexes ...
Sometimes we need to import large data files to CUBRID tables which can go over several GB. In this article we will cover different ways to perfo...