Author: Henry Kwon
CUBRID 10 has added various new features. One of the significant new features is string compression function! Now, let’s take a look at it!
String compression is shown in the table below.
CUBRID string compression is performed only for 255 bytes or more, and compression is not performed unless compression is efficient. In order to test how good the string compression ratio is, we entered 100,000 test data in CUBRID 9.3 and 10.1 without string compression and checked the table size.
- “Case #1” entered non-overlapping string data and checked the compression rate.
- “Case #2” entered duplicate data and checked the compression rate.
Classification |
Scenario |
Case #1 |
|
Case #2 |
|
The table below shows how to generate 100,000 data for each case. First, one test data was entered, and the test data was generated using the catalog table and the Cartesian Product in the “insert table select ...” statement.
Classification |
SQL Statement |
Case #1 |
create table str_comp( col1 string ); insert into str_comp values('aoisioeopsvnasdfpoasdfoasjd’) insert into str_comp select a.col1 from str_comp a, db_attribute b, db_attribute c, db_attribute d limit 99999; |
Case #2 |
create table str_comp2( col1 string ); insert into str_comp2 values(' aaaaaaaaaaaaaaaaaaaaaaaaaaa '); insert into str_comp2 select a.col1 from str_comp2 a, db_attribute b, db_attribute c, db_attribute d limit 99999; |
We entered 100,000 data into the SQL statement in the table above and checked the table size. The table size is “show heap capacity of table name;” I ran the command and checked the Num_pages value, and the results are shown in the table below.
|
No. of input |
9.3 |
10.1 |
Compression rate |
Case #1 |
100,000 |
3,704 |
3,704 |
0 % |
Case #2 |
100,000 |
3,704 |
357 |
938 % |
"Num_pages" entered and verified the data of the non-duplicated string of "case #1" and the result was the same. The reason is that compression was not performed because the data being entered was the same as the original value, and the data was the same because the original data was entered. In "Case #2", duplicate data was entered and the compression rate was treated as 938%.
When the CUBRID 10 string compression function decompresses the compressed string, it is said that "decompress the compressed string when reading from the database". However, a question arises: How much time does it take to extract data?
To answer that, we wrote and tested the following script. The script below is a script that receives the SQL statement to be executed as a parameter, executes the SQL statement, records the result in resut_sql.txt, and outputs the value minus the completion time from the time before execution.
Name |
Script Content |
run_time.sh |
#!/bin/bash StartTime=$(date +%s.%N)
RUN_SQL=$1 #echo $RUN_SQL csql -u dba -c "$RUN_SQL" -o result_sql.txt test_db
EndTime=$(date +%s.%N)
run_time=`echo "$EndTime - (($StartTime)) " | bc` echo $run_time |
The script was executed as follow:
sh run_time.sh "select * from str_comp limit 100000;"
The time required to extract 500,000 cases from the in-house test server equipment was measured as follows. The measurement time below is not an absolute value, and may be changed depending on equipment performance, OS setting, storage IO performance, etc.
Classification |
9.3 |
10.1 |
Compression Rate |
Time Difference |
||
Number |
Page Count |
Time (sec) |
Page Count |
Time (sec) |
||
Case#2 |
3,704 |
1.04317 |
357 |
0.848578 |
938 % |
-0.194592 |
The time taken to extract 100,000 data from CUBRID 9.3 was 1.04317 seconds, and 10.1 was 0.848578 seconds. The time difference was -0.194592 seconds. In another word, the processing speed is 0.19sec faster when decompressing and extracting 100,000 compressed 501 byte data.
In summary, the storage capacity in DISK was 57 MB (3,704 * page size 16K) in 9.3, and 5 MB (357 * page size 16K) in 10.1. The processing time is 0.19 seconds faster. The string compression function added in CUBRID 10 reduces disk usage and improves processing performance.