It started 4 years ago...

with a plan of making CUBRID engine better, faster, stronger. We have since replaced Two-Phase Locking with Multiversion Concurrency Control; we have optimized and improved performance; we added more features and fixed many issues. CUBRID 10.1 is finally here.

CUBRID 10.1 is better, bringing new features. The globalization suite is enhanced with time zone data types, based on IANA definitions, and BINARY charset/collation to store raw data as string data types. Common Table Expression is another old ambition fulfilled by CUBRID 10.1; developers may write complex queries easier and clearer, recursive queries included.

To achieve our goal of making the engine faster, we decided to push the workload limits on TPC‑C benchmark (standard for OLTP system configurations) further and further. We have almost doubled the optimum workloads since 9.3. In parallel, we also followed and improved results for other benchmarks: Sysbench by 30%, YCSB by 50% to 65%, while for TPC‑W benchmark, Shopping mix is 6 times faster and Ordering mix almost 4 times faster. CUBRID 10.1 outperforms MySQL on write-heavy workloads like TPC‑C, Sysbench and YCSB A.

We made the engine stronger by rewriting the entire storage module. As a result, loading large databases is three times faster. How big can our databases be? Well, it is hard to tell and we are tempted to say we have no limit. The CUBRID engine maximum workload is usually capped by machine limits and never by database size.

CUBRID 10.1 is more stable too. The QA team extended the testing suite with new automated tests, which covered many corner cases that may or may not happen in the real-world scenarios. We fixed roughly 5000 such issues in CUBRID 10.0 and CUBRID 10.1 alone, more than the total issue number fixed in all 9.x versions. The High Availability system is also more stable on heavier workloads, the replication delay being reduced significantly.

We consider CUBRID 10.1 is a success story. It is fast, strong and reliable, it is an open source solution and it offers a comprehensive set of features, usually found only in enterprise products.

Try out CUBRID 10.1 Engine right now!

version 10.1 (latest) for Windows

Need CUBRID for Linux? Try it out now

CUBRID 10.0 vs CUBRID 10.1

One year ago we released CUBRID 10.0 beta. It included a standard insert-only MVCC implementation and many others performance improvements, yet it was far from getting the results we wished to obtain. We changed the MVCC implementation and pushed more performance fixes, which finally led us to much better results.

One year ago we release CUBRID 10.0 beta. It included a naive insert-only MVCC implementation and many others performance improvements, yet it was from getting the results we wished to obtain. We changed the MVCC implementation and pushed more performance fixes, which finally lead us to the wanted result.

CUBRID 10.0 beta version used to have the intuitive insert-only policy, adopted by most MVCC systems, for MVCC Versions Scheme. Long story short, this creates multiple row versions stored in different locations (Object ID, or OID); one bad consequence of this policy is index management – our initial implementation choice had to update indexes even for non-key updates (due to object location change).

Non-key update is a very common operation in real-world and benchmark scenarios; thus, this weakness alone had dire consequences on the overall performance. Although we have done many other optimizations besides MVCC, some benchmarks barely showed any performance improvement compared to 9.3 READ UNCOMMITTED. We could boast with an isolation level upgrade to READ COMMITTED, but that was not enough for our ambitions.

So we replaced the MVCC Versions Scheme with update in-place policy – versions are substituted at the row original location. A link to the undo log record address is saved in row header, so previous version can be fetched from log. As a result, index remains unmodified for non-key updates. This change alone has boosted our maximum TPC‑C workload by 40%.

We pushed the TPC‑C limits further by fixing two other very large databases bottlenecks. First we redesigned the disk and file managers, which fixed page allocation performance, then we redesigned the page buffering system, which solved the thrashing problem.

Performance

The CUBRID 10.1 most important goal has always been performance, with a focus on write-heavy oriented workloads. To follow our progress and to prove the effectiveness of our performance optimizations, we have used several standard benchmarks: TPC‑C, Sysbench, YCSB, TPC‑W. We increased our performance by 1.3 to 6 times, and we outperform MySQL on write-heavy workloads.

TPC‑C is considered de facto standard for online transaction processing (OLTP) and it was defined by Transaction Processing Performance Council (TPC). It is a mixture of read-only and update intensive transactions typical to a wholesale supplier managing orders (order acceptance, delivery, recording payments, checking the status of orders and monitoring stock levels).

TPC‑C workload and database size depend on the configured number of warehouses (clients). Since it provides the heaviest workload, we also used TPC‑C as principal guideline to tracking our performance improvements.

The benchmarks measured metrics are new-order transaction rate (tpmC), price per tpmC and response times (average and 90th percentile).

Compared to 9.3, the maximum number of warehouses has been increased from 1000W to 1800W and maximum tpmC by almost 80%. CUBRID 10.1 also performs better than MySQL by over 30%.

The blue and green lines show how the tpmC scales with the number of warehouses, while the table shows 90th percentile new order latencies. CUBRID 10.1 peaks at 1800W, with 26.4k tpmC and only 0.60 seconds latency. MySQL peaks at 1400W, with 19.5k tpmC and over 4 seconds latency.

Sysbench OLTP is another heavy workload benchmark used to test MySQL performance, but adapted by us to test the CUBRID engine. Over the years, we used Sysbench 0.5 benchmark, but we have recently updated it to 1.0. There are two Sysbench versions, one for read/write workloads and one read-only. We use the read/write version for our tests.

This benchmark measures the transactions per second and the average transaction execution time.

The CUBRID 10.1 TPS outperforms MySQL’s by 10%.

Sysbench results show the difficulties we faced in CUBRID 10.0 with our first MVCC implementation. It also shows how CUBRID 10.1 has overcame those issues.

Yahoo! Cloud Serving Benchmark (YCSB) is a framework that provides a common set of workloads for evaluating the performance of different key-value and cloud serving stores. We’ve been tracking the results for workloads A and B.

Workload A is the update heavy workload and has a mix of 50/50 reads and writes; an application example is a session store recording recent actions.

Workload B has a 95/5 reads/write mix; Application example: photo tagging; add a tag is an update, but most operations are to read tags.

The benchmark measures transactions per second (ops) and average/max latencies.

YCSB shows the gradual progress of our performance results since the release of 9.3. It also mirrors perfectly the difference between heavy-write oriented CUBRID and heavy-read oriented MySQL.

TPC‑W is a transactional web e-Commerce benchmark, simulating the activities of a business oriented transactional web server.

The benchmark defines three distinct Measurement Intervals, each for a distinct mix of web interactions, resulting in three distinct performance metrics. WIPS is the primary performance metric as it is based on an average shopping scenario. WIPSb and WIPSo are secondary performance metrics. While similar to the WIPS metric, the web interaction mix for WIPSb is primarily browsing, and the web interaction mix for WIPSo involves a significant percentage of ordering resulting in transaction processing.

The TPC‑W results show a big leap in performance for this type of workloads. This is another example to show CUBRID engine fares better with update intensive transactions.

TPC‑C is considered de facto standard for online transaction processing (OLTP) and it was defined by Transaction Processing Performance Council (TPC). It is a mixture of read-only and update intensive transactions typical to a wholesale supplier managing orders (order acceptance, delivery, recording payments, checking the status of orders and monitoring stock levels).

TPC‑C workload and database size depend on the configured number of warehouses (clients). Since it provides the heaviest workload, we also used TPC‑C as principal guideline to tracking our performance improvements.

The benchmarks measured metrics are new-order transaction rate (tpmC), price per tpmC and response times (average and 90th percentile).

Compared to 9.3, the maximum number of warehouses has been increased from 1000W to 1800W and maximum tpmC by almost 80%. CUBRID 10.1 also performs better than MySQL by over 30%.

The blue and green lines show how the tpmC scales with the number of warehouses, while the table shows 90th percentile new order latencies. CUBRID 10.1 peaks at 1800W, with 26.4k tpmC and only 0.60 seconds latency. MySQL peaks at 1400W, with 19.5k tpmC and over 4 seconds latency.

Sysbench OLTP is another heavy workload benchmark used to test MySQL performance, but adapted by us to test the CUBRID engine. Over the years, we used Sysbench 0.5 benchmark, but we have recently updated it to 1.0. There are two Sysbench versions, one for read/write workloads and one read-only. We use the read/write version for our tests.

This benchmark measures the transactions per second and the average transaction execution time.

The CUBRID 10.1 TPS outperforms MySQL’s by 10%.

Sysbench results show the difficulties we faced in CUBRID 10.0 with our first MVCC implementation. It also shows how CUBRID 10.1 has overcame those issues.

Yahoo! Cloud Serving Benchmark (YCSB) is a framework that provides a common set of workloads for evaluating the performance of different key-value and cloud serving stores. We’ve been tracking the results for workloads A and B.

Workload A is the update heavy workload and has a mix of 50/50 reads and writes; an application example is a session store recording recent actions.

Workload B has a 95/5 reads/write mix; Application example: photo tagging; add a tag is an update, but most operations are to read tags.

The benchmark measures transactions per second (ops) and average/max latencies.

YCSB shows the gradual progress of our performance results since the release of 9.3. It also mirrors perfectly the difference between heavy-write oriented CUBRID and heavy-read oriented MySQL.

TPC‑W is a transactional web e-Commerce benchmark, simulating the activities of a business oriented transactional web server.

The benchmark defines three distinct Measurement Intervals, each for a distinct mix of web interactions, resulting in three distinct performance metrics. WIPS is the primary performance metric as it is based on an average shopping scenario. WIPSb and WIPSo are secondary performance metrics. While similar to the WIPS metric, the web interaction mix for WIPSb is primarily browsing, and the web interaction mix for WIPSo involves a significant percentage of ordering resulting in transaction processing.

The TPC‑W results show a big leap in performance for this type of workloads. This is another example to show CUBRID engine fares better with update intensive transactions.

The road has been long and at times winding, but we persevered. We have spent many hours, days, months, looking at numbers and graphics trying to find the most sensible bottlenecks. Then we had to find ingenious ways to overcome them, sometimes simplifying the logic, other times trying a more complex but better tuned one; we ended up replacing many parts in our engine. The story that got us here, to these performance results, has many storylines.

In multiple threaded systems, performance bottlenecks are commonly related to concurrency on shared resources. Tables, rows, index keys, data pages, data files are only few examples of shared resources typical to transactional systems. Moreover, the list becomes a lot bigger considering database internal processes which may also need concurrent access on same resources – caches, logging, communication to name just a few.

MVCC was the single major change that affected almost all engine modules. But the effort resulted in an optimized locking module; no object locks are required to read a row and index key locks has been removed completely. That gave us the opportunity to further reduce the complexity of locking system and eliminated deadlocks almost completely.

However, MVCC comes with a considerable overhead in maintaining the MVCC Snapshot Table, necessary to compute transaction snapshot. Our first and not so naive implementation stumbled in the point-read and point-update dominated YCSB benchmark. After careful analysis and incremental tuning, we ended up with a very complex, yet optimal, almost entirely lock free, snapshot system. A mutex may be used occasionally, but for very short operations.

Another critical concurrency optimization, especially with CUBRID 10.0 MVCC implementation, was the latch on index nodes. All index manipulation functions used to traverse the B+-trees using exclusive latch, in case it had to split or merge nodes. But splits and merges are rarely executed and using exclusive latch just in case is overkill. So we added page latch promotion feature and, shared, promotable traversals are used instead exclusive ones. In very unlucky cases when latch is not promotable, traversals are restarted, but this cost is small compared to the overall gain.

We've tested various corner cases and analyzed performance, discovering different bottlenecks in engine's internal processes. We found that the existing synchronization tools – a simple mutex and a bulky and complex custom critical section – were not enough for the wide spectrum of access patterns. Therefore, we added two more tools to our sync suite: re-entrant mutexes and light weight rwlocks which are most intensively used in the network communication module.

Other access patterns are dominated by read operations. In many such cases, lock free data structures relying on the magical atomic operations are more adequate than synchronizing tools; sometimes “write” operations are repeated, but the overall performance is much better. We have two such data structures: lock free hash table, which is now intensively used – lock table, query plan cache, session table being just a few important examples, and lock free circular queue, which is particularly adept for communication pipelines between threads (e.g. job/resource assignment).

The log page buffer is another shared data structure that used to be protected using our custom critical section. It became overloaded due to our increased demand from our MVCC implementation, so we replaced with a lock-free custom data structure.

Some of our biggest changes were targeted to optimizing very large database management. We have completely rewritten file and disk managers, and we redesigned page replacement algorithm, effectively fixing two important TPC‑C bottlenecks.

The database used for TPC‑C benchmark scales in size with the configured workload. For the high-end workloads of 1800W/2000W, the database size grows over 300GB, while the largest index alone is bigger than 50GB.

Once we fixed the MVCC policy issue, and tried to increase the TPC‑C workload further and further, we face two new issues: slow execution of operations that scale with database size (e.g. page allocations), and thrashing. The storage system was rewritten completely as consequence.

We first rewrote file and disk managers to address a page allocation issue. The old file design was unable to quickly allocate a page in very large files; as a result, it blocked the largest index, which is very active at the same time, by trying to split its nodes. After rewrite, most page allocations access and modify a single page, and only several pages in the worst case scenarios. Page deallocation still scales linearly with file size, but it is significantly faster than before. Disk operations too are faster and more predictable.

Then next problem was thrashing and we had to rewrite our page buffering system. With a new approach called Page Quota, LRU lists are classified into shared and private lists; each thread has its own private list to avoid polluting shared lists with single-access pages. The LRU list itself is more complex and more capable of filtering truly cold pages. Searching for page replacement candidates was also optimized by channeling threads directly to victims and by forcing them to wait when system becomes very stressed.

As a side notice, although not part of the benchmark results, we also optimized the databases loading size, also important for very large database.

The result of these changes is a well performing TPC‑C in 1800W high load, with a very low and stable IO ratio and with no noticeable bottlenecks. Even the 1900W and 2000W workloads, although not as optimal as 1800W, are still running in stable parameters.

Although we like sharing stories about all the performance tunings we have done the last couple of years, we would need much more lines and that’s besides the scope of this page. We would like to just mention a few more of them, hoping we can provide more details in separate blog posts.

We micromanaged the code by cherry picking what to inline and what to move to next stack, based on execution patterns.

We improved communication between CAS and server using lighter sync and merging exchanged packets.

We store very large strings more efficiently by compressing them. The bigger are the strings, the more space is saved.

Repeating class locks is avoided (in most usual cases) – it had an unusual high impact on very short transaction.

Invalidating (and flushing) page on deallocation is deferred; flush thread writes it to disk later; merging index nodes does not wait for IO. Additionally, merging conditions are tightened and adjacent nodes are not latched as often; especially if page replacement system is stressed.

Query plan cache is loaded much faster with the cost of additional memory to cache clones.

External sort algorithm is enhanced with a simple optimization of find nth page functionality; simple as it was, the optimization had a big impact on index loading and executing ordered queries.

Some bottlenecks in executing queries over partitioned tables have been removed. Global index has been divided into local indexes. Fetching partition pruning predicate was improved. Partition specific DDL statements have been optimized.

Finally, we’ll end with important improvements regarding High-Availability feature, with a faster replication and reduced failover time.

In multiple threaded systems, performance bottlenecks are commonly related to concurrency on shared resources. Tables, rows, index keys, data pages, data files are only few examples of shared resources typical to transactional systems. Moreover, the list becomes a lot bigger considering database internal processes which may also need concurrent access on same resources – caches, logging, communication to name just a few.

MVCC was the single major change that affected almost all engine modules. But the effort resulted in an optimized locking module; no object locks are required to read a row and index key locks has been removed completely. That gave us the opportunity to further reduce the complexity of locking system and eliminated deadlocks almost completely.

However, MVCC comes with a considerable overhead in maintaining the MVCC Snapshot Table, necessary to compute transaction snapshot. Our first and not so naive implementation stumbled in the point-read and point-update dominated YCSB benchmark. After careful analysis and incremental tuning, we ended up with a very complex, yet optimal, almost entirely lock free, snapshot system. A mutex may be used occasionally, but for very short operations.

Another critical concurrency optimization, especially with CUBRID 10.0 MVCC implementation, was the latch on index nodes. All index manipulation functions used to traverse the B+-trees using exclusive latch, in case it had to split or merge nodes. But splits and merges are rarely executed and using exclusive latch just in case is overkill. So we added page latch promotion feature and, shared, promotable traversals are used instead exclusive ones. In very unlucky cases when latch is not promotable, traversals are restarted, but this cost is small compared to the overall gain.

We've tested various corner cases and analyzed performance, discovering different bottlenecks in engine's internal processes. We found that the existing synchronization tools – a simple mutex and a bulky and complex custom critical section – were not enough for the wide spectrum of access patterns. Therefore, we added two more tools to our sync suite: re-entrant mutexes and light weight rwlocks which are most intensively used in the network communication module.

Other access patterns are dominated by read operations. In many such cases, lock free data structures relying on the magical atomic operations are more adequate than synchronizing tools; sometimes “write” operations are repeated, but the overall performance is much better. We have two such data structures: lock free hash table, which is now intensively used – lock table, query plan cache, session table being just a few important examples, and lock free circular queue, which is particularly adept for communication pipelines between threads (e.g. job/resource assignment).

The log page buffer is another shared data structure that used to be protected using our custom critical section. It became overloaded due to our increased demand from our MVCC implementation, so we replaced with a lock-free custom data structure.

Some of our biggest changes were targeted to optimizing very large database management. We have completely rewritten file and disk managers, and we redesigned page replacement algorithm, effectively fixing two important TPC‑C bottlenecks.

The database used for TPC‑C benchmark scales in size with the configured workload. For the high-end workloads of 1800W/2000W, the database size grows over 300GB, while the largest index alone is bigger than 50GB.

Once we fixed the MVCC policy issue, and tried to increase the TPC‑C workload further and further, we face two new issues: slow execution of operations that scale with database size (e.g. page allocations), and thrashing. The storage system was rewritten completely as consequence.

We first rewrote file and disk managers to address a page allocation issue. The old file design was unable to quickly allocate a page in very large files; as a result, it blocked the largest index, which is very active at the same time, by trying to split its nodes. After rewrite, most page allocations access and modify a single page, and only several pages in the worst case scenarios. Page deallocation still scales linearly with file size, but it is significantly faster than before. Disk operations too are faster and more predictable.

Then next problem was thrashing and we had to rewrite our page buffering system. With a new approach called Page Quota, LRU lists are classified into shared and private lists; each thread has its own private list to avoid polluting shared lists with single-access pages. The LRU list itself is more complex and more capable of filtering truly cold pages. Searching for page replacement candidates was also optimized by channeling threads directly to victims and by forcing them to wait when system becomes very stressed.

As a side notice, although not part of the benchmark results, we also optimized the databases loading size, also important for very large database.

Although we like sharing stories about all the performance tunings we have done the last couple of years, we would need much more lines and that’s besides the scope of this page. We would like to just mention a few more of them, hoping we can provide more details in separate blog posts.

We micromanaged the code by cherry picking what to inline and what to move to next stack, based on execution patterns.

We improved communication between CAS and server using lighter sync and merging exchanged packets.

We store very large strings more efficiently by compressing them. The bigger are the strings, the more space is saved.

Repeating class locks is avoided (in most usual cases) – it had an unusual high impact on very short transaction.

Invalidating (and flushing) page on deallocation is deferred; flush thread writes it to disk later; merging index nodes does not wait for IO. Additionally, merging conditions are tightened and adjacent nodes are not latched as often; especially if page replacement system is stressed.

Query plan cache is loaded much faster with the cost of additional memory to cache clones.

External sort algorithm is enhanced with a simple optimization of find nth page functionality; simple as it was, the optimization had a big impact on index loading and executing ordered queries.

Some bottlenecks in executing queries over partitioned tables have been removed. Global index has been divided into local indexes. Fetching partition pruning predicate was improved. Partition specific DDL statements have been optimized.

Finally, we’ll end with important improvements regarding High-Availability feature, with a faster replication and reduced failover time.

Features

CUBRID 10.1 also brings new features that improve globalization – Time zone data types using IANA and BINARY CHARSET; SQL extensions – Common Table Expressions, NATURAL JOIN, new statement types, functions and support for SCHEMA comments; utilities – checkdb, spacedb, checksumdb, restoreslave, docker.

We provide full support for time zone data types including arithmetic operations, upgrading existing date/time functions to include time zone, and adding new specialized time zone functions. Time zone support implies a large amount of meta-data to describe modifications through time and across the globe. It is prone to change for various reasons (mostly economic and political ones). We chose the IANA format since it is the "de facto" standard, it is free and open source, it is used by most applications, and it is continuously maintained. Since our time zone definitions uses IANA, the time zone configuration may be easily updated (using files downloaded from IANA site - https://www.iana.org/time-zones). Our engine is packaged with tools for upgrading user data upon time zone definitions change.

Some loose ends from 9.x version are resolved with the new version. Except for known issues for collation inference algorithm, we added the BINARY charset and BINARY collation which allows raw data of reasonable size to be stored in string data types. This permitted us to define more intuitive charset conversion rules (now, the ISO88591 charset is properly converted to/from other charsets, while the new BINARY charset is transparent to such operation). Supported collations and charsets are shown in system tables and views and have their specialized SHOW statements.

Common Table Expressions is an old ambition fulfilled by CUBRID 10.1. Developers may write complex queries faster, and statements are easier to understand due to isolated common SQL patterns. The feature also supports recursive queries.

There are many other SQL extensions besides CTE, like the NATURAL JOIN for queries, new statements – CREATE OR REPLACE PROCEDURE/FUNCTION, CREATE TABLE IF NOT EXISTS, KILL, SHOW statements, support for SCHEMA comment, CASCADE CONSTRAINTS for DROP TABLE, and also new functions.

We have extended the functionality of some utilities like checkdb, spacedb and added new utilities like checksumdb and restoreslave. To make things easier for you, you can build a docker image in just a few simple steps - https://github.com/CUBRID/cubrid-docker.

What else

We’ve done a lot of performance tuning and added new features, but we also did a lot of code refactoring, we fixed five thousand issues and we moved the project to GitHub and opened new channels of communications with our community.

Refactoring the code has been one central topic in CUBRID 10.0 and CUBRID 10.1. The end-user may not notice this, but we know the importance of keeping a neat code and we have done a lot of cleaning (we still do). Many server modules have been rewritten partially or completely. Storage module has suffered most changes, disk manager, file manager and page buffering system being completely overhauled; B+-tree and heap most basic operations have also been rewritten, task that could not be postponed due to increased complexity under MVCC. We also rewrote lock manager, query manager, query plan cache, scan manager, performance monitoring, and log module.

Bug fixing was no less a challenge. Our QA devils have done an excellent job in finding new ways of testing our engine and surfacing unbelievable bugs. The RQG, short for Random Query Generator, is a demonic tool that executes in parallel hundreds random queries. The CTL tool allows us run multi-threaded queries in a predictable order, testing the consistency of our transactional concurrency. Nonetheless, the case suite of existing tools was also extended. It was a common effort that allowed us to fix an impressive number of 5000 issues.

The tools that we use to test the engine daily are now also open source and incredibly easy to use, thanks to our QA rock stars.

Ideas we plan to explore next

We are very proud of what we achieved in these couple of years. It was the collective effort of dozens of people spread across the world, in Korea, China and Romania. Some of us have moved on to different projects, some have just recently joined, but we like to think we are all a big family.

We would like to end this story by offering a sneak peek into our future plans. Our full backlog has features that would probably take decades to be fully implemented, so we will filter them carefully.

Our next priorities are enhancing the High Availability feature, improving performance even further, and making CUBRID engine easier for you to use.

For High Availability, we have plans both short-term and long-term. We’ll start with refactoring in the next milestone, cleaning up the replication logic, and then we will extend apply on slave and optimizing network communication. Ultimately, we would like to provide a multi master replication system.

We will continue to analyze and tackle our performance bottlenecks, with a special attention to read heavy workloads this time. Our broader plans include tasks like bulk inserts, sorting and loading index optimization, improving partitions, count queries, and many more.

Our plans for making CUBRID easier to use are still under construction. We started by making docker available, and we want to extend our driver options. We will add JSON data type and support JSON operations and indexes natively. We will continue to improve the administrator tools – CUBRID Manager especially.

The next milestone will also include the postponed feature of out-of-row column. We separate very large columns from the rest of the row to improve the performance of queries not reading these columns, as often happens.

We end the sneak peek here with the promise that more will come in the form of blog posts, so stay tuned. Don’t hesitate to check our project on GitHub and to write us on /r/CUBRID or via email, whichever you prefer. Give our engine a test drive by clicking download below and tell us what you think☺.

Try out CUBRID 10.1 Engine right now!

version 10.1 (latest) for Windows

Need CUBRID for Linux? Try it out now