better, faster, stronger
Try out CUBRID 10.1 Engine right now!
version 10.1 (latest) for Windows
One year ago we released CUBRID 10.0 beta. It included a standard insert-only MVCC implementation and many others performance improvements, yet it was far from getting the results we wished to obtain. We changed the MVCC implementation and pushed more performance fixes, which finally led us to much better results.
One year ago we release CUBRID 10.0 beta. It included a naive insert-only MVCC implementation and many others performance improvements, yet it was from getting the results we wished to obtain. We changed the MVCC implementation and pushed more performance fixes, which finally lead us to the wanted result.
CUBRID 10.0 beta version used to have the intuitive insert-only policy, adopted by most MVCC systems, for MVCC Versions Scheme. Long story short, this creates multiple row versions stored in different locations (Object ID, or OID); one bad consequence of this policy is index management – our initial implementation choice had to update indexes even for non-key updates (due to object location change).
Non-key update is a very common operation in real-world and benchmark scenarios; thus, this weakness alone had dire consequences on the overall performance. Although we have done many other optimizations besides MVCC, some benchmarks barely showed any performance improvement compared to 9.3 READ UNCOMMITTED. We could boast with an isolation level upgrade to READ COMMITTED, but that was not enough for our ambitions.
So we replaced the MVCC Versions Scheme with update in-place policy – versions are substituted at the row original location. A link to the undo log record address is saved in row header, so previous version can be fetched from log. As a result, index remains unmodified for non-key updates. This change alone has boosted our maximum TPC‑C workload by 40%.
We pushed the TPC‑C limits further by fixing two other very large databases bottlenecks. First we redesigned the disk and file managers, which fixed page allocation performance, then we redesigned the page buffering system, which solved the thrashing problem.
The CUBRID 10.1 most important goal has always been performance, with a focus on write-heavy oriented workloads. To follow our progress and to prove the effectiveness of our performance optimizations, we have used several standard benchmarks: TPC‑C, Sysbench, YCSB, TPC‑W. We increased our performance by 1.3 to 6 times, and we outperform MySQL on write-heavy workloads.
In multiple threaded systems, performance bottlenecks are commonly related to concurrency on shared resources. Tables, rows, index keys, data pages, data files are only few examples of shared resources typical to transactional systems. Moreover, the list becomes a lot bigger considering database internal processes which may also need concurrent access on same resources – caches, logging, communication to name just a few.
MVCC was the single major change that affected almost all engine modules. But the effort resulted in an optimized locking module; no object locks are required to read a row and index key locks has been removed completely. That gave us the opportunity to further reduce the complexity of locking system and eliminated deadlocks almost completely.
However, MVCC comes with a considerable overhead in maintaining the MVCC Snapshot Table, necessary to compute transaction snapshot. Our first and not so naive implementation stumbled in the point-read and point-update dominated YCSB benchmark. After careful analysis and incremental tuning, we ended up with a very complex, yet optimal, almost entirely lock free, snapshot system. A mutex may be used occasionally, but for very short operations.
Another critical concurrency optimization, especially with CUBRID 10.0 MVCC implementation, was the latch on index nodes. All index manipulation functions used to traverse the B+-trees using exclusive latch, in case it had to split or merge nodes. But splits and merges are rarely executed and using exclusive latch just in case is overkill. So we added page latch promotion feature and, shared, promotable traversals are used instead exclusive ones. In very unlucky cases when latch is not promotable, traversals are restarted, but this cost is small compared to the overall gain.
We've tested various corner cases and analyzed performance, discovering different bottlenecks in engine's internal processes. We found that the existing synchronization tools – a simple mutex and a bulky and complex custom critical section – were not enough for the wide spectrum of access patterns. Therefore, we added two more tools to our sync suite: re-entrant mutexes and light weight rwlocks which are most intensively used in the network communication module.
Other access patterns are dominated by read operations. In many such cases, lock free data structures relying on the magical atomic operations are more adequate than synchronizing tools; sometimes “write” operations are repeated, but the overall performance is much better. We have two such data structures: lock free hash table, which is now intensively used – lock table, query plan cache, session table being just a few important examples, and lock free circular queue, which is particularly adept for communication pipelines between threads (e.g. job/resource assignment).
The log page buffer is another shared data structure that used to be protected using our custom critical section. It became overloaded due to our increased demand from our MVCC implementation, so we replaced with a lock-free custom data structure.
Some of our biggest changes were targeted to optimizing very large database management. We have completely rewritten file and disk managers, and we redesigned page replacement algorithm, effectively fixing two important TPC‑C bottlenecks.
The database used for TPC‑C benchmark scales in size with the configured workload. For the high-end workloads of 1800W/2000W, the database size grows over 300GB, while the largest index alone is bigger than 50GB.
Once we fixed the MVCC policy issue, and tried to increase the TPC‑C workload further and further, we face two new issues: slow execution of operations that scale with database size (e.g. page allocations), and thrashing. The storage system was rewritten completely as consequence.
We first rewrote file and disk managers to address a page allocation issue. The old file design was unable to quickly allocate a page in very large files; as a result, it blocked the largest index, which is very active at the same time, by trying to split its nodes. After rewrite, most page allocations access and modify a single page, and only several pages in the worst case scenarios. Page deallocation still scales linearly with file size, but it is significantly faster than before. Disk operations too are faster and more predictable.
Then next problem was thrashing and we had to rewrite our page buffering system. With a new approach called Page Quota, LRU lists are classified into shared and private lists; each thread has its own private list to avoid polluting shared lists with single-access pages. The LRU list itself is more complex and more capable of filtering truly cold pages. Searching for page replacement candidates was also optimized by channeling threads directly to victims and by forcing them to wait when system becomes very stressed.
As a side notice, although not part of the benchmark results, we also optimized the databases loading size, also important for very large database.
The result of these changes is a well performing TPC‑C in 1800W high load, with a very low and stable IO ratio and with no noticeable bottlenecks. Even the 1900W and 2000W workloads, although not as optimal as 1800W, are still running in stable parameters.
Although we like sharing stories about all the performance tunings we have done the last couple of years, we would need much more lines and that’s besides the scope of this page. We would like to just mention a few more of them, hoping we can provide more details in separate blog posts.
We micromanaged the code by cherry picking what to inline and what to move to next stack, based on execution patterns.
We improved communication between CAS and server using lighter sync and merging exchanged packets.
We store very large strings more efficiently by compressing them. The bigger are the strings, the more space is saved.
Repeating class locks is avoided (in most usual cases) – it had an unusual high impact on very short transaction.
Invalidating (and flushing) page on deallocation is deferred; flush thread writes it to disk later; merging index nodes does not wait for IO. Additionally, merging conditions are tightened and adjacent nodes are not latched as often; especially if page replacement system is stressed.
Query plan cache is loaded much faster with the cost of additional memory to cache clones.
External sort algorithm is enhanced with a simple optimization of find nth page functionality; simple as it was, the optimization had a big impact on index loading and executing ordered queries.
Some bottlenecks in executing queries over partitioned tables have been removed. Global index has been divided into local indexes. Fetching partition pruning predicate was improved. Partition specific DDL statements have been optimized.
Finally, we’ll end with important improvements regarding High-Availability feature, with a faster replication and reduced failover time.
CUBRID 10.1 also brings new features that improve globalization – Time zone data types using IANA and BINARY CHARSET; SQL extensions – Common Table Expressions, NATURAL JOIN, new statement types, functions and support for SCHEMA comments; utilities – checkdb, spacedb, checksumdb, restoreslave, docker.
We’ve done a lot of performance tuning and added new features, but we also did a lot of code refactoring, we fixed five thousand issues and we moved the project to GitHub and opened new channels of communications with our community.
Refactoring the code has been one central topic in CUBRID 10.0 and CUBRID 10.1. The end-user may not notice this, but we know the importance of keeping a neat code and we have done a lot of cleaning (we still do). Many server modules have been rewritten partially or completely. Storage module has suffered most changes, disk manager, file manager and page buffering system being completely overhauled; B+-tree and heap most basic operations have also been rewritten, task that could not be postponed due to increased complexity under MVCC. We also rewrote lock manager, query manager, query plan cache, scan manager, performance monitoring, and log module.
Bug fixing was no less a challenge. Our QA devils have done an excellent job in finding new ways of testing our engine and surfacing unbelievable bugs. The RQG, short for Random Query Generator, is a demonic tool that executes in parallel hundreds random queries. The CTL tool allows us run multi-threaded queries in a predictable order, testing the consistency of our transactional concurrency. Nonetheless, the case suite of existing tools was also extended. It was a common effort that allowed us to fix an impressive number of 5000 issues.
The tools that we use to test the engine daily are now also open source and incredibly easy to use, thanks to our QA rock stars.
Try out CUBRID 10.1 Engine right now!
version 10.1 (latest) for Windows