posted 5 years ago in Dev Platform category by Hye Jeong Lee
NoSQL is the talk of the town. And we have already covered what it is for in one of our previous blogs. Today I would like to share the NoSQL benchmark test results we have recently conducted. It will help you to understand if the soon to develop system is compatible to NoSQL, and which NoSQL product to select.
The interest in NoSQL continues to rise because the amount of data to process continues to increase. International internet companies, including Google and Facebook, have their own NoSQL solution designed to process the exploding amount of data.
Why are they using NoSQL instead of RDBMS?
RDBMS is known to experience burden when processing tera or peta unit large sized data. However, this can be resolved through sharding. For instance, NHN’s Café, Blog and News services use RDBMS through sharding.
There is no single correct answer in processing bulk data. Since every situation is different, the company must select the solution that best fits their situation and apply that for seamless service.
Out of the RDBMSs, Oracle is an exception since Oracle’s performance and functions, such as mass data processing or data synchronization, are far more superior to other RDBMS. However, the high cost can be a problem. Depending on the size, it may be more economical to develop a NoSQL than purchasing an Oracle license.
Even if you have enough expertise on how to process bulk data with RDBMS, we will need to have continuous interest and training in NoSQL. NoSQL is gaining popularity because of its ability to process mass data, but it still has many technical (or functional) limitations compared to RDBMS. However, this will be resolved as time passes.
NoSQL provides a non-relational, and in the long run, schema-free data model, allowing the horizontal extension of the system. Instead, it is less structured than RDBMS and does not guarantee ACID. Therefore, after the INSERT operation is completed and SELECT conducted, a different value can be acquired. And after the NoSQL demo has failed and restored, the stored value may be different than before. Operations such as transactions or join cannot be conducted.
While ACID has been abandoned, the flexibility, scalability and usability of data storage have increased. Therefore, it is more suitable for explosive amounts of data processing.
Now let’s see if the NoSQL products really play the expected part for the internet services you are to develop. Why don’t we take a look at the characteristics of the most widely used NoSQL products (Cassandra, HBase, MongoDB) by looking into each of its architectures and benchmark tests that were conducted.
Benchmarking Tests using YCSB
YCSB (Yahoo Cloud Servicing Benchmark) is a test framework developed by Yahoo. It allows to conduct benchmark tests on storages by creating "work loads". Through this benchmark, the storage most suitable for the service that is to be developed can be selected.
Basic operations are Insert, Update, Read, and Scan. There are basic workload sets that combine the basic operations, but new additional workloads can also be created.
YCBS currently supports Cassandra, HBase, MongoDB, Voldemort and JDBC. If tests on other storages are needed, then use YCBS interfaces to test during the development process.
This article contains tests conducted on the following products and versions.
Although Cassandra’s latest version is 0.8.0, we have decided to use the previous version known to be stable. Because when testing with the 0.8 version, the gossip protocol between nodes malfunctioned and the node up/down information was incorrect.
- HBase-0.90.2 (Hadoop-0.20-append)
The HBase-0.90.2 (Hadoop-0.20-append) was selected because, if not the Hadoop-append version, there may be problems on decreased durability in HDFS.
The test workload is as follows.
- Insert Only
Enter 50 million 1K-sized records to the empty DB.
- Read Only
Search the key in the Zipfian Distribution for a one hour period on the DB that contains 50 million 1K-sized records.
- Read & Update
Conduct read and update one-on-one instead of read under the identical conditions of Read Only.
There are three testing equipments with the same specifications:
- Nehalem 6 Core x 2 CPU, 16GB Memory.
Each conduct replicates and distributes three copies. However, with MongoDB the performance result was abnormally high when both replications and distribution compositions were organized, so the test was conducted only with the replica set. The parameter for each product is shown in the following listings. Other items were set to default.
# If a minimum of one succeeds while replicating, then it is returned to application. Consistency Level, Read=ONE, Write=ONE # Can be set to periodic(default) or batch. # Periodically writes (fsync) the commit log on the disk, # and batch executes regularly (1ms) collecting those to fsync every time. commitlog_sync: batch, commitlog_sync_batch_window_in_ms: 1 # Degree the key location is cached. When 1.0 - everything is cached. key_cached=1.0
HeapSize, HBase=8G, HDFS=2G
# Size of Oplog. Oplog is the log the master accumulated for replication. --oplogSize=100G
The results are as shown in the following figure.
- All three products showed better throughput in INSERT.
- Cassandra showed outstanding throughput in INSERT-only with 20,000 ops.
- HBase also presented relatively good performance in INSERT-only.
- The products did not show large differences with READ-only than INSERT-only.
- HBase’s performance was better than Cassandra in READ-only.
- Cassandra’s performance in READ-and-UPDATE was better than the other two products. Cassandra’s READ-and-UPDATE might have been higher than READ-only due to Cassandra’s excellent insert throughout.
- MongoDB’s throughout in all three conditions was the lowest of the three products. MongoDB, which uses the Memory Mapped File (mmap), probably showed poor performance because the large data size exceeded the physical memory size.
Cassandra and HBase Architecture
Both Cassandra and HBase are influenced by Google’s BigTable. (Although Cassandra was directly influenced by Amazon’s Dynamo.) BigTable is a Column-Oriented DB that stores data in a Multidimensional Sorted Map format and has the <row key, column family, column key> data structure. Column family is a basic unit that stores data with column groups that are related to each other.
In BigTable, data is written basically with the append method. In other words, when modifying data, the updates are appended to a file, rather than an in-place update in the stored file. The figure below shows BigTable’s data read/insert path.
When a write operation is inserted, it is first placed in a memory space called memtable. If the memtable is full, then the whole data is stored in a file called SSTable (Sorted String Table). Data may be lost if the server collapses before the memtable data is moved to SSTable, so to provide durability it is necessary to save the history of commit logs every time before writing to the memtable. When conducting the read operation, first find the pertaining key in the memtable. If it is not in the memtable, search for it in the SSTable. You may have to search multiple SSTables.
There are advantages in the writing operation if this architecture is used. This is because the ‘writing’ operation is only recorded in the memory and moved to the actual disk only after a certain amount has been accumulated. Therefore, concurrent I/O can be avoided. However, when "reading", if "reading" is done in the SSTable and not in the memtable, then the performance will relatively decrease. Both Cassandra and HBase use bloom filter to quickly judge whether the key exists in the SSTable and creates an index for use. However, if there are many SSTables, then a lot of I/O will be created during the reading operation. Therefore, compaction is done to improve the reading performance. Compaction is where two SSTables merge and sort to become one SSTable, which decreases the number of SSTables. The reading and writing performance improved as more compactions are done.
For these reasons READ and READ-and-UPDATE are much slower than INSERT operations in these NoSQL solutions.
Additionally, Cassandra uses consistent hashing. The HBase and Casandra may have their similarities, there are differences as well. Cassandra prefers AP (Availability & Partition tolerance) while HBase prefers CP (Consistency & Partition tolerance). CAP theorem is a theory in Distributed Computing.
The theory claims that there is no system that provides all three Consistency, Availability and Partition tolerance. For example, replications must be made in multiple nodes to increase usability, and the adjustability between replicated data must be met. However, in order to make operation possible, even during network partitioning, it becomes difficult to guarantee the adjustability between replications or data. Therefore, only a part of the CAP can be provided.
As mentioned before, Amazon’s Dynamo has a direct influence on Cassandra. Dynamo uses consistent hashing to disperse data to the key-value store, and provides high adjustability. Consistent hashing sequences the value (slot) that hashed the key and placed it in a ring format. The multiple nodes that create the cluster processes certain ranges of the ring. Therefore, every time a node in the cluster falls out or comes in, the closed node on the ring can take over the concerned range or distribute the range without rehashing.
Cassandra also raises the usability level with the concept of consistency level. This concept is related to replication. Confirmation of the number of replications and the completion of replication can be adjusted with the system parameter. For example, if three replications are maintained and a write operation is inserted, then the operation is only considered to be successful if the three replications are completed.
However, Cassandra allows only N (under 3) number of executions to be checked and immediately returns to value. Therefore, write can be conducted successfully even with failures in the replication node, which raises usability. Histories of failed write operations are recorded on a different node, and the operation can be retried at a later date (this is called “hinted handoff”). Since the success of replicated writing is not guaranteed, the data suitability is checked in the reading stage. Generally, if there are multiple replications, it is collected into one when reading. However, Cassandra keeps in mind that not all replications match, and reads the data from all three replications, checks whether it is identical, and restores the latest data if it is not suitable (this is called “read repair”).
This is why Cassandra is slower in reading than writing.
Now let’s look into HBase. HBase has the same structure as BigTable. While BigTable operates on the tablet unit, HBase disperses and replicates with the region unit. Key is arranged according to ranges, and the location of the region, where the key is stored in, is saved as meta data (this is called meta region). Therefore, if a key is inserted, first find the region, and later the client will cache this information. HBase places the writing operation in a single region, without distributing and splits the region if it becomes too large. Adjustments, such as creating regions in advance, must be made for the action above.
HDFS is in charge of data storage and replication. When writing a file with Hadoop’s distribution file system, HDFS writes multiple replications (synchronous replication) and only reads one of them when reading. This guarantees data integrity and lowers reading overhead. However, since HBase and HDFS operates separately, the location of the node that contains memtable and SSTable may change, which may cause additional network I/O.
What happens if there is a failure in the node? If the region server collapses, other nodes take over the tasks which comprised the region. In other words, the commit log that pertains to the region is replayed and memtable is created. If the failed node contains HDFS’s data, then chunk that pertains to the node is migrated to a different node.
Thus, we have looked at the similarities and differences of Cassandra and HBase.
Then why did the two products show a difference in writing during the comparison test?
- First of all, there is a difference in commit log. Cassandra writes the commit log once in the local file system. However, HBase enter the commit log in the HDFS and HDFS conducts the replication. Data locality is not put into consideration, so the file write request will probably be sent to a different physical node in the network.
- Second, Cassandra receives the write request on all three nodes, but HBase only ‘writes’ on a single region in the beginning, and receives requests on only one node.
Then why is there a difference in reading performance?
There are various differences, but for one READ request Cassandra reads the data three times while HBase only reads the data once. This places an influence on the throughput.
MongoDB is unlike Cassandra or HBase eplained earlier. There are too many functional groups that can be classified as NoSQL, which makes it difficult to compare with Cassandra and HBase on the same level.
Cassandra and HBase is for processing full-scale large amounts of data, while MongoDB can be used quickly, schema-free when using a certain amount of data.
MongoDB adopts a documented-oriented format, so it is more similar to RDBMS than a key-value or column oriented format.
MongoDB operates on a memory base and places high performance above data scalability. If reading and writing is conducted within the usable memory, then high-performance is possible. However, performance is not guaranteed if operations exceed the given memory. Therefore, MongoDB should be used as a medium or small-sized storage system.
Configure a test condition similar to the original test on a lower specification server and test 300 thousand records instead of 50 million, and MongoDB will record greater performance than Cassandra or HBase.
Mongo DB uses BSON for data storage. (As known as Binary JSON, but is similar to Hessian than JSON in data encoding format and expressive datatype.)
The logical unit that is expressed using BSON is called document. (When using NPC as an example, it is the same as when the NPC data is recorded on the disk and searched.) Document is a concept that corresponds to a row in RDBMS. The difference is that a document in MongoDB does not have a schema.
Accumulated documents are called collections. A collection corresponds to a table in RDBMS.
Basic query method is to search the collection to find documents that meet the conditions. Each document has a field called _id, which acts as the OID (Object Identifier), and all query results are returned with this _id.
Index can also be created for collections. An index can be prepared for the fields in the document. This index is implemented in B-Tree, similar to that of RDBM’s index. Each collection creates an index for _id by default. One of the strengths of MongoDB is that its functions are similar to RDBMS, in that it supports indexing and JOIN.
Further, we will now look at the structure of MongoDB.
MongoDB is configurable in replication and distribution configuration. The system configuration recommended by MongoDB is three shard servers and three replica sets for each server (in other words, nine nodes). Replications are made possible by Master/Slave or a replica set. Replica set supports automatic failover. It has one master and multiple slaves, and write can only be sent to the master. If the master is terminated, service is resumed after one of the slaves are appointed as a master.
The distribution method provided by MongoDB is sharding. Each collection can shard, and a field in the Collection can designate a shard key. Like HBase, sharding is operated with the Order-preserving partitioning method. Chunk is included in the <collection, minkey, maxkey>, a continuous data range within shard, and can be as big as 64M. Chunks that have exceeded the maximum size can split into two chunks, and the divided chunks can migrate to different shard servers. The following figure contains a picture of MongoDB, which is composed of shards and replica sets.
The client first contacts a process called mongos, which is in charge of routing and coordination. Coordination is done over multiple nodes through requests and the results are collected and returned. The config server retains meta information on shards and chunks.
Now let’s look into MongoDB's internal structure. MongoDB uses the asynchronous method for replication. When a “write” operation is inserted, master stores the history in Oplog. Oplog is a fixed sized, FlfO method collection called capped collection. Logs accumulated in the Oplog are regularly polled by the slave, read, then implemented. If the slave failed to replay the log while the Oplog fills up, then the slave must suspend log-based replications and synchronizes by resyncing the master’s data. One of Oplog’s other distinctive characteristics is that it can conduct the saved operations idempotently when replaying the log. Operation logs are saved in the order it was executed in, so logs can be replayed after that period. Ops such as increment are changed to set ops when logged for this purpose.
MongoDB’s replication and failover method and query power is similar to RDBMS. The most significant difference is that flexibility in data storage is provided to the application while schema-free data is stored.
Unlike NoSQL, RDBMS is "read" operation oriented. The detailed method differs for each product, but generally NoSQL products are implemented so that mass data can be ‘written’ quickly. On the other hand, RDBMS, which is ‘read’ oriented, not only enables quick reading, but also provides many functions and inquiry features. DBMS can quickly inquire values using indexes, but at the same time indexes slow down ‘writing’. Since an index must be configured for every ‘writing’ operation, the more indexes there are, the more time is needed to configure the indexes.
The difference of RDBMS and NoSQL can be compared with ACID and BASE, but comparisons can also be made with reading and writing.
If the business you are to build requires more reading than writing, and needs complex reading methods, as well as basic aggregate functions such as SUM() or AVG(), then RDBMS might be a better choice than NoSQL.
We have looked at benchmark tests done on Cassandra, HBase, MongoDB and the differences between the products. Cassandra and HBase show excellent writing abilities, but its reading performance did not meet expectations. This was because unlike existing RDBMSs, the two products were optimized for writing, and this resulted in a lot of concurrent I/O when reading.
MongoDB has similar structures to that of RDBMS and shows great flexibility in data modeling especially for medium and small-sized businesses.
By Lee Hye Jeong, Storage System Development Team, NHN Corporation.