posted last year in Dev Platform category by Ki Yeul Lee
SSD is faster than HDD. Then, would it be better to switch the storage medium from HDD to SSD? Let's learn about the DBMS buffering first, before considering the switch.
SSD has faster reading/writing speed than HDD. So replacing HDD with SDD will speed up the performance of a number of programs, including the OS. The performance improvement in the DBMS by replacing the storage medium from HDD to SSD, however, is minimal to none.
This is because data buffering (page buffering) is used in the DBMS. Data buffering is a sort of caching, and can decrease the storage medium I/O. In an environment where data buffering is running efficiently (i.e. the hit ratio is high for the caching system), the I/O performance of the storage medium become relatively less important.
So, if you want to improve the performance by switching from HDD to SSD, you should check if the DBMS' data buffering is operating properly before making the decision. You may get the improvements you wanted just by data buffering and without switching the storage medium to SSD.
In this article, we will look at the data buffering efficiency in both HDD and SSD environment and how the TPS changes accordingly, to find out the efficient way to configure the data buffering size and the best way to select the storage medium.
HDD Structure and IO Features
HDD records data on a revolving platter (magnetic disc). HDD is composed of several platters, which all revolve together instead of revolving independently from one another. As you can see from Figure 1 below, the concentric circle in the middle of the platter is called a track, and the track found in the same location as several platters (same track number) is called a cylinder. The head reads and writes magnetic data on the platter, and since data can be recorded on both sides of the platter, each platter has two heads. This head is attached to the end of the disc arm.
Figure 1: HDD's Track/Cylinder, Sector and Heads.
Data is recorded by block in a consecutive order on the cylinder of the rapidly spinning platter. This allows easy access to the consecutively recorded data with a single movement (platter rotation and disc arm movement).
Figure 2: Inside the HDD.
So, the IO time for a HDD refers to the total "seek time where the disc arm moves to the relevant cylinder," "rational latency where the platter rotates until it reached the relevant block (sector)" and "data transfer time where the head reads or writes data."
As the seek time and rational latency are much higher than data transfer time (seek time >> rational latency > data transfer time), the OS generally uses the optimized IO scheduling type, which minimizes the head movement and the platter rotation.
However, IO scheduling effects cannot always be counted upon because the application's IO request for OS is very frequent and the OS cannot predict the timing.
When creating an application, make sure that it requests a small number of large-volume read/write operations rather than frequently requesting a large number of small-volume read/write operations. Doing so can increase the system performance by allowing an effective IO scheduling at the OS level.
CUBRID lumps disc IOs by page. In addition, IO requests that do not affect response time are internally sorted and processed by their page numbers so that the OS could minimize the seek time and rational latency for a faster scheduling. The difference in performance depends on whether to apply this or not.
In HDD, time for the head to read and write data is relatively very short. Because of this, when it comes to improving disc IO performance, shortening seek time and rational latency becomes crucial. Sometimes the sequential writing shows better performance than SSD.
The Solid State Disk Structure and IO Features
Solid State Disk (SSD) is a disc that uses flash memory as the storage medium. Flash memory is a type of EEPROM that can read, write or erase the memory content electronically.
Figure 3: Inside the SSD.
When the memory is initialized, all bits are set to 1, whereas certain bits can be changed to 0 if programmed by page. Though the programmed pages can be reprogrammed, changing the bit from 0 to one is impossible while changing from 1 to 0 is possible.
Because of this, in order to be used globally, you must erase and initialize all bits in a specific area to 1 before any data can be recorded again in same area.
For HDD, it takes about the same time for the head to read and write data, but SSD has different reading and writing performance. Though this varies with each product, if the page reading time is about a few hundred nanoseconds (ns), the writing time is in microseconds (us) and erasing time is in milliseconds (ms).
Erasing an area happens simultaneously, and it usually takes the same amount of time to erase the entire chip or just a few pages. So, it is usually organized to process erase by sector (or block), which is the collection of several pages.
For a flash memory, there is a limit on how many times a certain sector can be erased. (The limit is usually about tens of thousands to hundreds of thousands.) Because of this, physical block addresses are mapped differently from logical block addresses so that there will be no frequent renewals in certain areas like FAT, and the number of erasing is set evenly in each sector. This process is called wear leveling.
An additional effect of wear leveling is that, if the content of a certain page needs to be moved, the erasing and writing tasks need to be processed, but if the content needs to be moved to, and written in a different physical location (and if that page has been erased already), then writing can be performed without erasing. However, if empty spaces cannot be secured at the right time, this may also reach its limit.
For these reasons, IO of SSD is very different from HDD. When reading, as SSD only requires data transfer time, unlike HDD which also requires seek time and rational latency, a constant response speed can be guaranteed even for random reading.
But when writing, if there is no empty spaces in the middle of wear leveling, then it will be erased by sector while IO tasks for the corresponding sector will be all put on hold.
DBMS - Data Buffering
The data buffering of DBMS is a type of caching. As the memory is fast and the disk is much slower in comparison, and recently and frequently used data may be used more often for a certain period of time in the future, frequently used data are stored to be read from the memory instead of the disk.
Buffering from DBMS can be processed with consideration of the transactions, which is different from disc cache of the OS. Additionally, though it was built based on LRU Caching Algorithm, by separating hot zone and cold zone, LRU management cost is cut down dramatically.
The overall processing time for an IO request from DBMS can be explained as the followings.
total access time = A * HR + B * (1 - HR)
A = Access time of memory
B = Average access time of storage
HR = Biffer Hit Ratio
Therefore, the difference of processing time of HDD and SSD, B * (1-HR), will decrease as the hit ratio gets bigger. Common DBMSs have a hit ratio of 95% or higher.
Test Environment and Scenario
This is the specifications of the DBMS server used for our tests:
- CPU: Xeon® CPU L5650 (2.27GHz 2*6 cores)
- Memory: 16G bytes
- HDD: 300G * 2
- SSD: 64G * 4
- OS: CentOS 5.3 x86_64
- DBMS: CUBRID 2008 R4.1 (8.4.1.0516) (64bit release build for linux_gnu)
YCSB (Yahoo! Cloud Serving Benchmark) program, created by Yahoo to benchmark cloud service storages, was used as the load generator. You can set different query patterns or distributions for analysis. In this test, the latest distribution (modified zipfian distribution and a high density pattern) sends the select request to DBMS. 60 transactions were processed at the same time for this test.
We measured TPS, hit ratio and the number of IO requests processed per second, as we changed the storage medium from HDD to SSD in the same equipment, and the data buffer from 2G to 12G, 1G at a time.
The below are the results from this test.
Graph 1: TPS Progress by Buffer Size.
As the data buffer for DBMS is increased, the TPS is increased as well, for both SSD and HDD. As HDD is relatively slow, once the page buffer size is increased for HDD, the TPS is increased as well.
The graph for SSD demonstrates a simple proportional relationship outside of sections 2G - 4G in the middle.
On the other hand, the graph for HDD shows an S-shaped curve, with an inflection point around 10 G.
Graph 2 below is the collection of stat dumps from CUBRID, which shows the progress of IO requests made from DBMS to the OS.
The result of the IO request when using SSD shows an exponential function shape that simply decreases according to the buffer size. On the other hand, when using HDD, it shows a graph with an inflection point similar to the TPS graph, except that the inflection point is around 6G.
Graph 2: IO request Progress by Buffer Size.
The hit ratio graph of the DBMS data buffer is as shown in Figure 3. As it was tested with the same distribution, SSD and HDD show the same shape. And though the hit ratio is the same, due to the difference in process speed, IO request shows a different shape, as in Graph 2.
Graph 3: Hit Ratio Progress According to the Buffer Size.
The graph has the shape of an exponential function with its tangent at 100%, and its incline becomes more gradual as the buffer size increases.
Result Analysis and Implications
As the test results indicate, the performance of DBMS is very sensitive to the hit ratio of the data buffer.
Therefore, to improve the performance of DBMS, it is crucial for the data buffer to maintain an adequate hit ratio.
The data buffer should be configured to the appropriate size to include the main working set. Also, the DBMS schema and application pattern should be designed to compose the main working set.
If a good hit ratio cannot be maintained in this way, then consider switching to SSD to maintain the performance above a certain level.
Let's analyze this proposition in detail, using the test results.
The application used for the test was zipfian distribution, which is a model with a well-composed working set.
Now, the DBMS data buffer must be composed to adequately cover the working set. The graph for hit ratio changes by data buffer is the same shape as Graph 3.
The incline of the graph changes according to the distribution of the working set, but tends to have the same shape as Graph 3 (exponential function with 100% as the tangent). In this case, the buffer size and the cost are directly proportional, but the incline of the hit ratio decreases. Therefore, the data buffer's extension cost and performance ratio start to decrease.
So, 8G, where the graph's incline becomes more gradual, should be the right data buffer size.
Intuitively, the TPS can be predicted to be proportional with the hit ratio graph, but as you can see in Graph 1, it shows different patterns depending on the IO features of the storage medium.
As SSD "reads" fast, the TPS shows a gradual decline.
HDD, however, shows a pattern with an inflection point. In Graph 1, the areas in the square show that the performance ratio for SSD and HDD are proportional, but after that area, the TPS performance of HDD starts to decline steeply, and then declines gradually.
This phenomenon can be linked with the IO request development in Graph 2. The HDD is more than able to process the IO request that increases with the smaller hit ratio in section A (between 10 G~12 G). So the TPS graph above only shows the difference in physical performance compared to SSD.
But once the buffer size becomes equal to or smaller than 10 G (section B), HDD cannot process the increased IO requests within the set time and starts to stall. The slowdown in IO requests causes a steep decline in the TPS graph.
This would be the state in which the IO scheduling from the OS is not functioning well, as we have seen before.
On the other hand, as the data buffer grows smaller (section C), the number of processing IO requests starts to increase within the set time, which means the OS' IO scheduling is working efficiently as the accumulated IO requests increase. This causes the TPS drop to flatten. However, even when IO requests increase, the hit ratio decreases, and the overall TPS decreases as well.
Gathering all results, the hit ratio starts changing gradually around 8G, but TPS incline grows steeply and reaches the inflection point at 10G. Because of this, the data buffer's performance improvement compared to the cost is maximized around 8G - 10G.
So in such environment, it would be more efficient cost-wise to install the data buffer to around 10 G instead of changing the storage medium to SSD.
By Ki Yeul Lee, Senior Engineer, DBMS Development Lab, NHN Corporation.