Background Image

BLOG

?

Shortcut

PrevPrev Article

NextNext Article

Larger Font Smaller Font Up Down Go comment Print Attachment

Written by SeHun Park on 11/09/2021

- HASH SCAN


Hash Scan is a scan method for hash join. Hash Scan is applied in view or hierarchical query. When a subquery such as view is joined as inner, index scan cannot be used. In this case, performance degradation occurs due to repeated inquiry of a lot of data. In this situation, Hash Scan is used.

hash scan vs nl.jpg

The picture above shows the difference between Nested Loop join and Hash Scan in the absence of an index. In the case of NL join, the entire data of INNER is scanned as many as the number of rows of OUTER. In contrast, Hash Scan scans INNER data once when building a hash data structure and scans OUTER once when searching. Therefore, you can search for the desired data relatively very quickly.

 

Here, the internal structure of Hash Scan is written as the flow of the program development process.

 

 

 

- IN-MEMORY HASH SCAN


CUBRID's Hash Scan uses in-memory, hybrid, and file hash data structures depending on the amount of data. First, let's look at the in-memory structure.

 

The advantage of the in-memory hash scan structure is there is no performance degradation during random access. However, the disadvantage of this structure is that the memory size is limited. It cannot be used in all cases due to its disadvantages, but it is the fastest method due to its advantages. Because of its advantage, it is suitable for chaining hash structures.

 

in-memory hash table.jpg

If there is a collision of hash key values, a new entry is put into the next pointer. It is a simple and fast structure. However, when implemented in file format, problems with random access or space utilization may occur. More details about this will be discussed in the following File Hash Structure section. CUBRID performs in-memory hash scan only within a limited size. You can change the limit size using the max_hash_list_scan_size system parameter.

 

At this stage, rather than implementing an in-memory hash data structure, it was necessary to analyze OPTIMIZER and EXECUTOR and think more about which part should be modified. You can refer to the following link for details about it. Moreover, you can check the design-related contents in JIRA and the results of code review in GIT.

 

 

JIRA: http://jira.cubrid.org/browse/CBRD-23665

GIT: https://github.com/CUBRID/cubrid/pull/2389

 

 

- HYBRID HASH SCAN


This is a method of storing the OID (Object Identifier) of the temp file, not DATA, in the value of the in-memory hash data structure.
memory hash table with temp file.jpg

 

Because OIDs are smaller than DATA, in-memory hash data structures can be used in larger data sets. This method is relatively slower than the in-memory hash method because data from the temp file must be read at the time of lookup. This is the second scan method considered in the hash scan. Check out the link below for more details.

 

JIRA: http://jira.cubrid.org/browse/CBRD-23828

GIT: https://github.com/CUBRID/cubrid/pull/2537

 

 

- FILE HASH SCAN


A scan method that uses the file hash data structure. The extendible hash data structure is applied.

 


extendible hash.jpg

 

 

The diagram above shows the operation of the extendible hash algorithm. When overflow occurs, the bucket is divided. Because it operates in this way of partitioning, it is an algorithm that can maintain the bucket space utilization rate above 50%. Since one bucket is implemented as a page, which is the smallest unit of disk I/O, the higher the bucket space utilization, the lower the disk I/O. For this reason, file hash scans use an extendible hash algorithm.

 

 


file hash scan.jpg

 

 

This is the implementation of the extendible hash data structure in CUBRID. The Directory file stores the VPID, which is the Page Identifier. One Bucket is implemented as one page. The data in the Bucket is sorted, so the lookup uses a binary search.

 

One drawback of extendible hash data structures is that there are no exceptions for duplicate data values. For example, if overflow occurs because the same value is all stored in one bucket, it is an algorithm that can no longer be stored. For this, a new Duplicate Key Bucket is created and added in the form of chaining. If more than a certain amount of data is duplicated, the data is moved to the DK bucket. Through this, a file hash scan with excellent space utilization while being able to flexibly store duplicate values is completed. Visit the following links for a detailed explanation.

 

 

JIRA: http://jira.cubrid.org/browse/CBRD-23816

GIT: https://github.com/CUBRID/cubrid/pull/2781

 

 

- HASH SCAN for Hierarchical Queries


A hierarchical query has a special limitation in that it is necessary to perform a lookup between the hierarchies after the join. Because of this, index scans cannot be used for hierarchical queries with joins. What you need in this situation is a hash scan, right? It has been modified so that hash scan can be used for hierarchical queries as well. Check the link below for more details.

 

JIRA: http://jira.cubrid.org/browse/CBRD-23749

GIT: https://github.com/CUBRID/cubrid/pull/2520

 

 

- HASH JOIN


The in-memory hash scan is applied in the CUBRID11 version, and the file hash scan is applied in the CUBRID11.2 version which will be released soon. The hash join function is currently under development. The development of the hash join function is to add a new join method to OPTIMIZER. Currently, there are Nested Loop join and Sort Merge Join in CUBRID. The CUBRID development team is planning to improve the overall OPTIMIZER. OPTIMIZER will be able to generate a more optimal execution plan. And with that work, a hash join method will be added. Before the hash join is added, the execution plan cannot check whether a hash scan is used. Instead, you can check whether Hash Scan is used in the trace information.

 

trace.jpg

 


- HASH SCAN Performance


In situations where a hash scan is required, the query performance has become incomparably faster than before.

 

Performance of hash scan.jpg

 

The performance is greatly improved compared to the previous case in cases where subqueries are joined as inner or hierarchical queries with joins. CUBRID analyzes the causes of several other cases and reflects improvements to improve query performance. Among these improvements, there are REWRITER improvements such as View Merging and Subquery unnest, and improvements related to View Merging are currently in progress. Next, we will learn how to transform a query in DBMS and why rewrite techniques such as View Merging and Subquery unnest are necessary.


  1. Monitoring CUBRID through Scouter

    Written by TaeHwan Seo on 01/18/2022 CUBRID users can monitor items in CUBRID through the Scouter. It was developed based on CUBRID 11.0 version. Full features are available from CUBRID 10.2.1 Version. Scouter (Server, Client) is available from version 2.15.0, bug fixes and features will be added by participating in Scouter GitHub in the future. The latest version of Scouter (as in 2022.01.18) is Scouter 2.15.0, Multi Agent support and bug fixes are currently in the PR stage. 1. What is Scouter? Scouter is an Open Source Application Performance Management (APM), it provides monitoring function for applications and OS. Scouter Basic Configuration Scouter-provided Information ​- WAS Basic Information Response speed/profiling information for each request, number of server requests/number of re...
    Read More
  2. QUERY CACHE Hint

    Written by MinJong Kim on 12/09/2021 ABOUT QUERY CACHE With the release of CUBRID 11.0, the CUBRID DBMS supports QUERY CACHE hint. In this article, we will take some time to look at QUERY CACHE. 1. What is Query Cache? Query Cache is a DBMS feature that stores the statements together with the retrieved record set in memory using the SELECT query statement and returns the previously cached values when the identical query statement is requested. The query cache can be useful in an environment where you have tables that do not change very often and for which the server receives many identical queries. Queries using the QUERY_CACHE hint are cached in a dedicated memory area, and the results are also cached in separate disk space. Query Cache Features 1. The QUERY_CACHE hint only applies to SELE...
    Read More
  3. CUBRID INSIDE: HASH SCAN Method

    Written by SeHun Park on 11/09/2021 - HASH SCAN Hash Scan is a scan method for hash join. Hash Scan is applied in view or hierarchical query. When a subquery such as view is joined as inner, index scan cannot be used. In this case, performance degradation occurs due to repeated inquiry of a lot of data. In this situation, Hash Scan is used. The picture above shows the difference between Nested Loop join and Hash Scan in the absence of an index. In the case of NL join, the entire data of INNER is scanned as many as the number of rows of OUTER. In contrast, Hash Scan scans INNER data once when building a hash data structure and scans OUTER once when searching. Therefore, you can search for the desired data relatively very quickly. Here, the internal structure of Hash Scan is written as the fl...
    Read More
  4. Converting PL/SQL to CUBRID Java SP using ANTLR and StringTemplate

    Written by Youngjin Joo on 09/30/2021 CUBRID DBMS (hereinafter 'CUBRID') does not support PL/SQL. If you want to continue your project by creating functions or subprograms with PL/SQL syntax in CUBRID, you need to convert them to Java Stored Function/Procedure (hereinafter 'Java SP'). Database developers, administrators, and engineers are often familiar with PL/SQL syntax but not with programming languages. In addition, application development depends very little on the DBMS used, but converting PL/SQL to Java SP seems difficult because it feels like you're developing a new system. Therefore, while I am looking for an easy way to convert PL/SQL to Java SP, I found out about ANTLR. ANTLR is a tool for generating parsers. With the help of contributors around the world, ANT...
    Read More
  5. CUBRID Internal: Storage Management (Disk Manager, File Manager)

    Written by Jaeeun, Kim on 08/11/2021 Introduction Database, just as its name implies, it needs spaces to store data. CUBRID, the open source DBMS that operates for the operating system allocates as much space as needed from the operating system and uses it efficiently as needed. In this article, we will talk about how CUBRID internally manages the storage to store data in the persistent storage device. Through this article, we hope developers can access the open source database CUBRID more easily. - The content of this article is based on version 10.2.0-7094ba. (However, it seems to be no difference in the latest develop branch, 11.0.0-c83e33. ) CUBRID Storage Management The CUBRID server has multiple modules that operate and manage data complexly and sophisticatedly. Among them, there are ...
    Read More
Board Pagination Prev 1 2 3 4 5 6 Next
/ 6

Join the CUBRID Project on