Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Versions available for this page: CUBRID 9.0.0 | 

Specific Operations Related to Collation

LIKE Operation

The LIKE conditional expression compares patterns between string data, and returns TRUE if a string whose pattern matches the search word is found.

As already proven above, when using a '''collation without expansion support''', each codepoint will receive a single integer value, representing its weight in the comparison process. This weight value is computed based on collation settings (strength, casing etc.). Due to the fact that characters can always be regarded as single entities, trying to match a string with a pattern using the LIKE predicate is equivalent to checking if the string can be found in a certain range of strings. For example in order to process a predicate such as ''s LIKE 'abc%' '', Cubrid will first rewrite it as a range restriction for the string ''s''. ''s LIKE 'abc%'' means that ''s'' must start with the string ''abc''. In terms of string comparison, this is equvalent, in expansion-free collations, with ''s'' being greater than ''abc'', but smaller than its successor (using the English alphabet, ''abc''s successor would be ''abd'').

s LIKE 'abc%' → s ≥ 'abc' AND s < 'abd' (if using strictly the English aphabet)

This way, the actual interpretation of LIKE is replaced with simple comparisons, but ''Collations with expansion support'' behave differently. As described above, if a collation supporting expansions is used, single weight values are no longer calculated for each codepoint based on DUCET, but the information from their corresponding collation element list is stored with original values (even though it is compressed). To compare strings when using such a collation means comparing the concatenated lists of collation elements for each codepoint or expansion, level by level.

If the LIKE predicate rewrite method is kept the same as in a collation with no expansion support as above example, the comparison result can be wrong. To ensure the right query result, the LIKE predicate rewrite method is ran differently as the below example. That is, the LIKE predicate is added as a filter to exclude the wrong data which can be added in a collation with expansion.

s LIKE 'abc%' → s ≥ 'abc' AND s < 'abd' and s LIKE 'abc%' (if using strictly the English aphabet)

Prefix Index and Collation Expansion

A prefix index can be created on the collation without expansion; however, it cannot be created on the column which has the collation with expansion.

CREATE TABLE t1 (s1 VARCHAR(200) COLLATE utf8_ja_exp);

CREATE INDEX idx_t_s1 on t(s1(5)); -> not allowed : error

Index Covering

Covering index scan  is query optimization, in which if all values in query can be computed using only the values found BTREE+ index, without requiring additional row lookup in heap file.

For two strings values, ‘abc’ and ‘ABC’, only one value is stored in the BTREE+ index (this is either ‘abc’ or ‘ABC’ depending which one was inserted first), along with the count of heap values to which it corresponds (in this case, 2). A query using covering index optimization will return this index value twice instead of returning the two original values. As a generic rule, this may happen when at least two different strings produce the same sort key in a given collation. For this reason, for all UTF-8 collations with strength level less than 4 (quaternary) the index covering query optimization is disabled. This is controlled by strength="tertiary/quaternary" in <strength> tag of collation definition in LDML. Even with quaternary strength, there are some situations in which different strings produce same keys. These are acceptable cases in which different codepoints (but with similar graphical symbols) maps to the same weight value.

For more information about collations, see Globalization > Overview.

For more information about covering index, see CUBRID SQL Guide > Query Optimization > Using INDEX > Covering Index.