Versions available for this page: CUBRID 9.0.0 |
A collation is an assembly of information which defines an order for characters and strings. One common type of collation is called alphabetization.
In CUBRID, collations are supported for a number of languages, including European and Asian. In addition to the different alphabets, some of these languages may require the definition of expansions or contractions for some characters or character groups. Most of these aspects have been put together by the Unicode Consortium into The Unicode Standard (up to version 6.1.0 in 2012). Most of the information is stored in the DUCET file http://www.unicode.org/Public/UCA/latest/allkeys.txt which contains all characters required by most languages.
Most of the codepoints represented in DUCET, are in range 0 - FFFF, but codepoints beyond this range are included. However, CUBRID will ignore the latest ones, and use only the codepoints in range 0 - FFFF (or a lower value, if configured).
Each codepoint in DUCET has one or more 'collation elements' attached to it. A collation element is a set of four numeric values, representing weights for 4 levels of comparison. Weight values are in range 0 - FFFF.
In DUCET, a charater is represented on a single line, in the form:
< codepoint_or_multiple_codepoints > ; [.W1.W2.W3.W4][....].... # < readable text explanation of the symbol/character >
A Korean character kiyeok is represented as follows:
1100 ; [.313B.0020.0002.1100] # HANGUL CHOSEONG KIYEOK
For example, 1100 is a codepoint, [.313B.0020.0002.1100] is one collation element, 313B is the weight of Level 1, 0020 is the weight of Level 2, 0002 is the weight of Level 3, and 1100 is the weight of Level 4.
Expansion support, defined as a functional property, means supporting the interpretation of a composed character as a pair of the same characters which it's made of. A rather obvious example is interpreting the character ''æ'' in the same way as the two character string ''ae''. This is an expansion. In DUCET, expansions are represented by using more than one collation element for a codepoint or contraction. By default, CUBRID has expansions disabled. Handling collations with expansions requires when comparing two strings several passes (up to the collation strength/level).