Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Versions available for this page: CUBRID 9.0.0 | 

Contraction and Expansion of Collation

CUBRID supports contraction and expansion for collation. Contraction and expansion are available for UTF-8 charset collation.

You can see the contraction and expansion of collation in the collation setting in the LDML file. Using contraction and expansion affects the size of locale data (shared library) and server performance.

Contraction

A contraction is a sequence consisting of two or more codepoints, considered a single letter in sorting. For example, in the traditional Spanish sorting order, "ch" is considered a single letter. All words that begin with "ch" sort after all other words beginning with "c", but before words starting with "d". Other examples of contractions are "ch" in Czech, which sorts after "h", and "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" respectively.

See http://userguide.icu-project.org/collation/concepts for additional information.

There are also some contractions defined in http://www.unicode.org/Public/UCA/latest/allkeys.txt DUCET.

Contractions are supported in both collation variants : with expansions and without expansions. Contractions support requires changes in a significant number of key areas. It also involves storing a contraction table inside the collation data. The handling of contractions is controlled by LDML parameters DUCETContractions="ignore/use" TailoringContractions="ignore/use" in <settings> tag of collation definition. The first one controls if contractions in DUCET file are loaded into collation, the second one controls if contractions defined by rules in LDML are ignore or not (easier way then adding-deleting all rules introducing contractions).

Expansion

Expansions refer to codepoints which have more than one collation element. Enabling expansions in CUBRID radically changes the collation's behavior as described below. The CUBRIDExpansions="use" parameter controls the this behavior.

Collation without Expansion

In a collation without expansions, each codepoint is treated independently. Based on the strength of the collation, the alphabet may or may not be fully sorted. A collation algorithm will sort the codepoints by comparing the weights in a set of levels, and then will generate a single value, representing the weight of the codepoint. String comparison will be rather straight-forward. Comparing two strings in an expansion-free collation means comparing codepoint by codepoint using the computed weight  values.

Collation with Expansion

In a collation with expansions, some composed characters (codepoints) are to be interpreted as an ordered list of other characters (codepoints). For example, 'æ' might require to be interpreted the same way as 'ae', or 'a' as ''ae'' or ''aa''. In DUCET, the collation element list of 'æ' will be the concatenation of collation element lists of both 'a' and 'e', in this order. Deciding a particular order for the codepoints is no longer possible, and neither is computing new weight values for each character/codepoint.

In a collation with expansions, string comparison is done by concatenating the collation elements for the codepoints/contractions in two lists (for the two strings) and then comparing the weights in those lists for each level.

Example

The purpose of these examples is to show that under different collation settings (with or without expansion support), string comparison might yield different results.

Here there are the lines from DUCET which correspond to a subset of codepoints to be used for comparisons in the examples below.

0041  ; [.15A3.0020.0008.0041] # LATIN CAPITAL LETTER A

0052  ; [.1770.0020.0008.0052] # LATIN CAPITAL LETTER R

0061  ; [.15A3.0020.0002.0061] # LATIN SMALL LETTER A

0072  ; [.1770.0020.0002.0072] # LATIN SMALL LETTER R

00C4  ; [.15A3.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS;

00E4  ; [.15A3.0020.0002.0061][.0000.0047.0002.0308] # LATIN SMALL LETTER A WITH DIAERESIS;

Three types of settings for the collation will be illustrated:

  • Primary strength, no casing (level 1 only)
  • Secondary stregth, no casing (levels 1 and 2)
  • Tertiary strength, uppercase first (levels 1, 2 and 3)

Sorting of the strings ''Ar'' and ''Ar'' will be attempted.

Collation without Expansions Support

When expansions are disabled, each codepoint is reassigning a new single valued weight. Based on the algorithms described above the weights for A, Á, Ä, R and their lowercase correspondents, the order of the codepoints for these characters, for each collation settings example above, will be as follows.

  • Primary strength: A = Ä < R = r
  • Secondary strength: A < Ä < R = r
  • Tertiary strength: A < Ä < R < r

The sort order for the chosen strings is easy to decide, since there are computed weights for each codepoint.

  • Primary strength: ''Ar'' = ''Är''
  • Secondary strength: ''Ar'' < ''Är''
  • Tertiary strength: ''Ar'' < ''Är''
Collation with Expansions

The sorting order is changed for collation with expansion.

Based on DUCET, the concatenated lists of collation elements for the strings from our samples are provided below:

Ar [.15A3.0020.0008.0041][.1770.0020.0002.0072]

Är [.15A3.0020.0008.0041][.0000.0047.0002.0308][.1770.0020.0002.0072]

It is rather obvious that on the first pass, for level 1 weights, 0x15A3 will be compared with 0x15A3. In the second iteration, the 0x0000 weight will be skipped, and 0x1770 will be compared with 0x1770. Since the strings are declared identical so far, the comparison will continue on the level 2 weights, first comparing 0x0020 with 0x0020, then 0x0020 with 0x0047, yielding ''Är'' > ''Ar''. The example above was meant to show how strings comparison is done when using a collation with expansion support.

Let us change the collation settings, and show how one may obtain a different order for the same strings when using a collation for German, where ''Ä'' is supposed to be interpreted as the character group ''AE''.

The codepoints and collation elements of the characters involved in this example are as follows.

0041 ; [.15A3.0020.0008.0041] # LATIN CAPITAL LETTER A

0045 ; [.15FF.0020.0008.0045] # LATIN CAPITAL LETTER E

0072 ; [.1770.0020.0002.0072] # LATIN SMALL LETTER R

00C4 ; [.15A3.0020.0008.0041][.15FF.0020.0008.0045] # LATIN CAPITAL LETTER A WITH DIAERESIS; EXPANSION

When comparing the strings ''Är'' and ''Ar'', the algorithm for string comparison when using a collation with expansion support will involve comparing the simulated concatenation of collation element lists for the characters in the two strings.

Ar [.15A3.0020.0008.0041][.1770.0020.0002.0072]

Är [.15A3.0020.0008.0041][.15FF.0020.0008.0045][.1770.0020.0002.0072]

On the first pass, when comparing level 1 weights, 0x15A3 will be compared with 0x15A3, then 0x1770 with 0x15FF, where a difference is found. This comparison yields ''Är'' < ''Ar'', a result completely different than the one for the previous example.