Multilingual support includes internationalization and localization. Internationalization can be applied to various languages and regions. Localization fits the language and culture in a specific area as appending the language-specific components. CUBRID supports multilingual collations including Europe and Asia to facilitate the localization.
Terms related to internationalization are as follows:
- Character set: A group of encoded symbols (giving a specific number to a certain symbol)
- Collation: A set of rules for comparison of characters in the character set and for sorting data
- Locale: A set of parameters that defines any special variant preferences such as number format, calendar format (month and day in characters), date/time format, collation, and currency format depending on the operator's language and country. Locale defines the linguistic localization. Character set of locale defines how the month in characters and other data are encoded. A locale identifier consists of at least a language identifier and a region identifier, and it is expressed as language[_territory][.codeset] (For example, Australian English using UTF-8 encoding is written as en_AU.UTF-8).
- Unicode normalization: The specification by the Unicode character encoding standard where some sequences of code points represent essentially the same character. CUBRID uses Normalization Form C (NFC: codepoint is decomposed and then composed) for input and Normalization Form D (NFD: codepoint is composed and then decomposed) for output. However, CUBRID does not apply the canonical equivalence rule as an exception.
For example, canonical equivalence is applied in general NFC rule so codepoint 212A (Kelvin K) is converted to codepoint 4B (ASCII code uppercase K). Since CUBRID does not perform the conversion by using the canonical equivalence rule to make normalization algorithm quicker and easier, it does not perform reverse-conversion, too.
- Canonical equivalence: A basic equivalence between characters or sequences of characters, which cannot be visually distinguished when they are correctly rendered. For example, let's see 'Å' ('A' with an angstrom). 'Å' (Unicode U + 212B) and Latin 'A' (Unicode U + 00C5) have same A and different codepoints, however, the decomposed result is 'A' and U+030A, so it is canonical equivalence.
- Compatibility equivalence): A weaker equivalence between characters or sequences of characters that represent the same abstract character. For example, let's see number '2' (Unicode U + 0032) and superscript '²'(Unicode U + 00B2). '²' is a different format of number '2', however, it is visually distinguished and has a different meaning, so it is not canonical equivalence. When normalizing '2²' with NFC, '2²' is maintained since it uses canonical equivalence. However, with NFKC, '²' is decomposed to '2' which is compatibility equivalence and then it can be recomposed to '22'. Unicode normalization of CUBRID does not apply the compatibility equivalence rule.
For more details on Unicode normalization, see http://unicode.org/reports/tr15/.
The default value of the system parameter related to Unicode normalization is unicode_input_normalization=no and unicode_output_normalization=no. For a more detailed description on parameters, see Syntax/Type Related Parameter.
CUBRID locale is defined by following attributes.
- Charset (codeset): How bytes are interpreted into single characters (Unicode codepoints)
- Collations: Among all collations defined in locale of LDML file, the last one is the default collation. Locale data may contain several collations.
- Alphabet (casing rules): One locale data may have up 2 alphabets, one for identifer and one for user data. One locale data can have two types of alphabets.
- Calendar: Names of weekdays, months, day periods (AM/PM)
- Numbering settings: Symbols for digit grouping, monetary currency
- Text conversion data (for CSQL conversion): Option. See Text conversion for CSQL .
- Unicode normalization data: Data converted by normalizing several characters with the same shape into one based on a specified rule. After normalization, characters with the same shape will have the same code value even though the locale is different. Each locale can activate/deactivate the normalization functionality.
Note Generally, locale supports a variety of character sets. However, CUBRID locale supports both ISO and UTF-8 character sets for English and Korean. The other operator-defined locales using the LDML file support the UTF-8 character set only.
Text console conversion works in CSQL console interface. Most locales have associated character set (or codepage in Windows) which make it easy to write non-ASCII characters from console. For example in LDML for tr_TR.utf8 locale, there is a line:
<consoleconversion type="ISO88599" windows_codepage="28599" linux_charset="iso88599,ISO_8859-9,ISO8859-9,ISO-8859-9">
If the user set its console in one of the above settings (chcp 28599 in Windows, or export LANG=tr_TR.iso88599 in Linux), CUBRID assumes all input is encoded in ISO-8859-9 charset, and converts all data to UTF-8. Also when printing results, CUBRID performs the reverse conversion (from UTF-8 to ISO-8859-9).
The setting is optional in the sense that the XML tag is not required in LDML locale file.
For example the locale km_KH.utf8 does not have a associated codepage.
A collation is an assembly of information which defines an order for characters and strings. In CUBRID, collation has the following properties.
- Strength: This is a measure of how "different" basic comparable items (characters) are. This affects selectivity. In LDML files, collation strength is configurable and has four levels. For example a Case insensitive collation should be set with level = "secondary" (2) or "primary" (1).
- Whether it supports or not expansions and contractions
Each column has a collation, so when applying LOWER, UPPER functions the casing rules of locale which defines the collation’s default language is used.
Depending on collation properties some CUBRID optimizations may be disabled for some collations:
- LIKE rewrite: is disabled for collations which maps several different character to the same weight (case insensitive collations for example) and for collations with expansions.
- Covering index scan: disabled for collations which maps several different character to the same weight (see Using Indexes > Covering Index).
- Prefix index: cannot be created on columns using collation with expansions.
Locale Save Location
CUBRID uses following directories and files to set the locales.
- $CUBRID/conf/cubrid_locales.txt file: A configuration file containing the list of locales to be supported
- $CUBRID/conf/cubrid_locales.all.txt file: A configuration file template with the same structure as cubrid_locales.txt. Contains the entire list of all the locales that the current version of CUBRID is capable of supporting without any efforts from the end user’s side.
- $CUBRID/locales/data directory: This contains files required to generate locale data.
- $CUBRID/locales/loclib directory: contains a C header file, locale_lib_common.h and OS dependent makefile which are used in the process of creating / generating locales shared libraries.
- $CUBRID/locales/data/ducet.txt file: Text file containing default universal collation information (codepoints, contractions and expansions, to be more specific) and their weights, as standardized by The Unicode Consortium, which is the starting point for the creation of collations. For more information, see http://unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table.
- $CUBRID/locales/data/unicodedata.txt file: Text file containing information about each Unicode codepoint regarding casing, decomposition, normalization etc. CUBRID uses this to determine casing. For more information, see http://www.ksu.ru/eng/departments/ktk/test/perl/lib/unicode/UCDFF301.html.
- $CUBRID/locales/data/ldml directory: XML files, name with the convention cubrid_<locale_name>.xml , containing locale information presented in human-readable XML format (LDML Locale Data Markup Language); a file for each of the supported language.
- $CUBRID/locales/data/codepages directory: contains codepage console conversion for single byte codepages(8859-1.txt , 8859-15.txt , 8859-9.txt) and codepage console conversion for double byte codepages(CP1258.txt , CP923.txt, CP936.txt , CP949.txt).
- $CUBRID/bin/make_locale.sh file or %CUBRID%binmake_locale.bat file: A script file used to generate shared libraries for locale data
- $CUBRID/lib directory: Shared libraries for generated locales will be stored here.