Globalization

Globalization includes internationalization and localization. Internationalization can be applied to various languages and regions. Localization fits the language and culture in a specific area as appending the language-specific components. CUBRID supports multilingual collations including Europe and Asia to facilitate the localization.

If you want to know overall information about character data setting, see Configuration Guide for Characters.

If you want to know about charset, collation and locale, see An Overview of Globalization.

If you want to apply the wanted locale to the database, you have to set the locale firstly, then create the database. Regarding this setting, see Locale Setting.

If you want to change the collation or charset specified on the database, specify COLLATE modifier or CHARSET modifier to the column, table, expression, and specify COLLATE modifier or Charset Introducer to the string literal. Regarding this setting, see Collation.

The functions or operators related to strings can work differently by charset and collation. Regarding this, see Operations Requiring Collation and Charset.

An Overview of Globalization

Character data

Character data (strings) may be stored with VARCHAR(STRING), CHAR, ENUM, and they support charset and collation.

Charset(character set, codeset) controls how characters are stored (on any type of storage) as bytes, or how a series of bytes forms a character. CUBRID supports ISO-88591-1, UTF-8 and EUC-KR charsets. For UTF-8, we support only the Unicode characters up to codepoint 10FFFF (encoded on up to four bytes). For instance, the character "Ç" is encoded in codeset ISO-8859-1 using a single byte (C7), in UTF-8 is encoded with 2 bytes (C3 87), while in EUC-KR this character is not available.

Collation decides how strings compare. Most often, users require case insensitive and case sensitive comparisons. For instance, the strings "ABC" and "abc" are equal in a case insensitive collation, while in a case sensitive collation, they are not, and depending on other collation settings, the relationship can be "ABC" < "abc" , or "ABC" > "abc".

Collation means more than comparing character casing. Collation decides the relationship between two strings (greater, lower, equal), is used in string matching (LIKE), or computing boundaries in index scan.

In CUBRID, a collation implies a charset. For instance, collations "utf8_en_ci" and "iso88591_en_ci" perform case insensitive compare, but operate on different charsets. Although for ASCII range, in these particular cases the results are similar, the collation with "utf8_en_ci" is slower, since it needs to work on variable number of bytes (UTF-8 encoding).

  • "'a' COLLATE iso88591_en_ci" indicates "_iso88591'a' COLLATE iso88591_en_ci".
  • "'a' COLLATE utf8_en_ci" indicates "_utf8'a' COLLATE utf8_en_ci".

All string data types support precision. Special care is required with fixed characters(CHAR). The values of this types are padded to fill up the precision. For instance, inserting "abc" into a CHAR(5) column, will result in storing "abc " (2 padding spaces are added). Space (ASCII 32 decimal, Unicode 0020) is the padding character for most charsets. But, for EUC-KR charset, the padding consists of one character which is stored with two bytes (A1 A1).

Locale Attributes

Locale is defined by following attributes.

  • Charset (codeset) : How bytes are interpreted into single characters (Unicode codepoints)
  • Collations : Among all collations defined in locale of LDML(UNICODE Locale Data Markup Language) file, the last one is the default collation. Locale data may contain several collations.
  • Alphabet (casing rules) : One locale data may have up 2 alphabets, one for identifier and one for user data. One locale data can have two types of alphabets.
  • Calendar : Names of weekdays, months, day periods (AM/PM)
  • Numbering settings : Symbols for digit grouping, monetary currency
  • Text conversion data : For CSQL conversion. Option.
  • Unicode normalization data : Data converted by normalizing several characters with the same shape into one based on a specified rule. After normalization, characters with the same shape will have the same code value even though the locale is different. Each locale can activate/deactivate the normalization functionality.

Note

Generally, locale supports a variety of character sets. However, CUBRID locale supports both ISO and UTF-8 character sets for English and Korean. The other operator-defined locales using the LDML file support the UTF-8 character set only.

Collation Properties

A collation is an assembly of information which defines an order for characters and strings. In CUBRID, collation has the following properties.

  • Strength : This is a measure of how "different" basic comparable items (characters) are. This affects selectivity. In LDML files, collation strength is configurable and has four levels. For example a Case insensitive collation should be set with level = "secondary" (2) or "primary" (1).
  • Whether it supports or not expansions and contractions

Each column has a collation, so when applying LOWER(), UPPER() functions the casing rules of locale which defines the collation’s default language is used.

Depending on collation properties some CUBRID optimizations may be disabled for some collations:

  • LIKE rewrite: is disabled for collations which maps several different character to the same weight (case insensitive collations for example) and for collations with expansions.
  • Covering index scan: disabled for collations which maps several different character to the same weight (see Covering Index).
  • Prefix index: cannot be created on columns using collation with expansions.

For more information, see Collation settings impacting CUBRID features .

Collation Naming Rules

The collation name in CUBRID follows the conversion:

<charset>_<lang specific>_<desc1>_<desc2>_...
  • <charset>: The full charset name as used by CUBRID. iso88591, utf8, euckr.

  • <lang specific>: a region/language specific. The language code is expected as two characters; en, de, es, fr, it, ja, km, ko, tr, vi, zh. "gen" if it does not address a specific language, but a more general sorting rule.

  • <desc1>_<desc2>_...: They have the following meaning. Most of them apply only to LDML collations.

    • ci: case insensitive In LDML, can be obtained using the settings: strength="secondary" caseLevel="off" caseFirst="off".

    • cs: case sensitive; By default all collations are case sensitive. In LDML, can be obtained using at least: strength="tertiary".

    • bin/binary: it means that the sorting order under such collation is almost the same with the order of codepoints; If memory (byte) compare is used, then almost the same result would be obtain. Space character and EUC double-byte padding character are always sorted as zero in "bin" collation. No collations with such setting are currently configured in LDML (they are already available as built-in), but a similar one can be obtained using the maximum setting strength="quaternary" or strength="identical".

    • ai : accent insensitive; this means that 'Á' is sorted the same as 'A'. Due to particularities of the UCA based algorithms, an accent insensitive collation is also a case insensitive collation. In LDML, can be obtained using: strength="primary".

    • uca : this signals a UCA based collation; this is used only to differentiate such collations from similar built-in variants. All LDML collations are based on UCA, but in order to keep shorter names only two collations ( 'utf8_ko_cs_uca' , 'utf8_tr_cs_uca' ) have this description in their names, in order to differentiate them from 'utf8_ko_cs' and 'utf8_tr_cs' collations.

    • exp : this collations use a full-word matching/compare algorithm, contrary to the rest of collations which use character-by-character compare. This collation uses a more complex algorithm, with multiple passes which is much slower, but may prove useful for alphabetical sorts. In LDML, the Expansion needs to be explicit by adding CUBRIDExpansions="use".

    • ab : accent backwards; it is particularity of French-Canadian sorting, where level 2 of UCA (used to store accents weights) is compared from end of string towards the beginning. This collation setting can be used only when :ref`expansion` setting is also activated. The "backwards" setting allows for the following sorting:

      • Normal Accent Ordering : cote < coté < côte < côté
      • Backward Accent Ordering : cote < côte < coté < côté
    • cbm: contraction boundary match; it is a particularity of collations with Expansion and Contraction and refers to how it behaves at string matching when a Contraction is found.

      Suppose the collation has defined the Contraction "ch"; then normally, the pattern "bac" will not match the string"bachxxx" But when the collation is configured to allow "matching the characters starting a contraction", the above matching will return a positive. Only one collation is configured in this manner - 'utf8_ja_exp_cbm' - Japanese sorting requires a lot of contractions.

The collation names are not dynamically generated. They are user defined (configured in LDML), and should reflect the settings of the collation.

The name of collation influences the internal numeric id of the collation. For instance, in CUBRID only 256 collations are allowed, and the numeric IDs are assigned as:

  • 0 -31 : built-in collations (for these collations the name and id are hard-coded)
  • 32 - 46 : LDML collations having "gen" as "language" part
  • 47 - 255: the rest of LDML collations

If you want to include all locales into the database which CUBRID provide, first, copy cubrid_locales.all.txt of $CUBRID/conf directory into cubrid_locales.txt and next, run make_locale script(in extension, Linux is .sh, Windows is .bat). For more details on make_locale script, see Step 2: Compiling Locale. If you want to include the newly added locale information into the existing database, run "cubrid synccolldb <dbname>". For more information, see Synchronization of Database Collations with System Collations.

If you include all locales defined in LDML files, CUBRID has the following collations.

Collation Locale for casing Character range
iso88591_bin en_US - English ASCII + ISO88591 (C0-FE, except D7, F7)
iso88591_en_cs en_US - English ASCII + ISO88591 (C0-FE, except D7, F7)
iso88591_en_ci en_US - English ASCII + ISO88591 (C0-FE, except D7, F7)
utf8_bin en_US - English ASCII
euckr_bin ko_KR, same as en_US - English ASCII
utf8_en_cs en_US - English ASCII
utf8_en_ci en_US - English ASCII
utf8_tr_cs tr_TR - Turkish Turkish alphabet
utf8_ko_cs ko_KR, same as en_US - English ASCII
utf8_gen de_DE - German, generic Unicode casing customized with German rules All Unicode codepoints in range 0000-FFFF
utf8_gen_ai_ci de_DE - German, generic Unicode casing customized with German rules All Unicode codepoints in range 0000-FFFF
utf8_gen_ci de_DE - German, generic Unicode casing customized with German rules All Unicode codepoints in range 0000-FFFF
utf8_de_exp_ai_ci de_DE - German, generic Unicode casing customized with German rules All Unicode codepoints in range 0000-FFFF
utf8_de_exp de_DE - German, generic Unicode casing customized with German rules All Unicode codepoints in range 0000-FFFF
utf8_es_cs es_ES - Spanish, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_fr_exp_ab fr_FR - French, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_ja_exp ja_JP - Japanese, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_ja_exp_cbm ja_JP - Japanese, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_km_exp km_KH - Cambodian, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_ko_cs_uca ko_KR - Korean, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF
utf8_tr_cs_uca tr_TR - Turkish, generic Unicode casing customized with Turkish rules All Unicode codepoints in range 0000-FFFF
utf8_vi_cs vi_VN - Vietnamese, same as generic Unicode casing All Unicode codepoints in range 0000-FFFF

On the above collations, 9 collations like iso88591_bin, iso88591_en_cs, iso88591_en_ci, utf8_bin, euckr_bin, utf8_en_cs, utf8_en_ci, utf8_tr_cs and utf8_ko_cs, are built in the CUBRID before running make_locale script.

Files For Locale Setting

CUBRID uses following directories and files to set the locales.

  • $CUBRID/conf/cubrid_locales.txt file: A configuration file containing the list of locales to be supported
  • $CUBRID/conf/cubrid_locales.all.txt file: A configuration file template with the same structure as cubrid_locales.txt. Contains the entire list of all the locales that the current version of CUBRID is capable of supporting without any efforts from the end user’s side.
  • $CUBRID/locales/data directory: This contains files required to generate locale data.
  • $CUBRID/locales/loclib directory: contains a C header file, locale_lib_common.h and OS dependent makefile which are used in the process of creating / generating locales shared libraries.
  • $CUBRID/locales/data/ducet.txt file: Text file containing default universal collation information (codepoints, contractions and expansions, to be more specific) and their weights, as standardized by The Unicode Consortium, which is the starting point for the creation of collations. For more information, see http://unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table .
  • $CUBRID/locales/data/unicodedata.txt file: Text file containing information about each Unicode codepoint regarding casing, decomposition, normalization etc. CUBRID uses this to determine casing. For more information, see http://www.ksu.ru/eng/departments/ktk/test/perl/lib/unicode/UCDFF301.html .
  • $CUBRID/locales/data/ldml directory: XML files, name with the convention cubrid_<locale_name>.xml, containing locale information presented in human-readable XML format (LDML Locale Data Markup Language); a file for each of the supported language.
  • $CUBRID/locales/data/codepages directory: contains codepage console conversion for single byte codepages(8859-1.txt , 8859-15.txt, 8859-9.txt) and codepage console conversion for double byte codepages(CP1258.txt , CP923.txt, CP936.txt , CP949.txt).
  • $CUBRID/bin/make_locale.sh file or %CUBRID%\bin\make_locale.bat file: A script file used to generate shared libraries for locale data
  • $CUBRID/lib directory: Shared libraries for generated locales will be stored here.

Locale Setting

Step 1: Selecting a Locale

Configure locales to use on $CUBRID/conf/cubrid_locales.txt. You can select all or some of locales which are supported.

CUBRID supports locales as follows: en_US, de_DE, es_ES, fr_FR, it_IT, ja_JP, km_KH, ko_KR, tr_TR, vi_VN, zh_CN.

The language and country for each locale are shown in the following table.

Locale Name Language - Country
en_US English - U.S.A.
de_DE German - Germany
es_ES Spanish - Spain
fr_FR French - France
it_IT Italian - Italy
ja_JP Japanese - Japan
km_KH Khmer - Cambodia
ko_KR Korean - Korea
tr_TR Turkish - Turkey
vi_VN Vietnamese - Vietnam
zh_CN Chinese - China

Note

The LDML files for the supported locales are named cubrid_<locale_name>.xml and they can be found in the $CUBRID/locales/data/ldml directory. If only a subset of these locales are to be supported by CUBRID, one must make sure their corresponding LDML files are present in the $CUBRID/locales/data/ldml folder. A locale cannot be used by CUBRID, unless it has an entry in cubrid_locales.txt file and it has a corresponding cubrid_<locale_name>.xml.

Locale libraries are generated according to the contents of $CUBRID/conf/cubrid_locales.txt configuration file. This file contains the language codes of the wanted locales (all user defined locales are generated with UTF-8 charset). Also, in this file can be configured the file paths for each locale LDML file and libraries can be optionally configured.

<lang_name>  <LDML file>                                        <lib file>
ko_KR        /home/CUBRID/locales/data/ldml/cubrid_ko_KR.xml    /home/CUBRID/lib/libcubrid_ko_KR.so

By default, the LDML files are found in $CUBRID/locales/data/ldml and the locale libraries in $CUBRID/lib; the filenames for LDML are formatted like: cubrid_<lang_name>.ldml. The filenames for libraries: libcubrid_<lang_name>.dll (.so for Linux).

Step 2: Compiling Locale

Once the requirements described above are met, the locales can be compiled.

Regarding the embedded locales in CUBRID, they can be used without compiling user locale library, so they can be used by skipping the step 2. But there are differences between the embedded locale and the library locale. Regarding this, refer Built-in Locale and Library Locale.

To compile the locale libraries, one must use the make_locale (.bat for Windows .sh for Linux) utility script from command console. The file is delivered in CUBRID/bin folder so it should be resolved by PATH environment variable. Here $CUBRID, $PATH are the environment variables of Linux, %CUBRID%, %PATH% are the environment variables of Windows.

Usage can be displayed by running make_locale.sh -h (make_locale /h in Windows. it requires Visual C++ 2005, 2008 or 2010 ).

make_locale.sh [options] [locale]

options ::= [-t 32|64 ] [-m debug|release]
locale ::= [de_DE|es_ES|fr_FR|it_IT|ja_JP|km_KH|ko_KR|tr_TR|vi_VN|zh_CN]
  • options

    • -t : Selects 32bit or 64bit (default value: 32).
    • -m : Selects release or debug. In general, release is selected (default value: release). The debug mode is provided for developers who would like to write the locale library themselves. Selects release or debug. In general, release is selected (default value: release). The debug mode is provided for developers who would like to write the locale library themselves.
  • locale : The locale name of the library to build. If locale is not specified, the build includes data from all configured locales. In this case, library file is stored in $CUBRID/lib directory with the name of libcubrid_all_locales.so (.dll for Windows).

To create user defined locale shared libraries, two choices are available:

  • Creating a single lib with all locales to be supported

    make_locale.sh                         # Build and pack all locales (32/release)
  • Creating one lib for each locale to be supported

    make_locale.sh -t 64 -m release ko_KR

The first choice is recommended. In this scenario, some data may be shared among locales. If you choose the first one, a lib supporting all locales has less than 15 MB; in the second one, consider for each locale library from 1 MB to more than 5 MB. Also the first one is recommended because it has no runtime overhead during restarting the servers when you choose the second one.

Warning

Limitations and Rules

  • Do not change the contents of $CUBRID/conf/cubrid_locales.txt after locales generation; Once generated the locales libraries, the contents of $CUBRID/conf/cubrid_locales.txt should not change (order of languages within file must also be preserved). During locale compiling, the generic collation use as default locale the first one in which is referenced; changing the order may cause different results with casing for such collation (utf8_gen_*).
  • Do not change the contents for $CUBRID/locales/data/*.txt files.

Note

Procedure of Executing make_locale.sh(.bat) Script

The processing in make_locale.sh(.bat) script

  1. Reads the .ldml file corresponding to a language, along with some other installed common data files like $CUBRID/locales/data/ducet.txt, $CUBRID/locales/data/unicodedata.txt, and  $CUBRID/locales/data/codepages/*.txt
  2. After processing of raw data, it writes in a temporary $CUBRID/locales/loclib/locale.c file C constants values and arrays consisting of locales data.
  3. The temporary file locale.c is passed to the platform compiler to build a .dll/.so file. This step assumes that the machines has an installed C/C++ compiler and linker. Currently, only the MS Visual Studio for Windows and gcc for Linux compilers are supported.
  4. Temporary files are removed.

Step 3: Setting CUBRID to Use a Specific Locale

Several locales can be defined, but only one locale can be selected as the default locale, by using the CUBRID_CHARSET environment variable.

In addition to the possibility of specifying a default locale, one can override the default calendar settings with the calendar settings from another locale, using the intl_date_lang system parameter.

  • CUBRID_CHARSET will be in the format: <locale_name>.[utf8 | iso] (e.g. tr_TR.utf8, en_EN.ISO, ko_KR.utf8)
  • intl_date_lang : <locale_name>. The possible values for <locale_name> are listed on Step 1: Selecting a Locale.

By default, if no charset is included in CUBRID_CHARSET, the ISO charset is assumed.

Note

Setting the Month/Day in Characters, AM/PM, and Number Format

For the function that inputs and outputs the day/time, you can set the month/day in characters, AM/PM, and number format by the locale in the intl_date_lang system parameter.

For the function that converts a string to numbers or the numbers to a string, you can set the string format by the locale in intl_number_lang system parameter.

Built-in Locale and Library Locale

Regarding the embedded locales in CUBRID, they can be used without compiling user locale library, so they can be used by skipping the step 2. But there are two differences between the embedded locale and the library locale.

  • Embedded(built-in) locale(and collation) are not aware of Unicode data For instance, casing (lower, upper) of (Á, á) is not available in embedded locales. The LDML locales provide data for Unicode codepoints up to 65535.
  • Also, the embedded collations deals only with ASCII range, or in case of 'utf8_tr_cs' - only ASCII and letters from Turkish alphabet. Embedded UTF-8 locales are not Unicode compatible, while compiled (LDML) locales are.

Currently, the built-in locales which can be set by CUBRID_CHARSET environment variable are:

  • en_US.iso88591
  • en_US.utf8
  • ko_KR.utf8
  • ko_KR.euckr
  • ko_KR.iso88591: Will have Romanized Korean names for month, day names.
  • tr_TR.utf8
  • tr_TR.iso88591: Will have Romanized Turkish names for month, day names.

The order stated above is important; if no charset is defined while configuring CUBRID_CHARSET, the charset is the charset of the locale shown first. For example, if CUBRID_CHARSET = ko_KR, the charset is specified to ko_KR.utf8, the first locale among the ko_KR in the above list. Locales of the other languages except the built-in locales should end with .utf8. For example, specify CUBRID_CHARSET = de_DE.utf8 for German.

The names of month and day for ko_KR.iso88591 and tr_TR.iso88591 should be Romanized. For example, "일요일" for Korean (Sunday in English) is Romanized to "Iryoil". Providing ISO-8859-1 characters only is required. For more information, see The Month/Day in Korean and Turkish Characters for ISO-8859-1 Charset.

The Month/Day in Korean and Turkish Characters for ISO-8859-1 Charset

In Korean or Turkish which have charset UTF-8 or in Korean which have charset EUC-KR, the month/day in characters and AM/PM are encoded according to the country. However, for ISO-8859-1 charset, if the month/day in characters and AM/PM in Korean or Turkish is used as its original encoding, an unexpected behavior may occur in the server process because of its complex expression. Therefore, the name should be Romanized. The default charset of CUBRID is ISO-8859-1 and the charset can be used for Korean and Turkish. The Romanized output format is as follows:

Day in Characters

Day in Characters Long/Short Format Long/Short Romanized Korean Long/Short Romanized Turkish
Sunday / Sun Iryoil / Il Pazar / Pz
Monday / Mon Woryoil / Wol Pazartesi / Pt
Tuesday / Tue Hwayoil / Hwa Sali / Sa
Wednesday / Wed Suyoil / Su Carsamba / Ca
Thursday / Thu Mogyoil / Mok Persembe / Pe
Friday / Fri Geumyoil / Geum Cuma / Cu
Saturday / Sat Toyoil / To Cumartesi / Ct

Month in Characters

Month in Characters Long/Short Format Long/Short Romanized Korean (Not Classified) Long/Short Romanized Turkish
January / Jan 1wol Ocak / Ock
February / Feb 2wol Subat / Sbt
March / Mar 3wol Mart / Mrt
April / Apr 4wol Nisan / Nsn
May / May 5wol Mayis / Mys
June / Jun 6wol Haziran / Hzr
July / Jul 7wol Temmuz / Tmz
August / Aug 8wol Agustos / Ags
September / Sep 9wol Eylul / Eyl
October / Oct 10wol Ekim / Ekm
November / Nov 11wol Kasim / Ksm
December / Dec 12wol Aralik / Arl

AM/PM in Characters

  Romanized in Korean Romanized in Turkish
AM ojeon AM
PM ohu PM

Step 4: Creating a Database with the Selected Locale Setting

Once the CUBRID_CHARSET and intl_date_lang environment variables have been set, one can create a new database (or delete and recreate an existing one). When issuing the command "cubrid createdb <db_name>", a database will be created using the settings in the variables described above. The charset and locale name are stored in "db_root" system catalog table. Once a database is created with a language and charset, it cannot change these settings.

Step 5 (optional): Manually Verifying the Locale File

The contents of locales libraries  may be displayed in human readable form using the dumplocale CUBRID utility. Execute cubrid dumplocale -h to output the usage. The used syntax is as follows:

cubrid dumplocale [options] [language-string]

options ::= -i|--input-file <shared_lib>
            -d|--calendar
            -n|--numeric
            {-a |--alphabet=}{l|lower|u|upper|both}
            -c|--codepoint-order
            -w|--weight-order
            {-s|--start-value} <starting_codepoint>
            {-e|--end-value} <ending_codepoint>
            -k
            -z

language-string ::= de_DE|es_ES|fr_FR|it_IT|ja_JP|km_KH|ko_KR|tr_TR|vi_VN|zh_CN
  • dumplocale: A command which dumps the contents of locale shared library previously generated using LDML input file.
  • language-string: One of de_DE, es_ES, fr_FR, it_IT, ja_JP, km_KH, ko_KR, tr_TR, vi_VN, zh_CN. Configures the locale language to dump the locale shared library. If it's not set, all languages which are configured on cubrid_locales.txt are given.

The followings are [options] for cubrid dumplocale.

-i, --input-file=FILE

The name of the locale shared library file (< shared_lib>) created previously. It includes the directory path.

-d, --calendar

Dumps the calendar and date/time data. Default value: No

-n, --numeric

Dumps the number data. Default value: No

-a, --alphabet=l|lower|u|upper|both

Dumps the alphabet and case data. Default value: No

--identifier-alphabet=l|lower|u|upper

Dumps the alphabet and case data for the identifier. Default value: No

-c, --codepoint-order

Dumps the collation data sorted by the codepoint value. Default value: No (displayed data: cp, char, weight, next-cp, char and weight)

-w, --weight-order

Dumps the collation data sorted by the weight value. Default value: No (displayed data: weight, cp, char)

-s, --start-value=CODEPOINT

Specifies the dump scope. Starting codepoint for -a, --identifier-alphabet, -c, -w options. Default value: 0

-e, --end-value=CODEPOINT

Specifies the dump scope. Ending codepoint for -a, --identifier-alphabet, -c, -w options. Default value: Max value read from the locale shared library.

-k, --console-conversion

Dumps the data of console conversion. Default value: No

-z, --normalization

Dumps the normalization data. Default value: No

The following example shows how to dump the calendar, number formatting, alphabet and case data, alphabet and case data for the identifier, collation sorting based on the codepoint order, collation sorting based on the weight, and the data in ko_KR locale into ko_KR_dump.txt by normalizing:

% cubrid dumplocale -d -n -a both -c -w -z ko_KR > ko_KR_dump.txt

It is highly recommended to redirect the console output to a file, as it can be very big data, and seeking information could prove to be difficult.

Synchronization of Database Collations with System Collations

CUBRID's normal operation requires that the system collation and the database collation must be the same. The system locale means that the locale which include built-in locales and library locales created through cubrid_locales.txt(refer Locale Setting), and it includes the system collation information. The database collation information is stored on the _db_collation system catalog table.

cubrid synccolldb utility checks if the database collation is the same with the system collation, and synchronize into the system collation if they are different. However, note that this utility doesn't transform the data itself stored on the database.

This utility can be used when the existing database collation should be changed after the system locale is changed. However, there are operations which the user have to do manually.

The user should do this operations before the synchronization. These operations can be done by running CSQL with cubrid_synccolldb_<database_name>.sql file, which is created by cubrid synccolldb -c.

  • change collation using ALTER TABLE .. MODIFY statement.
  • remove any views, indexes, triggers or partitions containing the collation.

Run synchrization with cubrid synccolldb. After then, do the following operations.

  • recreate views, indexes, triggers, or partitions
  • update application statements to use new collations

This utility should work only in offline mode.

synccolldb syntax is as follows.

cubrid synccolldb [options] database_name
  • cubrid: An integrated utility for the CUBRID service and database management.
  • synccolldb: A command to synchronize collations of a database with collations from the system(according to contents of locales libraries and $CUBRID/conf/cubrid_locales.txt).
  • database_name: A database name to be synchronized with collations from the system.

If [options] is omitted, synccolldb checks the collation differences between the system and the database, synchronize the database collation with the system collation, and create the cubrid_synccolldb_<database_name>.sql file including the queries of objects to be dropped before the synchronization.

The followings are [options] which are used on cubrid synccolldb.

-c, --check-only

This option prints out the collation information which is different between the database collation and the system collation.

-f, --force-only

This option doesn't ask when updating the database collation with the system collation.

The following shows that how it works when the system collation and the database collation are different.

Firstly, make locale library about ko_KR locale.

$ echo ko_KR > $CUBRID/conf/cubrid_locales.txt
$ make_locale.sh -t 64

Next, create the database.

$ cubrid createdb xdb --db-volume-size=20m --log-volume-size=20m

Create a schema. At this time, specify the needed collation in each table.

$ csql -S -udba xdb -i in.sql
CREATE TABLE dept(depname STRING PRIMARY KEY) COLLATE utf8_ko_cs_uca;
CREATE TABLE emp(eid INT PRIMARY KEY, depname STRING,address STRING) COLLATE utf8_ko_cs_uca;
ALTER TABLE emp ADD CONSTRAINT FOREIGN KEY (depname) REFERENCES dept(depname);

Change the locale setting of the system. If you do not any values on cubrid_locales.txt, the database consider that only built-in locales exist

$ echo "" > $CUBRID/conf/cubrid_locales.txt

Check the difference between system and database by running cubrid synccolldb -c command.

$ cubrid synccolldb -c xdb

----------------------------------------
----------------------------------------
Collation 'utf8_ko_cs_uca' (Id: 133) not found in database or changed in new system configuration.
----------------------------------------
----------------------------------------
Collation 'utf8_gen_ci' (Id: 44) not found in database or changed in new system configuration.
----------------------------------------
----------------------------------------
Collation 'utf8_gen_ai_ci' (Id: 37) not found in database or changed in new system configuration.
----------------------------------------
----------------------------------------
Collation 'utf8_gen' (Id: 32) not found in database or changed in new system configuration.
----------------------------------------
----------------------------------------
There are 4 collations in database which are not configured or are changed compared to system collations.
Synchronization of system collation into database is required.
Run 'cubrid synccolldb -f xdb'

If the indexes exist, firstly you should remove the indexes, and change the collation of each table, then recreate the indexes directly. The process to remove indexes and change the collation of tables can be executed by using cubrid_synccolldb_xdb.sql file which was created by synccolldb command. On the below example, a foreign key is the index which you should recreate.

$ cat cubrid_synccolldb_xdb.sql

ALTER TABLE [dept] COLLATE utf8_bin;
ALTER TABLE [emp] COLLATE utf8_bin;
ALTER TABLE [emp] DROP FOREIGN KEY [fk_emp_depname];
ALTER TABLE [dept] MODIFY [depname] VARCHAR(1073741823) COLLATE utf8_bin;
ALTER TABLE [emp] MODIFY [address] VARCHAR(1073741823) COLLATE utf8_bin;
ALTER TABLE [emp] MODIFY [depname] VARCHAR(1073741823) COLLATE utf8_bin;

$ csql -S -u dba -i cubrid_synccolldb_xdb.sql xdb

Removing the obsolete collations by executing the above cubrid_synccolldb_xdb.sql script file must be performed before forcing the synchronization of system collations into database.

Run cubrid synccolldb command. If the option is omitted, the message is shown to ask to run this command or not; if the -f option is given, the synchronization is run without checking message.

$ cubrid synccolldb xdb
Updating system collations may cause corruption of database. Continue (y/n) ?
Contents of '_db_collation' system table was updated with new system collations.

Recreate the dropped foreign key.

$ csql -S -u dba xdb

ALTER TABLE emp ADD CONSTRAINT FOREIGN KEY fk_emp_depname(depname) references dept(depname);

Note

In CUBRID, collations are identified by the ID number on the CUBRID server, and its range is from 0 to 255. LDML file is compiled with shared library, which offers the mapping information between the ID and the collation(name, attribute).

  • The system collation is the collation which is loaded from the locale library, by the CUBRID server and the CAS module.
  • The database collation is the collation which is stored into the _db_collation system table.

Collation

A collation is an assembly of information which defines an order for characters and strings. One common type of collation is called alphabetization.

If not explicitly set otherwise at column creation, the charset and collation of columns are charset and collation of table. The charset and collation are taken (in order in is found first) from the client. If the result of an expression is a character data type, gets the collation and charset by the collation inference with the operands of the expression.

Note

In CUBRID, collations are supported for a number of languages, including European and Asian. In addition to the different alphabets, some of these languages may require the definition of expansions or contractions for some characters or character groups. Most of these aspects have been put together by the Unicode Consortium into The Unicode Standard (up to version 6.1.0 in 2012). Most of the information is stored in the DUCET file http://www.unicode.org/Public/UCA/latest/allkeys.txt which contains all characters required by most languages.

Most of the codepoints represented in DUCET, are in range 0 - FFFF, but codepoints beyond this range are included. However, CUBRID will ignore the latest ones, and use only the codepoints in range 0 - FFFF (or a lower value, if configured).

Each codepoint in DUCET has one or more 'collation elements' attached to it. A collation element is a set of four numeric values, representing weights for 4 levels of comparison. Weight values are in range 0 - FFFF.

In DUCET, a character is represented on a single line, in the form:

< codepoint_or_multiple_codepoints >   ; [.W1.W2.W3.W4][....].... # < readable text explanation of the symbol/character >

A Korean character kiyeok is represented as follows:

1100  ; [.313B.0020.0002.1100] # HANGUL CHOSEONG KIYEOK

For example, 1100 is a codepoint, [.313B.0020.0002.1100] is one collation element, 313B is the weight of Level 1, 0020 is the weight of Level 2, 0002 is the weight of Level 3, and 1100 is the weight of Level 4.

Expansion support, defined as a functional property, means supporting the interpretation of a composed character as a pair of the same characters which it's made of. A rather obvious example is interpreting the character ''æ'' in the same way as the two character string ''ae''. This is an expansion. In DUCET, expansions are represented by using more than one collation element for a codepoint or contraction. By default, CUBRID has expansions disabled. Handling collations with expansions requires when comparing two strings several passes (up to the collation strength/level).

Charset and Collation of Column

Charset and Collation apply to string data types: VARCHAR (STRING), CHAR and ENUM . By default, all string data types inherit the default database collation and character set, but CUBRID supports two modifiers which affect collation and character set.

Charset

Character set may be specified as character string literal or as non-quoted identifier. Supported character sets:

  • ISO-8859-1
  • UTF-8 (with maximum 4 bytes per characters, which means it supports codepoints from 0 to 0x10FFFF)
  • EUC-KR (the support for this character set is only for backward compatibility reasons, its usage is not recommended)

Note

Previous versions of CUBRID 9.0 supported EUC-KR characters when ISO-8859-1 charset (the single one available) was set. From CUBRID 9.0 Beta, this is no longer available. EUC-KR characters should be used only with EUC-KR charset.

String Check

By default, all input data is assumed to be in the server character (set with CUBRID_CHARSET environment variable).  This may be overridden by SET NAMES or charset introducer (or COLLATE string literal modifier) (For more information, see Charset and Collation of Tables.

Invalid data may lead to undefined behavior or even crashes if string checking is disabled (by default is disabled). This can be enabled by intl_check_input_string system parameter. However, if you are sure that only valid data is input, you can obtain better performance by disabling string check.

Only UTF-8 and EUC-KR text data is checked for valid encodings. Since ISO-8859-1 is single byte encoding and all byte values are valid, there is no checking on this charset.

Charset Conversion

When collation / charset modifiers or normal collation inference requires it, character conversion may occur. Conversions are not reversible. The single effective charset conversion is from ISO88591 charset to UTF-8 charset. Losses may occur during this conversion: bytes  range 80-A0 are not valid ISO-8859-1 characters but may appear in strings. After conversion to UTF-8 this characters are replaced with '?'.

Conversion from UTF-8 or EUC-KR to ISO-8859-1 charset is a simple data stream re-interpretations (this is a trade-off since most Unicode characters do not have ISO-8859-1 correspondents).

ASCII characters are not affected by conversions: bytes in range 00-7F are encodings of the same characters in both ISO-8859-1  and UTF-8 character sets.

Rules for conversion of values from one charset to another:

Source \ Destination ISO-8859-1 UTF-8 EUC-KR
ISO-8859-1 No change Byte conversion. The byte size increases but the character length is the same. Not allowed
UTF-8 Byte reinterpretation. The byte size is the same but character length increases No change Not allowed
EUC-KR Byte reinterpretation. The byte size is the same but character length increases Not allowed No change

Collation

Collation may be specified as character string literal or as non-quoted identifier.

The following is a query on the _db_collation sytem table.

coll_id  coll_name        charset_name    is_builtin  has_expansions  contractions  uca_strength
================================================================================================
0        'iso88591_bin'   'iso88591'     'Yes'        'No'            0             'Not applicable'
1        'utf8_bin'       'utf8'         'Yes'        'No'            0             'Not applicable'
2        'iso88591_en_cs' 'iso88591'     'Yes'        'No'            0             'Not applicable'
3        'iso88591_en_ci' 'iso88591'     'Yes'        'No'            0             'Not applicable'
4        'utf8_en_cs'     'utf8'         'Yes'        'No'            0             'Not applicable'
5        'utf8_en_ci'     'utf8'         'Yes'        'No'            0             'Not applicable'
6        'utf8_tr_cs'     'utf8'         'Yes'        'No'            0             'Not applicable'
7        'utf8_ko_cs'     'utf8'         'Yes'        'No'            0             'Not applicable'
8        'euckr_bin'      'euckr'        'Yes'        'No'            0             'Not applicable'

Built-in collations are available without requiring additional user locale libraries.

Each collation has an associated charset. For this reason, it is not allowed to set incompatible pair to character set and collation.

When COLLATE modifier is specified without CHARSET modifier, then the default charset of collation is set. When CHARSET modifier is specified without COLLATE modifier, then the default collation is set. The default collation for character sets are the binary collation:

  • ISO-8859-1 : iso88591_bin
  • UTF-8 : utf8_bin
  • EUC-KR: euckr_bin

For more information on how to determine the collation among the expression parameters (operands) with different collations (and charsets), see How to Determine Collation among Columns with Different Collation.

CHARSET and COLLATE modifier

CUBRID supports two modifiers which affect collation and character set without following the default database collation and character set.

  • CHARACTER_SET (alias CHARSET) changes the columns character set
  • COLLATE changes the collation
<data_type> ::= <column_type> [<charset_modifier_clause>] [<collation_modifier_clause>]

<charset_modifier_clause> ::= {CHARACTER_SET | CHARSET} {<char_string_literal> | <identifier> }

<collation_modifier_clause> ::= {COLLATE } {<char_string_literal> | <identifier> }

The following example shows how to set the charset of the VARCHAR type column to UTF-8

CREATE TABLE t1 (s1 VARCHAR (100) CHARSET utf8);

The following example shows how to change the name of column s1 to c1 and the type to CHAR(10) with the collation of utf8_en_cs (the charset is the default charset of the collation, UTF-8).

ALTER TABLE t1 CHANGE s1 c1 CHAR(10) COLLATE utf8_en_cs;

The value of the c1 column is changed to the VARCHAR(5) type of which collation is iso88591_en_ci. It is performed by using the collation iso88591_en_ci for the type of column selected first or by using sorting.

SELECT CAST (c1 as VARCHAR(5) COLLATE 'iso88591_en_ci') FROM t1 ORDER BY 1;

The following query (same sorting) is similar to the above but the output column result is the original value.

SELECT c1 FROM t1 ORDER BY CAST (c1 as VARCHAR(5) COLLATE iso88591_en_ci);

How to Determine Collation among Columns with Different Collation

CREATE TABLE t (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_tr_cs);

-- insert values into both columns

SELECT s1, s2 FROM t WHERE s1 > s2;

In the above example, column s1 and column s2 have different collations. Comparing s1 with s2 means comparing the strings to determine which column value is "larger" among the records on the table t. In this case, an error will occur because the comparison between the collation utf8_en_cs and the collation utf8_tr_cs cannot be done.

The rules to determine the types of arguments for an expression are also applied to the rules to determine the collations.

  1. A common collation and a characterset are determined by considering all arguments of an expression.
  2. If an argument has a different collation(and a characterset) with a common collation(and a characterset) decided in No. 1., it is changed into the common collation(and a characterset).
  3. To change the collation, CAST() operator can be used.

Collation coercibility is used to determine the result collation of comparison expression. It expresses how easily the collation can be converted to the collation of the opposite argument. High collation coercibility when comparing two operands of an expression means that the collation can be easily converted to the collation of the opposite argument. That is, an argument with high collation coercibility can be changed to the collation of an argument with lower collation coercibility.

When an expression has various arguments with different collation, a common collation is computed based on each arguments collation and coercibility. The rules for collation inference are:

  1. Arguments with higher coercibility are coerced (or casted) to collation of arguments with lower coercibility
  2. When arguments have different collation but same coercibility, the expression’s collation cannot be resolved and an error is returned.

Below table shows the collation coercibility about arguments of the expression

Collation Coercibility Arguments of the Expression(Operands)
0 Operand having COLLATE modifier
1 Columns with non-binary collation
2 Columns with binary collation, except having ISO-8859-1 charset
3 Columns with binary collation and ISO-8859-1 charset(iso88591_bin)
4 SELECT values, Expression With non-binary collation
5 SELECT values, Expression With binary collation, except having ISO-8859-1 charset
6 SELECT values, Expression With binary collation and ISO-8859-1 charset(iso88591_bin)
7 Special functions (USER(), DATABASE(), SCHEMA(), VERSION())
8 Constants(string literals) With non-binary collation
9 Constants(string literals) With binary collation, except having ISO-8859-1 charset
10 Constants(string literals) With binary collation and ISO-8859-1 charset (iso88591_bin)
11 host variables, session variables

The following example shows converting two parameters with different collation to one collation.

  • Converting into the Wanted Collation

    The SELECT statement, failing to execute in the above example, is successfully executed by specifying a collation on one column by using the CAST operator as shown in the following query; then the two operands have the same collation.

    SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_en_cs);
    

    Also, by CAST s2 to binary collation, the s1 collation coercibility is 5, "fully convertible".

    SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_bin);
    

    In the following query, the second operand "CAST (s2 AS STRING COLLATE utf8_tr_cs)" is a sub-expression. The sub-expression has higher coercibility than the column (s1) so "CAST (s2 AS STRING COLLATE utf8_tr_cs)" is converted to the collation of s1.

    SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_tr_cs);
    

    Any expression has higher coercibility than any column. So "CONCAT (s2,'')" is converted to the collation of s1 in the following query and the query is successfully performed.

    SELECT s1, s2 FROM t WHERE s1 > CONCAT (s2,'');
    
  • Converting Collation of Constant and Column

    In the following case, comparison is made by using the collation of s1.

    SELECT s1, s2 FROM t WHERE s1 > 'abc';
    
  • When a Column is Created with Binary Collation

    CREATE TABLE t2 (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_bin);
    SELECT s1, s2 FROM t WHERE s1 > s2;
    

    In this case, s2 column's coercibility is 5(binary collation) and s2 can be "fully convertible" to the collation of s1. utf8_en_cs is used.

    CREATE TABLE t2 (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE iso88591_bin);
    SELECT s1, s2 FROM t WHERE s1 > s2;
    

    In this case, utf8_en_cs is used as collation, too. However, some overhead occurs to convert the charset to UTF-8 since s2 is the ISO charset. Charset conversion is made only when converting ISO to UTF-8.

    In the following query, the charset is not converted (UTF-8 byte data in s2 is easily reinterpreted to the ISO-8859-1 charset) but character comparison is made by using the iso88591_en_cs collation.

    CREATE TABLE t2 (
        s1 STRING COLLATE iso88591_en_cs,
        s2 STRING COLLATE utf8_bin
    );
    
    SELECT s1, s2 FROM t WHERE s1 > s2;
    
  • Converting Collation of Sub-Expression and Column

    Coercibility of sub-expressions is higher than coercibility of columns

    CREATE TABLE t (
        s1 STRING COLLATE utf8_en_cs,
        s2 STRING COLLATE utf8_tr_cs
    );
    
    SELECT s1, s2 FROM t WHERE s1 > s2 + 'abc';
    

    In this case, the second operand is the expression, so the collation of s1 is used.

    In the following example, an error occurs. An error occurs because '+' operation is tried for s2 and s3 where the collation is different.

    CREATE TABLE t (
        s1 STRING COLLATE utf8_en_cs,
        s2 STRING COLLATE utf8_tr_cs,
        s3 STRING COLLATE utf8_en_ci
    );
    
    SELECT s1, s2 FROM t WHERE s1 > s2 + s3;
    

    In the following example, the collation of s2 and s3 is utf8_tr_cs. Therefore, the collation of '+' expression is utf8_tr_cs, too. Expressions have higher coercibility than columns. Therefore, comparison operation is made by using the utf8_en_cs collation.

    CREATE TABLE t (
        s1 STRING COLLATE utf8_en_cs,
        s2 STRING COLLATE utf8_tr_cs,
        s3 STRING COLLATE utf8_tr_cs
    );
    
    SELECT s1, s2 FROM t WHERE s1 > s2 + s3;
    

Charset and Collation of Tables

The charset and the collation can be specified after the table creation syntax.

CREATE TABLE table_name ( column_list )  [CHARSET charset_name] [COLLATE collation_name]

If the charset and the collation of a column are omitted, the charset and the collation of a table is used. If the charset and the collation of a table are omitted, the charset and the collation of a system is used.

The following shows how to specify the collation on the table.

CREATE TABLE tbl(i1 INTEGER, s STRING) CHARSET utf8 COLLATE utf8_en_cs;

Charset and Collation of String Literals

The charset and the collation of a string literal are determined based on the following priority.

  1. Charset Introducer introducer or COLLATE modifier of string literal
  2. The charset and the collation defined by the SET NAMES Statement
  3. System charset and collation(Default collation set by the charset and the CUBRID_CHARSET environment variable)

SET NAMES Statement

The SET NAMES statement changes the default client charset and the collation. Therefore, all sentences in the client which has executed the statement have the specified charset and collation. The syntax is as follows.

SET NAMES [ charset_name ] [ COLLATE collation_name]
  • charset_name : Valid charset name is iso88591, utf8 and euckr.
  • collation_name : Collation setting can be omitted and all available collations can be set. The collation should be compatible with the charset; otherwise, an error occurs. To find the available collation names, look up the db_collation catalog VIEW (see Charset and Collation of Column).

The following example shows how to create the string literal with the default charset and collation.

SELECT 'a';

The following example shows how to create the string literal with the utf8 charset and utf8_bin collation(the default collation is the binary collation of the charset)

SET NAMES utf8;
SELECT 'a';

Charset Introducer

In front of the constant string, the charset introducer and the COLLATE modifier can be positioned. The charset introducer is the charset name starting with a underscore (_), coming before the constant string. The syntax to specify the CHARSET introducer and the COLLATE modifier for a string is as follows.

[charset_introducer]'constant-string' [ COLLATE collation_name ]
  • charset_introducer : a charset name starting with an underscore (_), can be omitted. One of _utf8, _iso88591, and _euckr can be entered.
  • constant-string : a constant string value.
  • collation_name : the name of a collation, which can be used in the system, can be omitted.

The default charset and collation of the constant string is determined based on the current database connected (the SET NAMES statement executed last or the default value).

  • When the string charset introducer is specified and the COLLATE modifier is omitted, the default collation (binary collation) of corresponding charset is set.
  • When the charset introducer is omitted and the COLLATE modifier is specified, the character is determined based on collation.

The following example shows how to specify the charset introducer and the COLLATE modifier.

SELECT 'cubrid';
SELECT _utf8'cubrid';
SELECT _utf8'cubrid' COLLATE utf8_en_cs;

The following example shows how to create the string literal with utf8 charset and utf8_en_cs collation. The COLLATE modifier of SELECT statement overrides the collation specified by SET NAMES syntax.

SET NAMES utf8 COLLATE utf8_en_ci;
SELECT 'a' COLLATE utf8_en_cs;

Charset and Collation of Expressions

The charset and collation of expression's result are inferred from charset and collation of arguments in the expression. Collation inference in CUBRID is based on coercibility. For more information, see How to Determine Collation among Columns with Different Collation.

All string matching function(LIKE, REPLACE, INSTR, POSITION, LOCATE, SUBSTRING_INDEX, FIND_IN_SET, etc) and comparison operators(<, >, =, etc) take collation into account.

Charset and Collation of System Data

The system charset is taken from CUBRID_CHARSET environment variable. The system collation is always the binary collation (<charset>_bin) of system charset. CUBRID supports three charset(iso88591, euckr, utf8), and accordingly three system collations.

Impact of CUBRID_CHARSET

The locale part of CUBRID_CHARSET controls:

  • character supported in identifiers and casing rules (called "alphabet")
  • default locale for date - string conversion functions
  • default locale for number - string conversion functions
  • console conversion in CSQL

Casing and identifiers

In CUBRID, identifiers are cases insensitive. Tables, columns, session variables, triggers, stored procedures are stored in lower case. Authentication identifiers (user and group names) are stored in upper case.

The ISO-8859-1 charset contains only 255 characters, so the primitives are able to use built-in data. Also the EUC-KR charset, from which only the ASCII compatible characters are considered for casing (and are handled in the code), is built-in.

The UTF-8 charset is a special case: There are built-in variants of UTF-8 locales (like en_US.utf8, tr_TR.utf8 and ko_KR.utf8) and LDML locales.

The built-in variant implement only the characters specific to the locale (ASCII characters for en_US.utf8 and ko_KR.utf8, ASCII + Turkish glyphs [1] for tr_TR.utf8). This means that while all UTF-8 characters encoded on maximum 4 bytes are still supported and accepted as identifiers, most of them are not handled as letters, and treated as any normal Unicode character by casing primitives. For instance, character "È" (Unicode codepoint 00C8) is allowed, but an identifier containing it will not be normalized to "è" (lower case).

CREATE TABLE ÈABC;

Therefore, after running above query, it will have a table name with "Èabc" into the system table, _db_class.

Using a LDML locale (built-in variants can also be overridden with a LDML variant), extends the supported Unicode characters up to codepoint FFFF. For instance, if CUBRID_CHARSET=es_ES.utf8 and the corresponding locale library is loaded, the previous statement will create a table with the name "èabc".

As previously mentioned, a set of casing rules and supported characters (letters) forms an "alphabet" in CUBRID (this is actually a tag in LDML). Some locales, like tr_TR and de_DE have specific casing rules: - in Turkish: lower('I')='ı' (dot-less lower i); upper ('i') = 'İ' (capital I with dot). - in German: upper ('ß')='SS' (two capital S letters).

Because of this, such locales have two sets of alphabets : one which applies to system data (identifiers) and one which applies to user data. The alphabet applying to user data include the special rules, while the system (identifiers) alphabet do not, thus making the system alphabets compatible between locales. This is required to avoid issues with identifiers (like in Turkish, where casing of the group name "public" results in errors -> "PUBLİC" != "PUBLIC").

It also provides a compatibility between databases with different locales (should be able to export - import schema and data).

String literal input and output

String literals data may be entered to CUBRID by various ways:

  • API interface (CCI)
  • language dependent interface - JDBC, Perl driver, etc.
  • CSQL - command line from console or input file

When receiving character data through drivers, CUBRID cannot be aware of the charset of those strings. All text data contained between quotes (string literals) are handled by CUBRID as raw bytes; the charset meta-information must be provided by client. CUBRID provides a way for the client to instruct it about which type of encoding is using for its character data. This is done with the SET NAMES statement or with charset introducer.

Text Conversion for CSQL

Text console conversion works in CSQL console interface. Most locales have associated character set (or codepage in Windows) which make it easy to write non-ASCII characters from console. For example in LDML for tr_TR.utf8 locale, there is a line:

<consoleconversion type="ISO88599" windows_codepage="28599" linux_charset="iso88599,ISO_8859-9,ISO8859-9,ISO-8859-9">

If the user set its console in one of the above settings (chcp 28599 in Windows, or export LANG=tr_TR.iso88599 in Linux), CUBRID assumes all input is encoded in ISO-8859-9 charset, and converts all data to UTF-8. Also when printing results, CUBRID performs the reverse conversion (from UTF-8 to ISO-8859-9). In Linux, to prevent this transform, using UTF-8(ex: export LANG=tr_TR.utf8) directly is recommended.

The setting is optional in the sense that the XML tag is not required in LDML locale file. For example, the locale km_KH.utf8 does not have an associated codepage.

Example for configuring French language and inputting French characters:

Set CUBRID_LANG/CUBRID_CHARSET=fr_FR.utf8, enable fr_FR in cubrid_locales.txt and compile the locales(see Locale Setting).

In Linux:

  • set console to receive UTF-8; set LANG=fr_FR.utf8 or en_US.utf8 (any locale with UTF-8). This setting will allow to input any UTF-8 character (not only French specific)
  • or, set console to receive ISO-8859-15; set LANG=fr_FR.iso885915; in LDML <consoleconversion> tag, set linux_charset="iso885915". This will receive only ISO-8859-15 characters which will be converted by CSQL to UTF-8 encoding.

In Windows:

  • set windows codepage to 28605 (chcp 28605); in LDML <consoleconversion> tag, set windows_codepage="28605". Codepage 28605 is the corresponding for ISO-8859-15 charset.

At input, the console conversion process takes all input (including statements) and performs the conversion (only if it is required - if it contains characters that needs conversion). At output (printing results, error messages), CSQL is more selective and does not convert all texts. For instance, printing of numeric values is not filtered through console conversion (since number text contains only ASCII characters).

Unicode Normalization

Glyphs [1] can be written in various forms using Unicode characters/codepoints. Most known are the decomposed and composed forms. For instance, the glyph 'Ä' is written in composed form with a single codepoint : 00C4, in UTF-8 these has two bytes : C3 84. In (fully) decomposed form, it written with two codepoints: 0041 ('A') and 0308 (COMBINING DIAERESIS), and in UTF-8 is encode using 3 bytes : 41 CC 88. Most text editors are able to handle both forms, so both encodings will appear as the same glyph : 'Ä'. Internally, CUBRID "knows" to work only with "fully composed" text.

For clients working with "fully decomposed" text, CUBRID can be configured to convert such text to "fully composed" and serve them back as "fully decomposed". Normalization is not a locale specific feature, it does not depend on locale.

unicode_input_normalization system parameter controls the composition at system level. For more details, see unicode_input_normalization.

The main use case is with both enabled (unicode_input_normalization, unicode_output_normalization) : this ensures that a string from a client knowing only decomposed Unicode is still properly handled by CUBRID. A second use case is with unicode_input_normalization = yes and unicode_output_normalization = no, for a client able to handle both types of Unicode writing.

Contraction and Expansion of Collation

CUBRID supports contraction and expansion for collation. Contraction and expansion are available for UTF-8 charset collation. You can see the contraction and expansion of collation in the collation setting in the LDML file. Using contraction and expansion affects the size of locale data (shared library) and server performance.

Contraction

A contraction is a sequence consisting of two or more codepoints, considered a single letter in sorting. For example, in the traditional Spanish sorting order, "ch" is considered a single letter. All words that begin with "ch" sort after all other words beginning with "c", but before words starting with "d". Other examples of contractions are "ch" in Czech, which sorts after "h", and "lj" and "nj" in Croatian and Latin Serbian, which sort after "l" and "n" respectively. See http://userguide.icu-project.org/collation/concepts for additional information. There are also some contractions defined in http://www.unicode.org/Public/UCA/latest/allkeys.txt DUCET.

Contractions are supported in both collation variants : with expansions and without expansions. Contractions support requires changes in a significant number of key areas. It also involves storing a contraction table inside the collation data. The handling of contractions is controlled by LDML parameters DUCETContractions="ignore/use" TailoringContractions="ignore/use" in <settings> tag of collation definition. The first one controls if contractions in DUCET file are loaded into collation, the second one controls if contractions defined by rules in LDML are ignore or not (easier way then adding-deleting all rules introducing contractions).

Expansion

Expansions refer to codepoints which have more than one collation element. Enabling expansions in CUBRID radically changes the collation's behavior as described below. The CUBRIDExpansions="use" parameter controls the this behavior.

Collation without Expansion

In a collation without expansions, each codepoint is treated independently. Based on the strength of the collation, the alphabet may or may not be fully sorted. A collation algorithm will sort the codepoints by comparing the weights in a set of levels, and then will generate a single value, representing the weight of the codepoint. String comparison will be rather straight-forward. Comparing two strings in an expansion-free collation means comparing codepoint by codepoint using the computed weight  values.

Collation with Expansion

In a collation with expansions, some composed characters (codepoints) are to be interpreted as an ordered list of other characters (codepoints). For example, 'æ' might require to be interpreted the same way as 'ae', or 'ä' as ''ae'' or ''aa''. In DUCET, the collation element list of 'æ' will be the concatenation of collation element lists of both 'a' and 'e', in this order. Deciding a particular order for the codepoints is no longer possible, and neither is computing new weight values for each character/codepoint.

In a collation with expansions, string comparison is done by concatenating the collation elements for the codepoints/contractions in two lists (for the two strings) and then comparing the weights in those lists for each level.

Example 1

The purpose of these examples is to show that under different collation settings (with or without expansion support), string comparison might yield different results.

Here there are the lines from DUCET which correspond to a subset of codepoints to be used for comparisons in the examples below.

0041  ; [.15A3.0020.0008.0041] # LATIN CAPITAL LETTER A
0052  ; [.1770.0020.0008.0052] # LATIN CAPITAL LETTER R
0061  ; [.15A3.0020.0002.0061] # LATIN SMALL LETTER A
0072  ; [.1770.0020.0002.0072] # LATIN SMALL LETTER R
00C4  ; [.15A3.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS;
00E4  ; [.15A3.0020.0002.0061][.0000.0047.0002.0308] # LATIN SMALL LETTER A WITH DIAERESIS;

Three types of settings for the collation will be illustrated:

  • Primary strength, no casing (level 1 only)
  • Secondary strength, no casing (levels 1 and 2)
  • Tertiary strength, uppercase first (levels 1, 2 and 3)

From now on, sorting of the strings "Ar" and "Är" will be attempted.

Collation without Expansions Support

When expansions are disabled, each codepoint is reassigning a new single valued weight. Based on the algorithms described above the weights for A, Ä, R and their lowercase correspondents, the order of the codepoints for these characters, for each collation settings example above, will be as follows.

  • Primary strength: A = Ä < R = r
  • Secondary strength: A < Ä < R = r
  • Tertiary strength: A < Ä < R < r

The sort order for the chosen strings is easy to decide, since there are computed weights for each codepoint.

  • Primary strength: "Ar" = "Är"
  • Secondary strength: "Ar" < "Är"
  • Tertiary strength: "Ar" < "Är"

Collation with Expansions

The sorting order is changed for collation with expansion. Based on DUCET, the concatenated lists of collation elements for the strings from our samples are provided below:

Ar [.15A3.0020.0008.0041][.1770.0020.0002.0072]
Är [.15A3.0020.0008.0041][.0000.0047.0002.0308][.1770.0020.0002.0072]

It is rather obvious that on the first pass, for level 1 weights, 0x15A3 will be compared with 0x15A3. In the second iteration, the 0x0000 weight will be skipped, and 0x1770 will be compared with 0x1770. Since the strings are declared identical so far, the comparison will continue on the level 2 weights, first comparing 0x0020 with 0x0020, then 0x0020 with 0x0047, yielding "Ar" < "Är". The example above was meant to show how strings comparison is done when using a collation with expansion support.

Let us change the collation settings, and show how one may obtain a different order for the same strings when using a collation for German, where "Ä" is supposed to be interpreted as the character group "AE". The codepoints and collation elements of the characters involved in this example are as follows.

0041  ; [.15A3.0020.0008.0041] # LATIN CAPITAL LETTER A
0045  ; [.15FF.0020.0008.0045] # LATIN CAPITAL LETTER E
0072  ; [.1770.0020.0002.0072] # LATIN SMALL LETTER R
00C4  ; [.15A3.0020.0008.0041][.15FF.0020.0008.0045] # LATIN CAPITAL LETTER A WITH DIAERESIS; EXPANSION

When comparing the strings "Är" and "Ar", the algorithm for string comparison when using a collation with expansion support will involve comparing the simulated concatenation of collation element lists for the characters in the two strings.

Ar [.15A3.0020.0008.0041][.1770.0020.0002.0072]
Är [.15A3.0020.0008.0041][.15FF.0020.0008.0045][.1770.0020.0002.0072]

On the first pass, when comparing level 1 weights, 0x15A3 will be compared with 0x15A3, then 0x1770 with 0x15FF, where a difference is found. This comparison yields "Ar" > "Är", a result completely different than the one for the previous example.

Example 2

In Canadian French sorting by the collation with expansion, accent is compared from end of string towards the beginning.

  • Normal Accent Ordering: cote < coté < côte < côté
  • Backward Accent Ordering: cote < côte < coté < côté

Operations Requiring Collation and Charset

Charset

Charset information is required for functions which use character primitives. There are exceptions : OCTET_LENGTH() and BIT_LENGTH() do not require charset internally to return the length in bytes and bits. However, for the same glyph (character symbol) stored in different charset, they return different values:

CREATE TABLE t (s_iso STRING CHARSET iso88591, s_utf8 STRING CHARSET utf8);
SET NAMES iso88591;
INSERT INTO t VALUES('È','È');

-- the first returnes 1, while the second does 2
SELECT OCTET_LENGTH(s_iso), OCTET_LENGTH(s_utf8) FROM t;

The previous example should be run from console (or a client) with ISO-8859-1 charset.

Collation

Collation is required in functions and operators which involves a comparison between two strings or matching two strings. These includes functions like : STRCMP(), POSITION(), LIKE condition, and operators (<,= , >=, etc.). Also clauses like ORDER BY, GROUP BY and aggregates(MIN(), MAX(), GROUP_CONCAT()) use collation.

Also, collation is considered in UPPER() and LOWER() functions, in the following manner:

  • Each collation has a default (parent) locale.
  • UPPER and LOWER functions are performed using the user alphabet of the default locale of the collation.

For most collations, the default locale is obvious (is embedded in the name):

  • utf8_tr_cs → tr_TR.utf8
  • iso88591_en_ci → en_US (ISO-8859-1 charset)

The binary collations have the following default locales:

  • iso88591_bin → en_US (ISO-8859-1 charset)
  • utf8_bin (en_US.utf8 - built-in locale - and handles ASCII characters only)
  • euckr_bin (ko_KR.euckr - built-in locale - and handles ASCII characters only)

There are some generic collations available in LDML. These collations have as default locale, the locale in which they are first found. The order of loading is the locales order from $CUBRID/conf/cubrid_locales.txt. Assuming the default order (alphabetical), the default locale for all generic LDML collations is de_DE (German).

Charset conversion

For the three charsets supported by CUBRID the conversion rules are:

  • euckr to/from utf8 not allowed

  • iso88591 to utf8 - byte conversion - it is the only transition where all characters in the source can be encoded in the destination.

  • iso88591 to euckr - not allowed (some characters cannot be encoded in EUC-KR)

  • utf8 and euckr to iso88591 - byte reinterpret; When value (to be converted) has domain with infinite precision, the size of storage remains the same, but precision increases.

    Note

    In 9.0 version, when value has domain with finite precision, the precision is kept resulting in data truncation. In 9.1 version, CUBRID no longer truncates the data to keep precision; in cases where precision is not enforced, CUBRID will extend the precision in order to keep data integrity.

Collation settings impacting CUBRID features

LIKE Conditional Optimization

The LIKE conditional expression compares patterns between string data, and returns TRUE if a string whose pattern matches the search word is found.

As already proven above, when using a "collation without expansion support", each codepoint will receive a single integer value, representing its weight in the comparison process. This weight value is computed based on collation settings (strength, casing etc.). Due to the fact that characters can always be regarded as single entities, trying to match a string with a pattern using the LIKE predicate is equivalent to checking if the string can be found in a certain range of strings. For example in order to process a predicate such as ''s LIKE 'abc%' '', CUBRID will first rewrite it as a range restriction for the string "s". "s LIKE 'abc%'" means that "s" must start with the string "abc". In terms of string comparison, this is equivalent, in expansion-free collations, with "s" being greater than "abc", but smaller than its successor (using the English alphabet, the successor of "abc" would be "abd").

s LIKE 'abc%' → s ≥ 'abc' AND s < 'abd' (if using strictly the English alphabet)

This way, the actual interpretation of LIKE is replaced with simple comparisons, but "Collations with expansion support" behave differently. As described above, if a collation supporting expansions is used, single weight values are no longer calculated for each codepoint based on DUCET, but the information from their corresponding collation element list is stored with original values (even though it is compressed). To compare strings when using such a collation means comparing the concatenated lists of collation elements for each codepoint or expansion, level by level. For more information about comparing strings on the collation with expansion, see Expansion.

If the LIKE predicate rewrite method is kept the same as in a collation with no expansion support as above example, the comparison result can be wrong. To ensure the right query result, the LIKE predicate rewrite method is ran differently as the below example. That is, the LIKE predicate is added as a filter to exclude the wrong data which can be added in a collation with expansion.

s LIKE 'abc%' → s ≥ 'abc' AND s < 'abd' and s LIKE 'abc%' (if using strictly the English alphabet)

Prefix Index and Collation Expansion

A prefix index can be created on the collation without expansion; however, it cannot be created on the column which has the collation with expansion.

CREATE TABLE tbl (col1 VARCHAR(200) COLLATE utf8_ja_exp);
CREATE INDEX idx_tbl_col1 on tbl(col1(5));

ERROR: before ' ; '
Prefix index is not allowed on attribute 'col1' (has collation with expansions).

Index Covering

Covering index scan is query optimization, in which if all values in query can be computed using only the values found in the index, without requiring additional row lookup in heap file. For more information, see Covering Index.

In the collation without casing, for two strings values, 'abc' and 'ABC', only one value is stored in the index(this is either 'abc' or 'ABC' depending which one was inserted first). As a result, the incorrect result may happen when at least two different strings produce the same sort key in a given collation. For this reason, for all UTF-8 collations with strength level less than 4 (quaternary), the index covering query optimization is disabled.

This is controlled by strength="tertiary/quaternary" in <strength> tag of collation definition in LDML. It should be considered to set this level as maximum strength, because the quaternary strength level requires not only more memory space and bigger size of the shared library file, but also string-comparison time.

For more information about collations, see Collation.

Summary of CUBRID Features for Each Collation

Collation LIKE condition kept after rewrite to range Allows index covering Allows prefix index
iso88591_bin No Yes Yes
iso88591_en_cs No Yes Yes
iso88591_en_ci Yes No Yes
utf8_bin No Yes Yes
euckr_bin No Yes Yes
utf8_en_cs No Yes Yes
utf8_en_ci Yes No Yes
utf8_tr_cs No Yes Yes
utf8_ko_cs No Yes Yes
utf8_gen No Yes Yes
utf8_gen_ai_ci Yes No Yes
utf8_gen_ci Yes No Yes
utf8_de_exp_ai_ci Yes No No
utf8_de_exp Yes No No
utf8_es_cs No Yes Yes
utf8_fr_exp_ab Yes No No
utf8_ja_exp Yes No No
utf8_ja_exp_cbm Yes No No
utf8_km_exp Yes No No
utf8_ko_cs_uca No Yes Yes
utf8_tr_cs_uca No Yes Yes
utf8_vi_cs No Yes Yes

Viewing Collation Information

To view the collation information, use CHARSET(), COLLATION() and COERCIBILITY() functions.

The information of the database collation can be shown on db_collation system view or SHOW COLLATION.

Configuration Guide for Characters

Database designers should take into account character data properties when designing the database structure. The following is the summarized guide when configuring aspects related to CUBRID character data.

CUBRID_CHARSET

  • by default, use CUBRID_CHARSET=en_US; this gives best performance.
  • using UTF-8 locale will increase storage requirement of fixed char(CHAR) by 4 times; using EUC-KR increases storage 3 times.
  • if user string literals have different charset and collation from system, query strings will grow as the string literals are decorated with them.
  • if localized (non-ASCII) characters will be used for identifiers, then use an .utf8 locale
  • once established the UTF-8 charset for CUBRID_CHARSET, it is best to use a LDML locale (this ensures that identifier names containing most Unicode characters are correctly cased).
  • setting a locale affects also conversion functions(intl_date_lang, intl_number_lang).
  • when you set CUBRID_CHARSET, there should be no concern on charset and collation of string-literals or user tables columns; all of them can be changed at run-time (with CAST() in queries) or ALTER .. CHANGE for a permanent change.

CHAR and VARCHAR

  • generally, use VARCHAR if there are large variations in actual number of characters in user data.
  • CHAR type is fixed length type. Therefore, Even if you store only English character in CHAR type, it requires 4 bytes storage in UTF-8 and 3 bytes in EUC-KR.
  • the precision of columns refers to the number of characters (glyphs).
  • after choosing precision, charset and collation should be set according to most used scenarios.

Choosing charset

  • even if your text contains non-ASCII character, use utf8 or euckr charsets only if application requires character counting, inserting, replacing
  • for CHAR data, the main concern should storage requirement (4x or utf8, 3x for euckr)
  • for both CHAR and VARCHAR data, there is some overhead when inserting/updating data: counting the precision (number of characters) of each instance is more consuming for non-ISO charsets.
  • in queries, charset of expressions may be converted using CAST() operator.

Choosing collation

  • if no collation dependent operations are performed (string searching, sorting, comparisons, casing), than choose binary collation for that charset
  • collation may be easily overridden using CAST() operator, and COLLATE modifier (in 9.1 version) if charset is unchanged between original charset of expression and the new collation.
  • collation controls also the casing rules of strings
  • collations with expansions are slower, but are more flexible and they perform whole-word sorting

Normalization

  • if your client applications send text data to CUBRID in decomposed form, then configure unicode_input_normalization = yes, so that CUBRID re-composes it and handles it in composed form
  • if your client "knows" to handle data only in decomposed form, than set unicode_output_normalization = yes, so that CUBRID always sends in decomposed form.
  • if the client "knows" both forms, then leave unicode_output_normalization = no

CAST vs COLLATE

Remark

  • Charset is assumed to be the same per CUBRID instance. Providing direct UTF-8 input from a client through CCI-JDBC is possible to a CUBRID instance started with UTF-8 charset. This is due to charset conversions (when CUBRID is using ISO charset, all input is assumed ISO and is converted to UTF-8, even client native UTF-8 strings). ASCII compatible characters are fully compatible with both ISO and UTF-8, and will not suffer any transformation.
  • Query plans printing: collation is not displayed in plans for results with late binding.
  • Only the Unicode code-points in range 0000-FFFF (Basic Multilingual Plan) are normalized.
  • Some locales use space character as separator for digit grouping (thousands, millions, ..). Space is allowed but not working properly in some cases of localized conversion from string to number.
  • User defined variable cannot be changed into the different collation from the system collation. For example, "set @v1='a' collate utf8_en_cs;" syntax cannot be executed when the system collation is iso88591.

Footnotes

[1](1, 2) glyph: an element for the shape of a character; a graphic symbol which indicates a shape or a form for a character. Because a glyph specifies the shape which is shown, several glyphs about one character can exist.

Table Of Contents

Previous topic

Partitioning

Next topic

Transaction and Lock

This Page