Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Versions available for this page: CUBRID 9.0.0 | 

Collation and Charset of Column

Collation (and character set) applies to string data types: VARCHAR (STRING), CHAR.

By default, all string data types inherit the default database collation and character set, but CUBRID supports two modifiers which affect collation and character set.

Charset

Character set may be specified as character string literal or as non-quoted identifier.

Supported character sets:

  • ISO-8859-1 (*)
  • UTF-8 (with maximum 4 bytes per characters, which means it supports codepoints from 0 to 0x10FFFF)
  • EUC-KR (the support for this character set is only for backward compatibility reasons, its usage is not recommended)

Character set may be specified as character string literal or as non-quoted identifier.

Supported character sets:

  • ISO-8859-1 (*)
  •   UTF-8 (with maximum 4 bytes per characters, which means it supports codepoints from 0 to 0x10FFFF)
  • EUC-KR (the support for this character set is only for backward compatibility reasons, its usage is not recommended)

* Note Previous versions of CUBRID supported EUC-KR characters when ISO-8859-1 charset (the single one available) was set. In Apricot, this is no longer available. EUC-KR characters should be used only with EUC-KR charset.

String Check

By default, all input data is assumed to be in the server character (set with CUBRID_LANG environment variable).  This may be overridden by SET NAMES or charset introducer (or COLLATE string literal modifier) (For more information, see Globalization > Collation of Charset and String.

Invalid data may lead to undefined behavior or even crashes if string checking is disabled (by default is disabled). This can be enabled by intl_check_input_string system parameter. However, if you are sure that only valid data is input, you can obtain better performance by disabling string check.

Only UTF-8 and EUC-KR text data is checked for valid encodings. Since ISO-8859-1 is single byte encoding and all byte values are valid, there is no checking on this charset.

Charset Conversion

When collation/charset modifiers or normal collation inference requires it, character conversion may occur. Conversions are not reversible. The single effective charset conversion is from ISO88591 charset to UTF-8 charset. Losses may occur during this conversion: bytes  range 80-A0 are not valid ISO-8859-1 characters but may appear in strings. After conversion to UTF-8 this characters are replaced with '?'.

Conversion from UTF-8 or EUC-KR to ISO-8859-1 charset is a simple data stream re-interpretations (this is a trade-off since most Unicode characters do not have ISO-8859-1 correspondents).

ASCII characters are not affected by conversions: bytes in range 00-7F are encodings of the same characters in both ISO-8859-1  and UTF-8 character sets.

Rules for conversion of values from one charset to another:

Source\Destination

ISO-8859-1

UTF-8

EUC-KR

ISO-8859-1

No change

Byte conversion.
The byte size increases but the character length is the same.

Not allowed

UTF-8

Byte reinterpretation.
The byte size is the same but character length increases.

No change

Not allowed

EUC-KR

Byte reinterpretation.
The byte size is the same but character length increases.

Not allowed

No change

Collation

Collation may be specified as character string literal or as non-quoted identifier.

The following is a query on the _db_collation sytem table.

coll_id  coll_name        charset_name    is_builtin  has_expansions  contractions uca_strength

==========================================================================================

0        'iso88591_bin'   'ISO8859-1'    'YES'        'NO'            0   'NOT APPLICABLE'

1        'utf8_bin'       'UTF-8'        'YES'        'NO'            0   'NOT APPLICABLE'

2        'iso88591_en_cs' 'ISO8859-1'    'YES'        'NO'            0   'NOT APPLICABLE'

3        'iso88591_en_ci' 'ISO8859-1'    'YES'        'NO'            0   'NOT APPLICABLE'

4        'utf8_en_cs'     'UTF-8'        'YES'        'NO'            0   'NOT APPLICABLE'

5        'utf8_en_ci'     'UTF-8'        'YES'        'NO'            0   'NOT APPLICABLE'

6        'utf8_tr_cs'     'UTF-8'        'YES'        'NO'            0   'NOT APPLICABLE'

7        'utf8_ko_cs'     'UTF-8'        'YES'        'NO'            0   'NOT APPLICABLE'

8        'euckr_bin'      'KSC-EUC'      'YES'        'NO'            0   'NOT APPLICABLE'

Built-in collations are available without requiring additional user locale libraries.

Each collation has an associated charset. For this reason, it is not allowed to set incompatible pair to character set and collation.

When COLLATE modifier is specified without CHARSET, then the default charset of collation is set.

When CHARSET modifier is specificer without COLLATE, then the default collation is set. The default collation for character sets are the binary collation:

  • ISO-8859-1 : iso88591_bin
  • UTF-8 : utf8_bin
  • EUC-KR: euckr_bin

For more information on how to determine the collation among the expression parameters (operands) with different collations (and charsets), see How to Determine Collation among Columns with Different Collations.

Syntax

CUBRID supports two modifiers which affect collation and character set without following the default database collation and character set.

  • CHARACTER_SET (alias CHARSET) changes the columns character set
  • COLLATE (alias COLLATION) changes the collation

<data_type> ::=

<column_type> [<charset_modifier_clause>] [<collation_modifier_clause>]

 

<charset_modifier_clause> ::= {CHARACTER_SET | CHARSET} {<char_string_literal> | <identifier> }

 

<collation_modifier_clause> ::= {COLLATE | COLLATION} {<char_string_literal> | <identifier> }

Example

The following example shows how to set the charset of the STRING type (the maximum value of the VARCHAR type) column to UTF-8

CREATE TABLE t1 (s1 STRING CHARSET utf8);

The following example shows how to change the name of column s1 to c1 and the type to CHAR(10) with the collation of utf8_en_cs (the charset is the default charset of the collation, UTF-8).

ALTER TABLE t1 CHANGE s1 c1 CHAR(10) COLLATE utf8_en_cs;

The value of the c1 column is changed to the VARCHAR(5) type of which collation is iso88591_en_ci. It is performed by using the collation iso88591_en_ci for the type of column selected first or by using sorting.

SELECT CAST (c1 as VARCHAR(5) COLLATE 'iso88591_en_ci') FROM t1 ORDER BY 1;

The following query (same sorting) is similar to the above but the output column result is the original value.

SELECT c1 FROM t1 ORDER BY CAST (c1 as VARCHAR(5) COLLATE iso88591_en_ci);

How to Determine Collation among Columns with Different Collation

CUBRID determines the collation and charset to be used for detecting columns when the columns (expressions) have different collations and charsets.

CREATE TABLE t (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_tr_cs);

-- insert values into both columns

SELECT s1, s2 FROM t WHERE s1 > s2;

In the above example, column s1 and column s2 have different collations. Comparing s1 with s2 means comparing the strings to determine which column value is "larger" among the records on the table t. The collation utf8_en_cs and the collation utf8_tr_cs cannot be compared to each other, so an error will be output.

Collation coercibility is used to determine the result collation of comparison expression. It expresses how easily the collation can be converted to the collation of the opposite argument. High collation coercibility when comparing two operands of an expression means that the collation can be easily converted to the collation of the opposite argument. That is, an argument with high collation coercibility can be changed to the collation of an argument with lower collation coercibility.

When an expression has various arguments with different collation, a common collation is computed based on each arguments collation and coercibility. The rules for collation inference are:

  1. Arguments with higher coercibility are coerced (or casted) to collation of arguments with lower coercibility
  2. When arguments have different collation but same coercibility, the expression’s collation cannot be resolved and an error is returned.
  3. Arguments which are sub-expressions with CAST operator are transparent for collations: the collation propagates to the operand of CAST; an argument which is an expression with CAST operator is still handled as a regular expression in terms of argument coercibility like any regular operator.

Level of Collation Change

Parameter (Operand) of the Expression

5
Convertible (string)

Constant

Host variable

An argument that contains system collation by default (iso88591_bin, utf8_bin) (*)

4
Convertible (system constant)

Special functions (USER (), DATABASE (), SCHEMA (), VERSION ())

3
Convertible (expression)

SELECT Value, sub-expression

2
Convertible (reserved)

Not used now

1
Convertible (implied collation)

Column

0
Non-convertible (explicit collation)

Not used now

(*) binary collation override the coercibility of argument type. General column arguments are not coercible, but columns with binary collations become fully coercible.

The following example shows converting two parameters with different collation to one collation.

Converting Desired Collation by Specifying It

The SELECT statement, failing to execute in the above example, is successfully executed by specifying a collation on one column by using the CAST function as shown in the following query; then the two operands have the same collation.

SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_en_cs);

Also, by CAST s2 to binary collation, the s1 collation coercibility is 5, "fully convertible".

SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_bin);

In the following query, the second operand "CAST (s2 AS STRING COLLATE utf8_tr_cs)" is a sub-expression. The sub-expression has higher coercibility than the column (s1) so "CAST (s2 AS STRING COLLATE utf8_tr_cs)" is converted to the collation of s1.

SELECT s1, s2 FROM t WHERE s1 > CAST (s2 AS STRING COLLATE utf8_tr_cs);

Any expression has higher coercibility than any column. So "CONCAT (s2,'')" is converted to the collation of s1 in the following query and the query is successfully performed.

SELECT s1, s2 FROM t WHERE s1 > CONCAT (s2,'');

Converting Collation of Constant and Column

In the following case, comparison is made by using the collation of s1.

SELECT s1, s2 FROM t WHERE s1 > 'abc';

When a Column is Created with Binary Collation

CREATE TABLE t2 (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_bin);

SELECT s1, s2 FROM t WHERE s1 > s2;

In this case, s2 is the binary collation. Therefore, its coercibility is 5 and s2 can be "fully convertible" to the collation of s1. utf8_en_cs is used.

CREATE TABLE t2 (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE iso88591_bin);

SELECT s1, s2 FROM t WHERE s1 > s2;

In this case, utf8_en_cs is used as collation, too. However, some overhead occurs to convert the charset to UTF-8 since s2 is the ISO charset. Charset conversion is made only when converting ISO to UTF-8.

In the following query, the charset is not converted (UTF08 byte data in s2 is easily reinterpreted to the ISO-8859-1 charset) but character comparison is made by using the iso88591_en_cs collation.

CREATE TABLE t2 (s1 STRING COLLATE iso88591_en_cs, s2 STRING COLLATE utf8_bin);

SELECT s1, s2 FROM t WHERE s1 > s2;

Converting Collation of Sub-Expression and Column

Coercibility of sub-expressions is higher than coercibility of columns

CREATE TABLE t (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_tr_cs);

SELECT s1, s2 FROM t WHERE s1 > s2 + 'abc';

In this case, the second operand is the expression, so the collation of s1 is used.

In the following example, an error occurs. An error occurs because '+' operation is tried for s2 and s3 where the collation is different.

CREATE TABLE t (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_tr_cs, s3 STRING COLLATE utf8_en_ci);

SELECT s1, s2 FROM t WHERE s1 > s2 + s3;

In the following example, the collation of s2 and s3 is utf8_tr_cs. Therefore, the collation of '+' expression is utf8_tr_cs, too. Expressions have higher coercibility than columns. Therefore, comparison operation is made by using the utf8_en_cs collation.

CREATE TABLE t (s1 STRING COLLATE utf8_en_cs, s2 STRING COLLATE utf8_tr_cs, s3 STRING COLLATE utf8_tr_cs);

SELECT s1, s2 FROM t WHERE s1 > s2 + s3;