Open Source RDBMS - Seamless, Scalable, Stable and Free

한국어 | Login |Register

Internationalization with CUBRID OLE DB Provider


The scope of this tutorial is to share some information on using CUBRID OLE DB Data Provider version 8.4.1, whenever internationalization and encoding must be implemented in your project. We will show you how to deal with non-ASCII characters - how to update and retrieve UTF–8 encoded data.

If you are not familiar with the CUBRID OLE DB Data Provider, please consider reading the introduction tutorials. Also, we are assuming that you are familiar with OLE DB technology in general and with the CUBRID RDBMS (http://www.cubrid.org).

Requirements:

CUBRID and Internationalization

Probably many of you are already familiar with the way MySQL handles Unicode. In particular, in MySQL, you can configure in the connection string and in the database objects (tables, columns) the desired encoding.

SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

CUBRID, on the other hand, implements a slightly different model. Users can store whatever data they want, including data in UTF–8 encoding, without the need to specify the tables/columns encoding. In fact, CUBRID server is not aware of which character set(s) you use, because it stores the data, reads the data, and compares the data in “plain” bytes. There are bytes and only bytes from the database engine perspective!

So we have CUBRID Rule #1:
It is the client’s responsibility to encode the data the way it is needed (UTF–8, in particular, is most probably used for non-ASCII data), before it is sent to CUBRID. CUBRID will store the bytes sequence unaltered.

And the immediate consequence is Rule #2:
CUBRID will send back the exact stored bytes back to the client. It is the client’s responsibility to interpret those bytes (encode them to UTF–8 or wide char data type, for example).

Single byte comparison

The last thing to keep in mind when dealing with the character set is the way CUBRID compares data. As we already stated, CUBRID stores, retrieves, and compares the data in bytes. So, when you want to store the text with multi-byte characters, set single_byte_compare = yes (the default is “no”) in the cubrid.conf. Otherwise, CUBRID server will try to compare in two or three bytes and will return a different result.

One easy way to do this is using CUBRID Manager:

single_byte_compare_configuration_cubrid_manager.png

single_byte_compare configuration in CM.

In the next versions of CUBRID, expect more support for characters sets. It is now a work in progress, expected to be available in the next releases – stay tuned! ☺

VARCHAR/CHAR data type handling

Let’s take a look at a concrete example of handling non-ASCII data in a VARCHAR column. We will insert and retrieve some Korean data and we will use UTF–8 for encoding. We will also use a CDynamicAccessor based code and a table with just one VARCHAR column.

Here is how we can insert the data (for simplicity, all error handling code has been removed):

CCommand<CNoAccessor, CNoRowset> cmd;
 
#pragma execution_character_set("utf-8") 
 
TCHAR *sql_val = L"채식주의자 입니다";
 
TCHAR *sql = (TCHAR *)malloc(sizeof(char) * 128);
memset(sql, '\0', sizeof(char) * 128);
wcscat(sql, L"INSERT INTO `t` VALUES('");
wcscat(sql, sql_val);
wcscat(sql, L"')");
 
//Get UTF-8 required length
int utf8_length = WideCharToMultiByte(
    CP_UTF8,		// Convert to UTF-8
    0,		// No special character conversions required 
    sql,		// UTF-16 string to convert from
    -1,		// utf16 is NULL terminated
    NULL,		// Determining correct output buffer size
    0,		// Determining correct output buffer size
    NULL,		// Must be NULL for CP_UTF8
    NULL		// Must be NULL for CP_UTF8
 );

…

//Allocate space for the UTF-8 string
char* sql_utf8 = (char *)malloc(sizeof(char) * utf8_length);
 
utf8_length = WideCharToMultiByte(
    CP_UTF8,
    0,					
    sql,
    -1,		
    sql_utf8,
    utf8_length,
    NULL,
    NULL
);

…
 
HRESULT hr = cmd.Open(session, sql_utf8);

…
	
cmd.Close();

If you look in CUBRID Manager, you will see that the expected data is there (please make sure you set UTF–8 for the encoding in the CM client):

view_table_data_in_cubrid_manager.png

View table data in CM.

If you wonder what is the following line all about

#pragma execution_character_set("utf-8")

… please read http://support.microsoft.com/kb/980263.

Summarizing, the things you should notice are:

  • When using CUBRID OLE DB provider, the recommended way to handle data is using wide char data type (TCHAR)
  • We have encoded the TCHAR data to UTF–8 before sending the data to CUBRID; to accomplish that, we have used the WideCharToMultiByte function (that must be called twice, once to get the needed buffer length and once to get the encoding).

And now let’s see how we can retrieve the same data we have previously inserted:

CCommand<CDynamicAccessor, CRowset> cmd2;
hr = cmd2.Open(session, L"SELECT * FROM `t`");
…
 
hr = cmd2.MoveNext();
…
 
void *val = cmd2.GetValue(1);
char *ptr_val = (char *)val;
 
//Get wide char required length
int wchar_length = MultiByteToWideChar(
    CP_UTF8,			// convert from UTF-8
    MB_ERR_INVALID_CHARS,	// error on invalid chars
    ptr_val,			// source UTF-8 string
    utf8_length, 		// total length of source UTF-8 string    
    NULL,			// unused - no conversion done in this step
    0
);

…

//Allocate space for the wide char string
TCHAR *my_ptr_val = (TCHAR *)malloc(wchar_length * sizeof(TCHAR));
memset(my_ptr_val, '\0', wchar_length * sizeof(TCHAR));
 
wchar_length = MultiByteToWideChar(
    CP_UTF8,												// convert from UTF-8
    MB_ERR_INVALID_CHARS,						
    ptr_val,												// source UTF-8 string
    utf8_length, 
    my_ptr_val,											// destination buffer
    wchar_length * sizeof(TCHAR)
);
	
…


ATLASSERT(EQ_STR(my_ptr_val,  sql_val) == true);

As you can see, what we did was:

  • Use the GetValue() to retrieve the data
  • Get a pointer to the data bytes
  • Use the function MultiByteToWideChar to convert the UTF–8 data we have previously inserted to wide char.

As you can see from this example, it is not really difficult to handle internationalization data using the driver ☺.

The things to remember always are:

  • Encode the data in the desired encoding before sending the data to CUBRID.
  • CUBRID returns “plain” bytes, so it is the client’s responsibility to “interpret” them in the correct encoding.

What about CHAR data type…?

Well, there is nothing really special – the same code we have shown above will work also with CHAR data types. The only difference you have to keep in mind is that the CHAR data type is padded with whitespaces – read more in CHAR manual page.

STRING data type

As you know, STRING is a data type which was introduced not long time ago in CUBRID. Actually, the STRING data type is stored internally as VARCHAR (1073741823).

However, because there is no length returned by the engine and because there are some limitations in the driver implementation, sometimes the STRING data types needs to be handled in a specific way.

In particular, when dealing with internationalization and STRING, what we recommend is to use standard accessors consumer templates. Even if theoretically you can store up to 1073741823 bytes in STRING column, this is unlikely and therefore it is a decent assumption to assume you are aware of a shorter maximal length of the data (in the worst case scenario, just use the maximum length, of course).

So let’s create a table and insert some data, the same way we did in the previous example:

if(!ExecuteSQL(_T("CREATE TABLE `t`(`str` STRING)")))
    return false;
	
CCommand<CNoAccessor, CNoRowset> cmd;
 
#pragma execution_character_set("utf-8") 
 
TCHAR *sql_val = L"Одного языка никогда недостаточно";
TCHAR *sql = (TCHAR *)malloc(sizeof(char) * 128);
memset(sql, '\0', sizeof(char) * 128);
wcscat(sql, L"INSERT INTO `t` VALUES('");
wcscat(sql, sql_val);
wcscat(sql, L"')");
…

Next, let’s make sure we have our Accessor defined (we are assuming a maximum length of 8912 in the below example):

class CMyAccessor
{
public:
	char m_str8912;
 
	BEGIN_COLUMN_MAP(CMyAccessor)
		COLUMN_ENTRY(1, m_str)
	END_COLUMN_MAP()
};

And here is how we can retrieve the data:

CCommand <CAccessor<CMyAccessor>> cmd2;
 
hr = cmd2.Open(session, L"SELECT * FROM `t`");
…
hr = cmd2.MoveNext();
…
 
void *val = cmd2.m_str;
char *ptr_val = (char *)val;
 
//Get wide char required length
int wchar_length = MultiByteToWideChar(
    CP_UTF8,			// convert from UTF-8
    MB_ERR_INVALID_CHARS,	// error on invalid chars
    ptr_val,			// source UTF-8 string
    utf8_length, 		// total length of source UTF-8 string    
    NULL,			// unused - no conversion done in this step
    0
);

…

//Allocate space for the wide char string
TCHAR *my_ptr_val = (TCHAR *)malloc(wchar_length * sizeof(TCHAR));
memset(my_ptr_val, '\0', wchar_length * sizeof(TCHAR));
 
wchar_length = MultiByteToWideChar(
    CP_UTF8,												// convert from UTF-8
    MB_ERR_INVALID_CHARS,						
    ptr_val,												// source UTF-8 string
    utf8_length, 
    my_ptr_val,											// destination buffer
    wchar_length * sizeof(TCHAR)
);
	
…
	

ATLASSERT(EQ_STR(my_ptr_val,  sql_val) == true);

As you can see, there is nothing really different, the same concepts of encoding and decoding the data at the “client level” applies, when using a non-dynamic accessor.

Links & Resources

This concludes the Internationalization with CUBRID OLE DB Provider tutorial. Please let us know your feedback – we highly appreciate your suggestions and comments - and remember to periodically check the CUBRID web site for more CUBRID tutorials and resources.

If you have questions, please ask them in our Q&A site.

comments powered by Disqus
Page info
viewed 2912 times
translations en
Author
posted 2 years ago by
CUBRID
Contributors
updated 2 years ago by
View revisions
Share this article