Oracle Charset and codepage

richard6688 · 2013-12-09 02:34:13

This time, I use mORMot to process some data which charset is Chinese. My Env is Os_Win7_codepage_936/Oracle_11g_charset_AL32UTF8.
I create prop with default codepage param 0, the result is mess character returned. After a few hour of tracing code, finally I change the codepage to cp_utf8 to make my work done. But the problem here still exists.

Follow the code in unit SynDBOracle.pas, when I use default param 0 to create prop, the Class use GetACP to obtain the OS default codepage 936 first. When query data, convert data returned by OCI from codepage 936 to UTF8. So, I use CP_UTF8 to bypass this convertion. From here, I think SynDBOracle may have a wrong concept about codepage and charset.

I have used Oracle since Oracle8. Each time, charset give us a little trouble when starting a new project, because Chinese Charecter is involved mainly. codepage , charset and NLS_LANG make us mad.

To make clear about problem, I'd like to discuss the concept about codepage , charset, which may make things right.

I think, the codepage and NLS_LANG is related to the client tools of Oracle, such as SQL*PLUS, EXP/IMP; but for direct connect, only charset is necessary to concern. Because, when mORMot connect to Oracle DB using OCI, the client means mORMot, the charset used by client means UTF8. If DB use UTF8, no conversion is needed. If DB use the charset other than UTF8, the conversion is needed. so here mORMot is nothing related to codepage concept. codepage is related to OS or App.

So, Based on the viewpoint above, CreateWithCodePage should changed to CreateWithCharset; the aCodepage param should change to aCharset. for default charset string, we can query DB to get the charset used by DB, or just set it as 'UTF8' simply.

I do'nt know what descripted here is make things clear, because English isn't my native language.

For Programming with Unicode, oracle have the doc here
http://docs.oracle.com/cd/B28359_01/ser … nicode.htm

By the way, another problem is the error msg returned by OCI display not right, I donot know how mORMot convert this message.

I know all the things discussed here not related to ASCII, because all the charset is superset of ASCII. for the guy not use CJK charset, it is difficult to find such error.

And also codepage/charset list in SynDBOracle may not be right for codepage 936. for codepage, consider following table

Common IANA character set Name      vs.     Oracle character set name
   (Http header & html meta tag)               (NLS_LANG character set part)
 

    UTF-8                                           UTF8
    windows-1250                                EE8MSWIN1250
    windows-1251                                CL8MSWIN1251
    windows-1252                                WE8MSWIN1252
    windows-1253                                EL8MSWIN1253
    windows-1254                                TR8MSWIN1254
    windows-1255                                IW8MSWIN1255
    windows-1256                                AR8MSWIN1256
    windows-1257                                BLT8MSWIN1257
    windows-1258                                VN8MSWIN1258
    windows-936 or GBK                        ZHS16GBK
    Big5                                            ZHT16MSWIN950
    Big5-HKSCS                                  ZHT16HKSCS – Hong Kong extension of big5
    TIS-620                                       TH8TISASCII
    Shift_JIS                                     JA16SJIS
    korean or KS_C_5601-1989             KO16MSWIN949

ab · 2013-12-09 09:42:19

Yes, in production, we use here huge Oracle databases with a dedicated Charset, which always map the Windows code page.
For instance, for our customers in Korea, we have a KO16MSWIN949 database, and GetACP=949 on the workstations.

I agree this is a convention on our side only, and should be made less specific.
Especially, new DB should better use UTF8 encoding, to support Unicode as required by most modern applications.

We have created a ticket to follow this refactoring.
See http://synopse.info/fossil/info/a6a639ec43

What are your expecations, for changing how TSQLDBOracleConnectionProperties should be initialized?
I believe even CreateWithCodePage/CreateWithCharSet is a wrong idea, since it breaks the default inherited TSQLDBConnectionProperties.Create() signature.
Some general custom parameters may definitively make sense, e.g. as is done with SynDBFireDAC (i.e. add parameters to the ServerName or DatabaseName parameters, encoded as URI, with a TSQLDBOracleConnectionProperties.URI() class function).

Feedback is welcome!

richard6688 · 2013-12-09 11:57:01

In my opnion, there is no CreateWithCodePage/CreateWithCharSet needed. when creating connection to OCI, just query Oracle for what Charset is used, this may return charset string such as AL32UTF8/KO16MSWIN949/ZHS16GBK, after that you may use OCINlsCharSetNameToId or a sql query to get related charset_ID;

sql to get DB Charset:

SELECT PROPERTY_VALUE
  FROM DATABASE_PROPERTIES
 WHERE PROPERTY_NAME = 'NLS_CHARACTERSET'

sql to convert charsetID

 SELECT NLS_CHARSET_ID('AL32UTF8') FROM dual

'AL32UTF8' here is the charset string to query.

but for this auto solution, I don't know if it's available in old db system such as Oracle DB 8.

Last edited by richard6688 (2013-12-09 11:57:51)

ab · 2013-12-09 15:34:56

Nice solution!

We have removed CreateWithCodePage() constructor: now the charset will be retrieved at connection, and used for CHAR/NVARCHAR2 fields.

See http://synopse.info/fossil/info/2950b39ac5

Hope it works on your side too.

Thanks a lot for the feedback!
You are in the contributor list at the beginning of SynDBOracle, now.

richard6688 · 2013-12-10 01:10:38

thanks for the quick code-fixed.

I have test the new code about SynDBOracle. Still have a problem here.
You did not process the UTF-8 compatible charset. AL32UTF8 charset is a superset of UTF8. it's UTF8 of Oracle.
on http://docs.oracle.com/cd/B28359_01/ser … m#NLSPG608
they said:

AL32UTF8

An Oracle Database character set for the SQL CHAR data type, which is used for the database character set. It encodes Unicode data in the UTF-8 encoding.

But Its oracle charset ID is not in the list you refer: http://www.mydul.net/charsets.html
folowing code will retrival all the charset ID list of oracle

select nls_charset_id(value) nls_charset_id,  value nls_charset_name, to_char(nls_charset_id(value),'xxxx') hex_id
from  v$nls_valid_values
where parameter = 'CHARACTERSET'
order by nls_charset_id(value)

the list of My oracle 11g here

...
"871","UTF8","  367"
"872","UTFE","  368"
"873","AL32UTF8","  369"
...

Unfortunately, UTF8 ID <> AL32UTF8 ID, so you shall make modified to code.
The Simple solution may modify TSQLDBOracleConnection.STRToUTF8 such as

const OCI_AL32UTF8=$369
procedure TSQLDBOracleConnection.STRToUTF8(P: PAnsiChar; var result: RawUTF8;
  ColumnDBCharSet,ColumnDBForm: integer);
var L: integer;
begin
  L := StrLen(PUTF8Char(P));
  if (L=0) or (ColumnDBCharSet in [OCI_UTF8, OCI_AL32UTF8]) or (ColumnDBForm=SQLCS_NCHAR) or
     (fOCICharSet in [OCI_UTF8, OCI_AL32UTF8]) then
    SetString(result,P,L) else
    result := fAnsiConvert.AnsiBufferToRawUTF8(P,L);
end;

UTF-8(unicode 3.0) was introduced before Oracle 9i. AL32UTF8(unicode 5.0) is introduced by 9i.
UTFE(unicode 5.0) is used only on EBCDIC platforms.

Last edited by richard6688 (2013-12-10 01:14:41)

ab · 2013-12-10 09:31:49

You are right.
I was not aware of OCI_AL32UTF8, nor that you may use UTF-8 in CHAR/VARCHAR2 instead of NCHAR/NVARCHAR2...

I've tried to enhance Oracle DB CharSet process.
See http://synopse.info/fossil/info/406380e3a2

richard6688 · 2013-12-10 10:23:40

Now, it's OK.

I have tested with my DB 11g with charset AL32UTF8 of CHAR/VARCHAR2 and national charset AL16UTF16 of NCHAR/NVARCHAR2.

Thanks, Good worK!

richard6688 · 2013-12-18 03:58:49

hi，
About error message, as I mentioned on #1, is not correct displayed for Chinese. though that isn't real problem, but for debug, is not handy.
after I check your code

procedure TSQLDBOracleLib.HandleError(Status: Integer; ErrorHandle: POCIError;
  InfoRaiseException: Boolean; LogLevelNoRaise: TSynLogInfo);
begin
......
 raise ESQLDBOracle.Create(String(msg));
end;

I don't know how to cast it use your ansiconvert, but can use delphi Tencodeing to convert it correctly.
code as:

raise ESQLDBOracle.Create(TEncoding.UTF8.GetString(Tbytes(msg)));

Now, I don't know its real means. you may know that to integrate the code.

ab · 2013-12-18 10:36:32

I have forced correct character conversion when processing Oracle error messages.
See http://synopse.info/fossil/info/2ab02f4ac0

mORMot Open Source

#1 2013-12-09 02:34:13

Oracle Charset and codepage

#2 2013-12-09 09:42:19

Re: Oracle Charset and codepage

#3 2013-12-09 11:57:01

Re: Oracle Charset and codepage

#4 2013-12-09 15:34:56

Re: Oracle Charset and codepage

#5 2013-12-10 01:10:38

Re: Oracle Charset and codepage

#6 2013-12-10 09:31:49

Re: Oracle Charset and codepage

#7 2013-12-10 10:23:40

Re: Oracle Charset and codepage

#8 2013-12-18 03:58:49

Re: Oracle Charset and codepage

#9 2013-12-18 10:36:32

Re: Oracle Charset and codepage

Board footer