#1 2015-10-05 08:39:08

ComingNine
Member
Registered: 2010-07-29
Posts: 294

Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

The code page involved is 1252, and the character involved is the Copyright sign.

The file Tiny.pas is ANSI-encoded and contains a single Copyright sign. That is to say, its content is a single byte 0xA9.
The TinyUTF8WithoutBOM.pas contains the UTF8-encoded Copyright sign but without the UTF8 BOM, i.e., two bytes 0xC2, 0xA9.
The TinyUTF8WithBOM.pas contains the UTF8-encoded Copyright sign and with the UTF8 BOM, i.e., five bytes 0xEF, 0xBB, 0xBF, 0xC2, 0xA9.

The call to TSynAnsiConvert.Engine(CODEPAGE_US).AnsiToUTF8 will convert this single byte 0xA9 to two bytes, i.e., 0xC2, 0xA9.
More importantly, the call to TSynAnsiConvert.Engine(CODEPAGE_US).UTF8ToAnsi will convert the two bytes 0xC2, 0xA9 back to the original single byte 0xA9. Everything is perfect till now.

However, the call to TSynAnsiConvert.Engine(CODEPAGE_US).UTF8ToAnsi will convert the five bytes 0xEF, 0xBB, 0xBF, 0xC2, 0xA9 to the two bytes 0xC2, 0xA9 instead of the original single byte 0xA9. Could you help to comment whether this behavior is a bug ?

program Project1; 
{$APPTYPE CONSOLE} 
uses FastMM4, SynCommons, mORMot, SysUtils;
begin
  SynCommons.FileFromString(
    TSynAnsiConvert.Engine(CODEPAGE_US).AnsiToUTF8(SynCommons.StringFromFile('Tiny.pas')), 
    'TinyUTF8WithoutBOM.pas');
  SynCommons.FileFromString(
    TSynAnsiConvert.Engine(CODEPAGE_US).UTF8ToAnsi(SynCommons.StringFromFile('TinyUTF8WithoutBOM.pas')), 
    'TinyConvertedBackFromUTF8WithoutBOM.pas');
  SynCommons.FileFromString(
    TSynAnsiConvert.Engine(CODEPAGE_US).UTF8ToAnsi(SynCommons.StringFromFile('TinyUTF8WithBOM.pas')), 
    'TinyConvertedBackFromUTF8WithBOM.pas');
end.

Last edited by ComingNine (2015-10-05 08:46:04)

Offline

#2 2015-10-05 08:48:39

ComingNine
Member
Registered: 2010-07-29
Posts: 294

Re: Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

The post is edited in order to make things clearer. Thank you for your efforts !

Offline

#3 2015-10-05 09:32:12

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,540
Website

Re: Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

Please, look at AnyTextFileTo* functions ( AnyTextFileToRawUTF8 for example)

Offline

#4 2015-10-05 10:13:26

ComingNine
Member
Registered: 2010-07-29
Posts: 294

Re: Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

Dear mpv, thank you for our comment ! I have checked but but I do not think AnyTextFile* is related here.

My question is essentially why TSynAnsiConvert.Engine(CODEPAGE_US).UTF8ToAnsi converts the UTF8 five bytes 0xEF, 0xBB, 0xBF, 0xC2, 0xA9 to the UTF8 two bytes 0xC2, 0xA9, instead of the ANSI single byte 0xA9. Could you help to comment ?

Offline

#5 2015-10-05 10:25:56

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,225
Website

Re: Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

Use AnyTextFileToRawUTF8() instead of StringFromFile().
It will recognize any BOM, and let the conversion take place as expected.

Offline

#6 2015-10-05 10:36:18

ComingNine
Member
Registered: 2010-07-29
Posts: 294

Re: Possible TSynAnsiConvert.UTF8ToAnsi bug when there is BOM present ?

Dear ab and mpv, thank you for your kind help very much !

Dear mpv, sorry that I did not realize that I should not feed BOM into TSynAnsiConvert.UTF8ToAnsi !...

Offline

Board footer

Powered by FluxBB