
Purpose: Framework Core Low-Level Unicode UTF-8 UTF-16 Ansi Conversion
- this unit is a part of the Open Source Synopse mORMot framework 2, licensed under a MPL/GPL/LGPL three license - see LICENSE.md
| Unit Name | Description | |
|---|---|---|
| mormot.core.base | Framework Core Shared Types and RTL-like Functions | |
| mormot.core.os | Framework Core Low-Level Wrappers to the Operating-System API |
| Objects | Description | |
|---|---|---|
| ESynUnicode | Exception raised by this unit in case of fatal conversion issue | |
| TSynAnsiConvert | An abstract class to handle Ansi to/from Unicode translation | |
| TSynAnsiFixedWidth | A class to handle Ansi to/from Unicode translation of fixed width encoding (i.e. non MBCS) | |
| TSynAnsiUtf16 | A class to handle UTF-16 to/from Unicode translation | |
| TSynAnsiUtf8 | A class to handle UTF-8 to/from Unicode translation | |
| TUtf8Table |
TUtf8Table = object(TObject)
function GetHighUtf8Ucs4(var U: PUtf8Char): Ucs4CodePoint;
Retrieve a >127 UCS4 CodePoint from UTF-8
ESynUnicode = class(ExceptionWithProps)
Exception raised by this unit in case of fatal conversion issue
TSynAnsiConvert = class(TObject)
An abstract class to handle Ansi to/from Unicode translation
- implementations of this class will handle efficiently all Code Pages
- this default implementation will use the Operating System APIs
- you should not create your own class instance by yourself, but should better retrieve an instance using TSynAnsiConvert.Engine(), which will initialize either a TSynAnsiFixedWidth or a TSynAnsiConvert instance on need
constructor Create(aCodePage: cardinal); reintroduce; virtual;
Initialize the internal conversion engine
function AnsiBufferToUnicode(Dest: PWideChar; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PWideChar; overload; virtual;
Direct conversion of a PAnsiChar buffer into an Unicode buffer
- Dest^ buffer must be reserved with at least SourceChars*2 bytes
- this default implementation will use the Operating System APIs
- will append a trailing #0 to the returned PWideChar, unless NoTrailingZero is set
function AnsiBufferToUtf8(Dest: PUtf8Char; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PUtf8Char; overload; virtual;
Direct conversion of a PAnsiChar buffer into a UTF-8 encoded buffer
- Dest^ buffer must be reserved with at least SourceChars*3 bytes
- will append a trailing #0 to the returned PUtf8Char, unless NoTrailingZero is set
- this default implementation will use the Operating System APIs
function AnsiToAnsi(From: TSynAnsiConvert; Source: PAnsiChar; SourceChars: cardinal): RawByteString; overload;
Convert any Ansi buffer (providing a From converted) into Ansi Text
function AnsiToAnsi(From: TSynAnsiConvert; const Source: RawByteString): RawByteString; overload;
Convert any Ansi Text (providing a From converted) into Ansi Text
function AnsiToRawUnicode( Source: PAnsiChar; SourceChars: cardinal): RawUnicode; overload; virtual;
Convert any Ansi buffer into an Unicode String
- returns a value using our RawUnicode kind of string
function AnsiToRawUnicode(const AnsiText: RawByteString): RawUnicode; overload;
Convert any Ansi Text into an UTF-16 Unicode String
- returns a value using our RawUnicode kind of string
function AnsiToUnicodeString(const Source: RawByteString): SynUnicode; overload;
Convert any Ansi buffer into an Unicode String
- returns a SynUnicode, i.e. Delphi 2009+ UnicodeString or a WideString
function AnsiToUnicodeString( Source: PAnsiChar; SourceChars: cardinal): SynUnicode; overload;
Convert any Ansi buffer into an Unicode String
- returns a SynUnicode, i.e. Delphi 2009+ UnicodeString or a WideString
function AnsiToUtf8(const AnsiText: RawByteString): RawUtf8; virtual;
Convert any Ansi Text into an UTF-8 encoded String
- internally calls AnsiBufferToUtf8 virtual method
class function Engine(aCodePage: cardinal): TSynAnsiConvert;
Returns the engine corresponding to a given code page
- a global list of TSynAnsiConvert instances is handled by the unit - therefore, caller should not release the returned instance
- will return nil in case of unhandled code page
- is aCodePage is 0, will return CurrentAnsiConvert value
function RawUnicodeToAnsi(const Source: RawUnicode): RawByteString;
Convert any Unicode-encoded String into Ansi Text
- internally calls UnicodeBufferToAnsi virtual method
function UnicodeBufferToAnsi(Dest: PAnsiChar; Source: PWideChar; SourceChars: cardinal): PAnsiChar; overload; virtual;
Direct conversion of an Unicode buffer into a PAnsiChar buffer
- Dest^ buffer must be reserved with at least SourceChars * 3 bytes
- will detect and ignore any trailing UTF-16LE BOM marker
- this default implementation will rely on the Operating System for all non ASCII-7 chars
function UnicodeBufferToAnsi(Source: PWideChar; SourceChars: cardinal): RawByteString; overload; virtual;
Direct conversion of an Unicode buffer into an Ansi Text
function UnicodeStringToAnsi(const Source: SynUnicode): RawByteString;
Convert any Unicode-encoded String into Ansi Text
- internally calls UnicodeBufferToAnsi virtual method
function Utf8BufferToAnsi(Source: PUtf8Char; SourceChars: cardinal): RawByteString; overload;
Convert any UTF-8 encoded buffer into Ansi Text
- internally calls Utf8BufferToAnsi virtual method
function Utf8BufferToAnsi(Dest: PAnsiChar; Source: PUtf8Char; SourceChars: cardinal): PAnsiChar; overload; virtual;
Direct conversion of an UTF-8 encoded buffer into a PAnsiChar buffer
- Dest^ buffer must be reserved with at least SourceChars bytes
- no trailing #0 is appended to the buffer
function Utf8ToAnsi(const u: RawUtf8): RawByteString; virtual;
Convert any UTF-8 encoded String into Ansi Text
- internally calls Utf8BufferToAnsi virtual method
function Utf8ToAnsiBuffer2K(const S: RawUtf8; Dest: PAnsiChar; DestSize: integer): integer;
Direct conversion of a UTF-8 encoded string into a WinAnsi <2KB buffer
- will truncate the destination string to DestSize bytes (including the trailing #0), with a maximum handled size of 2048 bytes
- returns the number of bytes stored in Dest^ (i.e. the position of #0)
procedure AnsiBufferToRawUtf8(Source: PAnsiChar; SourceChars: cardinal; out Value: RawUtf8); overload; virtual;
Direct conversion of a PAnsiChar buffer into a UTF-8 encoded string
- will call AnsiBufferToUnicode() overloaded virtual method
procedure Utf8BufferToAnsi(Source: PUtf8Char; SourceChars: cardinal; var result: RawByteString); overload; virtual;
Convert any UTF-8 encoded buffer into Ansi Text
- internally calls Utf8BufferToAnsi virtual method
property AnsiCharShift: byte read fAnsiCharShift;
Corresponding length binary shift used for worst conversion case
property CodePage: cardinal read fCodePage;
Corresponding code page
TSynAnsiFixedWidth = class(TSynAnsiConvert)
A class to handle Ansi to/from Unicode translation of fixed width encoding (i.e. non MBCS)
- this class will handle efficiently all Code Page availables without MBCS encoding - like WinAnsi (1252) or Russian (1251)
- it will use internal fast look-up tables for such encodings
- this class could take some time to generate, and will consume more than 64 KB of memory: you should not create your own class instance by yourself, but should better retrieve an instance using TSynAnsiConvert.Engine(), which will initialize either a TSynAnsiFixedWidth or a TSynAnsiConvert instance on need
- this class has some additional methods (e.g. IsValid*) which take advantage of the internal lookup tables to provide some fast process
constructor Create(aCodePage: cardinal); override;
Initialize the internal conversion engine
function AnsiBufferToUnicode(Dest: PWideChar; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PWideChar; override;
Direct conversion of a PAnsiChar buffer into an Unicode buffer
- Dest^ buffer must be reserved with at least SourceChars*2 bytes
- will append a trailing #0 to the returned PWideChar, unless NoTrailingZero is set
function AnsiBufferToUtf8(Dest: PUtf8Char; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PUtf8Char; override;
Direct conversion of a PAnsiChar buffer into a UTF-8 encoded buffer
- Dest^ buffer must be reserved with at least SourceChars*3 bytes
- will append a trailing #0 to the returned PUtf8Char, unless NoTrailingZero is set
function AnsiToRawUnicode(Source: PAnsiChar; SourceChars: cardinal): RawUnicode; override;
Convert any Ansi buffer into an Unicode String
- returns a value using our RawUnicode kind of string
function IsValidAnsi(WideText: PWideChar; Length: PtrInt): boolean; overload;
Return TRUE if the supplied unicode buffer only contains characters of the corresponding Ansi code page
- i.e. if the text can be displayed using this code page
function IsValidAnsi(WideText: PWideChar): boolean; overload;
Return TRUE if the supplied unicode buffer only contains characters of the corresponding Ansi code page
- i.e. if the text can be displayed using this code page
function IsValidAnsiU(Utf8Text: PUtf8Char): boolean;
Return TRUE if the supplied UTF-8 buffer only contains characters of the corresponding Ansi code page
- i.e. if the text can be displayed using this code page
function IsValidAnsiU8Bit(Utf8Text: PUtf8Char): boolean;
Return TRUE if the supplied UTF-8 buffer only contains 8-bit characters of the corresponding Ansi code page
- i.e. if the text can be displayed with only 8-bit unicode characters (e.g. no "tm" or such) within this code page
function UnicodeBufferToAnsi(Dest: PAnsiChar; Source: PWideChar; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an Unicode buffer into a PAnsiChar buffer
- Dest^ buffer must be reserved with at least SourceChars * 3 bytes
- will detect and ignore any trailing UTF-16LE BOM marker
- this overridden version will use internal lookup tables for fast process
function Utf8BufferToAnsi(Dest: PAnsiChar; Source: PUtf8Char; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an UTF-8 encoded buffer into a PAnsiChar buffer
- Dest^ buffer must be reserved with at least SourceChars bytes
- no trailing #0 is appended to the buffer
- non Ansi compatible characters are replaced as '?'
function WideCharToAnsiChar(wc: cardinal): integer;
Conversion of a wide char into the corresponding Ansi character
- return -1 for an unknown WideChar in the current code page
property AnsiToWide: TWordDynArray read fAnsiToWide;
Direct access to the Ansi-To-Unicode lookup table
- use this array like AnsiToWide: array[byte] of word
property WideToAnsi: TByteDynArray read fWideToAnsi;
Direct access to the Unicode-To-Ansi lookup table
- use this array like WideToAnsi: array[word] of byte
- any unhandled WideChar will return ord('?')
TSynAnsiUtf8 = class(TSynAnsiConvert)
A class to handle UTF-8 to/from Unicode translation
- match the TSynAnsiConvert signature, for code page CP_UTF8
- this class is mostly a non-operation for conversion to/from UTF-8
constructor Create(aCodePage: cardinal); override;
Initialize the internal conversion engine
function AnsiBufferToUnicode(Dest: PWideChar; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PWideChar; override;
Direct conversion of a PAnsiChar UTF-8 buffer into an Unicode buffer
- Dest^ buffer must be reserved with at least SourceChars*2 bytes
- will append a trailing #0 to the returned PWideChar, unless NoTrailingZero is set
function AnsiBufferToUtf8(Dest: PUtf8Char; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PUtf8Char; override;
Direct conversion of a PAnsiChar UTF-8 buffer into a UTF-8 encoded buffer
- Dest^ buffer must be reserved with at least SourceChars*3 bytes
- will append a trailing #0 to the returned PUtf8Char, unless NoTrailingZero is set
function AnsiToRawUnicode(Source: PAnsiChar; SourceChars: cardinal): RawUnicode; override;
Convert any UTF-8 Ansi buffer into an Unicode String
- returns a value using our RawUnicode kind of string
function AnsiToUtf8(const AnsiText: RawByteString): RawUtf8; override;
Convert any Ansi Text into an UTF-8 encoded String
- directly assign the input as result, since no conversion is needed
function UnicodeBufferToAnsi(Source: PWideChar; SourceChars: cardinal): RawByteString; override;
Direct conversion of an Unicode buffer into an Ansi Text
function UnicodeBufferToAnsi(Dest: PAnsiChar; Source: PWideChar; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an Unicode buffer into a PAnsiChar UTF-8 buffer
- will detect and ignore any trailing UTF-16LE BOM marker
- Dest^ buffer must be reserved with at least SourceChars * 3 bytes
function Utf8BufferToAnsi(Dest: PAnsiChar; Source: PUtf8Char; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an UTF-8 encoded buffer into a PAnsiChar UTF-8 buffer
- Dest^ buffer must be reserved with at least SourceChars bytes
- no trailing #0 is appended to the buffer
function Utf8ToAnsi(const u: RawUtf8): RawByteString; override;
Convert any UTF-8 encoded String into Ansi Text
- directly assign the input as result, since no conversion is needed
procedure AnsiBufferToRawUtf8(Source: PAnsiChar; SourceChars: cardinal; out Value: RawUtf8); override;
Direct conversion of a PAnsiChar buffer into a UTF-8 encoded string
procedure Utf8BufferToAnsi(Source: PUtf8Char; SourceChars: cardinal; var result: RawByteString); override;
Convert any UTF-8 encoded buffer into Ansi Text
TSynAnsiUtf16 = class(TSynAnsiConvert)
A class to handle UTF-16 to/from Unicode translation
- match the TSynAnsiConvert signature, for code page CP_UTF16
- even if UTF-16 is not an Ansi format, code page CP_UTF16 may have been used to store UTF-16 encoded binary content
- this class is mostly a non-operation for conversion to/from Unicode
constructor Create(aCodePage: cardinal); override;
Initialize the internal conversion engine
function AnsiBufferToUnicode(Dest: PWideChar; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PWideChar; override;
Direct conversion of a PAnsiChar UTF-16 buffer into an Unicode buffer
- Dest^ buffer must be reserved with at least SourceChars*2 bytes
- will append a trailing #0 to the returned PWideChar, unless NoTrailingZero is set
function AnsiBufferToUtf8(Dest: PUtf8Char; Source: PAnsiChar; SourceChars: cardinal; NoTrailingZero: boolean = false): PUtf8Char; override;
Direct conversion of a PAnsiChar UTF-16 buffer into a UTF-8 encoded buffer
- Dest^ buffer must be reserved with at least SourceChars*3 bytes
- will append a trailing #0 to the returned PUtf8Char, unless NoTrailingZero is set
function AnsiToRawUnicode(Source: PAnsiChar; SourceChars: cardinal): RawUnicode; override;
Convert any UTF-16 Ansi buffer into an Unicode String
- returns a value using our RawUnicode kind of string
function UnicodeBufferToAnsi(Dest: PAnsiChar; Source: PWideChar; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an Unicode buffer into a PAnsiChar UTF-16 buffer
- Dest^ buffer must be reserved with at least SourceChars * 3 bytes
function Utf8BufferToAnsi(Dest: PAnsiChar; Source: PUtf8Char; SourceChars: cardinal): PAnsiChar; override;
Direct conversion of an UTF-8 encoded buffer into a PAnsiChar UTF-16 buffer
- Dest^ buffer must be reserved with at least SourceChars bytes
- no trailing #0 is appended to the buffer
PNormTable = ^TNormTable;
Pointer to a lookup table used for fast case conversion
PNormTableByte = ^TNormTableByte;
Pointer to a lookup table used for fast case conversion
PTextByteSet = ^TTextByteSet;
Points to an Ordinal lookup table used for branch-less text parsing
PTextCharSet = ^TTextCharSet;
Points to an AnsiChar lookup table used for branch-less text parsing
TBomFile = ( bomNone, bomUnicode, bomUtf8 );
Text file layout, as returned by BomFile() and StringFromBomFile()
- bomNone means there was no BOM recognized
- bomUnicode stands for UTF-16 LE encoding (as on Windows products)
- bomUtf8 stands for a UTF-8 BOM (as on Windows products)
TCharConversionFlags = set of ( ccfNoTrailingZero, ccfReplacementCharacterForUnmatchedSurrogate);
Option set for RawUnicodeToUtf8() conversion
TIdemPropNameUSameLen = function(P1, P2: pointer; P1P2Len: PtrInt): boolean;
Delphi does not like to inline goto
TNormTable = packed array[AnsiChar] of AnsiChar;
Lookup table used for fast case conversion
TNormTableByte = packed array[byte] of byte;
Lookup table used for fast case conversion
TOnStringTranslate = procedure(var English: string) of object;
A generic callback, which can be used to translate some text on the fly
- maps procedure TLanguageFile.Translate(var English: string) signature as defined in mORMoti18n.pas
- can be used e.g. for TSynMustache's {{"English text}} callback
TSynAnsicharSet = set of AnsiChar;
Used to store a set of 8-bit encoded characters
TSynByteSet = set of byte;
Used to store a set of 8-bit unsigned integers
TTextByteSet = array[byte] of TTextChar;
Defines an Ordinal lookup table used for branch-less text parsing
TTextChar = set of ( tcNot01013, tc1013, tcCtrlNotLF, tcCtrlNot0Comma, tcWord, tcIdentifierFirstChar, tcIdentifier, tcUriUnreserved);
Character categories for text linefeed/word/identifier/uri parsing
- using such a set compiles into TEST [MEM], IMM so is more efficient than a regular set of AnsiChar which generates much slower BT [MEM], IMM
- the same 256-byte memory will also be reused from L1 CPU cache during the parsing of complex input
TTextCharSet = array[AnsiChar] of TTextChar;
Defines an AnsiChar lookup table used for branch-less text parsing
TUtf8Compare = function(P1, P2: PUtf8Char): PtrInt;
Function prototype used internally for UTF-8 buffer comparison
- also used e.g. in mormot.core.variants unit
BOM_UTF16LE = #$FEFF;
UTF-16LE BOM WideChar marker, as existing e.g. in some UTF-16 Windows files
UNICODE_REPLACEMENT_CHARACTER = $fffd;
Replace any incoming character whose value is unrepresentable in Unicode
- set e.g. by GetUtf8WideChar(), Utf8UpperReference() or RawUnicodeToUtf8() when ccfReplacementCharacterForUnmatchedSurrogate is set
- encoded as $ef $bf $bd bytes in UTF-8
| Functions or procedures | Description | |
|---|---|---|
| AddRawUtf8 | Add the Value to Values[], with an external count variable, for performance | |
| AddRawUtf8 | True if Value was added successfully in Values[] | |
| AddRawUtf8 | Add Value[] items to Values[], with an external count variable, for performance | |
| AddRawUtf8 | Add Value[] items to Values[] | |
| AddSortedRawUtf8 | Add a RawUtf8 value in an alphaticaly sorted dynamic array of RawUtf8 | |
| AddString | Add the Value to Values[] string array | |
| Ansi7ToString | Convert any Ansi 7-bit encoded String into a RTL string | |
| Ansi7ToString | Convert any Ansi 7-bit encoded String into a RTL string | |
| Ansi7ToString | Convert any Ansi 7-bit encoded String into a RTL string | |
| AnsiBufferToTempUtf8 | Convert any Ansi memory buffer into UTF-8, using a TSynTempBuffer if needed | |
| AnsiCharToUtf8 | Convert an AnsiChar buffer (of a given code page) into a UTF-8 string | |
| AnsiIComp | Fast WinAnsi comparison using the NormToUpper[] array for all 8-bit values | |
| AnsiICompW | Fast case-insensitive Unicode comparison handling ASCII 7-bit chars | |
| AnsiToString | Convert an AnsiString (of a given code page) into a RTL string | |
| AnsiToUtf8 | Convert an AnsiString (of a given code page) into a UTF-8 string | |
| AnyAnsiToUtf8 | Direct conversion of an AnsiString with an unknown code page into an UTF-8 encoded String | |
| AnyAnsiToUtf8 | Direct conversion of an AnsiString with an unknown code page into an UTF-8 encoded String | |
| AnyTextFileToRawUtf8 | Read a File content into a RawUtf8, detecting any leading BOM | |
| AnyTextFileToString | Read a File content into a RTL string, detecting any leading BOM | |
| AnyTextFileToSynUnicode | Read a File content into SynUnicode string, detecting any leading BOM | |
| AppendShortComma | Fast append some UTF-8 text into a ShortString, with an ending ',' | |
| BomFile | Check the file BOM at the beginning of a file buffer | |
| CamelCase | Convert a string into an human-friendly CamelCase identifier | |
| CamelCase | Convert a string into an human-friendly CamelCase identifier | |
| CaseCopy | Low-level function called when inlining UpperCase(Copy) and LowerCase(Copy) | |
| CaseSelf | Low-level function called when inlining UpperCaseSelf and LowerCaseSelf | |
| CodePageToText | Return a code page number into human-friendly text | |
| ContainsUtf8 | Return true if up^ is contained inside the UTF-8 buffer p^ | |
| ConvertCaseUtf8 | Fast conversion of the supplied text into 8-bit case sensitivity | |
| DeleteRawUtf8 | Delete a RawUtf8 item in a dynamic array of RawUtf8; | |
| DeleteRawUtf8 | Delete a RawUtf8 item in a dynamic array of RawUtf8 | |
| DetectRawUtf8 | Detect UTF-8 content and mark the variable with the CP_UTF8 codepage | |
| EndWith | Check case-insensitive matching ending of text in upTextEnd | |
| EndWithArray | Returns the index of a case-insensitive matching ending of p^ in upArray[] | |
| EndWithExact | Check case-sensitive matching ending of text in ending | |
| FastFindIndexedPUtf8Char | Retrieve the index of a PUtf8Char in a PUtf8Char array via a sort indexed | |
| FastFindPUtf8CharSorted | Retrieve the index where is located a PUtf8Char in a sorted PUtf8Char array | |
| FastFindPUtf8CharSorted | Retrieve the index where is located a PUtf8Char in a sorted PUtf8Char array | |
| FastFindUpperPUtf8CharSorted | Retrieve the index where is located a PUtf8Char in a sorted uppercase array | |
| FastLocatePUtf8CharSorted | Retrieve the index where to insert a PUtf8Char in a sorted PUtf8Char array | |
| FastLocatePUtf8CharSorted | Retrieve the index where to insert a PUtf8Char in a sorted PUtf8Char array | |
| FillZero | Fill all bytes of this UTF-16 string with zeros, i.e. 'toto' -> #0#0#0#0 | |
| FillZero | Fill all bytes of this dynamic array of bytes with zeros | |
| FillZero | Fill all bytes of this UTF-8 string with zeros, i.e. 'toto' -> #0#0#0#0 | |
| FillZero | Fill all bytes of this memory buffer with zeros, i.e. 'toto' -> #0#0#0#0 | |
| FillZero | Fill all bytes of this UTF-8 string with zeros, i.e. 'toto' -> #0#0#0#0 | |
| FindAnsi | Return true if UpperValue (Ansi) is contained in A^ (Ansi) | |
| FindNameValue | Search and returns a value from its uppercased named entry | |
| FindNameValue | Search for a value from its uppercased named entry | |
| FindNameValuePointer | Search and returns a PUtf8Char value from its uppercased named entry | |
| FindNextUtf8WordBegin | Points to the beginning of the next word stored in U | |
| FindRawUtf8 | Low-level efficient search of Value in Values[] | |
| FindRawUtf8 | Return the index of Value in Values[], -1 if not found | |
| FindRawUtf8 | Return the index of Value in Values[], -1 if not found | |
| FindShortStringListExact | Fast search of an exact case-insensitive match of a RTTI's PShortString array | |
| FindShortStringListTrimLowerCase | Fast case-insensitive search of a left-trimmed lowercase match of a RTTI's PShortString array | |
| FindShortStringListTrimLowerCaseExact | Fast case-sensitive search of a left-trimmed lowercase match of a RTTI's PShortString array | |
| FindUnicode | Return true if Upper (Unicode encoded) is contained in U^ (UTF-8 encoded) | |
| FindUtf8 | Return true if UpperValue (Ansi) is contained in U^ (UTF-8 encoded) | |
| GetCaptionFromPCharLen | UnCamelCase and translate a char buffer | |
| GetHighUtf8Ucs4 | Internal function, used to retrieve a >127 US4 CodePoint from UTF-8 | |
| GetLineContains | Returns TRUE if the supplied uppercased text is contained in the text buffer | |
| GetLineSize | Compute the line length from source array of chars | |
| GetLineSizeSmallerThan | Returns true if the line length from source array of chars is not less than the specified count | |
| GetNextFieldProp | Retrieve the next SQL-like identifier within the UTF-8 buffer | |
| GetNextFieldPropSameLine | Retrieve the next identifier within the UTF-8 buffer on the same line | |
| GetNextLine | Extract a line from source array of chars | |
| GetNextStringLineToRawUnicode | Return next string delimited with #13#10 from P, nil if no more | |
| GetNextUtf8Upper | Retrieve the next UCS4 CodePoint stored in U, then update the U pointer | |
| GetUtf8WideChar | Decode UTF-16 WideChar from UTF-8 input buffer | |
| GotoEndOfQuotedString | Get the next character after a quoted buffer | |
| GotoNextNotSpace | Get the next character not in [#1..' '] | |
| GotoNextNotSpaceSameLine | Get the next character not in [#9,' '] | |
| GotoNextSpace | Get the next character in [#0..' '] | |
| IdemFileExt | Returns true if the file name extension contained in p^ is the same same as extup^ | |
| IdemFileExts | Returns matching file name extension index as extup^ | |
| IdemPChar | Returns true if the beginning of p^ is the same as up^ | |
| IdemPChar | Returns true if the beginning of p^ is the same as up^ | |
| IdemPCharAndGetNextLine | Return true if IdemPChar(source,searchUp), and go to the next line of source | |
| IdemPCharArray | Returns the index of a matching beginning of p^ in upArray[] | |
| IdemPCharArrayBy2 | Returns the index of a matching beginning of p^ in upArray two characters | |
| IdemPCharU | Returns true if the beginning of p^ is the same as up^ | |
| IdemPCharW | Returns true if the beginning of p^ is same as up^ | |
| IdemPCharWithoutWhiteSpace | Returns true if the beginning of p^ is the same as up^, ignoring white spaces | |
| IdemPPChar | Returns the index of a matching beginning of p^ in nil-terminated up^ array | |
| IdemPropName | Case insensitive comparison of ASCII 7-bit identifiers | |
| IdemPropName | Case insensitive comparison of ASCII 7-bit identifiers | |
| IdemPropName | Case insensitive comparison of ASCII 7-bit identifiers | |
| IdemPropNameU | Case insensitive comparison of ASCII 7-bit identifiers | |
| IdemPropNameU | Case insensitive comparison of ASCII 7-bit identifiers | |
| IdemPropNameUSameLenNotNull | Case insensitive comparison of ASCII 7-bit identifiers of same length | |
| IsCaseSensitive | Check if the supplied text has some case-insentitive 'a'..'z','A'..'Z' chars | |
| IsCaseSensitive | Check if the supplied text has some case-insentitive 'a'..'z','A'..'Z' chars | |
| IsFixedWidthCodePage | Check if a code page is known to be of fixed width, i.e. not MBCS | |
| IsValidUtf8 | Returns TRUE if the supplied buffer has valid UTF-8 encoding | |
| IsValidUtf8 | Returns TRUE if the supplied buffer has valid UTF-8 encoding | |
| IsValidUtf8WithoutControlChars | Returns TRUE if the supplied buffer has valid UTF-8 encoding with no #0..#31 control characters | |
| IsValidUtf8WithoutControlChars | Returns TRUE if the supplied buffer has valid UTF-8 encoding with no #1..#31 control characters | |
| IsVoid | Check all character within text are spaces or control chars | |
| IsWinAnsi | Return TRUE if the supplied unicode buffer only contains WinAnsi characters | |
| IsWinAnsi | Return TRUE if the supplied unicode buffer only contains WinAnsi characters | |
| IsWinAnsiU | Return TRUE if the supplied UTF-8 buffer only contains WinAnsi characters | |
| IsWinAnsiU8Bit | Return TRUE if the supplied UTF-8 buffer only contains WinAnsi 8-bit characters | |
| IsZero | Returns TRUE if Value is nil or all supplied Values[] equal '' | |
| LeftU | Returns n leading characters | |
| LowerCase | Fast conversion of the supplied text into lowercase | |
| LowerCaseCopy | Fast conversion of the supplied text into lowercase | |
| LowerCaseSelf | Fast in-place conversion of the supplied variable text into lowercase | |
| LowerCaseSynUnicode | Use the RTL to convert the SynUnicode text to LowerCase | |
| LowerCaseU | Fast conversion of the supplied text into 8-bit lowercase | |
| LowerCaseUnicode | Accurate conversion of the supplied UTF-8 content into the corresponding lower-case Unicode characters | |
| NextNotSpaceCharIs | Check if the next character not in [#1..' '] matchs a given value | |
| NextUtf8Ucs4 | Get the UCS4 CodePoint stored in P^ (decode UTF-8 if necessary) | |
| OnlyChar | Returns the supplied text content, without any other char than specified | |
| PosCharAny | Fast retrieve the position of any value of a given set of characters | |
| PosExI | A ASCII-7 case-insensitive version of PosEx() | |
| PosExI | A case-insensitive version of PosEx() with a specified lookup table | |
| PosExIPas | Internal function used when inlining PosExI() | |
| PosI | A non case-sensitive RawUtf8 version of Pos() | |
| PosIU | A non case-sensitive RawUtf8 version of Pos() | |
| PropNameSanitize | Try to generate a PropNameValid() output from an incoming text | |
| PropNamesValid | Returns TRUE if the given text buffers contains A..Z,0..9,_ characters | |
| PropNameValid | Returns TRUE if the given text buffer contains a..z,A..Z,0..9,_ characters | |
| QuickSortRawUtf8 | Sort a dynamic array of RawUtf8 items | |
| QuickSortRawUtf8 | Sort a RawUtf8 array, low values first | |
| QuotedStr | Format a text buffer with SQL-like quotes | |
| QuotedStr | Format a text content with SQL-like quotes | |
| QuotedStr | Format a text content with SQL-like quotes | |
| RawUnicodeToString | Convert any UTF-16 encoded buffer into a RTL string | |
| RawUnicodeToString | Convert any RawUnicode encoded string into a RTL string | |
| RawUnicodeToString | Convert any UTF-16 encoded buffer into a RTL string | |
| RawUnicodeToSynUnicode | Convert any UTF-16 buffer into a generic SynUnicode Text | |
| RawUnicodeToSynUnicode | Convert any RawUnicode String into a generic SynUnicode Text | |
| RawUnicodeToUtf8 | Convert a UTF-16 PWideChar buffer into a UTF-8 buffer | |
| RawUnicodeToUtf8 | Convert a RawUnicode string into a UTF-8 string | |
| RawUnicodeToUtf8 | Convert a UTF-16 PWideChar buffer into a UTF-8 string | |
| RawUnicodeToUtf8 | Convert a UTF-16 PWideChar buffer into a UTF-8 string | |
| RawUnicodeToUtf8 | Convert a UTF-16 PWideChar buffer into a UTF-8 string | |
| RawUnicodeToUtf8 | Convert a UTF-16 PWideChar buffer into a UTF-8 temporary buffer | |
| RawUnicodeToWinAnsi | Convert a UTF-16 PWideChar buffer into a WinAnsi (code page 1252) string | |
| RawUnicodeToWinAnsi | Convert a UTF-16 string into a WinAnsi (code page 1252) string | |
| RawUnicodeToWinPChar | Direct conversion of a UTF-16 encoded buffer into a WinAnsi PAnsiChar buffer | |
| RawUtf8DynArrayEquals | True if both TRawUtf8DynArray are the same | |
| RawUtf8DynArrayEquals | True if both TRawUtf8DynArray are the same for a given number of items | |
| RawUtf8FromFile | Read a File content into a RawUtf8, detecting any leading BOM | |
| RawUtf8OfChar | UTF-8 dedicated (and faster) alternative to StringOfChar((Ch,Count)) | |
| RightU | Returns n trailing characters | |
| SameTextU | SameText() overloaded function with proper UTF-8 decoding | |
| ShortStringToUtf8 | Direct conversion of a WinAnsi ShortString into a UTF-8 text | |
| SortDynArrayAnsiStringI | Compare two "array of AnsiString" elements, with no case sensitivity | |
| SortDynArrayPUtf8CharI | Compare two "array of PUtf8Char/PAnsiChar" elements, with no case sensitivity | |
| SortDynArrayStringI | Compare two "array of RTL string" elements, with no case sensitivity | |
| SortDynArrayUnicodeStringI | Compare two "array of WideString/UnicodeString" elements, with no case sensitivity | |
| Split | Split a RawUtf8 string into two strings, according to SepStr separator | |
| Split | Split a RawUtf8 string into several strings, according to SepStr separator | |
| Split | Split a RawUtf8 string into two strings, according to SepStr separator | |
| SplitRight | Returns the last occurrence of the given SepChar separated context | |
| SplitRights | Returns the last occurrence of the given SepChar separated context | |
| StartWith | Check case-insensitive matching starting of text in upTextStart | |
| StartWithExact | Check case-sensitive matching starting of text in start | |
| StrCompIL | Our fast version of StrCompIL(), to be used with PUtf8Char | |
| StrCompL | Our fast version of StrCompL(), to be used with PUtf8Char | |
| strcspn | Pure pascal version of strcspn(), to be used with PUtf8Char/PAnsiChar | |
| StrIComp | Our fast version of StrIComp(), to be used with PUtf8Char/PAnsiChar | |
| StrICompLNotNil | StrIComp-like function with a length, lookup table and Str1/Str2 expected not nil | |
| StrICompNotNil | StrIComp-like function with a lookup table and Str1/Str2 expected not nil | |
| StrILNotNil | StrIComp function with a length, lookup table and Str1/Str2 expected not nil | |
| StringBufferToUtf8 | Convert any RTL string 0-terminated Text buffer into an UTF-8 string | |
| StringBufferToUtf8 | Convert any RTL string buffer into an UTF-8 encoded buffer | |
| StringDynArrayToRawUtf8DynArray | Convert the string dynamic array into a dynamic array of UTF-8 strings | |
| StringFromBomFile | Read a file into a temporary variable, check the BOM, and adjust the buffer | |
| StringListToRawUtf8DynArray | Convert the string list into a dynamic array of UTF-8 strings | |
| StringReplaceAll | Fast version of StringReplace(S, OldPattern, NewPattern, [rfReplaceAll]); | |
| StringReplaceAll | Fast version of several cascaded StringReplaceAll() | |
| StringReplaceAll | Case-sensitive (or not) StringReplace(S, OldPattern, NewPattern,[rfReplaceAll]) | |
| StringReplaceAllProcess | Actual replacement function called by StringReplaceAll() on first match | |
| StringReplaceChars | Fast replace of a specified char by a given string | |
| StringReplaceTabs | Fast replace of all #9 chars by a given string | |
| StringToAnsi7 | Convert any RTL string into Ansi 7-bit encoded String | |
| StringToRawUnicode | Convert any RTL string into a RawUnicode encoded String | |
| StringToRawUnicode | Convert any RTL string into a RawUnicode encoded String | |
| StringToSynUnicode | Convert any RTL string into a SynUnicode encoded String | |
| StringToSynUnicode | Convert any RTL string into a SynUnicode encoded String | |
| StringToUtf8 | Convert any RTL string into an UTF-8 encoded String | |
| StringToUtf8 | Convert any RTL string buffer into an UTF-8 encoded String | |
| StringToUtf8 | Convert any RTL string into an UTF-8 encoded String | |
| StringToUtf8 | Convert any RTL string into an UTF-8 encoded TSynTempBuffer | |
| StringToVariant | Convert any RTL string into a variant storing a UTF-8 string | |
| StringToVariant | Convert any RTL string into a variant storing a UTF-8 string | |
| StringToWinAnsi | Convert any RTL string into WinAnsi (Win-1252) 8-bit encoded String | |
| StrPosI | A non case-sensitive version of Pos() | |
| StrPosIReference | UTF-8 Unicode 10.0 case-insensitive Pattern search within UTF-8 buffer | |
| strspn | Pure pascal version of strspn(), to be used with PUtf8Char/PAnsiChar | |
| SynUnicodeToString | Convert any SynUnicode encoded string into a RTL string | |
| SynUnicodeToUtf8 | Convert a SynUnicode string into a UTF-8 string | |
| ToUtf8 | Convert any RTL string into an UTF-8 encoded String | |
| ToUtf8 | Convert any UTF-8 encoded ShortString Text into an UTF-8 encoded String | |
| TRawUtf8DynArrayFrom | Quick helper to initialize a dynamic array of RawUtf8 from some constants | |
| TrimChar | Returns the supplied text content, without any specified char | |
| TrimChars | Trim some trailing and ending chars | |
| TrimControlChars | Returns the supplied text content, without any control char | |
| TrimLeft | Trims leading whitespace characters from the string by removing new line, space, and tab characters | |
| TrimLeftLines | Trims leading whitespaces of every lines of the UTF-8 text | |
| TrimLeftLowerCase | Trim first lowercase chars ('otDone' will return 'Done' e.g.) | |
| TrimLeftLowerCaseShort | Trim first lowercase chars ('otDone' will return 'Done' e.g.) | |
| TrimLeftLowerCaseToShort | Trim first lowercase chars ('otDone' will return 'Done' e.g.) | |
| TrimLeftLowerCaseToShort | Trim first lowercase chars ('otDone' will return 'Done' e.g.) | |
| TrimOneChar | Returns the supplied text content, without one specified char | |
| TrimRight | Trims trailing whitespace characters from the string by removing trailing newline, space, and tab characters | |
| Ucs4ToUtf8 | UTF-8 encode one UCS4 CodePoint into Dest | |
| UnCamelCase | Convert a CamelCase string into a space separated one | |
| UnCamelCase | Convert a CamelCase string into a space separated one | |
| UnicodeBufferToString | Convert an Unicode buffer into a RTL string | |
| UnicodeBufferToUtf8 | Convert an Unicode buffer into a UTF-8 string | |
| UnicodeBufferToVariant | Convert an Unicode buffer into a variant storing a UTF-8 string | |
| UnicodeBufferToWinAnsi | Convert an Unicode buffer into a WinAnsi (code page 1252) string | |
| UniqueRawUtf8ZeroToTilde | Will fast replace all #0 chars as ~ | |
| UnQuotedSqlSymbolName | Unquote a SQL-compatible symbol name | |
| UnQuoteSqlString | Unquote a SQL-compatible string | |
| UnQuoteSqlStringVar | Unquote a SQL-compatible string | |
| UnZeroed | Convert a binary buffer into a fake ASCII/UTF-8 content without any #0 input | |
| UpperCase | Fast conversion of the supplied text into uppercase | |
| UpperCaseCopy | Fast conversion of the supplied text into uppercase | |
| UpperCaseCopy | Fast conversion of the supplied text into uppercase | |
| UpperCaseReference | UpperCase conversion of a UTF-8 string using our Unicode 10.0 tables | |
| UpperCaseSelf | Fast in-place conversion of the supplied variable text into uppercase | |
| UpperCaseSynUnicode | Use the RTL to convert the SynUnicode text to UpperCase | |
| UpperCaseU | Fast conversion of the supplied text into 8-bit uppercase | |
| UpperCaseUcs4Reference | UpperCase conversion of UTF-8 into UCS4 using our Unicode 10.0 tables | |
| UpperCaseUnicode | Accurate conversion of the supplied UTF-8 content into the corresponding upper-case Unicode characters | |
| UpperCopy | Copy source into dest^ with ASCII 7-bit upper case conversion | |
| UpperCopy255 | Delphi does not like inlining goto+label copy source into a 256 chars dest^ buffer with 7-bit upper case conversion | |
| UpperCopy255Buf | Copy source^ into a 256 chars dest^ buffer with 7-bit upper case conversion | |
| UpperCopy255W | Copy UTF-16 source into dest^ with ASCII 7-bit upper case conversion | |
| UpperCopy255W | Copy WideChar source into dest^ with upper case conversion | |
| UpperCopyShort | Copy source into dest^ with ASCII 7-bit upper case conversion | |
| UpperCopyWin255 | Copy source into dest^ with WinAnsi 8-bit upper case conversion | |
| Utf16CharToUtf8 | UTF-8 encode one UTF-16 encoded UCS4 CodePoint into Dest | |
| Utf8DecodeToRawUnicode | Convert a UTF-8 string into a RawUnicode string | |
| Utf8DecodeToRawUnicode | Convert a UTF-8 encoded buffer into a RawUnicode string | |
| Utf8DecodeToRawUnicodeUI | Convert a UTF-8 string into a RawUnicode string | |
| Utf8DecodeToRawUnicodeUI | Convert a UTF-8 string into a RawUnicode string | |
| Utf8DecodeToString | Convert any UTF-8 encoded buffer into a RTL string | |
| Utf8DecodeToString | Convert any UTF-8 encoded buffer into a RTL string | |
| Utf8DecodeToUnicode | Convert any UTF-8 encoded string into an UTF-16 temporary buffer | |
| Utf8DecodeToUnicode | Convert any UTF-8 encoded buffer into an UTF-16 temporary buffer | |
| Utf8DecodeToUnicodeRawByteString | Convert an UTF-8 encoded buffer into a UTF-16 encoded RawByteString buffer | |
| Utf8DecodeToUnicodeRawByteString | Convert an UTF-8 encoded buffer into a UTF-16 encoded RawByteString buffer | |
| Utf8DecodeToUnicodeStream | Convert an UTF-8 encoded buffer into a UTF-16 encoded stream of bytes | |
| Utf8FirstLineToUtf16Length | Calculate the UTF-16 Unicode characters count of the UTF-8 encoded first line | |
| Utf8IComp | Fast UTF-8 comparison handling WinAnsi CP-1252 case folding | |
| Utf8ICompReference | UTF-8 comparison using our Unicode 10.0 tables | |
| Utf8ILComp | Fast UTF-8 comparison handling WinAnsi CP-1252 case folding | |
| Utf8ILCompReference | UTF-8 comparison using our Unicode 10.0 tables | |
| Utf8ToFileName | Convert any UTF-8 encoded String into a generic RTL file name string | |
| Utf8ToRawUtf8 | Direct conversion of a UTF-8 encoded zero terminated buffer into a RawUtf8 String | |
| Utf8ToShortString | Direct conversion of a UTF-8 encoded buffer into a WinAnsi ShortString buffer | |
| Utf8ToString | Convert any UTF-8 encoded String into a RTL string | |
| Utf8ToStringVar | Convert any UTF-8 encoded String into a RTL string | |
| Utf8ToSynUnicode | Convert any UTF-8 encoded String into a generic SynUnicode Text | |
| Utf8ToSynUnicode | Convert any UTF-8 encoded buffer into a generic SynUnicode Text | |
| Utf8ToSynUnicode | Convert any UTF-8 encoded String into a generic SynUnicode Text | |
| Utf8ToUnicodeLength | Calculate the UTF-16 Unicode characters count, UTF-8 encoded in source^ | |
| Utf8ToWideChar | Convert an UTF-8 encoded text into a WideChar (UTF-16) buffer | |
| Utf8ToWideChar | Convert an UTF-8 encoded text into a WideChar (UTF-16) buffer | |
| Utf8ToWideString | Convert any UTF-8 encoded String into a generic WideString Text | |
| Utf8ToWideString | Convert any UTF-8 encoded String into a generic WideString Text | |
| Utf8ToWideString | Convert any UTF-8 encoded String into a generic WideString Text | |
| Utf8ToWinAnsi | Direct conversion of a UTF-8 encoded zero terminated buffer into a WinAnsi String | |
| Utf8ToWinAnsi | Direct conversion of a UTF-8 encoded string into a WinAnsi String | |
| Utf8ToWinPChar | Direct conversion of a UTF-8 encoded buffer into a WinAnsi PAnsiChar buffer | |
| Utf8TruncatedLength | Compute the truncated length of the supplied UTF-8 value if it exceeds the specified bytes count | |
| Utf8TruncatedLength | Compute the truncated length of the supplied UTF-8 value if it exceeds the specified bytes count | |
| Utf8TruncateToLength | Will truncate the supplied UTF-8 value if its length exceeds the specified bytes count | |
| Utf8TruncateToUnicodeLength | Will truncate the supplied UTF-8 value if its length exceeds the specified UTF-16 Unicode characters count | |
| Utf8UpperCopy | Copy UTF-8 buffer into dest^ handling WinAnsi CP-1252 NormToUpper[] folding | |
| Utf8UpperCopy255 | Copy UTF-8 buffer into dest^ handling WinAnsi CP-1252 NormToUpper[] folding | |
| Utf8UpperReference | UpperCase conversion of a UTF-8 buffer using our Unicode 10.0 tables | |
| Utf8UpperReference | UpperCase conversion of a UTF-8 buffer using our Unicode 10.0 tables | |
| WideCharToWinAnsi | Conversion of a wide char into a WinAnsi (CodePage 1252) char index | |
| WideCharToWinAnsiChar | Conversion of a wide char into a WinAnsi (CodePage 1252) char | |
| WideStringToUtf8 | Convert a WideString into a UTF-8 string | |
| WideStringToWinAnsi | Convert a WideString into a WinAnsi (code page 1252) string | |
| WinAnsiBufferToUtf8 | Direct conversion of a WinAnsi PAnsiChar buffer into a UTF-8 encoded buffer | |
| WinAnsiToRawUnicode | Direct conversion of a WinAnsi (CodePage 1252) string into a Unicode encoded String | |
| WinAnsiToSynUnicode | Convert a Win-Ansi string into a Delphi 2009+ or FPC Unicode string | |
| WinAnsiToSynUnicode | Convert a Win-Ansi encoded buffer into a Delphi 2009+ or FPC Unicode string | |
| WinAnsiToUnicodeBuffer | Direct conversion of a WinAnsi (CodePage 1252) string into a Unicode buffer | |
| WinAnsiToUtf8 | Direct conversion of a WinAnsi (CodePage 1252) string into a UTF-8 encoded String | |
| WinAnsiToUtf8 | Direct conversion of a WinAnsi (CodePage 1252) string into a UTF-8 encoded String | |
| Zeroed | Convert a fake UTF-8 buffer without any #0 input back into its original binary |
procedure AddRawUtf8(var Values: TRawUtf8DynArray; var ValuesCount: integer; const Value: TRawUtf8DynArray); overload;
Add Value[] items to Values[], with an external count variable, for performance
procedure AddRawUtf8(var Values: TRawUtf8DynArray; const Value: TRawUtf8DynArray); overload;
Add Value[] items to Values[]
function AddRawUtf8(var Values: TRawUtf8DynArray; var ValuesCount: integer; const Value: RawUtf8): PtrInt; overload;
Add the Value to Values[], with an external count variable, for performance
function AddRawUtf8(var Values: TRawUtf8DynArray; const Value: RawUtf8; NoDuplicates: boolean = false; CaseSensitive: boolean = true): boolean; overload;
True if Value was added successfully in Values[]
function AddSortedRawUtf8(var Values: TRawUtf8DynArray; var ValuesCount: integer; const Value: RawUtf8; CoValues: PIntegerDynArray = nil; ForcedIndex: PtrInt = -1; Compare: TUtf8Compare = nil): PtrInt;
Add a RawUtf8 value in an alphaticaly sorted dynamic array of RawUtf8
- returns the index where the Value was added successfully in Values[]
- returns -1 if the specified Value was already present in Values[] (we must avoid any duplicate for O(log(n)) binary search)
- if CoValues is set, its content will be moved to allow inserting a new value at CoValues[result] position - a typical usage of CoValues is to store the corresponding ID to each RawUtf8 item
- if FastLocatePUtf8CharSorted() has been already called, this index can be set to optional ForceIndex parameter
- by default, exact (case-sensitive) match is used; you can specify a custom compare function if needed in Compare optional parameter
function AddString(var Values: TStringDynArray; const Value: string): PtrInt;
Add the Value to Values[] string array
procedure Ansi7ToString(Text: PWinAnsiChar; Len: PtrInt; var result: string); overload;
Convert any Ansi 7-bit encoded String into a RTL string
- the Text content must contain only 7-bit pure ASCII characters
function Ansi7ToString(Text: PWinAnsiChar; Len: PtrInt): string; overload;
Convert any Ansi 7-bit encoded String into a RTL string
- the Text content must contain only 7-bit pure ASCII characters
function Ansi7ToString(const Text: RawByteString): string; overload;
Convert any Ansi 7-bit encoded String into a RTL string
- the Text content must contain only 7-bit pure ASCII characters
function AnsiBufferToTempUtf8(var Temp: TSynTempBuffer; Buf: PAnsiChar; BufLen, CodePage: cardinal): PUtf8Char;
Convert any Ansi memory buffer into UTF-8, using a TSynTempBuffer if needed
- caller should release any memory by calling Temp.Done
- returns a pointer to the UTF-8 converted buffer - which may be buf
procedure AnsiCharToUtf8(P: PAnsiChar; L: integer; var result: RawUtf8; CodePage: integer);
Convert an AnsiChar buffer (of a given code page) into a UTF-8 string
- the destination code page should be supplied
- wrapper around TSynAnsiConvert.Engine(CodePage).AnsiBufferToRawUtf8()
function AnsiIComp(Str1, Str2: pointer): PtrInt;
Fast WinAnsi comparison using the NormToUpper[] array for all 8-bit values
function AnsiICompW(u1, u2: PWideChar): PtrInt;
Fast case-insensitive Unicode comparison handling ASCII 7-bit chars
- use the NormToUpperAnsi7Byte[] array, i.e. compare 'a'..'z' as 'A'..'Z'
- this version expects u1 and u2 to be zero-terminated
function AnsiToString(const Ansi: RawByteString; CodePage: integer): string;
Convert an AnsiString (of a given code page) into a RTL string
- the destination code page should be supplied
- wrapper around TSynAnsiConvert.Engine(CodePage) and string conversion
function AnsiToUtf8(const Ansi: RawByteString; CodePage: integer): RawUtf8;
Convert an AnsiString (of a given code page) into a UTF-8 string
- use AnyAnsiToUtf8() if you want to use the codepage of the input string
- wrapper around TSynAnsiConvert.Engine(CodePage).AnsiToUtf8()
procedure AnyAnsiToUtf8(const s: RawByteString; var result: RawUtf8); overload;
Direct conversion of an AnsiString with an unknown code page into an UTF-8 encoded String
- will assume CurrentAnsiConvert.CodePage prior to Delphi 2009
- newer UNICODE versions of Delphi will retrieve the code page from string
function AnyAnsiToUtf8(const s: RawByteString): RawUtf8; overload;
Direct conversion of an AnsiString with an unknown code page into an UTF-8 encoded String
- will assume CurrentAnsiConvert.CodePage prior to Delphi 2009
- newer UNICODE versions of Delphi will retrieve the code page from string
- use AnsiToUtf8() if you want to specify the codepage
function AnyTextFileToRawUtf8(const FileName: TFileName; AssumeUtf8IfNoBom: boolean = false): RawUtf8;
Read a File content into a RawUtf8, detecting any leading BOM
- assume file with no BOM is encoded with the current Ansi code page, not UTF-8, unless AssumeUtf8IfNoBom is true and it behaves like RawUtf8FromFile()
function AnyTextFileToString(const FileName: TFileName; ForceUtf8: boolean = false): string;
Read a File content into a RTL string, detecting any leading BOM
- assume file with no BOM is encoded with the current Ansi code page, not UTF-8
- if ForceUtf8 is true, won't detect the BOM but assume whole file is UTF-8
function AnyTextFileToSynUnicode(const FileName: TFileName; ForceUtf8: boolean = false): SynUnicode;
Read a File content into SynUnicode string, detecting any leading BOM
- assume file with no BOM is encoded with the current Ansi code page, not UTF-8
- if ForceUtf8 is true, won't detect the BOM but assume whole file is UTF-8
procedure AppendShortComma(text: PAnsiChar; len: PtrInt; var result: ShortString; trimlowercase: boolean);
Fast append some UTF-8 text into a ShortString, with an ending ','
function BomFile(var Buffer: pointer; var BufferSize: PtrInt): TBomFile;
Check the file BOM at the beginning of a file buffer
- BOM is common only with Microsoft products
- returns bomNone if no BOM was recognized
- returns bomUnicode or bomUtf8 if UTF-16LE or UTF-8 BOM were recognized: and will adjust Buffer/BufferSize to ignore the leading 2 or 3 bytes
procedure CamelCase(P: PAnsiChar; len: PtrInt; var s: RawUtf8; const isWord: TSynByteSet = [ord('0')..ord('9'), ord('a')..ord('z'), ord('A')..ord('Z')]); overload;
Convert a string into an human-friendly CamelCase identifier
- replacing spaces or punctuations by an uppercase character
- as such, it is not the reverse function to UnCamelCase()
procedure CamelCase(const text: RawUtf8; var s: RawUtf8; const isWord: TSynByteSet = [ord('0')..ord('9'), ord('a')..ord('z'), ord('A')..ord('Z')]); overload;
Convert a string into an human-friendly CamelCase identifier
- replacing spaces or punctuations by an uppercase character
- as such, it is not the reverse function to UnCamelCase()
procedure CaseCopy(Text: PUtf8Char; Len: PtrInt; Table: PNormTable; var Dest: RawUtf8);
Low-level function called when inlining UpperCase(Copy) and LowerCase(Copy)
procedure CaseSelf(var S: RawUtf8; Table: PNormTable);
Low-level function called when inlining UpperCaseSelf and LowerCaseSelf
function CodePageToText(aCodePage: cardinal): TShort16;
Return a code page number into human-friendly text
function ContainsUtf8(p, up: PUtf8Char): boolean;
Return true if up^ is contained inside the UTF-8 buffer p^
- search up^ at the beginning of every UTF-8 word (aka in Soundex)
- here a "word" is a Win-Ansi word, i.e. '0'..'9', 'A'..'Z'
- up^ must be already Upper
function ConvertCaseUtf8(P: PUtf8Char; const Table: TNormTableByte): PtrInt;
Fast conversion of the supplied text into 8-bit case sensitivity
- convert the text in-place, returns the resulting length
- it will decode the supplied UTF-8 content to handle more than 7-bit of ascii characters during the conversion (leaving not WinAnsi characters untouched)
- will not set the last char to #0 (caller must do that if necessary)
function DeleteRawUtf8(var Values: TRawUtf8DynArray; var ValuesCount: integer; Index: integer; CoValues: PIntegerDynArray = nil): boolean; overload;
Delete a RawUtf8 item in a dynamic array of RawUtf8
- if CoValues is set, the integer item at the same index is also deleted
function DeleteRawUtf8(var Values: TRawUtf8DynArray; Index: PtrInt): boolean; overload;
Delete a RawUtf8 item in a dynamic array of RawUtf8;
procedure DetectRawUtf8(var source: RawByteString);
Detect UTF-8 content and mark the variable with the CP_UTF8 codepage
- to circumvent FPC concatenation bug with CP_UTF8 and CP_RAWBYTESTRING
function EndWith(const text, upTextEnd: RawUtf8): boolean;
Check case-insensitive matching ending of text in upTextEnd
- returns true if the item matched
- ignore case - upTextEnd must be already in upper case
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- see EndWithExact() from mormot.core.text for a case-sensitive version
function EndWithArray(const text: RawUtf8; const upArray: array of RawUtf8): integer;
Returns the index of a case-insensitive matching ending of p^ in upArray[]
- returns -1 if no item matched
- ignore case - upArray[] items must be already in upper case
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
function EndWithExact(const text, textEnd: RawUtf8): boolean;
Check case-sensitive matching ending of text in ending
- returns true if the item matched
- see EndWith() from mormot.core.unicode for a case-insensitive version
function FastFindIndexedPUtf8Char(P: PPUtf8CharArray; R: PtrInt; var SortedIndexes: TCardinalDynArray; Value: PUtf8Char; ItemComp: TUtf8Compare): PtrInt;
Retrieve the index of a PUtf8Char in a PUtf8Char array via a sort indexed
- will use fast O(log(n)) binary search algorithm
function FastFindPUtf8CharSorted(P: PPUtf8CharArray; R: PtrInt; Value: PUtf8Char): PtrInt; overload;
Retrieve the index where is located a PUtf8Char in a sorted PUtf8Char array
- R is the last index of available entries in P^ (i.e. Count-1)
- string comparison is case-sensitive StrComp (so will work with any PAnsiChar)
- returns -1 if the specified Value was not found
- will use inlined binary search algorithm with optimized x86_64 branchless asm
- slightly faster than plain FastFindPUtf8CharSorted(P,R,Value,@StrComp)
function FastFindPUtf8CharSorted(P: PPUtf8CharArray; R: PtrInt; Value: PUtf8Char; Compare: TUtf8Compare): PtrInt; overload;
Retrieve the index where is located a PUtf8Char in a sorted PUtf8Char array
- R is the last index of available entries in P^ (i.e. Count-1)
- string comparison will use the specified Compare function
- returns -1 if the specified Value was not found
- will use fast O(log(n)) binary search algorithm
function FastFindUpperPUtf8CharSorted(P: PPUtf8CharArray; R: PtrInt; Value: PUtf8Char; ValueLen: PtrInt): PtrInt;
Retrieve the index where is located a PUtf8Char in a sorted uppercase array
- P[] array is expected to be already uppercased
- searched Value is converted to uppercase before search via UpperCopy255Buf(), so is expected to be short, i.e. length < 250
- R is the last index of available entries in P^ (i.e. Count-1)
- returns -1 if the specified Value was not found
- will use fast O(log(n)) binary search algorithm
- slightly faster than plain FastFindPUtf8CharSorted(P,R,Value,@StrIComp)
function FastLocatePUtf8CharSorted(P: PPUtf8CharArray; R: PtrInt; Value: PUtf8Char): PtrInt; overload;
Retrieve the index where to insert a PUtf8Char in a sorted PUtf8Char array
- R is the last index of available entries in P^ (i.e. Count-1)
- string comparison is case-sensitive StrComp (so will work with any PAnsiChar)
- returns -1 if the specified Value was found (i.e. adding will duplicate a value)
- will use fast O(log(n)) binary search algorithm
function FastLocatePUtf8CharSorted(P: PPUtf8CharArray; R: PtrInt; Value: PUtf8Char; Compare: TUtf8Compare): PtrInt; overload;
Retrieve the index where to insert a PUtf8Char in a sorted PUtf8Char array
- this overloaded function accept a custom comparison function for sorting
- R is the last index of available entries in P^ (i.e. Count-1)
- string comparison is case-sensitive (so will work with any PAnsiChar)
- returns -1 if the specified Value was found (i.e. adding will duplicate a value)
- will use fast O(log(n)) binary search algorithm
procedure FillZero(var secret: RawByteString); overload;
Fill all bytes of this memory buffer with zeros, i.e. 'toto' -> #0#0#0#0
- will write the memory buffer directly, if this string instance is not shared (i.e. has refcount = 1), to avoid zeroing still-used values
- may be used to cleanup stack-allocated content
... finally FillZero(secret); end;
procedure FillZero(var secret: RawUtf8); overload;
Fill all bytes of this UTF-8 string with zeros, i.e. 'toto' -> #0#0#0#0
- will write the memory buffer directly, if this string instance is not shared (i.e. has refcount = 1), to avoid zeroing still-used values
- may be used to cleanup stack-allocated content
... finally FillZero(secret); end;
procedure FillZero(var secret: SpiUtf8); overload;
Fill all bytes of this UTF-8 string with zeros, i.e. 'toto' -> #0#0#0#0
- SpiUtf8 type has been defined explicitly to store Sensitive Personal Information
procedure FillZero(var secret: SynUnicode); overload;
Fill all bytes of this UTF-16 string with zeros, i.e. 'toto' -> #0#0#0#0
procedure FillZero(var secret: TBytes); overload;
Fill all bytes of this dynamic array of bytes with zeros
- will write the memory buffer directly, if this array instance is not shared (i.e. has refcount = 1), to avoid zeroing still-used values
function FindAnsi(A, UpperValue: PAnsiChar): boolean;
Return true if UpperValue (Ansi) is contained in A^ (Ansi)
- find UpperValue starting at word beginning, not inside words
function FindNameValue(P: PUtf8Char; UpperName: PAnsiChar): PUtf8Char; overload;
Search for a value from its uppercased named entry
- i.e. iterate IdemPChar(source,UpperName) over every line of the source
- returns the text just after UpperName if it has been found at line beginning
- returns nil if UpperName was not found at any line beginning
- could be used e.g. to efficently extract a value from HTTP headers, whereas FindIniNameValue() is tuned for [section]-oriented INI files
function FindNameValue(const NameValuePairs: RawUtf8; UpperName: PAnsiChar; var Value: RawUtf8; KeepNotFoundValue: boolean = false; UpperNameSeparator: AnsiChar = #0): boolean; overload;
Search and returns a value from its uppercased named entry
- i.e. iterate IdemPChar(source,UpperName) over every line of the source
- returns true and the trimmed text just after UpperName into Value if it has been found at line beginning
- returns false and set Value := '' if UpperName was not found (or leave Value untouched if KeepNotFoundValue is true)
- could be used e.g. to efficently extract a value from HTTP headers, whereas FindIniNameValue() is tuned for [section]-oriented INI files
- do TrimLeftLines(NameValuePairs) first if the lines start with spaces/tabs
function FindNameValuePointer(NameValuePairs: PUtf8Char; UpperName: PAnsiChar; out FoundLen: PtrInt; UpperNameSeparator: AnsiChar): PUtf8Char;
Search and returns a PUtf8Char value from its uppercased named entry
- as called when inlining FindNameValue()
- won't make any memory allocation, so could be fine for a quick lookup
function FindNextUtf8WordBegin(U: PUtf8Char): PUtf8Char;
Points to the beginning of the next word stored in U
- returns nil if reached the end of U (i.e. #0 char)
- here a "word" is a Win-Ansi word, i.e. '0'..'9', 'A'..'Z'
function FindRawUtf8(const Values: array of RawUtf8; const Value: RawUtf8; CaseSensitive: boolean = true): integer; overload;
Return the index of Value in Values[], -1 if not found
- CaseSensitive=false will use StrICmp() for A..Z / a..z equivalence
function FindRawUtf8(const Values: TRawUtf8DynArray; const Value: RawUtf8; CaseSensitive: boolean = true): integer; overload;
Return the index of Value in Values[], -1 if not found
- CaseSensitive=false will use StrICmp() for A..Z / a..z equivalence
function FindRawUtf8(Values: PRawUtf8; const Value: RawUtf8; ValuesCount: integer; CaseSensitive: boolean): integer; overload;
Low-level efficient search of Value in Values[]
- CaseSensitive=false will use StrICmp() for A..Z / a..z equivalence
function FindShortStringListExact(List: PShortString; MaxValue: integer; aValue: PUtf8Char; aValueLen: PtrInt): integer;
Fast search of an exact case-insensitive match of a RTTI's PShortString array
function FindShortStringListTrimLowerCase(List: PShortString; MaxValue: integer; aValue: PUtf8Char; aValueLen: PtrInt): integer;
Fast case-insensitive search of a left-trimmed lowercase match of a RTTI's PShortString array
function FindShortStringListTrimLowerCaseExact(List: PShortString; MaxValue: integer; aValue: PUtf8Char; aValueLen: PtrInt): integer;
Fast case-sensitive search of a left-trimmed lowercase match of a RTTI's PShortString array
function FindUnicode(PW: PWideChar; Upper: PWideChar; UpperLen: PtrInt): boolean;
Return true if Upper (Unicode encoded) is contained in U^ (UTF-8 encoded)
- will use the slow but accurate Operating System API (Win32 or ICU) to perform the comparison at Unicode-level
- consider using StrPosIReference() for our faster Unicode 10.0 version
function FindUtf8(U: PUtf8Char; UpperValue: PAnsiChar): boolean;
Return true if UpperValue (Ansi) is contained in U^ (UTF-8 encoded)
- find UpperValue starting at word beginning, not inside words
- UTF-8 decoding is done on the fly (no temporary decoding buffer is used)
procedure GetCaptionFromPCharLen(P: PUtf8Char; out result: string);
UnCamelCase and translate a char buffer
- P is expected to be #0 ended
- return "string" type, i.e. UnicodeString for Delphi 2009+
function GetHighUtf8Ucs4(var U: PUtf8Char): Ucs4CodePoint;
Internal function, used to retrieve a >127 US4 CodePoint from UTF-8
- not to be called directly, but from inlined higher-level functions
- here U^ shall be always >= #80
- typical use is as such:
ch := ord(P^);
if ch and $80=0 then
inc(P) else
ch := GetHighUtf8Ucs4(P);function GetLineContains(p, pEnd, up: PUtf8Char): boolean;
Returns TRUE if the supplied uppercased text is contained in the text buffer
function GetLineSize(P, PEnd: PUtf8Char): PtrUInt;
Compute the line length from source array of chars
- if PEnd = nil, end counting at either #0, #13 or #10
- otherwise, end counting at either #13 or #10
- just a wrapper around BufferLineLength() checking PEnd=nil case
function GetLineSizeSmallerThan(P, PEnd: PUtf8Char; aMinimalCount: integer): boolean;
Returns true if the line length from source array of chars is not less than the specified count
function GetNextFieldProp(var P: PUtf8Char; var Prop: RawUtf8): boolean;
Retrieve the next SQL-like identifier within the UTF-8 buffer
- will also trim any space (or line feeds) and trailing ';'
- any comment like '/*nocache*/' will be ignored
- returns true if something was set to Prop
function GetNextFieldPropSameLine(var P: PUtf8Char; var Prop: ShortString): boolean;
Retrieve the next identifier within the UTF-8 buffer on the same line
- GetNextFieldProp() will just handle line feeds (and ';') as spaces - which is fine e.g. for SQL, but not for regular config files with name/value pairs
- returns true if something was set to Prop
function GetNextLine(source: PUtf8Char; out next: PUtf8Char; andtrim: boolean = false): RawUtf8;
Extract a line from source array of chars
- next will contain the beginning of next line, or nil if source has ended
function GetNextStringLineToRawUnicode(var P: PChar): RawUnicode;
Return next string delimited with #13#10 from P, nil if no more
- this function returns a RawUnicode string type
function GetNextUtf8Upper(var U: PUtf8Char): Ucs4CodePoint;
Retrieve the next UCS4 CodePoint stored in U, then update the U pointer
- this function will decode the UTF-8 content before using NormToUpper[]
- will return '?' if the UCS4 CodePoint is higher than #255: so use this function only if you need to deal with ASCII characters (e.g. it's used for Soundex and for ContainsUtf8 function)
function GetUtf8WideChar(P: PUtf8Char): cardinal;
Decode UTF-16 WideChar from UTF-8 input buffer
- any surrogate (Ucs4>$ffff) is returned as UNICODE_REPLACEMENT_CHARACTER=$fffd
function GotoEndOfQuotedString(P: PUtf8Char): PUtf8Char;
Get the next character after a quoted buffer
- the first character in P^ must be either ', either "
- it will return the latest quote position, ignoring double quotes within
function GotoNextNotSpace(P: PUtf8Char): PUtf8Char;
Get the next character not in [#1..' ']
function GotoNextNotSpaceSameLine(P: PUtf8Char): PUtf8Char;
Get the next character not in [#9,' ']
function GotoNextSpace(P: PUtf8Char): PUtf8Char;
Get the next character in [#0..' ']
function IdemFileExt(p: PUtf8Char; extup: PAnsiChar; sepChar: AnsiChar = '.'): boolean;
Returns true if the file name extension contained in p^ is the same same as extup^
- ignore case - extup^ must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- could be used e.g. like IdemFileExt(aFileName,'.JP');
function IdemFileExts(p: PUtf8Char; const extup: array of PAnsiChar; sepChar: AnsiChar = '.'): integer;
Returns matching file name extension index as extup^
- ignore case - extup[] must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- could be used e.g. like IdemFileExts(aFileName,['.PAS','.INC']);
function IdemPChar(p: PUtf8Char; up: PAnsiChar; table: PNormTable): boolean; overload;
Returns true if the beginning of p^ is the same as up^
- this overloaded function accept the uppercase lookup buffer as parameter
function IdemPChar(p: PUtf8Char; up: PAnsiChar): boolean; overload;
Returns true if the beginning of p^ is the same as up^
- ignore case - up^ must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated characters): but when you only need to search for field names e.g. IdemPChar() is prefered, because it'll be faster than IdemPCharU(), if UTF-8 decoding is not mandatory
- if p is nil, will return FALSE
- if up is nil, will return TRUE
function IdemPCharAndGetNextLine(var source: PUtf8Char; searchUp: PAnsiChar): boolean;
Return true if IdemPChar(source,searchUp), and go to the next line of source
function IdemPCharArray(p: PUtf8Char; const upArray: array of PAnsiChar): integer;
Returns the index of a matching beginning of p^ in upArray[]
- returns -1 if no item matched
- ignore case - upArray^ must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- warning: this function expects upArray[] items to have AT LEAST TWO CHARS (it will use a fast 16-bit comparison of initial 2 bytes)
- consider IdemPPChar() which is faster but a bit more verbose
function IdemPCharArrayBy2(p: PUtf8Char; const upArrayBy2Chars: RawUtf8): PtrInt;
Returns the index of a matching beginning of p^ in upArray two characters
- returns -1 if no item matched
- ignore case - upArray^ must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
function IdemPCharU(p, up: PUtf8Char): boolean;
Returns true if the beginning of p^ is the same as up^
- ignore case - up^ must be already Upper
- this version will decode the UTF-8 content before using NormToUpper[], so it will be slower than the IdemPChar() function above, but will handle WinAnsi accentuated characters (e.g. 'e' acute will be matched as 'E')
function IdemPCharW(p: PWideChar; up: PUtf8Char): boolean;
Returns true if the beginning of p^ is same as up^
- ignore case - up^ must be already Upper
- this version expects p^ to point to an Unicode char array
function IdemPCharWithoutWhiteSpace(p: PUtf8Char; up: PAnsiChar): boolean;
Returns true if the beginning of p^ is the same as up^, ignoring white spaces
- ignore case - up^ must be already Upper
- any white space in the input p^ buffer is just ignored
- chars are compared as 7-bit Ansi only (no accentuated characters): but when you only need to search for field names e.g. IdemPChar() is prefered, because it'll be faster than IdemPCharU(), if UTF-8 decoding is not mandatory
- if p is nil, will return FALSE
- if up is nil, will return TRUE
function IdemPPChar(p: PUtf8Char; up: PPAnsiChar): PtrInt;
Returns the index of a matching beginning of p^ in nil-terminated up^ array
- returns -1 if no item matched
- ignore case - each up^ must be already Upper
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- warning: this function expects up^ items to have AT LEAST TWO CHARS (it will use a fast 16-bit comparison of initial 2 bytes)
function IdemPropName(const P1, P2: ShortString): boolean; overload;
Case insensitive comparison of ASCII 7-bit identifiers
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
function IdemPropName(const P1: ShortString; P2: PUtf8Char; P2Len: PtrInt): boolean; overload;
Case insensitive comparison of ASCII 7-bit identifiers
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
function IdemPropName(P1, P2: PUtf8Char; P1Len, P2Len: PtrInt): boolean; overload;
Case insensitive comparison of ASCII 7-bit identifiers
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
- this version expects P1 and P2 to be a PAnsiChar with specified lengths
function IdemPropNameU(const P1, P2: RawUtf8): boolean; overload;
Case insensitive comparison of ASCII 7-bit identifiers
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
- is an alternative with PropNameEquals() to be used inlined e.g. in a loop
function IdemPropNameU(const P1: RawUtf8; P2: PUtf8Char; P2Len: PtrInt): boolean; overload;
Case insensitive comparison of ASCII 7-bit identifiers
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
- this version expects P2 to be a PAnsiChar with specified length
function IdemPropNameUSameLenNotNull(P1, P2: PUtf8Char; P1P2Len: PtrInt): boolean;
Case insensitive comparison of ASCII 7-bit identifiers of same length
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- behavior is undefined with UTF-8 encoding (some false positive may occur)
- this version expects P1 and P2 to be a PAnsiChar with an already checked identical length, so may be used for a faster process, e.g. in a loop
- if P1 and P2 are RawUtf8, you should better call overloaded function IdemPropNameU(const P1,P2: RawUtf8), which would be slightly faster by using the length stored before the actual text buffer of each RawUtf8
function IsCaseSensitive(P: PUtf8Char; PLen: PtrInt): boolean; overload;
Check if the supplied text has some case-insentitive 'a'..'z','A'..'Z' chars
- will therefore be correct with true UTF-8 content, but only for 7-bit
function IsCaseSensitive(const S: RawUtf8): boolean; overload;
Check if the supplied text has some case-insentitive 'a'..'z','A'..'Z' chars
- will therefore be correct with true UTF-8 content, but only for 7-bit
function IsFixedWidthCodePage(aCodePage: cardinal): boolean;
Check if a code page is known to be of fixed width, i.e. not MBCS
- i.e. will be implemented as a TSynAnsiFixedWidth
function IsValidUtf8(const source: RawUtf8): boolean; overload;
Returns TRUE if the supplied buffer has valid UTF-8 encoding
- will also refuse #0 characters within the buffer
- on Haswell AVX2 Intel/AMD CPUs, will use very efficient ASM, reaching e.g. 21 GB/s parsing speed on a Core i5-13500
function IsValidUtf8(source: PUtf8Char): boolean; overload;
Returns TRUE if the supplied buffer has valid UTF-8 encoding
- will stop when the buffer contains #0
- just a wrapper around IsValidUtf8Buffer(source, StrLen(source)) so if you know the source length, you would better call IsValidUtf8Buffer() directly
- on Haswell AVX2 Intel/AMD CPUs, will use very efficient ASM, reaching e.g. 15 GB/s parsing speed on a Core i5-13500 - StrLen() itself runs at 37 GB/s
function IsValidUtf8WithoutControlChars(source: PUtf8Char): boolean; overload;
Returns TRUE if the supplied buffer has valid UTF-8 encoding with no #1..#31 control characters
- supplied input is a pointer to a #0 ended text buffer
function IsValidUtf8WithoutControlChars(const source: RawUtf8): boolean; overload;
Returns TRUE if the supplied buffer has valid UTF-8 encoding with no #0..#31 control characters
- supplied input is a RawUtf8 variable
function IsVoid(const text: RawUtf8): boolean;
Check all character within text are spaces or control chars
- i.e. a faster alternative to if TrimU(text)='' then
function IsWinAnsi(WideText: PWideChar; Length: integer): boolean; overload;
Return TRUE if the supplied unicode buffer only contains WinAnsi characters
- i.e. if the text can be displayed using ANSI_CHARSET
function IsWinAnsi(WideText: PWideChar): boolean; overload;
Return TRUE if the supplied unicode buffer only contains WinAnsi characters
- i.e. if the text can be displayed using ANSI_CHARSET
function IsWinAnsiU(Utf8Text: PUtf8Char): boolean;
Return TRUE if the supplied UTF-8 buffer only contains WinAnsi characters
- i.e. if the text can be displayed using ANSI_CHARSET
function IsWinAnsiU8Bit(Utf8Text: PUtf8Char): boolean;
Return TRUE if the supplied UTF-8 buffer only contains WinAnsi 8-bit characters
- i.e. if the text can be displayed using ANSI_CHARSET with only 8-bit unicode characters (e.g. no "tm" or such)
function IsZero(const Values: TRawUtf8DynArray): boolean; overload;
Returns TRUE if Value is nil or all supplied Values[] equal ''
function LeftU(const S: RawUtf8; n: PtrInt): RawUtf8;
Returns n leading characters
function LowerCase(const S: RawUtf8): RawUtf8;
Fast conversion of the supplied text into lowercase
- this will only convert 'A'..'Z' into 'a'..'z' (no NormToLower use), and will therefore be correct with true UTF-8 content
procedure LowerCaseCopy(Text: PUtf8Char; Len: PtrInt; var Dest: RawUtf8);
Fast conversion of the supplied text into lowercase
- this will only convert 'A'..'Z' into 'a'..'z' (no NormToLower use), and will therefore be correct with true UTF-8 content
procedure LowerCaseSelf(var S: RawUtf8);
Fast in-place conversion of the supplied variable text into lowercase
- this will only convert 'A'..'Z' into 'a'..'z' (no NormToLower use), and will therefore be correct with true UTF-8 content, but only for 7-bit
function LowerCaseSynUnicode(const S: SynUnicode): SynUnicode;
Use the RTL to convert the SynUnicode text to LowerCase
function LowerCaseU(const S: RawUtf8): RawUtf8;
Fast conversion of the supplied text into 8-bit lowercase
- this will not only convert 'A'..'Z' into 'a'..'z', but also accentuated latin characters ('E' acute into 'e' e.g.), using NormToLower[] array
- it will therefore decode the supplied UTF-8 content to handle more than 7-bit of ascii characters
function LowerCaseUnicode(const S: RawUtf8): RawUtf8;
Accurate conversion of the supplied UTF-8 content into the corresponding lower-case Unicode characters
- will use the available API (e.g. Win32 or ICU), so may not be consistent on all systems - and also slower than LowerCase/LowerCaseU versions
function NextNotSpaceCharIs(var P: PUtf8Char; ch: AnsiChar): boolean;
Check if the next character not in [#1..' '] matchs a given value
- first ignore any non space character
- then returns TRUE if P^=ch, setting P to the character after ch
- or returns FALSE if P^<>ch, leaving P at the level of the unexpected char
function NextUtf8Ucs4(var P: PUtf8Char): Ucs4CodePoint;
Get the UCS4 CodePoint stored in P^ (decode UTF-8 if necessary)
function OnlyChar(const text: RawUtf8; const only: TSynAnsicharSet): RawUtf8;
Returns the supplied text content, without any other char than specified
- specify a custom char set to be included, e.g. as ['A'..'Z']
function PosCharAny(Str: PUtf8Char; Characters: PAnsiChar): PUtf8Char;
Fast retrieve the position of any value of a given set of characters
- see also strspn() function which is likely to be faster
function PosExI(const SubStr, S: RawUtf8; Offset: PtrUInt; Lookup: PNormTable): PtrInt; overload;
A case-insensitive version of PosEx() with a specified lookup table
- redirect to mormot.core.base PosEx() if Lookup = nil
function PosExI(const SubStr, S: RawUtf8; Offset: PtrUInt): PtrInt; overload;
A ASCII-7 case-insensitive version of PosEx()
- will use the NormToUpperAnsi7 lookup table for character conversion
function PosExIPas(Sub, P: PUtf8Char; Offset: PtrUInt; Lookup: PNormTable): PtrInt;
Internal function used when inlining PosExI()
function PosI(uppersubstr: PUtf8Char; const str: RawUtf8): PtrInt;
A non case-sensitive RawUtf8 version of Pos()
- uppersubstr is expected to be already in upper case
- this version handle only 7-bit ASCII (no accentuated characters)
- see PosIU() if you want an UTF-8 version with accentuated chars support
function PosIU(substr: PUtf8Char; const str: RawUtf8): integer;
A non case-sensitive RawUtf8 version of Pos()
- substr is expected to be already in upper case
- this version will decode the UTF-8 content before using NormToUpper[]
- see PosI() for a non-accentuated, but faster version
function PropNameSanitize(const text, fallback: RawUtf8): RawUtf8;
Try to generate a PropNameValid() output from an incoming text
- will trim all spaces, and replace most special chars by '_'
- if it is not PropNameValid() after those replacements, will return fallback
function PropNamesValid(const Values: array of RawUtf8): boolean;
Returns TRUE if the given text buffers contains A..Z,0..9,_ characters
- use it with property names values (i.e. only including A..Z,0..9,_ chars)
- this function allows numbers as first char, so won't check the first char the same way than PropNameValid() which refuses digits as pascal convention
function PropNameValid(P: PUtf8Char): boolean;
Returns TRUE if the given text buffer contains a..z,A..Z,0..9,_ characters
- should match most usual property names values or other identifier names in the business logic source code
- i.e. can be tested via IdemPropName*() functions, and the MongoDB-like extended JSON syntax as generated by dvoSerializeAsExtendedJson
- following classic pascal naming convention, first char must be alphabetical or '_' (i.e. not a digit), following chars can be alphanumerical or '_'
procedure QuickSortRawUtf8(Values: PRawUtf8Array; L, R: PtrInt; caseInsensitive: boolean = false); overload;
Sort a RawUtf8 array, low values first
procedure QuickSortRawUtf8(var Values: TRawUtf8DynArray; ValuesCount: integer; CoValues: PIntegerDynArray = nil; Compare: TUtf8Compare = nil); overload;
Sort a dynamic array of RawUtf8 items
- if CoValues is set, the integer items are also synchronized
- by default, exact (case-sensitive) match is used; you can specify a custom compare function if needed in Compare optional parameter
function QuotedStr(const S: RawUtf8; Quote: AnsiChar = ''''): RawUtf8; overload;
Format a text content with SQL-like quotes
- this function implements what is specified in the official SQLite3 documentation: "A string constant is formed by enclosing the string in single quotes ('). A single quote within the string can be encoded by putting two single quotes in a row - as in Pascal."
procedure QuotedStr(const S: RawUtf8; Quote: AnsiChar; var result: RawUtf8); overload;
Format a text content with SQL-like quotes
procedure QuotedStr(P: PUtf8Char; PLen: PtrInt; Quote: AnsiChar; var result: RawUtf8); overload;
Format a text buffer with SQL-like quotes
function RawUnicodeToString(P: PWideChar; L: integer): string; overload;
Convert any UTF-16 encoded buffer into a RTL string
function RawUnicodeToString(const U: RawUnicode): string; overload;
Convert any RawUnicode encoded string into a RTL string
- uses StrLenW() and not length(U) to handle case when was used as buffer
procedure RawUnicodeToString(P: PWideChar; L: integer; var result: string); overload;
Convert any UTF-16 encoded buffer into a RTL string
function RawUnicodeToSynUnicode(const Unicode: RawUnicode): SynUnicode; overload;
Convert any RawUnicode String into a generic SynUnicode Text
function RawUnicodeToSynUnicode( WideChar: PWideChar; WideCharCount: integer): SynUnicode; overload;
Convert any UTF-16 buffer into a generic SynUnicode Text
procedure RawUnicodeToUtf8(WideChar: PWideChar; WideCharCount: integer; var result: RawUtf8; Flags: TCharConversionFlags = [ccfNoTrailingZero]); overload;
Convert a UTF-16 PWideChar buffer into a UTF-8 string
function RawUnicodeToUtf8(WideChar: PWideChar; WideCharCount: integer; Flags: TCharConversionFlags = [ccfNoTrailingZero]): RawUtf8; overload;
Convert a UTF-16 PWideChar buffer into a UTF-8 string
procedure RawUnicodeToUtf8(WideChar: PWideChar; WideCharCount: integer; var result: TSynTempBuffer; Flags: TCharConversionFlags); overload;
Convert a UTF-16 PWideChar buffer into a UTF-8 temporary buffer
function RawUnicodeToUtf8(Dest: PUtf8Char; DestLen: PtrInt; Source: PWideChar; SourceLen: PtrInt; Flags: TCharConversionFlags): PtrInt; overload;
Convert a UTF-16 PWideChar buffer into a UTF-8 buffer
- replace system.UnicodeToUtf8 implementation, which is rather slow since Delphi 2009+
- append a trailing #0 to the ending PUtf8Char, unless ccfNoTrailingZero is set
- if ccfReplacementCharacterForUnmatchedSurrogate is set, this function will identify unmatched surrogate pairs and replace them with UNICODE_REPLACEMENT_CHARACTER - see https://en.wikipedia.org/wiki/Specials_(Unicode_block)
function RawUnicodeToUtf8(const Unicode: RawUnicode): RawUtf8; overload;
Convert a RawUnicode string into a UTF-8 string
function RawUnicodeToUtf8(WideChar: PWideChar; WideCharCount: integer; out Utf8Length: integer): RawUtf8; overload;
Convert a UTF-16 PWideChar buffer into a UTF-8 string
- this version doesn't resize the resulting RawUtf8 string, but return the new resulting RawUtf8 byte count into Utf8Length
function RawUnicodeToWinAnsi(const Unicode: RawUnicode): WinAnsiString; overload;
Convert a UTF-16 string into a WinAnsi (code page 1252) string
function RawUnicodeToWinAnsi( WideChar: PWideChar; WideCharCount: integer): WinAnsiString; overload;
Convert a UTF-16 PWideChar buffer into a WinAnsi (code page 1252) string
procedure RawUnicodeToWinPChar(dest: PAnsiChar; source: PWideChar; WideCharCount: integer);
Direct conversion of a UTF-16 encoded buffer into a WinAnsi PAnsiChar buffer
function RawUtf8DynArrayEquals(const A, B: TRawUtf8DynArray; Count: integer): boolean; overload;
True if both TRawUtf8DynArray are the same for a given number of items
- A and B are expected to have at least Count items
- comparison is case-sensitive
function RawUtf8DynArrayEquals(const A, B: TRawUtf8DynArray): boolean; overload;
True if both TRawUtf8DynArray are the same
- comparison is case-sensitive
function RawUtf8FromFile(const FileName: TFileName): RawUtf8;
Read a File content into a RawUtf8, detecting any leading BOM
- will assume text file with no BOM is already UTF-8 encoded
- an alternative to StringFromFile() if you want to handle UTF-8 content and the files are likely to be natively UTF-8 encoded, or with a BOM
function RawUtf8OfChar(Ch: AnsiChar; Count: integer): RawUtf8;
UTF-8 dedicated (and faster) alternative to StringOfChar((Ch,Count))
function RightU(const S: RawUtf8; n: PtrInt): RawUtf8;
Returns n trailing characters
function SameTextU(const S1, S2: RawUtf8): boolean;
SameText() overloaded function with proper UTF-8 decoding
- fast version using NormToUpper[] array for all WinAnsi characters
- this version will decode each UTF-8 glyph before using NormToUpper[]
- current implementation handles UTF-16 surrogates as Utf8IComp()
function ShortStringToUtf8(const source: ShortString): RawUtf8;
Direct conversion of a WinAnsi ShortString into a UTF-8 text
- call internally WinAnsiConvert fast conversion class
function SortDynArrayAnsiStringI(const A, B): integer;
Compare two "array of AnsiString" elements, with no case sensitivity
- just a wrapper around inlined StrIComp()
function SortDynArrayPUtf8CharI(const A, B): integer;
Compare two "array of PUtf8Char/PAnsiChar" elements, with no case sensitivity
- just a wrapper around inlined StrIComp()
function SortDynArrayStringI(const A, B): integer;
Compare two "array of RTL string" elements, with no case sensitivity
- the expected string type is the RTL string
- just a wrapper around StrIComp() for AnsiString or AnsiICompW() for UNICODE
function SortDynArrayUnicodeStringI(const A, B): integer;
Compare two "array of WideString/UnicodeString" elements, with no case sensitivity
- implemented here since would call AnsiICompW()
function Split(const Str, SepStr: RawUtf8; var LeftStr, RightStr: RawUtf8; ToUpperCase: boolean = false): boolean; overload;
Split a RawUtf8 string into two strings, according to SepStr separator
- returns true and LeftStr/RightStr if they were separated by SepStr
- if SepStr is not found, LeftStr=Str and RightStr='' and returns false
- if ToUpperCase is TRUE, then LeftStr and RightStr will be made uppercase
function Split(const Str: RawUtf8; const SepStr: array of RawUtf8; const DestPtr: array of PRawUtf8): PtrInt; overload;
Split a RawUtf8 string into several strings, according to SepStr separator
- this overloaded function will fill a DestPtr[] array of PRawUtf8
- if any DestPtr[]=nil, the item will be skipped
- if input Str end before al SepStr[] are found, DestPtr[] is set to ''
- returns the number of values extracted into DestPtr[]
function Split(const Str, SepStr: RawUtf8; var LeftStr: RawUtf8; ToUpperCase: boolean = false): RawUtf8; overload;
Split a RawUtf8 string into two strings, according to SepStr separator
- this overloaded function returns the right string as function result
- if SepStr is not found, LeftStr=Str and result=''
- if ToUpperCase is TRUE, then LeftStr and result will be made uppercase
function SplitRight(const Str: RawUtf8; SepChar: AnsiChar; LeftStr: PRawUtf8 = nil): RawUtf8;
Returns the last occurrence of the given SepChar separated context
- e.g. SplitRight('01/2/34','/')='34'
- if SepChar doesn't appear, will return Str, e.g. SplitRight('123','/')='123'
- if LeftStr is supplied, the RawUtf8 it points to will be filled with the left part just before SepChar ('' if SepChar doesn't appear)
function SplitRights(const Str, SepChar: RawUtf8): RawUtf8;
Returns the last occurrence of the given SepChar separated context
- e.g. SplitRight('path/one\two/file.ext','/\')='file.ext', i.e. SepChars='/\' will be like ExtractFileName() over RawUtf8 string
- if SepChar doesn't appear, will return Str, e.g. SplitRight('123','/')='123'
function StartWith(const text, upTextStart: RawUtf8): boolean;
Check case-insensitive matching starting of text in upTextStart
- returns true if the item matched
- ignore case - upTextStart must be already in upper case
- chars are compared as 7-bit Ansi only (no accentuated chars, nor UTF-8)
- see StartWithExact() from mormot.core.text for a case-sensitive version
function StartWithExact(const text, textStart: RawUtf8): boolean;
Check case-sensitive matching starting of text in start
- returns true if the item matched
- see StartWith() from mormot.core.unicode for a case-insensitive version
function StrCompIL(P1, P2: pointer; L: PtrInt; Default: PtrInt = 0): PtrInt;
Our fast version of StrCompIL(), to be used with PUtf8Char
- i.e. make a case-insensitive comparison of two memory buffers, using supplied length
- Default value is returned if both P1 and P2 buffers are equal
function StrCompL(P1, P2: pointer; L: PtrInt; Default: PtrInt = 0): PtrInt;
Our fast version of StrCompL(), to be used with PUtf8Char
- i.e. make a binary comparison of two memory buffers, using supplied length
- Default value is returned if both P1 and P2 buffers are equal
function strcspn(s, reject: pointer): integer;
Pure pascal version of strcspn(), to be used with PUtf8Char/PAnsiChar
- returns size of initial segment of s which doesn't appears in reject chars, e.g.
strcspn('1234,6789',',')=4
- please note that this optimized version may read up to 3 bytes beyond reject but never after s end, so is safe e.g. over memory mapped files
function StrIComp(Str1, Str2: pointer): PtrInt;
Our fast version of StrIComp(), to be used with PUtf8Char/PAnsiChar
function StrICompLNotNil(Str1, Str2: pointer; Up: PNormTableByte; L: PtrInt): PtrInt;
StrIComp-like function with a length, lookup table and Str1/Str2 expected not nil
function StrICompNotNil(Str1, Str2: pointer; Up: PNormTableByte): PtrInt;
StrIComp-like function with a lookup table and Str1/Str2 expected not nil
function StrILNotNil(Str1, Str2: pointer; Up: PNormTableByte; L: PtrInt): PtrInt;
StrIComp function with a length, lookup table and Str1/Str2 expected not nil
- returns L for whole match, or < L for a partial match
function StringBufferToUtf8(Dest: PUtf8Char; Source: PChar; SourceChars: PtrInt): PUtf8Char; overload;
Convert any RTL string buffer into an UTF-8 encoded buffer
- Dest must be able to receive at least SourceChars*3 bytes
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringBufferToUtf8(Source: PChar; out result: RawUtf8); overload;
Convert any RTL string 0-terminated Text buffer into an UTF-8 string
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringDynArrayToRawUtf8DynArray(const Source: TStringDynArray; var result: TRawUtf8DynArray);
Convert the string dynamic array into a dynamic array of UTF-8 strings
function StringFromBomFile(const FileName: TFileName; out FileContent: RawByteString; out Buffer: pointer; out BufferSize: PtrInt): TBomFile;
Read a file into a temporary variable, check the BOM, and adjust the buffer
procedure StringListToRawUtf8DynArray(Source: TStringList; var result: TRawUtf8DynArray);
Convert the string list into a dynamic array of UTF-8 strings
function StringReplaceAll(const S, OldPattern, NewPattern: RawUtf8; CaseInsensitive: boolean): RawUtf8; overload;
Case-sensitive (or not) StringReplace(S, OldPattern, NewPattern,[rfReplaceAll])
- calls plain StringReplaceAll() version for CaseInsensitive = false
- calls StringReplaceAll(.., NormToUpperAnsi7) if CaseInsensitive = true
function StringReplaceAll(const S: RawUtf8; const OldNewPatternPairs: array of RawUtf8; CaseInsensitive: boolean = false): RawUtf8; overload;
Fast version of several cascaded StringReplaceAll()
function StringReplaceAll(const S, OldPattern, NewPattern: RawUtf8; Lookup: PNormTable = nil): RawUtf8; overload;
Fast version of StringReplace(S, OldPattern, NewPattern, [rfReplaceAll]);
function StringReplaceAllProcess(const S, OldPattern, NewPattern: RawUtf8; found: integer; Lookup: PNormTable): RawUtf8;
Actual replacement function called by StringReplaceAll() on first match
- not to be called as such, but defined globally for proper inlining
function StringReplaceChars(const Source: RawUtf8; OldChar, NewChar: AnsiChar): RawUtf8;
Fast replace of a specified char by a given string
function StringReplaceTabs(const Source, TabText: RawUtf8): RawUtf8;
Fast replace of all #9 chars by a given string
function StringToAnsi7(const Text: string): RawByteString;
Convert any RTL string into Ansi 7-bit encoded String
- the Text content must contain only 7-bit pure ASCII characters
function StringToRawUnicode(const S: string): RawUnicode; overload;
Convert any RTL string into a RawUnicode encoded String
- it's prefered to use TLanguageFile.StringToUtf8() method in mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
function StringToRawUnicode(P: PChar; L: integer): RawUnicode; overload;
Convert any RTL string into a RawUnicode encoded String
- it's prefered to use TLanguageFile.StringToUtf8() method in mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringToSynUnicode(const S: string; var result: SynUnicode); overload;
Convert any RTL string into a SynUnicode encoded String
- overloaded to avoid a copy to a temporary result string of a function
function StringToSynUnicode(const S: string): SynUnicode; overload;
Convert any RTL string into a SynUnicode encoded String
- it's prefered to use TLanguageFile.StringToUtf8() method in mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringToUtf8(const Text: string; var result: RawUtf8); overload;
Convert any RTL string into an UTF-8 encoded String
- this overloaded function use a faster by-reference parameter for the result
function StringToUtf8(const Text: string; var Temp: TSynTempBuffer): integer; overload;
Convert any RTL string into an UTF-8 encoded TSynTempBuffer
- returns the number of UTF-8 bytes available in Temp.buf
- this overloaded function use a TSynTempBuffer for the result to avoid any memory allocation for the shorter content
- caller should call Temp.Done to release any heap-allocated memory
function StringToUtf8(const Text: string): RawUtf8; overload;
Convert any RTL string into an UTF-8 encoded String
- in the VCL context, it's prefered to use TLanguageFile.StringToUtf8() method from mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringToUtf8(Text: PChar; TextLen: PtrInt; var result: RawUtf8); overload;
Convert any RTL string buffer into an UTF-8 encoded String
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure StringToVariant(const Txt: string; var result: variant); overload;
Convert any RTL string into a variant storing a UTF-8 string
- could be used e.g. as TDocVariantData.AddValue() parameter
function StringToVariant(const Txt: string): variant; overload;
Convert any RTL string into a variant storing a UTF-8 string
- could be used e.g. as TDocVariantData.AddValue() parameter
function StringToWinAnsi(const Text: string): WinAnsiString;
Convert any RTL string into WinAnsi (Win-1252) 8-bit encoded String
function StrPosI(uppersubstr, str: PUtf8Char): PUtf8Char;
A non case-sensitive version of Pos()
- uppersubstr is expected to be already in upper case
- this version handle only 7-bit ASCII (no accentuated characters)
function StrPosIReference(U: PUtf8Char; const Up: RawUcs4): PUtf8Char;
UTF-8 Unicode 10.0 case-insensitive Pattern search within UTF-8 buffer
- returns nil if no match, or the Pattern position found inside U^
- Up should have been already converted using UpperCaseUcs4Reference()
- won't call the Operating System, so is consistent on all platforms, and don't require any temporary UTF-16 decoding
function strspn(s, accept: pointer): integer;
Pure pascal version of strspn(), to be used with PUtf8Char/PAnsiChar
- returns size of initial segment of s which appears in accept chars, e.g.
strspn('abcdef','debca')=5
- please note that this optimized version may read up to 3 bytes beyond accept but never after s end, so is safe e.g. over memory mapped files
function SynUnicodeToString(const U: SynUnicode): string;
Convert any SynUnicode encoded string into a RTL string
function SynUnicodeToUtf8(const Unicode: SynUnicode): RawUtf8;
Convert a SynUnicode string into a UTF-8 string
function ToUtf8(const Ansi7Text: ShortString): RawUtf8; overload;
Convert any UTF-8 encoded ShortString Text into an UTF-8 encoded String
- expects the supplied content to be already ASCII-7 or UTF-8 encoded, e.g. a RTTI type or property name: it won't work with Ansi-encoded strings
function ToUtf8(const Text: string): RawUtf8; overload;
Convert any RTL string into an UTF-8 encoded String
function TRawUtf8DynArrayFrom(const Values: array of RawUtf8): TRawUtf8DynArray;
Quick helper to initialize a dynamic array of RawUtf8 from some constants
- can be used e.g. as:
MyArray := TRawUtf8DynArrayFrom(['a','b','c']);
function TrimChar(const text: RawUtf8; const exclude: TSynAnsicharSet): RawUtf8;
Returns the supplied text content, without any specified char
- specify a custom char set to be excluded, e.g. as [#0 .. ' ']
procedure TrimChars(var S: RawUtf8; Left, Right: PtrInt);
Trim some trailing and ending chars
- if S is unique (RefCnt=1), will modify the RawUtf8 in place
- faster alternative to S := copy(S, Left + 1, length(S) - Left - Right)
function TrimControlChars(const text: RawUtf8): RawUtf8;
Returns the supplied text content, without any control char
- here control chars have an ASCII code in [#0 .. ' '], i.e. text[] <= ' '
function TrimLeft(const S: RawUtf8): RawUtf8;
Trims leading whitespace characters from the string by removing new line, space, and tab characters
procedure TrimLeftLines(var S: RawUtf8);
Trims leading whitespaces of every lines of the UTF-8 text
- also delete void lines
- could be used e.g. before FindNameValue() call
- modification is made in-place so S will be modified
function TrimLeftLowerCase(const V: RawUtf8): PUtf8Char;
Trim first lowercase chars ('otDone' will return 'Done' e.g.)
- return a PUtf8Char to avoid any memory allocation
function TrimLeftLowerCaseShort(V: PShortString): RawUtf8;
Trim first lowercase chars ('otDone' will return 'Done' e.g.)
- return an RawUtf8 string: enumeration names are pure 7-bit ANSI with Delphi 7 to 2007, and UTF-8 encoded with Delphi 2009+
procedure TrimLeftLowerCaseToShort(V: PShortString; out result: ShortString); overload;
Trim first lowercase chars ('otDone' will return 'Done' e.g.)
- return a ShortString: enumeration names are pure 7-bit ANSI with Delphi 7 to 2007, and UTF-8 encoded with Delphi 2009+
function TrimLeftLowerCaseToShort(V: PShortString): ShortString; overload;
Trim first lowercase chars ('otDone' will return 'Done' e.g.)
- return a ShortString: enumeration names are pure 7-bit ANSI with Delphi 7 to 2007, and UTF-8 encoded with Delphi 2009+
function TrimOneChar(const text: RawUtf8; exclude: AnsiChar): RawUtf8;
Returns the supplied text content, without one specified char
function TrimRight(const S: RawUtf8): RawUtf8;
Trims trailing whitespace characters from the string by removing trailing newline, space, and tab characters
function Ucs4ToUtf8(ucs4: Ucs4CodePoint; Dest: PUtf8Char): PtrInt;
UTF-8 encode one UCS4 CodePoint into Dest
- return the number of bytes written into Dest (i.e. from 1 up to 6)
- this method DOES properly handle UTF-16 surrogate pairs
function UnCamelCase(const S: RawUtf8): RawUtf8; overload;
Convert a CamelCase string into a space separated one
- 'OnLine' will return 'On line' e.g., and 'OnMyLINE' will return 'On my LINE'
- will handle capital words at the beginning, middle or end of the text, e.g. 'KLMFlightNumber' will return 'KLM flight number' and 'GoodBBCProgram' will return 'Good BBC program'
- will handle a number at the beginning, middle or end of the text, e.g. 'Email12' will return 'Email 12'
- '_' char is transformed into ' - '
- '__' chars are transformed into ': '
- return an RawUtf8 string: enumeration names are pure 7-bit ANSI with Delphi up to 2007, and UTF-8 encoded with Delphi 2009+
function UnCamelCase(D, P: PUtf8Char): integer; overload;
Convert a CamelCase string into a space separated one
- 'OnLine' will return 'On line' e.g., and 'OnMyLINE' will return 'On my LINE'
- will handle capital words at the beginning, middle or end of the text, e.g. 'KLMFlightNumber' will return 'KLM flight number' and 'GoodBBCProgram' will return 'Good BBC program'
- will handle a number at the beginning, middle or end of the text, e.g. 'Email12' will return 'Email 12'
- return the char count written into D^
- D^ and P^ are expected to be UTF-8 encoded: enumeration and property names are pure 7-bit ANSI with Delphi 7 to 2007, and UTF-8 encoded with Delphi 2009+
- '_' char is transformed into ' - '
- '__' chars are transformed into ': '
function UnicodeBufferToString(source: PWideChar): string;
Convert an Unicode buffer into a RTL string
function UnicodeBufferToUtf8(source: PWideChar): RawUtf8;
Convert an Unicode buffer into a UTF-8 string
function UnicodeBufferToVariant(source: PWideChar): variant;
Convert an Unicode buffer into a variant storing a UTF-8 string
- could be used e.g. as TDocVariantData.AddValue() parameter
procedure UnicodeBufferToWinAnsi(source: PWideChar; out Dest: WinAnsiString);
Convert an Unicode buffer into a WinAnsi (code page 1252) string
procedure UniqueRawUtf8ZeroToTilde(var u: RawUtf8; MaxSize: PtrInt = maxInt);
Will fast replace all #0 chars as ~
- could be used after UniqueRawUtf8() on a in-placed modified JSON buffer, in which all values have been ended with #0
- you can optionally specify a maximum size, in bytes (this won't reallocate the string, but just add a #0 at some point in the UTF-8 buffer)
- could allow logging of parsed input e.g. after an exception
function UnQuotedSqlSymbolName(const ExternalDBSymbol: RawUtf8): RawUtf8;
Unquote a SQL-compatible symbol name
- e.g. '[symbol]' -> 'symbol' or '"symbol"' -> 'symbol'
function UnQuoteSqlString(const Value: RawUtf8): RawUtf8;
Unquote a SQL-compatible string
function UnQuoteSqlStringVar(P: PUtf8Char; out Value: RawUtf8): PUtf8Char;
Unquote a SQL-compatible string
- the first character in P^ must be either ' or " then internal double quotes are transformed into single quotes
- 'text '' end' -> text ' end
- "text "" end" -> text " end
- returns nil if P doesn't contain a valid SQL string
- returns a pointer just after the quoted text otherwise
function UnZeroed(const bin: RawByteString): RawUtf8;
Convert a binary buffer into a fake ASCII/UTF-8 content without any #0 input
- will use ~ char to escape any #0 as ~0 pair (and plain ~ as ~~ pair)
- output is just a bunch of non 0 bytes, so not trully valid UTF-8 content
- may be used as an alternative to Base64 encoding if 8-bit chars are allowed
- call ZeroedRawUtf8() as reverse function
function UpperCase(const S: RawUtf8): RawUtf8;
Fast conversion of the supplied text into uppercase
- this will only convert 'a'..'z' into 'A'..'Z' (no NormToUpper use), and will therefore be correct with true UTF-8 content, but only for 7-bit
procedure UpperCaseCopy(const Source: RawUtf8; var Dest: RawUtf8); overload;
Fast conversion of the supplied text into uppercase
- this will only convert 'a'..'z' into 'A'..'Z' (no NormToUpper use), and will therefore be correct with true UTF-8 content, but only for 7-bit
procedure UpperCaseCopy(Text: PUtf8Char; Len: PtrInt; var Dest: RawUtf8); overload;
Fast conversion of the supplied text into uppercase
- this will only convert 'a'..'z' into 'A'..'Z' (no NormToUpper use), and will therefore be correct with true UTF-8 content, but only for 7-bit
function UpperCaseReference(const S: RawUtf8): RawUtf8;
UpperCase conversion of a UTF-8 string using our Unicode 10.0 tables
- won't call the Operating System, so is consistent on all platforms, whereas UpperCaseUnicode() may vary depending on each library implementation
- won't use temporary UTF-16 decoding, and optimized for plain ASCII content
procedure UpperCaseSelf(var S: RawUtf8);
Fast in-place conversion of the supplied variable text into uppercase
- this will only convert 'a'..'z' into 'A'..'Z' (no NormToUpper use), and will therefore be correct with true UTF-8 content, but only for 7-bit
function UpperCaseSynUnicode(const S: SynUnicode): SynUnicode;
Use the RTL to convert the SynUnicode text to UpperCase
function UpperCaseU(const S: RawUtf8): RawUtf8;
Fast conversion of the supplied text into 8-bit uppercase
- this will not only convert 'a'..'z' into 'A'..'Z', but also accentuated latin characters ('e' acute into 'E' e.g.), using NormToUpper[] array
- it will therefore decode the supplied UTF-8 content to handle more than 7-bit of ascii characters (so this function is dedicated to WinAnsi code page 1252 characters set)
function UpperCaseUcs4Reference(const S: RawUtf8): RawUcs4;
UpperCase conversion of UTF-8 into UCS4 using our Unicode 10.0 tables
- won't call the Operating System, so is consistent on all platforms, whereas UpperCaseUnicode() may vary depending on each library implementation
function UpperCaseUnicode(const S: RawUtf8): RawUtf8;
Accurate conversion of the supplied UTF-8 content into the corresponding upper-case Unicode characters
- will use the available API (e.g. Win32 or ICU), so may not be consistent on all systems - consider UpperCaseReference() to use our Unicode 10.0 tables
- will temporary decode S into and from UTF-16 so is likely to be slower
function UpperCopy(dest: PAnsiChar; const source: RawUtf8): PAnsiChar;
Copy source into dest^ with ASCII 7-bit upper case conversion
- returns final dest pointer
- will copy up to the source buffer end: so Dest^ should be big enough - which will the case e.g. if Dest := pointer(source)
function UpperCopy255(dest: PAnsiChar; const source: RawUtf8): PAnsiChar; overload;
Delphi does not like inlining goto+label copy source into a 256 chars dest^ buffer with 7-bit upper case conversion
- used internally for short keys match or case-insensitive hash
- returns final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be defined e.g. as array[byte] of AnsiChar on the caller stack)
function UpperCopy255Buf(dest: PAnsiChar; source: PUtf8Char; sourceLen: PtrInt): PAnsiChar;
Copy source^ into a 256 chars dest^ buffer with 7-bit upper case conversion
- used internally for short keys match or case-insensitive hash
- returns final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be defined e.g. as array[byte] of AnsiChar on the caller stack)
function UpperCopy255W(dest: PAnsiChar; source: PWideChar; L: PtrInt): PAnsiChar; overload;
Copy WideChar source into dest^ with upper case conversion
- used internally for short keys match or case-insensitive hash
- returns final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be array[byte] of AnsiChar), replacing any non WinAnsi character by '?'
function UpperCopy255W(dest: PAnsiChar; const source: SynUnicode): PAnsiChar; overload;
Copy UTF-16 source into dest^ with ASCII 7-bit upper case conversion
- used internally for short keys match or case-insensitive hash
- returns final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be array[byte] of AnsiChar), replacing any non WinAnsi character by '?'
function UpperCopyShort(dest: PAnsiChar; const source: ShortString): PAnsiChar;
Copy source into dest^ with ASCII 7-bit upper case conversion
- returns final dest pointer
- this special version expect source to be a ShortString
function UpperCopyWin255(dest: PWinAnsiChar; const source: RawUtf8): PWinAnsiChar;
Copy source into dest^ with WinAnsi 8-bit upper case conversion
- used internally for short keys match or case-insensitive hash
- returns final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be array[byte] of AnsiChar)
function Utf16CharToUtf8(Dest: PUtf8Char; var Source: PWord): integer;
UTF-8 encode one UTF-16 encoded UCS4 CodePoint into Dest
- return the number of bytes written into Dest (i.e. from 1 up to 6)
- Source will contain the next UTF-16 character
- this method DOES properly handle UTF-16 surrogate pairs
function Utf8DecodeToRawUnicode(P: PUtf8Char; L: integer): RawUnicode; overload;
Convert a UTF-8 encoded buffer into a RawUnicode string
- if L is 0, L is computed from zero terminated P buffer
- RawUnicode is ended by a WideChar(#0)
- faster than System.Utf8Decode() which uses slow widestrings
function Utf8DecodeToRawUnicode(const S: RawUtf8): RawUnicode; overload;
Convert a UTF-8 string into a RawUnicode string
function Utf8DecodeToRawUnicodeUI(const S: RawUtf8; var Dest: RawUnicode): integer; overload;
Convert a UTF-8 string into a RawUnicode string
- returns the resulting length (in bytes) will be stored within Dest
- see also Utf8DecodeToUnicode() which uses a TSynTempBuffer for storage
function Utf8DecodeToRawUnicodeUI(const S: RawUtf8; DestLen: PInteger = nil): RawUnicode; overload;
Convert a UTF-8 string into a RawUnicode string
- this version doesn't resize the length of the result RawUnicode and is therefore useful before a Win32 Unicode API call (with nCount=-1)
- if DestLen is not nil, the resulting length (in bytes) will be stored within
- see also Utf8DecodeToUnicode() which uses a TSynTempBuffer for storage
procedure Utf8DecodeToString(P: PUtf8Char; L: integer; var result: string); overload;
Convert any UTF-8 encoded buffer into a RTL string
function Utf8DecodeToString(P: PUtf8Char; L: integer): string; overload;
Convert any UTF-8 encoded buffer into a RTL string
- it's prefered to use TLanguageFile.Utf8ToString() in mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
function Utf8DecodeToUnicode(Text: PUtf8Char; Len: PtrInt; var temp: TSynTempBuffer): PtrInt; overload;
Convert any UTF-8 encoded buffer into an UTF-16 temporary buffer
function Utf8DecodeToUnicode(const Text: RawUtf8; var temp: TSynTempBuffer): PtrInt; overload;
Convert any UTF-8 encoded string into an UTF-16 temporary buffer
- returns the number of WideChar stored in temp (not bytes)
- caller should make temp.Done after temp.buf has been used
function Utf8DecodeToUnicodeRawByteString(P: PUtf8Char; L: integer): RawByteString; overload;
Convert an UTF-8 encoded buffer into a UTF-16 encoded RawByteString buffer
- could be used instead of deprecated RawUnicode when a temp UTF-16 buffer is needed
function Utf8DecodeToUnicodeRawByteString(const U: RawUtf8): RawByteString; overload;
Convert an UTF-8 encoded buffer into a UTF-16 encoded RawByteString buffer
- could be used instead of deprecated RawUnicode when a temp UTF-16 buffer is needed
function Utf8DecodeToUnicodeStream(P: PUtf8Char; L: integer): TStream;
Convert an UTF-8 encoded buffer into a UTF-16 encoded stream of bytes
function Utf8FirstLineToUtf16Length(source: PUtf8Char): PtrInt;
Calculate the UTF-16 Unicode characters count of the UTF-8 encoded first line
- count may not match the UCS4 CodePoint, in case of UTF-16 surrogates
- end the parsing at first #13 or #10 character
function Utf8IComp(u1, u2: PUtf8Char): PtrInt;
Fast UTF-8 comparison handling WinAnsi CP-1252 case folding
- this version expects u1 and u2 to be zero-terminated
- decode the UTF-8 content before using NormToUpper[] lookup table
- match the our SYSTEMNOCASE custom (and default) SQLite 3 collation
- consider Utf8ICompReference() for Unicode 10.0 support
function Utf8ICompReference(u1, u2: PUtf8Char): PtrInt;
UTF-8 comparison using our Unicode 10.0 tables
- this version expects u1 and u2 to be zero-terminated
- Utf8IComp() handles WinAnsi CP-1252 latin accents - this one is Unicode
- won't call the Operating System, so is consistent on all platforms, and don't require any temporary UTF-16 decoding
- has a branchless optimized process of 7-bit ASCII charset [a..z] -> [A..Z]
function Utf8ILComp(u1, u2: PUtf8Char; L1, L2: cardinal): PtrInt;
Fast UTF-8 comparison handling WinAnsi CP-1252 case folding
- this version expects u1 and u2 not to be necessary zero-terminated, but uses L1 and L2 as length for u1 and u2 respectively
- decode the UTF-8 content before using NormToUpper[] lookup table
- consider Utf8ILCompReference() for Unicode 10.0 support
function Utf8ILCompReference(u1, u2: PUtf8Char; L1, L2: integer): PtrInt;
UTF-8 comparison using our Unicode 10.0 tables
- this version expects u1 and u2 not to be necessary zero-terminated, but uses L1 and L2 as length for u1 and u2 respectively
- Utf8ILComp() handles WinAnsi CP-1252 latin accents - this one is Unicode
- won't call the Operating System, so is consistent on all platforms, and don't require any temporary UTF-16 decoding
- has a branchless optimized process of 7-bit ASCII charset [a..z] -> [A..Z]
procedure Utf8ToFileName(const Text: RawUtf8; var result: TFileName);
Convert any UTF-8 encoded String into a generic RTL file name string
procedure Utf8ToRawUtf8(P: PUtf8Char; var result: RawUtf8);
Direct conversion of a UTF-8 encoded zero terminated buffer into a RawUtf8 String
procedure Utf8ToShortString(var dest: ShortString; source: PUtf8Char);
Direct conversion of a UTF-8 encoded buffer into a WinAnsi ShortString buffer
- non WinAnsi chars are replaced by '?' placeholders
function Utf8ToString(const Text: RawUtf8): string;
Convert any UTF-8 encoded String into a RTL string
- it's prefered to use TLanguageFile.Utf8ToString() in mORMoti18n, which will handle full i18n of your application
- it will work as is with Delphi 2009+ (direct unicode conversion)
- under older version of Delphi (no unicode), it will use the current RTL codepage, as with WideString conversion (but without slow WideString usage)
procedure Utf8ToStringVar(const Text: RawUtf8; var result: string);
Convert any UTF-8 encoded String into a RTL string
function Utf8ToSynUnicode(const Text: RawUtf8): SynUnicode; overload;
Convert any UTF-8 encoded String into a generic SynUnicode Text
procedure Utf8ToSynUnicode(Text: PUtf8Char; Len: PtrInt; var result: SynUnicode); overload;
Convert any UTF-8 encoded buffer into a generic SynUnicode Text
procedure Utf8ToSynUnicode(const Text: RawUtf8; var result: SynUnicode); overload;
Convert any UTF-8 encoded String into a generic SynUnicode Text
function Utf8ToUnicodeLength(source: PUtf8Char): PtrUInt;
Calculate the UTF-16 Unicode characters count, UTF-8 encoded in source^
- count may not match the UCS4 CodePoint, in case of UTF-16 surrogates
- faster than System.Utf8ToUnicode with dest=nil
function Utf8ToWideChar(dest: PWideChar; source: PUtf8Char; MaxDestChars, sourceBytes: PtrInt; NoTrailingZero: boolean = false): PtrInt; overload;
Convert an UTF-8 encoded text into a WideChar (UTF-16) buffer
- faster than System.Utf8ToUnicode
- this overloaded function expect a MaxDestChars parameter
- sourceBytes can not be 0 for this function
- enough place must be available in dest buffer (guess is sourceBytes*3+2)
- a WideChar(#0) is added at the end (if something is written) unless NoTrailingZero is TRUE
- returns the BYTE COUNT (not WideChar count) written in dest, excluding the ending WideChar(#0)
function Utf8ToWideChar(dest: PWideChar; source: PUtf8Char; sourceBytes: PtrInt = 0; NoTrailingZero: boolean = false): PtrInt; overload;
Convert an UTF-8 encoded text into a WideChar (UTF-16) buffer
- faster than System.Utf8ToUnicode
- sourceBytes can by 0, therefore length is computed from zero terminated source
- enough place must be available in dest buffer (guess is sourceBytes*3+2)
- a WideChar(#0) is added at the end (if something is written) unless NoTrailingZero is TRUE
- returns the BYTE count written in dest, excluding the ending WideChar(#0)
function Utf8ToWideString(const Text: RawUtf8): WideString; overload;
Convert any UTF-8 encoded String into a generic WideString Text
procedure Utf8ToWideString(const Text: RawUtf8; var result: WideString); overload;
Convert any UTF-8 encoded String into a generic WideString Text
procedure Utf8ToWideString(Text: PUtf8Char; Len: PtrInt; var result: WideString); overload;
Convert any UTF-8 encoded String into a generic WideString Text
function Utf8ToWinAnsi(const S: RawUtf8): WinAnsiString; overload;
Direct conversion of a UTF-8 encoded string into a WinAnsi String
function Utf8ToWinAnsi(P: PUtf8Char): WinAnsiString; overload;
Direct conversion of a UTF-8 encoded zero terminated buffer into a WinAnsi String
function Utf8ToWinPChar(dest: PAnsiChar; source: PUtf8Char; count: integer): integer;
Direct conversion of a UTF-8 encoded buffer into a WinAnsi PAnsiChar buffer
function Utf8TruncatedLength(const text: RawUtf8; maxBytes: PtrUInt): PtrInt; overload;
Compute the truncated length of the supplied UTF-8 value if it exceeds the specified bytes count
- this function will ensure that the returned content will contain only valid UTF-8 sequence, i.e. will trim the whole trailing UTF-8 sequence
- returns maxBytes if text was not truncated, or the number of fitting bytes
function Utf8TruncatedLength(text: PAnsiChar; textlen, maxBytes: PtrUInt): PtrInt; overload;
Compute the truncated length of the supplied UTF-8 value if it exceeds the specified bytes count
- this function will ensure that the returned content will contain only valid UTF-8 sequence, i.e. will trim the whole trailing UTF-8 sequence
- returns maxBytes if text was not truncated, or the number of fitting bytes
function Utf8TruncateToLength(var text: RawUtf8; maxBytes: PtrUInt): boolean;
Will truncate the supplied UTF-8 value if its length exceeds the specified bytes count
- this function will ensure that the returned content will contain only valid UTF-8 sequence, i.e. will trim the whole trailing UTF-8 sequence
- returns FALSE if text was not truncated, TRUE otherwise
function Utf8TruncateToUnicodeLength(var text: RawUtf8; maxUtf16: integer): boolean;
Will truncate the supplied UTF-8 value if its length exceeds the specified UTF-16 Unicode characters count
- count may not match the UCS4 CodePoint, in case of UTF-16 surrogates
- returns FALSE if text was not truncated, TRUE otherwise
function Utf8UpperCopy(Dest, Source: PUtf8Char; SourceChars: cardinal): PUtf8Char;
Copy UTF-8 buffer into dest^ handling WinAnsi CP-1252 NormToUpper[] folding
- returns the final dest pointer
- current implementation handles UTF-16 surrogates
function Utf8UpperCopy255(dest: PAnsiChar; const source: RawUtf8): PUtf8Char;
Copy UTF-8 buffer into dest^ handling WinAnsi CP-1252 NormToUpper[] folding
- returns the final dest pointer
- will copy up to 255 AnsiChar (expect the dest buffer to be array[byte] of AnsiChar), with UTF-8 encoding
function Utf8UpperReference(S, D: PUtf8Char): PUtf8Char; overload;
UpperCase conversion of a UTF-8 buffer using our Unicode 10.0 tables
- won't call the Operating System, so is consistent on all platforms, whereas UpperCaseUnicode() may vary depending on each library implementation
- some codepoints enhance in length, so D^ should be at least twice than S^
- any invalid input is replaced by UNICODE_REPLACEMENT_CHARACTER=$fffd
- won't use temporary UTF-16 decoding, and optimized for plain ASCII content
function Utf8UpperReference(S, D: PUtf8Char; SLen: PtrUInt): PUtf8Char; overload;
UpperCase conversion of a UTF-8 buffer using our Unicode 10.0 tables
- won't call the Operating System, so is consistent on all platforms, whereas UpperCaseUnicode() may vary depending on each library implementation
- some codepoints enhance in length, so D^ should be at least twice than S^
- any invalid input is replaced by UNICODE_REPLACEMENT_CHARACTER=$fffd
- won't use temporary UTF-16 decoding, and optimized for plain ASCII content
- knowing the Source length, this function will handle any ASCII 7-bit input by quad, for efficiency
function WideCharToWinAnsi(wc: cardinal): integer;
Conversion of a wide char into a WinAnsi (CodePage 1252) char index
- return -1 for an unknown WideChar in code page 1252
function WideCharToWinAnsiChar(wc: cardinal): AnsiChar;
Conversion of a wide char into a WinAnsi (CodePage 1252) char
- return '?' for an unknown WideChar in code page 1252
function WideStringToUtf8(const aText: WideString): RawUtf8;
Convert a WideString into a UTF-8 string
function WideStringToWinAnsi(const Wide: WideString): WinAnsiString;
Convert a WideString into a WinAnsi (code page 1252) string
function WinAnsiBufferToUtf8(Dest: PUtf8Char; Source: PAnsiChar; SourceChars: cardinal): PUtf8Char;
Direct conversion of a WinAnsi PAnsiChar buffer into a UTF-8 encoded buffer
- Dest^ buffer must be reserved with at least SourceChars*3
- call internally WinAnsiConvert fast conversion class
function WinAnsiToRawUnicode(const S: WinAnsiString): RawUnicode;
Direct conversion of a WinAnsi (CodePage 1252) string into a Unicode encoded String
- very fast, by using a fixed pre-calculated array for individual chars conversion
function WinAnsiToSynUnicode(WinAnsi: PAnsiChar; WinAnsiLen: PtrInt): SynUnicode; overload;
Convert a Win-Ansi encoded buffer into a Delphi 2009+ or FPC Unicode string
- this function is faster than default RTL, since use no Win32 API call
function WinAnsiToSynUnicode(const WinAnsi: WinAnsiString): SynUnicode; overload;
Convert a Win-Ansi string into a Delphi 2009+ or FPC Unicode string
- this function is faster than default RTL, since use no Win32 API call
procedure WinAnsiToUnicodeBuffer(const S: WinAnsiString; Dest: PWordArray; DestLen: PtrInt);
Direct conversion of a WinAnsi (CodePage 1252) string into a Unicode buffer
- very fast, by using a fixed pre-calculated array for individual chars conversion
- text will be truncated if necessary to avoid buffer overflow in Dest[]
function WinAnsiToUtf8(const S: WinAnsiString): RawUtf8; overload;
Direct conversion of a WinAnsi (CodePage 1252) string into a UTF-8 encoded String
- faster than SysUtils: don't use Utf8Encode(WideString) -> no Windows.Global(), and use a fixed pre-calculated array for individual chars conversion
function WinAnsiToUtf8(WinAnsi: PAnsiChar; WinAnsiLen: PtrInt): RawUtf8; overload;
Direct conversion of a WinAnsi (CodePage 1252) string into a UTF-8 encoded String
- faster than SysUtils: don't use Utf8Encode(WideString) -> no Windows.Global(), and use a fixed pre-calculated array for individual chars conversion
function Zeroed(const u: RawUtf8): RawByteString;
Convert a fake UTF-8 buffer without any #0 input back into its original binary
- may be used as an alternative to Base64 decoding if 8-bit chars are allowed
- call UnZeroedRawUtf8() as reverse function
CurrentAnsiConvert: TSynAnsiConvert;
Global TSynAnsiConvert instance to handle current system encoding
- this is the encoding as used by the AnsiString type, so will be used before Delphi 2009 to speed-up RTL string handling (especially for UTF-8)
- this instance is global and instantied during the whole program life time
IdemPropNameUSameLen: array[boolean] of TIdemPropNameUSameLen;
Case (in)sensitive comparison of ASCII 7-bit identifiers of same length
IsValidUtf8Buffer: function(source: PUtf8Char; sourcelen: PtrInt): boolean;
Returns TRUE if the supplied buffer has valid UTF-8 encoding
- will also refuse #0 characters within the buffer
- on Haswell AVX2 Intel/AMD CPUs, will use very efficient ASM
LoadResStringTranslate: procedure(var Text: string) = nil;
These procedure type must be defined if a default system.pas is used
- expect generic "string" type, i.e. UnicodeString for Delphi 2009+
NormToLower: TNormTable;
Lookup table used for fast case conversion to lowercase
- handle 8-bit upper chars as in WinAnsi / code page 1252 (e.g. accents)
- is defined globally, since may be used from an inlined function
NormToLowerAnsi7: TNormTable;
This table will convert 'A'..'Z' into 'a'..'z'
- so it will work with UTF-8 without decoding, whereas NormToUpper[] expects WinAnsi encoding
NormToNorm: TNormTable;
Case sensitive NormToUpper[]/NormToLower[]-like table
- i.e. NormToNorm[c] = c
NormToUpper: TNormTable;
Lookup table used for fast case conversion to uppercase
- handle 8-bit upper chars as in WinAnsi / code page 1252 (e.g. accents)
- is defined globally, since may be used from an inlined function
NormToUpperAnsi7: TNormTable;
This table will convert 'a'..'z' into 'A'..'Z'
- so it will work with UTF-8 without decoding, whereas NormToUpper[] expects WinAnsi encoding
RawByteStringConvert: TSynAnsiFixedWidth;
Global TSynAnsiConvert instance with no encoding (RawByteString/RawBlob)
SortDynArrayAnsiStringByCase: array[boolean] of TDynArraySortCompare;
A quick wrapper to SortDynArrayAnsiString or SortDynArrayAnsiStringI comparison functions
StrCompByCase: array[boolean] of TUtf8Compare;
A quick wrapper to StrComp or StrIComp comparison functions
TEXT_CHARS: TTextCharSet;
Lookup table for text linefeed/word/identifier/uri branch-less parsing
Utf8AnsiConvert: TSynAnsiUtf8;
Global TSynAnsiConvert instance to handle UTF-8 encoding (code page CP_UTF8)
- this instance is global and instantied during the whole program life time
WinAnsiConvert: TSynAnsiFixedWidth;
Global TSynAnsiConvert instance to handle WinAnsi encoding (code page 1252)
- this instance is global and instantied during the whole program life time
- it will be created from hard-coded values, and not using the system API, since it appeared that some systems (e.g. in Russia) did tweak the registry so that 1252 code page maps 1251 code page