#1 2017-02-28 13:22:11

cybexr
Member
Registered: 2016-09-14
Posts: 78

Some feature request about sqlite3 FTS & tokenizer

Dear ab,  when running mormot sample 30, I want to check about unicode61 tokenizer,  after modified TSQLArticleSearch definition as  TSQLArticleSearch = class(TSQLRecordFTS4Unicode61), but the mormot generated SQLite3-db still representated table 'ArticleSearch'  as tokenize = simple
after I check & modify mormot.pas row-31023 to Self.InheritsFrom(TSQLRecordFTS3Unicode61),  everything works fine. So I created a pull request on Github. pls review it and thank you again for your enthusiasm on this wonderful framework :-)

And after doing some work on tokenizer, I found simple & porter & unicode61 tokenizer both can't deal Chinese correctly. Is anyway to add ICU tokenizer or some custom tokenizer?

Below is some informatioin I've investigated :
http://stackoverflow.com/questions/1838 … -when-i-in
https://github.com/wangwang4git/SQLite3-ICU
https://github.com/haifengkao/SqliteSubstringSearch
https://sqlite.org/fts3.html#tokenizer

Offline

#2 2017-02-28 17:04:55

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,206
Website

Re: Some feature request about sqlite3 FTS & tokenizer

I've merged your pull request.

And now tokenize=... SQL will be generated from TSQLRecordFTS3/4 class name.
e.g. TSQLRecordFTS4Porter -> tokenize=porter
See https://synopse.info/fossil/info/afd4717549

So if you define your own TSQLRecordFTS4ICU class, it will generate a virtual FTS4 table with tokenize=icu parameter.
But you still need to add the tokenizer function to the SQLite3 engine.
I suspect you need to compile your SQLite3.obj with ICU as reported by https://sqlite.org/fts3.html#compiling_ … 3_and_fts4 i.e.
-DSQLITE_ENABLE_ICU

Offline

#3 2017-03-01 01:48:33

cybexr
Member
Registered: 2016-09-14
Posts: 78

Re: Some feature request about sqlite3 FTS & tokenizer

ab, thank you for your so quick response,  I'lll try to add some CJK language friendly tokenizer to engine.

Offline

Board footer

Powered by FluxBB