Batch inserts with unique, dupes and later counting.

enrcas · 2018-11-28 14:02:59

Hi,

Sometime ago I learn here the best way to batch insert millions of rows (thanks ab),

First I was doing classic inserts, I wanted to build a chess database (of games), but after some learning and asking,
I finally did them in a batch, as you can see in the examples:

 pgndb:=TSQLRestServerDB.CreateWithOwnModel([TSQLpgnGame],DATABASE_FILE_NAME);
  try
    pgndb.DB.LockingMode:=lmExclusive;
    pgndb.DB.Synchronous:=smOff;
    pgndb.CreateMissingTables;
    pgnbatch:=TSQLRestBatch.Create(pgndb,TSQLpgnGame,INSERTS_PER_COMMIT);
    try
      pgnrec:=TSQLPgnGame.Create;
      try 
      //loop the batch here
      // do some stuff, parse the PGN file, build record
       pgnbatch.Add(pgnrec,true);
       pgnrec.clearproperties;
     //end loop
      finally
        pgnrec.Free;
      end;
      pgndb.BatchSend(pgnbatch);
    finally
      pgnbatch.Free;
    end;
  finally
  pgndb.Free;
   Writeln('Memory used finally = ', IntToThousandString(MemoryUsed));
...

Now, I'm trying to learn a bit more the right way to do stuff with Mormot.
As I want games to be unique in the DB, I created a index

    pgndb.CreateMissingTables;
    pgndb.CreateSQLIndex(TSQLgame,'Gamedata',true); //true=unique yes

But... I was thinking, and sorry because I learned DB and SQL in the old school way, when you did everything by hand with SQL ...

How dupes work in this way of inserting? I mean, imagine I am inserting a batch of 1000 games, and some of them are already in the DB (only unique-ing here by gamedata, that is, the game in itself, not players, tournament or whatever).

When I was doing classic inserts, I got an exception when I try to insert a dupe row (by index I mean), so I could count them and process them apart.

I examined the examples and did some google searches but found nothing. Thanks!

ab · 2018-11-28 15:30:43

Maintaining the indexes takes resources, and slows down insertion.
For instance, from a blank database, inserting the data, then creating the indexes after the insertion is the fastest.

You can try the "boInsertOrReplace" option at Batch creation to update existing content.
(it will use 'insert or replace into' instead of plain 'insert into' in the executed SQL)
But it will work with IDs, not with your "gamedata" field.

alpinistbg · 2018-11-28 18:24:44

Using batches will not affect speed when DB.Synchronous:=smOff; mORMot uses Int64 as PK, so you can't benefit also from "WITHOUT ROWID Optimization", https://www.sqlite.org/withoutrowid.html. Not sure that the @ab statement:

For instance, from a blank database, inserting the data, then creating the indexes after the insertion is the fastest.

is true, rather no.

Duplicates with UNIQUE INDEX will always give you exceptions. After all, what is the purpose? Fast inserts or fast search?

ab · 2018-11-28 18:33:20

"Creating the indexes after the insertion is the fastest" is a known fact.
See e.g. https://www.whoishostingthis.com/compar … ulk-insert
This is how all DB engines work: maintaining the index during insertions has a cost (locking the table, fragmenting the disk....), whereas creating the index from scratch is a much faster process - see also https://stackoverflow.com/a/3688755/458259

Maintaining a secondary indexes can drop down performances, especially in SQlite3, with a lot of data.
See https://stackoverflow.com/questions/15778716
In our code, we tend to try using only the primary RowID/ID, or a dedicated separated R-Tree index if needed.

BTW the whole https://www.whoishostingthis.com/compar … e/optimize article is a nice read.

alpinistbg · 2018-11-28 22:40:41

ab wrote:

"Creating the indexes after the insertion is the fastest" is a known fact.
See e.g. https://www.whoishostingthis.com/compar … ulk-insert
This is how all DB engines work: maintaining the index during insertions has a cost (locking the table, fragmenting the disk....), whereas creating the index from scratch is a much faster process - see also https://stackoverflow.com/a/3688755/458259

Hmm, maybe... Depends on the (order of) data you're inserting. Wondering if creating an index does not lock too, doesn't it?

ab wrote:

Maintaining a secondary indexes can drop down performances, especially in SQlite3, with a lot of data.
See https://stackoverflow.com/questions/15778716
In our code, we tend to try using only the primary RowID/ID, or a dedicated separated R-Tree index if needed.

What's the difference with good ol' plain fopen()/fseek()/fread()/fwrite()/fclose() then?

Lately I wonder what apps you are dealing with, they must be very problem specific ... to dump a tons of information without provisions for fast retrieving ... DAQ systems?

Regards,

Last edited by alpinistbg (2018-11-28 22:41:12)

enrcas · 2018-11-29 09:19:59

Thanks both!, I knew about indexing later, looked through the examples and other tests, and it works that way, which somehow has sense (you already know this if you have coded a b-tree).

My idea is importing many files, some of them are multigigabyte others a few Kbs, I don't want dupe games, so I need the "gamedata" to be Unique.
Then, I need the index to be created before, so I can't create an index later?

Gamedata is a string of text, I first tried to hash this text with sha1 for example and then unique this sha1 field, but I found no improvement in this (sqlite does the job very good), so I revert to just plain insert the string with the field "uniqued"

I did many tests last year about inserting and finally found batch inserting as mormot does was faster than inserting row by row inside a transaction.
But yesterday I read about OTA importing in a post by AB.
Still a draft

I am thinking I am going to go back to my first attempt, simple inserts inside a transaction, maybe talking directly to the SQLite engine (I already have code for that).
I want later to report many types of stats derived from the DB, so I'll need a lot of selects with grouping, counting, etc...

Of course, there are thousands of ways of doing a chess game collection, by hand (some python scripting) or with already very good chess databases (TarraschChess, for example, is open source, and the author has explained a lot about the decisions and code changes in his blog, very interesting).

Sorry if this thing is pretty basic, but I'm just a guy doing some experiments and trying to learn a bit. Thanks!

BTW, I tested a lot the fastest way to read text files, I finally did my own readln(), faster than other methods, I can't assure is bug free but I did a lot of testing and it worked. Basically, is a simple function that reads a block of bytes from the file, scans for #10 and returns a string, the key point here is how you work with the last (possibly) non-complete string in your buffer, you have to read another chunk and fix that. It could be speeded up, using vectorized scanning and other stuff like that, but it's totally out of my knowledge. Anyway, faster that streams, stringlists, memory mapped files, readln with bigger settextbuff, etc...

Last edited by enrcas (2018-11-29 12:19:48)

mORMot Open Source

#1 2018-11-28 14:02:59

Batch inserts with unique, dupes and later counting.

#2 2018-11-28 15:30:43

Re: Batch inserts with unique, dupes and later counting.

#3 2018-11-28 18:24:44

Re: Batch inserts with unique, dupes and later counting.

#4 2018-11-28 18:33:20

Re: Batch inserts with unique, dupes and later counting.

#5 2018-11-28 22:40:41

Re: Batch inserts with unique, dupes and later counting.

#6 2018-11-29 09:19:59

Re: Batch inserts with unique, dupes and later counting.

Board footer