#1 2021-05-28 14:15:58

okoba
Member
Registered: 2019-09-29
Posts: 106

AES-NI speed in Dictionary

I've experienced 2X slowness with TSynDictionary (in Add and Find) when I use mormot.crypt.core and it activates AES-NI.

    
  for I := 0 to High(A) do
    A[I] := IntToStr(I);       
  Dic := TSynDictionary.Create(TypeInfo(TStringDynArray), TypeInfo(TInt64DynArray));
  T := GetTickCount64();
  for I := 0 to High(A) do
    Dic.Add(A[I], I);
  WriteLn('Fill: ', GetTickCount64() - T);

  T := GetTickCount64();
  for I := 1 to High(A) do
    Dic.Find(A[I]);
  WriteLn('Find: ', GetTickCount64() - T);
  Dic.Free;     

Here is the output of the mormot test:

  - crc32c: 320,100 assertions passed  69.16ms
      pas 430.8 MB/s fast 2 GB/s sse42 8.6 GB/s sse42+aesni 9.7 GB/s

Tested on Windows, latest mORMot and FPC on two machines with i3 and i9.

Offline

#2 2021-05-28 14:19:15

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

What is the A[] length and the Fill: Find: values displayed?

Offline

#3 2021-05-28 15:34:03

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

If I run something similar with UInt32ToUtf8() and RawUtf8 keys, I have:

     100000 add crc32c in 23.95ms i.e. 4,173,971/s, aver. 0us
     100000 find crc32c in 8.56ms i.e. 11,678,150/s, aver. 0us
     100000 add AesNiHash32 in 25.85ms i.e. 3,868,471/s, aver. 0us
     100000 find AesNiHash32 in 8.95ms i.e. 11,166,945/s, aver. 0us

If I run

    for i := 1 to MAX do
    begin
      UInt32ToUtf8(i, k);
      check(dict.Add(k, v) = i - 1);
..
    for i := 0 to MAX - 1 do
      HashAnsiString(@PPointerArray(dict.Keys.Value^)[i], crc32c);

with both hashers, on 1,000,000 items, here is the result:

     1000000 crc32c in 6.34ms i.e. 157,629,255/s, aver. 0us
     1000000 AesNiHash32 in 6.63ms i.e. 150,670,483/s, aver. 0us

So I don't see such a 2x speed difference, and the distribution should be much better with AesNiHash32 than with crc32c (much less colisions) on real data (not a simple increase sequence of integers).

   a[i] := RandomAnsi7(15 + i and 7);

This much more realistic array of keys gives a slightly better performance for AesNiHash32:

     100000 add crc32c in 29.25ms i.e. 3,418,569/s, aver. 0us
     100000 find crc32c in 11.10ms i.e. 9,006,574/s, aver. 0us
     100000 add AesNiHash32 in 28.56ms i.e. 3,500,420/s, aver. 0us
     100000 find AesNiHash32 in 10.74ms i.e. 9,304,922/s, aver. 0us

Also note that GetTickCount64 is not a good way of measuring time on Windows. It has a very low resolution. Use our TPrecisionTimer instead.

Offline

#4 2021-05-29 08:11:58

okoba
Member
Registered: 2019-09-29
Posts: 106

Re: AES-NI speed in Dictionary

I was reporting string dictionary without set capacity, surprised that why the new algorithm is not faster.
This time I did an another test: https://gist.github.com/OkobaPatino/cec … 358984f5c5
These are the result of latest mORMot, latest FPC, Windows and i9.

Remarks:
In increasing string AesNiHash32 is slower, near 25% with SetCapacity and near 50% without SetCapacity.
In random GUID AesNiHash32 is a bit slower.
If dictionary capacity is set beforehand, Add (400%) and Find (25%) is faster, no matter the Hasher.

Questions:
Find is slower when you do not set capacity before hand. Why?
Can I ask where much less collisions come to play with AesNiHash32 as it is always slower than crc32c?

BTW It would be nice to have the ability of choosing the Hasher of the dictionary now that we have more options.

Last edited by okoba (2021-05-29 08:23:31)

Offline

#5 2021-05-30 15:25:26

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

I guess a missing dic.Free would help stabilize your tests in your code.

Anyway, I have made further testing, and there seems to be a problem about TSynDictionary: it has more collisions than I expected. This is the reason why Find() is slower when the dictionary size grows up with no SetCapacity.
But with or without SetCapacity performance should be comparable. The current pattern is very conservative about memory consumption, but it comes with some performance penalty. I am currently tuning and fixing it. Some pull request to be released tomorrow.

What I have seen is that AesNiHash32 is indeed a bit slower than crc32c from some of my tests, but it has less collisions, and is not suggest to DoS attacks with pre-computed input.

Offline

#6 2021-05-31 16:22:14

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

I have rewritten TSynDictionary / TDynArrayHasher process.

The new associated tests include some numbers.
https://github.com/synopse/mORMot2/comm … d55d0R5850

In practice, the tests are not as simple as yours: it uses random access for Find(), not a naive for i := 0 to Count - 1 loop, which is neither realistic nor useful (you don't use a dictionary for such a loop).
In practice, r[]=i is faster than r[]=Random32() - up to 3x for count>100,000
Real numbers state that once we reach 100,000 items, the CPU cache starts to be a bottleneck for Find().

Then we still have O(1) process up to 10,000,000 items, with linear speed.
SetCapacity() makes the Add() twice faster, as expected.
AesNiHash32 is sometimes faster, sometimes slower. And it is always safer than crc32c since it doesn't suffer from potential DoS attacks from forged input data to trigger unexpected collisions.

Default options are stil AesNiHash32 as hash, and DYNARRAYHASH_LEMIRE + DYNARRAYHASH_PO2 as algorithms, and seems to give a good balance between performance and memory usage.

Offline

#7 2021-05-31 17:49:24

okoba
Member
Registered: 2019-09-29
Posts: 106

Re: AES-NI speed in Dictionary

Great update. Thank you ab!
I've checked your changes and tests, and rerun my test in an i3 too. Without SetCapacity results are 2X faster for Add() and 1.3X for find. Also Find() is almost the same with and without SetCapacity.  It's great!

Two questions, is there a count limit for TSynDictionary / TDynArrayHasher? What will be the O for more than 10M?

Last edited by okoba (2021-05-31 18:03:17)

Offline

#8 2021-05-31 19:02:35

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

For more than 1,000,000 I would switch to a SQLite3 database. Slower to insert, but at least request could be fast with minimal memory consumption.

There is no theoritical limit in TSynDictionary but the available memory.

Offline

#9 2021-05-31 19:12:37

okoba
Member
Registered: 2019-09-29
Posts: 106

Re: AES-NI speed in Dictionary

Thank you.

Offline

#10 2021-06-01 12:19:36

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

Note that TSynDictionary/TDynArrayHasher consumes little memory for its hash table, in comparison for most other implementations.
Only one 32-bit integer per slot, which is at most twice the array Capacity.
In the numbers displayed in this commit test, you see how much memory is involved for the hash table as "slots=..." information.

Offline

#11 2021-06-01 12:33:08

okoba
Member
Registered: 2019-09-29
Posts: 106

Re: AES-NI speed in Dictionary

Offline

#12 2021-06-01 13:22:43

mpv
Member
From: Ukraine
Registered: 2012-03-24
Posts: 1,534
Website

Re: AES-NI speed in Dictionary

@okoba - is this test executed on bare metal or inside VM? VM (especially HyperV) can be configured to emulate extended CPU instruction set (so called compatibility mode) and in such case performance can be very bad

Last edited by mpv (2021-06-01 13:23:04)

Offline

#13 2021-06-01 13:35:48

okoba
Member
Registered: 2019-09-29
Posts: 106

Re: AES-NI speed in Dictionary

Bare metal, Windows.

Offline

#14 2021-06-01 15:10:00

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,182
Website

Re: AES-NI speed in Dictionary

Thanks for the numbers.
Your tests are consistent with my own findings: with more number of items, the CPU cache is the bottleneck - not TSynDictionay.
This is the only reason why we can see Add() faster than Find() with 10,000,000 items: the former is linear, so is cache-friendly, whereas the second is random, so not cache-friendly.

And both crc32c or AesNiHash32 give similar numbers - with AesNi being safer and more proven for hash tables.

Your i9 is more than twice faster than my i5 during those tests, mostly due to a bigger cache I guess, and perhaps also because my i5 is a notebook / low voltage version.

Offline

Board footer

Powered by FluxBB