New crc32c() function using optimized asm and SSE 4.2 instruction

ab · 2014-05-25 12:52:57

Cyclic Redundancy Check (CRC) codes are widely used for integrity checking of data in fields such as storage and networking.
There is an ever-increasing need for very high-speed CRC computations on processors for end-to-end integrity checks.

We just introduced to mORMot's core unit (SynCommons.pas) a fast and efficient crc32c() function.

It will use either:
- Optimized x86 asm code, with unrolled loops;
- SSE 4.2 hardware crc32 instruction, if available.

Resulting speed is very good.

You would not use this new crc32c() function to replace the zlib's crc32() function, but as a convenient very fast hashing function at application level.
For instance, our TDynArray wrapper will use it for fast items hashing.

See http://blog.synopse.info/post/2014/05/2 … nstruction
and http://synopse.info/fossil/info/8fe7cc53b7 (and some following commits, including x64 support).

ab · 2014-08-25 12:56:23

Nice article on Eric's blog:
http://www.delphitools.info/2014/08/25/ … mment-3423

Feedback is always nice!

Johan · 2015-06-26 18:59:38

On my laptop: AMD64 crc32cfast is really really slow.
The following code:

program HashSpeedTest;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils,
  SynCommons in '..\..\Mormot\SynCommons.pas',
  FastDefaults,
  System.Generics.Defaults,
  System.Diagnostics;

var
  HashTest: array[0..10000] of Ansistring;

procedure InitHashTest;
var
  i,j: Integer;
  Len: integer;
  r: AnsiChar;
begin
  for i := Low(HashTest) to High(HashTest) do begin
    len:= Random(30)+3;
    SetLength(HashTest[i],Len);
    for j := 1 to len do begin
      r:= AnsiChar(Chr(Random(250)+1));
      HashTest[i][j]:= r;
    end;
  end;
  WriteLn('Init done');
end;

procedure crc;
var
  Timer: TStopWatch;
  i,a: integer;
  PrevCrc: integer;

begin
  for a := 0 to 5 do begin
    Timer:= TStopWatch.StartNew;
    PrevCRC:= 0;
    for i:= Low(HashTest) to High(HashTest) do begin
      PrevCrc:= crc32cfast(PrevCRC, PAnsiChar(pointer(HashTest[i])), Length(HashTest));
    end;
    Timer.Stop;
    WriteLn(Format('Crc32 took: %d millisec = %d ticks ',[Timer.ElapsedMilliseconds, Timer.ElapsedTicks]));
  end;
end;

procedure Murmur;
var
  Timer: TStopWatch;
  i,a: integer;
  PrevCrc: integer;
begin
  for a := 0 to 5 do begin
    Timer:= TStopWatch.StartNew;
    PrevCRC:= 0;
    for i:= Low(HashTest) to High(HashTest) do begin
      PrevCrc:= MurmurHash3((@HashTest[i])^, Length(HashTest[i]), PrevCrc);
    end;
    Timer.Stop;
    WriteLn(Format('Murmurhash3 took: %d millisec = %d ticks ',[Timer.ElapsedMilliseconds, Timer.ElapsedTicks]));
  end;
end;

procedure BobJenkins;
var
  Timer: TStopWatch;
  i,a: integer;
  PrevCrc: integer;
begin
  for a := 0 to 5 do begin
    Timer:= TStopWatch.StartNew;
    PrevCRC:= 0;
    for i:= Low(HashTest) to High(HashTest) do begin
      PrevCrc:= BobJenkinsHash((@HashTest[i])^, Length(HashTest[i]), PrevCrc);
    end;
    Timer.Stop;
    WriteLn(Format('Bobjenkins took: %d millisec = %d ticks ',[Timer.ElapsedMilliseconds, Timer.ElapsedTicks]));
  end;
end;

begin
  InitHashTest;
  Crc;
  Murmur;
  BobJenkins;
  WriteLn('done, press a key...');
  ReadLn;
end.

Outputs:

Init done
Crc32 took: 107 millisec = 230030 ticks
Crc32 took: 87 millisec = 187325 ticks
Crc32 took: 91 millisec = 195064 ticks
Crc32 took: 91 millisec = 196798 ticks
Crc32 took: 100 millisec = 215848 ticks
Crc32 took: 94 millisec = 202032 ticks
Murmurhash3 took: 0 millisec = 1312 ticks
Murmurhash3 took: 0 millisec = 1386 ticks
Murmurhash3 took: 1 millisec = 2561 ticks
Murmurhash3 took: 1 millisec = 2472 ticks
Murmurhash3 took: 1 millisec = 2464 ticks
Murmurhash3 took: 1 millisec = 2476 ticks
Bobjenkins took: 3 millisec = 7612 ticks
Bobjenkins took: 3 millisec = 7431 ticks
Bobjenkins took: 3 millisec = 6880 ticks
Bobjenkins took: 1 millisec = 3786 ticks
Bobjenkins took: 1 millisec = 3753 ticks
Bobjenkins took: 1 millisec = 3752 ticks
done, press a key...

As you can see I've eliminated the unneeded calls to `UniqueString` so what am I missing.
crc32c was supposed to be the end-all of hashes?
What am I missing?

ab · 2015-06-27 09:27:01

Your code is not correct.
You are hashing a lot more information with crc32 than with the others.

There is a [ i ] missing at the end of the line:

PrevCrc:= crc32cfast(PrevCRC, PAnsiChar(pointer(HashTest[i])), Length(HashTest));

Which should be:

PrevCrc:= crc32cfast(PrevCRC, PAnsiChar(pointer(HashTest[i])), Length(HashTest[i]));

Also test crc32c() - if your CPU support hardware crc32 instructions, it may help.

For a hashing algorithm, consider the number of collisions.
crc32c has pretty low number of collisions.

I did not notice it, but in latest version of Delphi, the BobJenkinsHash() implementation has been rewritten for speed.
Initial version (the one tested by Eric, I suspect) was very slow.
But since any TDictionary<> uses BobJenkinsHash(), it made sense for Embarcadero to optimize this part of the RTL.
Of course, crc32c() is still faster.

Johan · 2015-07-11 17:09:44

Fixed all the issues with testing and added some stress testing to the suite.
The result on my AMD laptop.

Starting testing
Zipcodes 00000..99999
BobJenkins Lookup3 took 127 ms, 272472 ticks
BobJenkins Lookup3 has 34 max low collisions
BobJenkins Lookup3 has 34 max high collisions
BobJenkins Lookup3 has 1 real collisions
Zipcodes 00000..99999
MurmurHash3 took 88 ms, 188765 ticks
MurmurHash3 has 34 max low collisions
MurmurHash3 has 34 max high collisions
MurmurHash3 has 1 real collisions
Zipcodes 00000..99999
CRC32c took 93 ms, 199934 ticks
CRC32c has 23 max low collisions
CRC32c has 24 max high collisions
CRC32c has 0 real collisions
Zipcodes 00000..99999
FNV_1a_Meiyan took 65 ms, 140930 ticks
FNV_1a_Meiyan has 32 max low collisions
FNV_1a_Meiyan has 32 max high collisions
FNV_1a_Meiyan has 0 real collisions
integers
BobJenkins Lookup3 took 111 ms, 238801 ticks
BobJenkins Lookup3 has 35 max low collisions
BobJenkins Lookup3 has 33 max high collisions
BobJenkins Lookup3 has 1 real collisions
integers
MurmurHash3 took 32 ms, 69188 ticks
MurmurHash3 has 34 max low collisions
MurmurHash3 has 36 max high collisions
MurmurHash3 has 0 real collisions
integers
CRC32c took 47 ms, 102769 ticks
CRC32c has 16 max low collisions
CRC32c has 16 max high collisions
CRC32c has 0 real collisions
integers
FNV_1a_Meiyan took 25 ms, 53990 ticks
FNV_1a_Meiyan has 47 max low collisions
FNV_1a_Meiyan has 23 max high collisions
FNV_1a_Meiyan has 0 real collisions
Loading 120 MB password file...
Done loading, starting the hash...
120 MB of Passwords
BobJenkins Lookup3 took 3714 ms, 7960607 ticks
BobJenkins Lookup3 has 2593 max low collisions
BobJenkins Lookup3 has 2595 max high collisions
BobJenkins Lookup3 has 1 real collisions
120 MB of Passwords
MurmurHash3 took 2562 ms, 5491776 ticks
MurmurHash3 has 2625 max low collisions
MurmurHash3 has 2620 max high collisions
MurmurHash3 has 1 real collisions
120 MB of Passwords
CRC32c took 2647 ms, 5674922 ticks
CRC32c has 2634 max low collisions
CRC32c has 2603 max high collisions
CRC32c has 2 real collisions
120 MB of Passwords
FNV_1a_Meiyan took 2381 ms, 5103116 ticks
FNV_1a_Meiyan has 2635 max low collisions
FNV_1a_Meiyan has 2622 max high collisions
FNV_1a_Meiyan has 1 real collisions
120 MB of Passwords as fast as possible
BobJenkins Lookup3 took 2852 ms, 6113225 ticks
BobJenkins Lookup3 has 0 max low collisions
BobJenkins Lookup3 has 0 max high collisions
BobJenkins Lookup3 has 0 real collisions
120 MB of Passwords as fast as possible
MurmurHash3 took 1707 ms, 3660022 ticks
MurmurHash3 has 0 max low collisions
MurmurHash3 has 0 max high collisions
MurmurHash3 has 0 real collisions
120 MB of Passwords as fast as possible
CRC32c took 1805 ms, 3869382 ticks
CRC32c has 0 max low collisions
CRC32c has 0 max high collisions
CRC32c has 0 real collisions
120 MB of Passwords as fast as possible
FNV_1a_Meiyan took 1560 ms, 3344962 ticks
FNV_1a_Meiyan has 0 max low collisions
FNV_1a_Meiyan has 0 max high collisions
FNV_1a_Meiyan has 0 real collisions
Done, press a key....

The longer the data to be hashed is, the faster Meiyan performs.
So far so big issues with collisions.

Can you make your hash functions compatible with the RTL ones?
Your order of parameters does not match Emba's.

This makes plugin of your hashes needlessly complicated.

I need this code to bridge the difference.

function HashCrc32(const Data; Len, InitData: integer): integer;
begin
  Result:= SynCommons.crc32cfast(InitData, @Data, Len);
end;

This slows down the hashing.
And on the CRC32SSE version that takes up a non-trivial percentage of the time.

Also I don't understand why the hashfunction takes a PAnsiString. Any pointer will do, surely.

Note that I'm using XE7. Dunno if Bobjenkins is optimized as much as possible.

Last remark:
The FNV_1a_Meiyan PurePascal version runs at the same speed as the optimized assembly function.
(At least on my laptop). All other hashes have PurePascal versions that run notably slower.

ab · 2015-07-11 17:46:33

So, with SSE4.2 aware CPU (almost all new models, I suspect), the winner would be definitively crc32c, I guess.
On my Core i7, crc32sse42 more than two times faster then the crc32fast you are using.

No other algorithm has such speed up potential as our crc32c version.

ab · 2021-02-15 07:52:52

About performance, SSE4.2 crc32c from Intel has a new challenger in town...
Our https://blog.synopse.info/?post/2021/02 … r-mORMot-2

mORMot Open Source

#1 2014-05-25 12:52:57

New crc32c() function using optimized asm and SSE 4.2 instruction

#2 2014-08-25 12:56:23

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

#3 2015-06-26 18:59:38

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

#4 2015-06-27 09:27:01

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

#5 2015-07-11 17:09:44

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

#6 2015-07-11 17:46:33

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

#7 2021-02-15 07:52:52

Re: New crc32c() function using optimized asm and SSE 4.2 instruction

Board footer