Fastest CRC calculation

dataman · 2010-08-08 13:27:49

Hello!
Thank you for very interesting projects!

Offer your attention units with fastest parallel calculation CRC: CRC32 & CRC64
These modules created by Aleksandr Sharahov.

P.S. Sorry for bad english.

ab · 2010-08-08 18:55:00

Thanks for your code. Very interesting.

I like your alignment trick (in @head).

Correct me, but apparently you converted the "BYFOUR" version of the standard CRC implementation (in crc32.c), into an optimized easy to read asm version.

That's a good conversion.

In our SynZip unit, I used the standard crc32.c version in "BYFOUR", so I guess performance will be similar.

But it could be a good idea to replace the fixed tables by a dynamic one, to reduce code size. And use your version to get rid of the crc32.obj file, without any performance loss.

Nice work, indeed!

You should make some blog or web site to publish code like this. And post something to EMB forum, to get some links.

dataman · 2010-08-08 19:51:50

Oh no, this code is not mine!
I just use it and I'm not so well know asm

And Aleksandr has repeatedly was on the Fastcode Winner's List.

ab wrote:

You should make some blog or web site to publish code like this.

This code has already been published, but only on the Russian Delphi forum.

ab · 2010-08-09 06:35:25

I thought that dataman=Aleksandr Sharahov

Now I remember his name in the FastCode library.
That's why I found out that this code was well written!

I'll indeed adapt this source code and use it to replace the crc.obj file in SynZip.pas.

Thank for the link.

ab · 2010-08-09 08:00:48

I've added this crc32 code to our SynZip.pas unit.

Modifications are committed in our source code repository:
http://synopse.info/fossil/info/fa45f97d0c

- crc32 is now coded in inlined fast asm (crc32.obj is no longer necessary)
- crc32 hashing is performed using 8 tables, for better CPU pipelining and faster execution
- crc32 tables are created on the fly during unit initialization, therefore saving 8 KB of code size from standard crc32.obj, with no speed penalty

If you compare the code, I just made some code refactoring, put the 8 tables usage under conditional define BYFOUR (defined by default), added the necessary "not eax" to have the exact same result as standard crc32() function, and written some dedicated regression tests.

ab · 2014-05-25 13:38:40

... and now we have written a variation of the code, including fast SSE 4.2 hardware crc32 computation, if available!

See http://blog.synopse.info/post/2014/05/2 … nstruction

maximmasiutin · 2020-02-22 22:04:51

The SSE 4.2 hardware crc32 computation can only calculate CRC32 using Castagnoli polynomial, not the IEEE one.

There is also an assembler implementation of Slicing-by-8 for x64 (Win64) that you can take from https://github.com/maximmasiutin/CRC32- … 64-Asm-Pas
It is faster than any higher level (pascal, C) and only takes 1,20 CPU cycles per byte. It is based on the IA-32 version of Aleksandr Sharahov http://guildalfa.ru/alsha/node/2 but with an essential modification: second iteration of loop unrolling was removed because testing demonstrated it had no benefit, or even made code slower because of a branch in the middle that could cause branch misprediction penalty.

Besides that, with PUREPASCAL the current code of the Synopse mORMot framework implements Slicing-by-4, not Slicing-by-8. Please also consider using Slicing-by-8 implementation for PUREPASCAL (the above GitHub link has it too).

ab · 2020-02-23 11:07:19

Yes, the SSE 4.2 crc32c hardware opcode uses not the same polynomial than zip/deflate.

We implemented only Slicing-by-4 since with older CPUs, it pollutes less the L1 cache, so is actually faster in practice, for small buffers - e.g. shortest strings.
For this very same reason, we actually fallback to xxHash32 as hash function when SSE 4.2 is not available, since it has no lookup table.

Your slicing-by-8 version is very interesting.
But perhaps worth it only for biggest buffers.

Edit:
After some more research https://fastmail.blog/2015/12/03/the-se … ster-crc32 is refreshing.
Sounds like if slicing by 16 could do even more benefit in some cases...

Some reference implementations:
https://github.com/torvalds/linux/blob/ … ib/crc32.c
https://github.com/robn/cyrus-imapd/blo … ib/crc32.c

ab · 2020-02-24 17:44:29

I have reimplemented crc32cfast() in fast x86_64 / i386 asm using Maxim feedback.
Our versions seem slightly more efficient than the one supplied in the link, and it works with FPC and also on Linux.

I kept the slice-by-four for the "pure pascal" version, since ARM chips has usually (much) smaller L1 cache than Intel/AMD CPUs.

Please check https://synopse.info/fossil/info/2252d725bf

mORMot Open Source

#1 2010-08-08 13:27:49

Fastest CRC calculation

#2 2010-08-08 18:55:00

Re: Fastest CRC calculation

#3 2010-08-08 19:51:50

Re: Fastest CRC calculation

#4 2010-08-09 06:35:25

Re: Fastest CRC calculation

#5 2010-08-09 08:00:48

Re: Fastest CRC calculation

#6 2014-05-25 13:38:40

Re: Fastest CRC calculation

#7 2020-02-22 22:04:51

Re: Fastest CRC calculation

#8 2020-02-23 11:07:19

Re: Fastest CRC calculation

#9 2020-02-24 17:44:29

Re: Fastest CRC calculation

Board footer