#1 2024-12-10 20:39:02

Uefi
Member
Registered: 2024-02-14
Posts: 46

BloomFilter Again

Hi, I tried another implementation of the bloom filter and this is what I noticed:

1. TSynBloomFilter doesn't work with probability 0.00000000000001 maximum from 0.00001
2. TSynBloomfiter has too many collisions, about 1 in every 1000

In other implementations there is approximately 1 collision per 10 thousand

Please study and try another implementation BloomFilter:
https://gist.github.com/synopse/b418126 … 6811203928

Please review the current implementation of the bloom filter, please make improvements, there are a lot of collisions

Offline

#2 2024-12-10 20:46:32

Uefi
Member
Registered: 2024-02-14
Posts: 46

Re: BloomFilter Again

Full module texrex if need:

https://drive.google.com/file/d/15c1WyT … drive_link

I had to fix the module since it was written for a Pascal object, I seem to have fixed everything

Offline

#3 2024-12-10 20:55:12

Uefi
Member
Registered: 2024-02-14
Posts: 46

Re: BloomFilter Again

It seems to me that there are some errors or shortcomings in the current implementation of the bloom filter, so there are a lot of collisions !

Offline

#4 2024-12-10 22:45:29

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,742
Website

Re: BloomFilter Again

Please follow the forum rules and don't put code in the thread itself.

Offline

#5 2024-12-10 23:37:52

Uefi
Member
Registered: 2024-02-14
Posts: 46

Re: BloomFilter Again

ab wrote:

Please follow the forum rules and don't put code in the thread itself.

How then to place the code?

Offline

#6 2024-12-11 08:04:44

TPrami
Member
Registered: 2010-07-06
Posts: 123

Re: BloomFilter Again

ab wrote:

Please follow the forum rules and don't put code in the thread itself.

May I as why is this?

Would seem, if correct highlighting available, it would be better to post into the forum. External source(s) might disappear and make forum thread useless later on.

-Tee-

Offline

#7 2024-12-11 08:14:11

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,742
Website

Re: BloomFilter Again

1) Because it is part of the rules you accepted.
And we follow the rules, don't we?
Otherwise, it won't compile.

2) Because the forum use a php engine with a SQLite3 backend which is not good at huge content, and we backup the whole DB file at once.
So we try to keep the posts not too verbose, to keep the DB small enough for the long term (we have more than 42,000 of posts since 2010).
I have no time to change the forum engine.

3) Because using a gist is always better to copy & paste, and make several revisions, and tend to favor not snippets of code, but a minimal reproducible example.
As the forum rules proposes.

Offline

#8 2024-12-11 08:27:11

TPrami
Member
Registered: 2010-07-06
Posts: 123

Re: BloomFilter Again

Uefi wrote:

Hi, I tried another implementation of the bloom filter and this is what I noticed:

In other implementations there is approximately 1 collision per 10 thousand

Where is this other implementation from?
What is it's licence?

-Tee-

Offline

#9 2024-12-11 09:38:08

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 14,742
Website

Re: BloomFilter Again

@UEfi I don't see any implementation differences between TSynBloomFilter and the one you provided.
The only difference is that we use the "addition trick" of https://www.eecs.harvard.edu/~michaelm/ … -02-05.pdf which is well used and proven.

I don't want to start again the same discussion we had
https://synopse.info/forum/viewtopic.php?id=6864
There was a lot of misunderstanding in our previous discussions, I can't find what you are expecting and if something is wrong in our class.

What we call "false positive" is computed and verified as such in our regression tests TTestCoreBase.BloomFilters:

    n := 0;
    for i := SIZ + 1 to SIZ + SIZ shr 5 do
      if b.MayExist(@i, SizeOf(i)) then
        inc(n);
    falsepositive := (n * 100) / (SIZ shr 5);
    Check(falsepositive < 1, 'falsepositive=%', [falsepositive]);

With crc32c() we are at 0.704 for the integers in 0..SIZ range. With crc() we have 0.9. With other functions like AesNiHash32, which is fed from random so gives diverse values, we are around 0.8 - 1.2, which is as expected with our FalsePositivePercent = 1.

Bloom filters usecase is with FalsePositivePercent around 1, typically 0.5, 1, 2, or 5. If you expect 0.0000x then you need to use another algorithm.

Offline

#10 2024-12-11 10:19:37

TPrami
Member
Registered: 2010-07-06
Posts: 123

Re: BloomFilter Again

ab wrote:

2) Because the forum use a php engine with a SQLite3 backend which is not good at huge content, and we backup the whole DB file at once.

Fair enough...

Offline

#11 2024-12-11 19:46:25

Uefi
Member
Registered: 2024-02-14
Posts: 46

Re: BloomFilter Again

@ab Well, please test in your module collisions occur 1 in 5000 in this third-party library collisions occur 1 in 300000, but the formula is the same here and there

Bloom:=TTrBloomFilter.Create(100000000, 0.00000000000001);
Bloom:=TSynBloomFilter.Create(100000000, 0.000001, cryptcrc32(camd5));

Please test how many collisions there will be in both cases in any way

Last edited by Uefi (2024-12-11 20:01:54)

Offline

Board footer

Powered by FluxBB