You are not logged in.
Pages: 1
Hello, first of all I want to express my deep gratitude for your wonderful libraries mORMot and mORMot2, I noticed that TSynBloomFilter in both libraries works somehow incorrectly, namely, it incorrectly identifies duplicate lines from large files over 600 megabytes.
So I have the following code:
program Project2;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
System.Classes,
System.Diagnostics,
mormot.core.search, // or SynTable
VCL.Dialogs;
procedure ReadFile(const fpath:string); inline;
var
uniq,dup:integer;
BloomFilter:TSynBloomFilter;
Reader: TStreamReader;
Line:string;
begin
dup:=0;
uniq:=0;
BloomFilter:=TSynBloomFilter.Create(5000000);
try
Reader := TStreamReader.Create(fpath, TEncoding.Default, True, 2048);
while not Reader.EndOfStream do begin
Line:=Trim(Reader.ReadLine);
if not BloomFilter.MayExist(Line) then begin
BloomFilter.Insert(Line);
inc(uniq);
end else
inc(dup);
end;
finally
Reader.Close;
FreeAndNil(Reader);
FreeAndNil(BloomFilter);
end;
writeln(Format('Duplicates:%d Uniq:%d',[dup, uniq]));
readln;
end;
var
op:Topendialog;
FF:string;
begin
op:=Topendialog.Create(nil);
if op.Execute then
FF:=op.FileName;
op.Free;
ReadFile(FF);
end.
Before I upload .txt files, I check them with other programs to remove duplicates and the results are very different,
Please check whether this is really a bug in the library itself or am I doing something wrong?
Last edited by Uefi (2024-04-20 23:37:03)
Offline
Please define "it incorrectly identifies duplicate lines" and "results are very different".
Note that by design, Bloom filters will return false positives, i.e. false duplicates, in range of the BloomFilter.FalsePositivePercent property (as supplied to Create).
And use RawByteString instead of string types.
Offline
Please define "it incorrectly identifies duplicate lines" and "results are very different".
Note that by design, Bloom filters will return false positives, i.e. false duplicates, in range of the BloomFilter.FalsePositivePercent property (as supplied to Create).
And use RawByteString instead of string types.
Hello, I mean that there is a very large number of collisions in a file that has no duplicates; it finds more than 1000 duplicates
Converting to the RawByteString type changes absolutely doesn't change anything
Last edited by Uefi (2024-04-21 08:37:06)
Offline
If you have 1% of lines which are notified, it is just as expected with the default aFalsePositivePercent=1 value.
https://en.wikipedia.org/wiki/Bloom_filter
https://en.wikipedia.org/wiki/False_pos … _negatives
Offline
If you have 1% of lines which are notified, it is just as expected with the default aFalsePositivePercent=1 value.
https://en.wikipedia.org/wiki/Bloom_filter
https://en.wikipedia.org/wiki/False_pos … _negatives
I achieved an order of magnitude fewer collisions by adding my crc32c function:
function crc32c(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=BobJenkinsHash(buf[1], len, crc);
end;
Please try to study and check this:
in mormot.core.search.pas:
interface
{$I ..\mormot.defines.inc}
uses
classes,
sysutils,
System.Generics.Defaults, // Add This module
mormot.core.base,
mormot.core.os,
mormot.core.rtti,
mormot.core.unicode,
mormot.core.text,
mormot.core.buffers,
mormot.core.datetime,
mormot.core.data,
mormot.core.json;
...
{ ****************** Bloom Filter Probabilistic Index }
{ TSynBloomFilter }
function crc32c(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal; // Add my function here
begin
Result:=BobJenkinsHash(buf[1], len, crc);
end;
const
BLOOM_VERSION = 0;
BLOOM_MAXHASH = 32; // only 7 is needed for 1% false positive ratio
Murmurhash3 also performed well:
in mormot.core.search.pas:
interface
{$I ..\mormot.defines.inc}
uses
classes,
sysutils,
FastCompare, // Add This module
mormot.core.base,
mormot.core.os,
mormot.core.rtti,
mormot.core.unicode,
mormot.core.text,
mormot.core.buffers,
mormot.core.datetime,
mormot.core.data,
mormot.core.json;
...
{ ****************** Bloom Filter Probabilistic Index }
{ TSynBloomFilter }
function crc32c(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=MurMurHash3(buf[1], len, crc);
end;
const
BLOOM_VERSION = 0;
BLOOM_MAXHASH = 32; // only 7 is needed for 1% false positive ratio
With my function I now get as it should be 0 duplicates in the file with no duplicates !
Last edited by Uefi (2024-04-21 19:41:51)
Offline
So it does work correctly, according to the hash function used.
By design, you will always have collisions in hash functions, especially after a modulo reduction.
crc32c is very fast on modern hardware, but may have more collisions depending on the dataset.
We can forge some input data with a lot of collisions with any non-cryptrographic hash functions.
I have added the ability to change the hash function if needed.
https://github.com/synopse/mORMot2/commit/056f0dcd
See e.g. CryptCrc32(caMd5/caSha1) from mormot.crypt.secure.
caMD5 is very good for this usage, and caSha1 may be very fast on modern SHA-NI Intel/AMD CPUs (more than 2GB/s).
Offline
Hello, I’m very glad that you are also interested in this question,
It seems that I was able to achieve almost 0% collisions with the hybrid use of hashes:
function crc32c(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=BobJenkinsHash(buf[1], len, xxHash32(buf[1], len, crc)); // Spring4D module
end;
In mormot.core.search.pas:
uses
classes,
sysutils,
mormot.core.base,
mormot.core.os,
mormot.core.rtti,
mormot.core.unicode,
mormot.core.text,
mormot.core.buffers,
mormot.core.datetime,
mormot.core.data,
mormot.core.json,
System.Generics.Defaults, // Add this
Spring.Hash; // And This (Spring4D Module)
.......
{ ****************** Bloom Filter Probabilistic Index }
{ TSynBloomFilter }
function crc32c(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=BobJenkinsHash(buf[1], len, xxHash32(buf[1], len, crc)); // Spring4D module
end;
const
BLOOM_VERSION = 0;
BLOOM_MAXHASH = 32; // only 7 is needed for 1% false positive ratio
And I create TSynBloomFilter like this:
BloomFilter:=TSynBloomFilter.Create(100000000,0.000001); // If you put less 0.000001 then there will be collisions Seems 0.000001 this is the limit =(
Please check this as well and report the results !
Last edited by Uefi (2024-04-22 08:23:52)
Offline
The number of collisions depends on the input data.
So you better switch to a cryptographic hash like caMd5 or caSha1, instead of tweaking some weak hash functions for your particular dataset.
Especially using the BobJenkins hash which is slower than mORMot MD5 AFAIR.
https://www.delphitools.info/2014/08/25 … -shootout/
Anyway, if you need 0% collisions, then you are misusing the Bloom filter. It is not what it is meant for.
You should be happy with the 1% collision rate you get.
Please try to understand the Bloom Filter usecases.
Offline
Can you please tell me how I can achieve 0% collisions without taking up too much RAM?
Offline
For a true collision-less algorithm, I would use a binary tree of SHA-256 hashes.
That is: collision-less hash and perfectly balanced efficient storage of the hashes.
Or perhaps a Merkle Tree if you want to store the content.
https://en.wikipedia.org/wiki/Merkle_tree
https://www.codementor.io/blog/merkle-trees-5h9arzd3n8
Offline
For a true collision-less algorithm, I would use a binary tree of SHA-256 hashes.
That is: collision-less hash and perfectly balanced efficient storage of the hashes.Or perhaps a Merkle Tree if you want to store the content.
https://en.wikipedia.org/wiki/Merkle_tree
https://www.codementor.io/blog/merkle-trees-5h9arzd3n8
I have just started to understand the new BloomFilter arsenal and I still don’t understand how to specify sha256 , sha512 in it
BloomFilter:=TSynBloomFilter.Create(100000000,0.00001, SHA3_512)); // NOT )
Tell me how I can find Merkle Tree in mORMot lib ?
Last edited by Uefi (2024-04-22 22:54:27)
Offline
Can you please tell me how I can achieve 0% collisions without taking up too much RAM?
Why are you so worried about RAM?
Most computers have gigabytes of free RAM, and it is pretty cheap to get more. Your mileage might vary, it all depends on everything for sure.
-Tee-
Last edited by TPrami (2024-04-23 04:06:23)
Offline
Uefi wrote:Can you please tell me how I can achieve 0% collisions without taking up too much RAM?
Why are you so worried about RAM?
Most computers have gigabytes of free RAM, and it is pretty cheap to get more. Your mileage might vary, it all depends on everything for sure.
-Tee-
Because the files that I process are very large 32-100 gigabytes, now imagine how much RAM the same THashSet will steal ?
By smaller standards, THashSet steals three times more RAM than the file itself 32*3 or 100*3 = 100-300 gigabytes RAM
Offline
Because the files that I process are very large 32-100 gigabytes, now imagine how much RAM the same THashSet will steal ?
By smaller standards, THashSet steals three times more RAM than the file itself 32*3 or 100*3 = 100-300 gigabytes RAM
Seems to be valid point. IF the processed file is tight (not and XML or JSON file, but binary etc), most likely it is too much to keep in memory anyways.
Depending what you are actually doing it could be better off in DB, or if it is some kind of lookup, wiit some bias/pattern maybe can have active HOT cache and if not found check from DB etc...
hard to tell, these are just shots in the dark...
-Tee-
Offline
How many diverse items have you in your huge file?
If you should not have any duplicate in your file, then you need to store each hash (or each truncated hash) somehow.
A database is no fast option for this. If you can store the hashes in memory, it would be faster.
If you store the SHA-256 THash256 of each, it is only one THash256 (32 bytes) per unique item.
My guess is that MD5 and THash128 may be enough for your usecase. If you have a newer computer with SHA-NI, try SHA-1 and truncate it to THash128 (16 bytes) per unique item.
There is no Merkle tree in mORMot.
And Merkle trees are not exactly for this purpose.
If you don't have too many unique items, you can use an array of THash256 and make a brute force search (e.g. next power-of-two bigger than 16 x the number of unique items in your input file).
But it is O(n) and could become slow.
Or you just use the last bits of the crypto THash256, and set a bit in a huge array of byte of a power-of-two size.
This may be the fastest and the easiest, and using a fixed set of memory.
Or you can make e.g. a binary tree of the poor (tm) by defining an
type
TTreeLeaf = array[0..SizeOf(THash256)-4] of byte;
TTree = array[byte, byte, byte] of array of TTreeLeaf;
Then store the THash256 using its 3 first bytes as fixed indexes, then store the remaining bytes in a dynamic array.
To search, you can quickly use the 3 first bytes, then make a brute force in the TTreeLeaf dynamic array. If it is too slow, you can just sort the TTreeLeaf dynamic array content, and use a binary search to locate the hash in O(log(n)) instead of O(n).
You could try first with 2 bytes indexes (array[byte, byte] of...) which would use less memory.
Or you could define a smaller TTreeLeaf size, e.g. only array[0..7] of byte, which is likely to be unique due to the cryptographic nature of the hash.
On Win64 target of course, to ensure you can have plenty of memory and are not limited to 2GB.
But we need to know more about the data.
Offline
Thank you very much for your great desire to help, what I don’t understand is how to use sha256 or sha512 for TSynBloomFilter for the newest version mORMot2 )?
Offline
Better stay away from bloom filters for your purpose, because it does not apply.
But if you want it, just make what I wrote above:
See e.g. CryptCrc32(caMd5/caSha1) from mormot.crypt.secure.
caMD5 is very good for this usage, and caSha1 may be very fast on modern SHA-NI Intel/AMD CPUs (more than 2GB/s).
Use e.g. the CryptCrc32(caMd5) result as THasher parameter to TSynBloomFilter.Create.
Offline
Thank you, I already realized that these hash functions already imply cryptographic strength. I tested BobJenkinsHash, MurmurHash3 and xxHash32 with a standard set of mORMot caMd5, caSha1, caCrc32 and others, in fact for some reason they lose a lot with them by an order of magnitude more collisions
For example with hybrid BobJenkinsHash and xxHash32 only 15 collisions, and with hybrid caMd5 and caSha1 539 collisions of course with the same settings TsynBloomFilter and the same file
function MyHasher(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=BobJenkinsHash(buf[0], len, xxHash32(buf[0], len, crc)); // Spring4D module
end;
function MyHasher(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
var
H1,H2: Thasher;
begin
H1:=CryptCrc32(caMd5);
H2:=CryptCrc32(caSha1);
Result:=H1(h2(crc, buf, len), buf, len);
end;
BloomFilter:=TSynBloomFilter.Create(100000000,0.00001, MyHasher);
Last edited by Uefi (2024-04-23 10:04:17)
Offline
Please read again and try to understand my post above
https://synopse.info/forum/viewtopic.ph … 317#p41317
Or you just use the last bits of the crypto THash256, and set a bit in a huge array of byte of a power-of-two size.
This may be the fastest and the easiest, and using a fixed set of memory.
The TSynBloomFilter use several such bit indexes internally.
You can just use one with a cryptographic hash like caMd5 and a big enough size of the bitset.
And don't cascade hashes, it just work by chance on your dataset. Use a cryptographic hash with a big enough modulo.
Offline
Even any one hash from the MOrmot set is significantly inferior to BobJenkinsHash or xxHash32 please check this, is there any hashing error ?
I really don’t understand why, in fact, cryptographic hashes are suddenly inferior to the old non-cryptographic hashes ?
Last edited by Uefi (2024-04-23 10:10:54)
Offline
Even any one hash from the MOrmot set is significantly inferior to BobJenkinsHash or xxHash32 please check this, is there any hashing error ?
You are making a lot of confusion about hashes and collisions.
There always be collisisions if you reduce a hash to some smaller storage.
Of course mORMot crypto hashes are validated against vectors and live data since years, and give the correct output.
Cascading BobJenkinsHash with xxHash32 will always be poorer than a cryptographic hash, by design. It is "better" for your use case only by chance, when truncated to the TSynBloomFilter small storage.
A 128-bit crypto hash will always have better distribution than a 32-bit non crypto hash.
You are thinking the wrong way around, starting from your own particular case without understanding how hash reduction is applied.
Offline
You are thinking the wrong way around.
In fact, for some reason not, please check for yourself if you don't trust me
Offline
You just don't read or understand what I wrote.
I did not say that your results were wrong.
I said that your assumptions are wrong, about how hash reduction, e.g. in bloom filters, works.
TSynBloomFilter works correctly for a Bloom Filter. You just misuse it, and have misconceptions about this algorithm, and hashes or cryptography in general.
Offline
You just don't read or understand what I wrote.
I did not say that your results were wrong.
I said that your assumptions are wrong, about how hash reduction, e.g. in bloom filters, works.
TSynBloomFilter works correctly for a Bloom Filter. You just misuse it, and have misconceptions about this algorithm, and hashes or cryptography in general.
Okay, here the debate can actually drag on for a long time, I would like to try something like SHA256, SHA512 So I can do it like this:
uses
HlpHashFactory // module HashLib4Pascal
function MyHasher(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=THashFactory.TCrypto.CreateSHA3_256.ComputeUntyped(buf, len).GetInt32;
end;
BloomFilter:=TSynBloomFilter.Create(100000000,0.00001, MyHasher);
Can I do something similar using just mORMot? Because HashLib4Pascal hashes very slowly
Last edited by Uefi (2024-04-23 10:38:24)
Offline
You have all the cryptographic hashes you need in mormot.crypt.core.pas.
Is it even possible to make a BloomFilter without false positives?
Last edited by Uefi (2024-04-24 04:30:05)
Offline
Is it even possible to make a BloomFilter without false positives?
As far as I know whole point of BloomFilter is that it allows False Positives. It is designed that way. And should be used in cases where that is OK.
-Tee-
Offline
You have all the cryptographic hashes you need in mormot.crypt.core.pas.
Hello, here are the test results for the standard TSynBloomFilter hashing and our own:
Hashing out of the box:
BloomFilter:=TSynBloomFilter.Create(100000000,0.00001);
Hashing with your own function:
function MyHasher(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
begin
Result:=BobJenkinsHash(buf[0], len, MurMurHash3(buf[0], len, crc xor len));
end;
BloomFilter:=TSynBloomFilter.Create(100000000,0.00001, MyHasher);
As you can see, standard hashing is much worse, both tests were carried out on the same file without duplicates
Last edited by Uefi (2024-04-26 11:39:09)
Offline
Just stop trying to use a Bloom filter when you don't need one!
Use MD5 + a single hash bit buffer and GetBitPtr() + SetBitPtr().
If it's not too difficult, could you write an example in Delphi ?
Offline
Just stop trying to use a Bloom filter when you don't need one!
Use MD5 + a single hash bit buffer and GetBitPtr() + SetBitPtr().
Hello, I forgot to answer, I think I did as you advised and now the results are normal:
function MyHasher(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
var
Hasher:Thasher;
begin
Hasher:=cryptCrc32(caMd5);
SetBitPtr(buf, 2);
Result:=Hasher(crc, buf, len);
end;
Now there are about 12-15 duplicates in file displacement seems to help a lot, I would like to somehow transfer SynBloomFilter to 64-bit hashes; in theory, there will be even fewer collisions
Last edited by Uefi (2024-05-11 18:57:41)
Offline
Pages: 1