SynLZ fails to compress long sequences of the same byte

Eric · 2018-06-21 08:53:07

When testing compression, I bumped on an odd issue with SynLZ: when the input data is a long sequence of the same byte, then compression fails.

The issue still happens if you introduce any non-repeated sub-sequence (for instance if you give any byte in the sequence another value).

Another example is for an input 8bit string of 'hello' followed by 1000 space character followed by 'world', which will not compress at all.

ab · 2018-06-22 16:53:34

There is a known issue with such patterns.
The compression implementation should be modified to handle this kind of content (without breaking its encoding), but I didn't find the time to do it up to now.
And to be honest, we never have such patterns in our real live data structures, since we always un-flatten big sequence of identical numbers before applying the compression.

In practice, it is mitigated by the layer we use on top of SynLZ, which fallback to simple storage when the compressor turns into such issues.

So any input is welcome.

Eric · 2018-06-25 06:09:11

Only noticed it here when investigating a storage oddity for one sensor which got de-connected (so sending only zeroes), and became was the #1 user of storage space.

Is there any description of how SynLZ beyond looking at the code of SynLZcompress1pas ?

Eric · 2018-06-25 07:35:44

I see the issue derives from when the hash leads to src-o <= 2, the compressor becomes dead in the water, as it then just copies bytes src to dest.
IIRC some LZ variations avoid that by "adopting" the not yet scanned bytes, which results in a form of RLE.

If I understood correctly the "src-o = 1" situation, it is basically a case where the dictionary gets "refreshed" at every step for that particular hash?

Doing some basic tests, I have found it's possible to tailor the compressed stream so that it can be made to work, however the issue is in the way the decoder looks up offsets, the RLE can only happen after the offset has been used once, which would mean the compression phase would need to maintain an image of the decompressor offset state.

For instance in an ouput with always the same bytes, you can enter the block guarded by src-o>2 the second time (f.i. when CWBit > 2), if you enter it the first time, then the "move(offset(h)^,dst^,t)" ([ replace by ( to escape BBcode) in the decompressor will fail as offset(h) is still empty.

At the moment I am not sure to fully understand the CWBit logic though, so I may be overlooking something simple

ab · 2018-06-25 14:50:48

Yes, I did the same trials, but it ended up with wrong decompression due to inconstent hashing on both compression and decompression sides.

The CWBit logic is just a Control Work which bit is set to indicate that some bytes (number following) are to be taken from the current offset[] value.

mORMot Open Source

#1 2018-06-21 08:53:07

SynLZ fails to compress long sequences of the same byte

#2 2018-06-22 16:53:34

Re: SynLZ fails to compress long sequences of the same byte

#3 2018-06-25 06:09:11

Re: SynLZ fails to compress long sequences of the same byte

#4 2018-06-25 07:35:44

Re: SynLZ fails to compress long sequences of the same byte

#5 2018-06-25 14:50:48

Re: SynLZ fails to compress long sequences of the same byte

Board footer