*Unsafe: 1 Likely to cause severe data loss even in case of the smallest corrupt...

lake99 · on Oct 22, 2016

I haven't read the whole article yet, but gzip and bzip2 recover fairly well from errors. As they are divided into blocks, flipping a bit will make you lose, at most, one block. This is not related to error correction though, but a way of mitigating losses.

>> Just one bit flip in the msb of any byte causes the remaining records to be read incorrectly. It also causes the size of the index to be calculated incorrectly, losing the position of the CRC32 and the stream footer.

This sounds severe enough to me.

userbinator · on Oct 22, 2016

As they are divided into blocks, flipping a bit will make you lose, at most, one block.

What happens if bits in the block header are corrupted? If it can't find the start of the next block, the same thing will happen.

Also, breaking up the data into blocks will decrease compression, since each block starts with a fresh state. It is ultimately a tradeoff between compression ratio and error resistance.

lake99 · on Oct 22, 2016

> If it can't find the start of the next block, the same thing will happen.

This has been addressed in the article. "Bzip2 is affected by this defect to a lesser extent; it contains two unprotected length fields in each block header. Gzip may be considered free from this defect because its only top-level unprotected length field (XLEN) can be validated using the LEN fields in the extra subfields. Lzip is free from this defect."

> Also, breaking up the data into blocks will decrease compression

This has been tested very thoroughly. Larger block sizes give rapidly diminishing marginal returns (man bzip2). Now, the largest you can go with bzip2 is 900KB.

glandium · on Oct 22, 2016

gzip usually references the data from the previous window in the current window. If you lose one block, you're likely to lose the following one. And the one following it. etc. until the end of the compressed data.

problems · on Oct 22, 2016

Yeah, seems like a trivial fix to me. Use Par2.

Tomte · on Oct 22, 2016

Par2 has inherent limitations that make it look distinctly historical (no Unicode etc.). There is a preliminary Par3, but the main reason why Par doesn't enjoy the popularity it should have is simple: Abysmal tooling.

I've tried all the usual implementations that are available on Windows recently, and they were all unusable. Like in "try to give it a hundred files, each in a single-megabyte range, and it will crash hard".

simcop2387 · on Oct 22, 2016

Why would par2 need to handle anything related to encoding? It's acting on the data independently of the actual contents of the data.

I'd agree about the tooling problems with it, it's "acceptable" on a linux/unix command line but I've never seen anything elsewhere that even looked halfway usable.

Tomte · on Oct 23, 2016

File names.

dozzie · on Oct 22, 2016

> It does seem like xz is somewhat overengineered, but [...] any other algorithm with similar compression performance will yield similar behaviour on corrupted data

You seem to confuse format and compression algorithm. From what the article says, the format seems bad. Once any data is damaged, all the rest of reading the file (as opposed to decompressing the data from file) goes out of the window. No way to re-synchronize reading with data blocks after corruption.