Unsafe: 1 Likely to cause severe data loss even in case of the smallest corruption (a single bit flip). 2 Likely to produce false negatives.
It does seem like xz is somewhat overengineered, but I don't think that's a characteristic unique to it; any other algorithm with similar compression performance will yield similar behaviour on corrupted data, since the whole point and why compression works is to remove redundancy. I say use something like Reed-Solomon on the compressed data, thus introducing a little redundancy, if you really want error correction.
I haven't read the whole article yet, but gzip and bzip2 recover fairly well from errors. As they are divided into blocks, flipping a bit will make you lose, at most, one block. This is not related to error correction though, but a way of mitigating losses.
>> Just one bit flip in the msb of any byte causes the remaining records to be read incorrectly. It also causes the size of the index to be calculated incorrectly, losing the position of the CRC32 and the stream footer.
As they are divided into blocks, flipping a bit will make you lose, at most, one block.
What happens if bits in the block header are corrupted? If it can't find the start of the next block, the same thing will happen.
Also, breaking up the data into blocks will decrease compression, since each block starts with a fresh state. It is ultimately a tradeoff between compression ratio and error resistance.
> If it can't find the start of the next block, the same thing will happen.
This has been addressed in the article. "Bzip2 is affected by this defect to a lesser extent; it contains two unprotected length fields in each block header. Gzip may be considered free from this defect because its only top-level unprotected length field (XLEN) can be validated using the LEN fields in the extra subfields. Lzip is free from this defect."
> Also, breaking up the data into blocks will decrease compression
This has been tested very thoroughly. Larger block sizes give rapidly diminishing marginal returns (man bzip2). Now, the largest you can go with bzip2 is 900KB.
gzip usually references the data from the previous window in the current window. If you lose one block, you're likely to lose the following one. And the one following it. etc. until the end of the compressed data.
Par2 has inherent limitations that make it look distinctly historical (no Unicode etc.). There is a preliminary Par3, but the main reason why Par doesn't enjoy the popularity it should have is simple: Abysmal tooling.
I've tried all the usual implementations that are available on Windows recently, and they were all unusable. Like in "try to give it a hundred files, each in a single-megabyte range, and it will crash hard".
Why would par2 need to handle anything related to encoding? It's acting on the data independently of the actual contents of the data.
I'd agree about the tooling problems with it, it's "acceptable" on a linux/unix command line but I've never seen anything elsewhere that even looked halfway usable.
> It does seem like xz is somewhat overengineered, but [...] any other algorithm with similar compression performance will yield similar behaviour on corrupted data
You seem to confuse format and compression algorithm. From what the
article says, the format seems bad. Once any data is damaged, all the rest of
reading the file (as opposed to decompressing the data from file) goes out
of the window. No way to re-synchronize reading with data blocks after
corruption.
It does seem like xz is somewhat overengineered, but I don't think that's a characteristic unique to it; any other algorithm with similar compression performance will yield similar behaviour on corrupted data, since the whole point and why compression works is to remove redundancy. I say use something like Reed-Solomon on the compressed data, thus introducing a little redundancy, if you really want error correction.