An ode to bzip

169 points
1/21/1970
3 days ago
by signa11

Comments


idoubtit

My experience does not match theirs when compressing text and code:

> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

3 days ago

mppm

What you are seeing here is probably the effect of window size. BZip has to perform the BWT strictly block-wise and is quite memory-hungry, so `bzip2 -9` uses a window size of 900KB, if I recall correctly. Dictionary-based algorithms are more flexible in this regard, and can gain a substantial advantage on very large and repetitive files. The article kind of forgets to mention this. Not that BZip isn't remarkably efficient for its simplicity, but it's not without limitations.

2 days ago

bigiain

I suspect the reason for the difference here may be specific use case and the implications there on the size of the files? The author's use case is Lua files to run in Minecraft, and I strongly suspect their example file at 327KB is very much closer to "typical" for that use case than a 1GB SQL file.

It wouldn't surprise me at all that "more modern" compression techniques work better on larger files. It also wouldn't surprise me too much if there was no such thing as a 1GB file when bzip was originally written, according to Wikipedia bzip2 is almost 30 years old "Initial releases 18 July 1996". And there are mentions of the preceding bzip (without the 2) which must have been even earlier than that. In the mid/late 90s I was flying round the world trips with a dozen or so 380 or 500MB hard drives in my luggage to screw into our colo boxen in Singapore London and San Francisco (because out office only has 56k adsl internet).

3 days ago

adrian_b

For large files, it is frequent to obtain much higher compression ratios when using a preprocessing method, e.g. by using lrzip (which invokes internal or external standard compressors after preprocessing the input to find long-range similarities).

For instance, "lrzip -b", which uses bzip2 for compression, typically achieves much higher compression ratios on big files than using either xz or zstd alone. Of course, you can also use lrzip with xz or zstd, with various parameters, but among the many existing possibilities you must find an optimum compromise between compression ratio and compression/decompression times.

2 days ago

ac29

> Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's

I compressed kernel 6.19.8 with zstd -19 --long and bzip3 (default settings). The latter compressed better and was about 8x faster.

3 days ago

cogman10

bzip is old and slow.

It was long surpassed by lzma and zstd.

But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.

3 days ago

duskwuff

Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.

3 days ago

yyyk

Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.

3 days ago

thesz

RAR/ACE/etc used continuous compression - all files were concatenated and compressed as if they were one single large file. Much like what is done with .tar.bz. Bzip on Windows did not do that, there was no equivalent of .tar.bz2 on Windows.

You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.

Then you can compare.

The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.

3 days ago

yyyk

>continuous compression

'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.

3 days ago

eviks

is SQL file text or code?

3 days ago

8n4vidtmkvmk

And here i got best compression out of xz for SQL.

3 days ago

saghm

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

3 days ago

0cf8612b2e1e

There was an analysis which argued that zstd is pareto optimal.

https://insanity.industries/post/pareto-optimal-compression/

3 days ago

lucb1e

> > Pareto optimality is a situation where no […] preference criterion can be better off without making at least one […] preference criterion worse off […].

(Omissions theirs.)

Wasn't that zstandard's stated goal? Not very surprising that it has this property, also considering it's much newer (2015) than the established tools like gzip (1992), bzip2 (1996), LZMA as used by xz utils (1999)

Edit: https://github.com/facebook/zstd/blob/4856a00164c1d7b947bd38... initial commit indeed states it's meant to have good ratios and good (de)compression speeds as compared to other tools, without sacrificing one for the other (»"Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).«). So Pareto by design, just not using that name

3 days ago

bonzini

> meant to have good ratios and good (de)compression speeds as compared to other tools

That does not mean it's Pareto optimal; Pareto-optimality forms a curve and while zstd, LZMA, LZ4, ZPAQ all want to be as close as possible to the curve, they focus on different parts of it. In particular zstd tries to stay on the middle part of the curve, while LZMA and LZ4 focus on opposite sides

          ___.--- higher throughput
        /     LZ4
       / zStd
      |
      ; LZMA
     |
     | ZPAQ
    lower size
Also, the Pareto curve is not necessarily known in advance. All you can do is add more and more algorithms or tweaks to understand what it looks like. For example, this blog post [https://insanity.industries/post/pareto-optimal-compression/] shows that prior to zstd, bzip2 and gzip2 were both pretty much Pareto optimal in the same area for ratio vs. compression speed. LZMA at low settings was a bit better but much slower. There was a huge gap between LZMA and LZ4, and bzip2/gzip filled it as best as they could.

The same blog post shows that zstd is an absolute speed demon at decompression; while not all zstd settings are Pareto optimal when looking at size vs compression speed (in particular LZMA wins at higher compression ratios, and even considering zstd only there's hardly a reason to use levels 11-15), zstd is pretty much Pareto optimal at all settings when looking at size vs. decompression speed. On the other hand at intermediate settings zstd is faster and produces smaller files than gzip, which therefore is not Pareto optimal (anymore).

2 days ago

rurban

This misses the very best compressors by Fabrice Bellard. https://bellard.org/nncp/ and for text tm_zip

2 days ago

lucb1e

Interesting approach. The fastest of the 4 presented compressors ("LSTM (small)") is 24 times slower than xz, and their best compressor ("LSTM (large1)") is 429 times slower than xz. Let alone gzip or, presumably, zstandard (not shown in paper). They also ran the models on different CPUs (a Core i5 and a Xeon E5) so the results are not even comparable within the same paper. A linked webpage lists the author's decompression times, which are even worse: xz decompresses twelve thousand times faster (50MB/s vs. 4kB/s) when nncp has an Nvidia RTX 3090 and 24GB RAM available to it, which apparently speeds it up by 3x compared to the original CPU implementation.

At half the size of xz's output, there can be applications for this, but you need to:

- not care about compression time

- not be constrained on hardware requirements (7.6GB RAM, ideally let it run on a GPU)

- not care about decompression time either

- and the data must be text (I can't find benchmarks other than from English Wikipedia text, but various sources emphasize it's a text compressor so presumably this is no good on e.g. a spacecraft needing to transmit sensor/research data over a weak connection, even if the power budget trade-off of running a GPU instead of pumping power into the antenna were the optimal thing to do)

2 days ago

usefulcat

Bzip is slower than zstd and doesn’t compress as well as xz. There’s no place for it.

3 days ago

jgalt212

10 years ago we look at replacing gzip archives with bzip. It was just too slow for the incremental space saving gains offered. Creating bzip archives took forever. I know xz is "better", but we still just gzip everywhere.

2 days ago

Sesse__

xz is generally slower-but-more-dense; it's not meant to be a full gzip replacement. Zstd, on the other hand, aims to be a “better gzip”, in that it's compresses about as fast as gzip but packs somewhat denser (and decompresses faster).

2 days ago

darkwater

You can pry bzip from my dead cold hands.

2 days ago

int_19h

Article has the numbers for their input:

uncompressed: 327005

(gzip) zopfli --i100: 75882

zstd -22 --long --ultra: 69018

xz -9: 67940

brotli -Z: 67859

lzip -9: 67651

bzip2 -9: 63727

bzip3: 61067

3 days ago

lucb1e

This matches my experience compressing structured text btw. Bzip2 will beat every other tool out there, both on compression ratio and, sadly, decompression time

OP says decompression time is so high because it has similar properties to a memory-hard password hash: it's bandwidth-bound due to the random access requirement. Even xz decompresses 2.5x faster, and I don't find it particularly fast

This is why I switched away, also for text compression; searching for anything that isn't near the beginning of a large file is tedious. My use-case for compression is generally not like OP's, that is, compressing 100KB so that it can fit into Minecraft (if I understood their purpose correctly); I compress files because they take too much disk space (gigabytes). But if I never wanted to access them, I'd not store them, so decompression speed matters. So I kinda do agree with GP that Bzip2 has limited purposes when Zstd is just a few % more bytes to store for over an order of magnitude more speed (1GB/s instead of 45MB/s)

Edit: And all that ignores non-json/xml/code/text compression tasks, where Bzip2/LZMA doesn't give you the best compression ratio. I'd argue it is premature optimization to use Bzip2 without a very specific use-case like OP has for very good code compression ratios and a simple decoder

3 days ago

necovek

I wonder what the combined speed would be for small to mid-sized text files if they were fully loaded into memory first? That swaps multiple random-accesses for single sequential read (even if sequential is not really with flash memory, it should still perform better than totally random access), and memory random access which should not be a bottleneck at these speeds.

Or perhaps this is already being done this way and it is a bottleneck?

This might not work for multi-gigabyte files, but most of text content is well under 1MiB.

2 days ago

lucb1e

This is essentially already being done. The kernel will keep in RAM as much of the file that the system is operating on as will fit

Code can care about cache locality and improve upon this default behavior, like if you find a heuristic where you can do another part of the decompression already while some of the data is still in a CPU cache and you don't have to hit main RAM, but whether that's possible will depend on the algorithm

Being in RAM sadly doesn't mean that random access doesn't matter anymore. Just like how hard drives (compare to, say, tape drives) are random access, you still have swapping times to load the data into the CPU (or arm-moving times in the case of an HDD)

2 days ago

ac29

I tested bzip3 vs xz compressing enwik8: bzip3 was 7x faster and had ~20% better compression

3 days ago

adrian_b

I have also tested bzip3 and initially I encountered a great number of cases where bzip3 was much better than zstd from all points of view, i.e. compression ratio and execution times, for any zstd options.

Unfortunately, later I have also found other cases where the bzip3 performance was not so good.

I spend a lot of time on compression/decompression, but sadly there is no algorithm that can be said to be superior to all others in all circumstances, so for optimum results you must still be prepared to use any of them.

Even the ancient gzip is preferable in certain cases, when speed is more important than compression ratio, because sometimes it happens to provide a better compromise than newer algorithms.

2 days ago

thesz

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"
The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.

3 days ago

fl0ki

This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handlers and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.

https://github.com/klauspost/compress/releases/tag/v1.18.4

3 days ago

klauspost

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.

Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.

3 days ago

fl0ki

Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!

3 days ago

terrelln

> Also making good progress on getting a slimmer version of zstd into the stdlib

Awesome! Please let me know if there is anything I can do to help

3 days ago

hexxagone

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

3 days ago

lucb1e

Was the name used with permission? Even if not trademarked (because open source freedom woohoo), it's a bit weird to release, say, Windows 12 without permission from the authors of Windows 11

I tried looking it up myself but it doesn't say in the readme or doc/ folder, there is no mention of any of the Bzip2 authors, and there is no website listed so I presume this Github page is canonical

3 days ago

tosti

Free software doesn't have a lot of trademarks, with the notable exception of Linux.

Also, the name is the algorithm. Bzip2 has versions and bzip3 is something else which has its own updated versions. Programs that implement a single algorithm often follow this pattern.

2 days ago

lucb1e

> Free software doesn't have a lot of trademarks

Hence my saying "not trademarked (because open source freedom woohoo)". I understand bzip3 is legal, but my question is whether it's a dick move

> the name is the algorithm. Bzip2 has versions and bzip3 is something else

I don't understand this. If bzip3 is "something else" compared to bzip2 then how can both be named after the algorithms they use? Nothing in bzip2's algorithm is related to the digit 2. If it used, say, powers of 2 centrally and bzip3 used pi for something, then the naming could make sense since it's indeed not a version number, but I don't remotely see this argument in these algorithms

Looking on Wikipedia and bzip2's website, there is no explanation of the name, and yesterday I read somewhere (either in the OP or in another comment) that it would stand for "better zip". It has nothing to do with PKZIP though. If they had both implemented pkzip's format, then a "spiritual successor" to bzip2 would be something like uzip, uzip2 (replacing "better" with "ultimate"), bbzip2, or any of a million other options, but that's not the pattern they chose to follow

How is this not just taking their established name for your own project with the version number +1?

2 days ago

eichin

Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)

3 days ago

shawn_w

There's also a parallel version of bzip2, pbzip2.

https://man.archlinux.org/man/pbzip2.1.en

And zstd is multi threaded from the beginning.

3 days ago

joecool1029

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

3 days ago

lucb1e

> bzip2 and xz are extremely slow to compress

This depends on the setting. At setting -19 (not even using --long or other tuning), Zstd is 10x slower to compress than bzip2, and 20x slower than xz, and it still gets a worse compression ratio for anything that vaguely looks like text!

But I agree if you look at the decompression side of things. Bzip2 and xz are just no competition for zstd or the gzip family (but then gzip and friends have worse ratios again, so we're left with zstd). Overall I agree with your point ("just use zstd") but not for the fast compression speed, if you care somewhat about ratios at least

3 days ago

hexxagone

In the LZ high compression regime where LZ can compete in terms of ratio, BWT compressors are faster to compress and slower to decompress than LZ codecs. BWT compressors are also more amenable to parallelization (check bsc and kanzi for modern implementations besides bzip3).

3 days ago

silisili

I'd argue it's more workload dependent, and everything is a tradeoff.

In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.

If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.

3 days ago

joecool1029

> If I want higher compatibility and fast speeds, I'd probably just reach for gzip.

It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.

EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/

3 days ago

adrian_b

Any claims about compressing programs are extremely data-dependent so any general claims will be false for certain test cases.

I do a lot of data compression and decompression, and I would have liked a lot to find a magic algorithm that works better than all others, to simplify my work.

After extensive tests I have found no such algorithm. Depending on the input files and depending on the compromise between compression ratio and execution times that is desired, I must use various algorithms, including zstd and xz, but also bzip2, bzip3 and even gzip.

I use quite frequently gzip (executed after lrzip preprocessing) for some very large files, where it provides better compression at a given execution time, or faster execution at a given compression ratio, than zstd with any options.

Of course for other kinds of files, zstd wins, but all the claims that zstd should be used for ALL applications are extremely wrong.

Whenever you must compress or decompress frequently, you must test all available algorithms with various parameters, to determine what works best for you. For big files, something like lrzip preprocessing must also be tested, as it can change a lot the performance of a compression algorithm.

2 days ago

NooneAtAll3

why would one even care about compression speed on minecraft ComputerCraft machine?

size and decompression are the main limitations

3 days ago

pella

imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )

3 days ago

lucb1e

It's good, but is it "the future" when it's extra work?

Consider that you could hand-code an algorithm to recognize cats in images but we would rather let the machine just figure it out for itself. We're kind of averse to manual work and complexity where we can brute force or heuristic our way out of the problem. For the 80% of situations where piping it into zstd gets you to stay within budget (bandwidth, storage, cpu time, whatever your constraint is), it's not really worth doing about 5000% more effort to squeeze out thrice the speed and a third less size

It really is considerably better, but I wonder how many people will do it, which means less implicit marketing by seeing it everywhere like we do the other tools, which means even fewer people will know to do it, etc.

3 days ago

cgag

This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803

3 days ago

srean

That is an interesting link.

Does gmail use a special codec for storing emails ?

3 days ago

duskwuff

The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.

3 days ago

dataflow

I kind of doubt they would do this to be honest. Every near-copy of a message is going to have small differences in at least the envelope (not sure if encoding differences are also possible depending on the server), and possibly going to be under different guarantees or jurisdictions. And it would just take one mistake to screw things up and leak data from one person to another. All for saving a few gigabytes over an account's lifetime. Doesn't really seem worth it, does it?

3 days ago

srean

That's why a base and a delta. Whereas PP was talking about general compression algorithm, my question was different.

In line with the original comment, I was asking about specialized "codecs" for gmail.

Humans do not read the same email many times. That makes it a good target for compression. I believe machines do read the same email many times, but that could be architected around.

2 days ago

srean

Yes.

These and other email specific redundancies ought to be covered by any specialized compression scheme. Also note, a lot of standard compression is deduplication. Fundamentally they are not that different.

Given that one needs to support deletes, this will end up looking like a garbage collected deduplication file system.

2 days ago

jiggawatts

I recently did some works on genomics / bioinformatics, where terabyte-sized text datasets are common and often compressed and decompressed multiple times in some workloads. This often becomes the serial bottleneck we all know and love from Amdahl's law.

I ran a bunch of benchmarks, and found that the only thing that mattered was if a particular tool or format supported parallel compression and/or parallel decompression. Nothing else was even close as a relevant factor.

If you're developing software for processing even potentially large files and you're using a format that is inherently serial, you've made a mistake. You're wasting 99.5% of a modern server's capacity, and soon that'll be 99.9%.

It really, really doesn't matter if one format is 5% faster or 5% bigger or whatever if you're throwing away a factor of 200 to 1,000 speedup that could be achieved through parallelism! Or conversely, the ability to throw up to 1,000x the compute at improving the compression ratio in the available wall clock time.

Checksumming, compression, decompression, encryption, and decryption over bulk data must all switch to fully parallel codecs, now. Not next year, next decade, or the year after we have 1,000+ core virtual machines in the cloud available on a whim. Oh wait, we do already: https://learn.microsoft.com/en-us/azure/virtual-machines/siz...

Not to mention: 200-400 Gbps NICs are standard now in all HPC VMs and starting to become commonplace in ordinary "database optimised" cloud virtual machines. Similarly, local and remote SSD-backed volumes that can read and write at speeds north of 10 GB/s.

There are very few (any?) compression algorithms that can keep up with a single TCP/IP stream on a data centre NIC or single file read from a modern SSD unless using parallelism. Similarly, most CPUs struggle to perform even just SHA or AES at those speeds on a single core.

2 days ago

hexxagone

Then you should take a look at https://github.com/flanglet/kanzi-cpp: it is optimized for fast roundtrips, multi-threaded by design and produces a seekable bitstream.

2 days ago

cobbzilla

My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.

Yes, there are better compression options today.

3 days ago

Grom_PE

PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.

3 days ago

vintermann

If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.

I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!

3 days ago

[deleted]
3 days ago

roytam87

I wonder why people think there is only 1 bzip (which seems to be bzip2), when the original bzip exists: https://unix.stackexchange.com/a/125803 and also bzip3: https://github.com/iczelia/bzip3

2 days ago

elophanto_agent

[dead]

3 days ago

kergonath

Bzip2 is slow. That’s the main issue. Gzip is good enough and much faster. Also, the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files is not a deal breaker, but it is annoying.

3 days ago

nine_k

Gzip is woefully old. Its only redeeming value is that it's already built into some old tools. Otherwise, use zstd, which is better and faster, both at compression and decompression. There's no reason to use gzip in anything new, except for backwards compatibility with something old.

3 days ago

kergonath

> Otherwise, use zstd, which is better and faster

Yes, I do. Zstd is my preferred solution nowadays. But gzip is not going anywhere as a fallback because there is a surprisingly high number of computers without a working libzstd.

3 days ago

duskwuff

One other redeeming quality that gzip/deflate does have is that its low memory requirements (~32 KB per stream). If you're running on an embedded device, or if you're serving a ton of compressed streams at the same time, this can be a meaningful benefit.

3 days ago

adrian_b

The claim that zstd is "better and faster", without additional qualifications, is false and misleading.

Indeed for many simple test cases zstd is both better and faster.

Despite that, it is possible to find input files for which zstd is either worse or slower than gzip, for any zstd options.

I have tested zstd a lot, but eventually I have chosen to use lrzip+gzip for certain very large files (tens of GB file size) with which I am working, because zstd was always either worse or slower. (For those files, at the same compression ratio and on a 6-year old PC, lrzip+gzip has a compression speed always greater than 100 MB/s, while zstd only of 30 to 40 MB/s.)

There are also other applications, where I do use zstd, but no compressing program is better than all the others in ALL applications.

2 days ago

duskwuff

bzip2 is particularly slow because the transform it depends on (BWT2) is "intrinsically slow" - it depends on cache-unfriendly operations with long dependency chains, preventing the CPU from extracting any parallelism:

https://cbloomrants.blogspot.com/2021/03/faster-inverse-bwt....

3 days ago

sedatk

> the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files

TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.

3 days ago

pocksuppet

tar packs multiple files into one. If you concatenate two gzipped files and unzip them, you just get a concatenated file.

3 days ago

sedatk

Ah okay, I thought gzip would support decompressing multiple files that way.

3 days ago

kergonath

How it works is, if you have two files foo.gz and bar.gz, and cat foo.gz bar.gz > foobar.gz, then foobar.gz is a valid gzip file and uncompresses to a single file with the contents of foo and bar.

It’s handy because it is very easy to just append stuff at the end of a compressed file without having to uncompress-append-recompress. It is a bit niche but I have a couple of use cases where it makes everything simpler.

3 days ago

bmacho

tar supports that types of concatenation, so you can concatenate tar.gz files, and unpack them all into separate files

3 days ago

sedatk

I know, but I've been always confused why a gzip file would have a filename field in its header if it's supposed to contain only one file. Obviously it's good to keep a backup of original filename somewhere, but it's confusing nonetheless.

2 days ago

saidnooneever

the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.

that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip

3 days ago

nine_k

Bzip2 is great for files that are compressed once, get decompressed many times, and the size is important. A good example is a software release.

3 days ago

pocksuppet

So is xz, or zstd, and the files are smaller. bzip2 disappeared from software releases when xz was widely available. gzip often remains, as the most compatible option, the FAT32 of compression algorithms.

3 days ago

lucb1e

Huh? Only if it gets decompressed few times I would say, because it's so extremely slow at it

Like a software installation that you do one time. I'd not even want it for updates if I need to update large data files. The only purpose I'd see is the first-time install where users are okay waiting a while, and small code patches that are distributed to a lot of people

(Or indeed if size is important, but then again bzip2 only shines if it's text-like. I don't really find this niche knowledge for a few % optimization worth teaching tbh. Better teach general principles so people can find a fitting solution if they ever find themselves under specific constraints like OP)

3 days ago

joecool1029

> the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.

Long comment to just say: ‘I have no idea about what I’m writing about’

These compression algorithms do not have anything to do with filesystem structure. Anyway the reason you can’t cat together parts of bzip2 but you can with zstd (and gzip) is because zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.

So like, another place bzip2 sucks ass is working with large archives because you need to seek the entire archive before you can decompress it and it makes situations without parity data way more likely to cause dataloss of the whole archive. Really, don’t use it unless you have a super specific use case and know the tradeoffs, for the average person it was great when we would spend the time compressing to save the time sending over dialup.

3 days ago

duskwuff

> zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.

This isn't accurate.

1) Most zstd streams consist of a single frame. The compressor only creates multiple frames if specifically directed to do so.

2) bzip2 blocks, by contrast, are fully independent - by default, the compressor works on 900 kB blocks of input, and each one is stored with no interdependencies between blocks. (However, software support for seeking within the archive is practically nonexistent.)

3 days ago

prerok

So... it's actually a reasonable objection over bzip2? I mean, you explained why it does not work with bzip2.

I think their argument is sound and it makes using bzip2 less useful in certain situations. I was once saved in resolving a problem we had when I figured out that concatening gzipped files just works out of the box. If not, it would have meant a bit more code, lots of additional testing, etc.

3 days ago

stefan_

bzip and gzip are both horrible, terribly slow. Wherever I see "gz" or "bz" I immediately rip that nonsense out for zstd. There is such a thing as a right choice, and zstd is it every time.

3 days ago

kergonath

> Wherever I see "gz" or "bz"

That should not happen too often, considering that IIRC bzip lasted only a couple of months before being replaced by bzip2.

3 days ago

laurencerowe

lz4 can still be the right choice when decompression speed matters. It's almost twice as fast at decompression with similar compression ratios to zstd's fast setting.

https://github.com/facebook/zstd?tab=readme-ov-file#benchmar...

3 days ago

anthk

pigz it's damn fast on compressing. Also, a Vax with NetBSD can run gzip. So here is it. Go try these new fancy formats on a Vax, I dare you.

And, yes, I prefer LZMA over the obsolete Bzip2 any day, but GZIP it's like the ZIP of free formats modulo packaging, which it's the job of TAR.

3 days ago

singpolyma3

Neither has been good enough for years.

3 days ago

ryan14975

[flagged]

3 days ago

foltik

LLM slop, and also incorrect.

900kb is not “buffering the entire input,” it’s just a slightly larger buffer size. Both are streaming algorithms.

2 days ago

jiggawatts

It's a bot/shill account advertising "zip as a service". I wish I was kidding.

2 days ago

comet_browser

[flagged]

3 days ago

nvme0n1p1

ai slop account

3 days ago