Design of GNU Parallel (2015)

170 points
1/20/1970
13 days ago
by Havelock

Comments


ketzu

This was quite interesting to look through!

Perl 5.8.0 is over 20 years old (https://dev.perl.org/perl5/news/2002/07/18/580ann/) while centOS 3.9 was released in 2007! At the same time it seems not-that-old and ancient.

My personal anecdote with gnu parallel was running into it while working in academia. It worked well and saved me some time, but I felt that it was unreasonable of a tool to ask for a citation to parallelise a script - it seemed that matplotlib, jupyter and co would need one as well. On the other hand, I decided to not use it, because I also feel that authors can ask for whatever they want.

12 days ago

ajsnigrutin

Yep, that's the great thing about perl... take a 20 year old script and it still works today. In comparison, if they used python, they'd be using python 2.2.

12 days ago

fmajid

That's basically a side-effect of Perl being a dead language, frozen because Perl 6 will never happen. It's surprisingly hard to eradicate, however.

12 days ago

chungy

Perl isn't dead, not by a long shot. Perl 6 happened too, and because compatibility was never even really a thought, renamed to Raku instead. There's been talks for a few years of finally bumping Perl's major version in order to change the defaults.

12 days ago

VyseofArcadia

There's value in stability, though.

Maybe it's not dead. Maybe it's just finished. Does everything need to keep changing? Change isn't always improvement, and even if it is, if you have to maintain backwards compatibility, sometimes the conceptual load of having to keep the old ways and the new ways in your head all the time isn't worth it.

Maybe we should start letting things just be finished.

12 days ago

uhtred

Perl 5 is actively developed still though, and presumable will become Perl 7 at some point.

Why does a language being stable mean it's dead? Is Awk dead?

12 days ago

ajsnigrutin

Not making breaking changes every few years doesn't mean that the language is dead. It's still being developed and new versions of perl are still coming out.

12 days ago

attractivechaos

> That's basically a side-effect of Perl being a dead language

Keeping long-term backward compatibility does not necessarily mean dying. C is 50 years old and still alive. I have written a lot more Perl than Python. IMHO, Perl is dying because its syntax is arcane and confusing. We can't solve this problem unless we design a brand new language.

12 days ago

chubot

Python 2.7 was released in 2010, and is even more frozen than Perl!

It still works, though you would have to archive/vendor dependencies

12 days ago

Ferret7446

It's a request, not a requirement. I see nothing wrong with the request nor if an individual decides to not cite it due to their principles/judgement.

12 days ago

ketzu

As I said, I think it is okay for authors to make any request they want, it is their software after all.

But I still think making citation for gnu parallel is unreasonable. There is a huge body of software, of which gnu parallel is probably the least important, that contributed to (at least my) research. Blowing up citation lists with those makes the citation list borderline useless.

It makes citations into advertising space for software - it's bad enough being coerced to make it an advertisement for reviewers papers.

12 days ago

a2800276

Wait what: `parallel` is a Perl script!? [1]

I would have thought it's black magic with assembler optimisations for MIPS and special considerations for HP-UX...

This is such a lovely and interesting writeup, it's wonderful that people take their time to share so generously!

[1] : an 11k loc petal script, you can read along here: https://github.com/gitGNU/gnu_parallel/blob/master/src/paral...

13 days ago

mhh__

assembly optimizations for starting processes?

12 days ago

remram

Maybe for reading the input, splitting it, and assembling the possibly-very-long argument lists passed to the processes.

12 days ago

chubot

Those things are all very fast compared to starting a process

12 days ago

remram

Command lines can be very long, so you can potentially read a million lines between executing processes.

12 days ago

NortySpock

I found GNU parallel useful when I wanted to queue up transcoding of flac files to mp3 on my Raspberry Pi. A few ffmpeg flags plus a list of files meant I could easily just saturate one job per core with a one-line bash command.

13 days ago

krylon

I like to use ts(1) for that. http://vicerveza.homeunix.net/~viric/soft/ts/

12 days ago

hkt

I've used it to parallelise updating hundreds of helm releases whose CI pipelines had ceased to exist. It is a neat tool.

13 days ago

noloblo

Can you please share the example code in gnu parallel

12 days ago

cricalix

parallel is a tool I've reached for many times; the citation bit it prints is odd - it seems to assume that the general use case is research/academic - but easily squelched.

A sample use case would be having a file that has words in it, one per line, and you want to run a program that operates on each word (device name, dollar amount, whatever). Sure, you can use a loop, but if the words and actions are independent, parallel is one way to spin up N copies of your program and pass it a single word from the file. Can get around Python's GIL without having to use multiprocessing or threads (as a more concrete example).

Didn't realise that it busy waits, but I'm typically running it on a not very busy server with tens of cores.

13 days ago

chungy

Thankfully both Debian and Arch patch out the citation nonsense.

13 days ago

RhysU

It is "nonsense" because...?

A) You don't understand. Please read the "Citation notice" section in the article.

B) You understand but don't use GNU Parallel.

C) You understand and use GNU Parallel in a non-academic setting and find the hassle of supplying --no-notice to be onerous vs the effort to write/maintain your own tool.

D) You understand and use GNU Parallel in an academic setting and have cited Ole or plan to cite Ole.

From the article, nearly 10 years ago Ole added the citation behavior after discussing it with his users: https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...

Ole's citations took off roughly coincident with this behavior being added: https://scholar.google.com/citations?hl=en&user=D7I0K34AAAAJ... (click "Cited By" and notice the bar chart).

12 days ago

hexane360

It's nonsense because the standard in academic settings is to cite works which contribute scientifically to the current work, not merely utilities. If I publish a paper on a command line tool for parallel processing, inspired by features from GNU parallel, I would cite GNU parallel. But if I'm doing (for instance) computational biology work, I'm not going to cite: - the Linux kernel - Python - Matlab - GNU parallel - RFC 793 - Every other program I use

Asking for citations is fine. But GNU parallel wants to treat it like a requirement of using the software, without making it a condition of the copyright: "== Is the citation notice compatible with GPLv3? ==

Yes. The wording has been cleared by Richard M. Stallman to be compatible with GPLv3. This is because the citation notice is not part of the license, but part of academic tradition."

This is disingenuous, because citing every tool you use in preparing a scientific work is not part of academic tradition. And the statement that "If you pay 10000 EUR you should feel free to use GNU Parallel without citing." doesn't make any sense in the "academic tradition" framing. If Ole thinks citations are required by academic tradition, that shouldn't change if I pay him enough money.

"If you disagree with Richard M. Stallman's interpretation and feel the citation notice does not adhere to GPLv3, you should treat the software as if it is not available under GPLv3. And since GPLv3 is the only thing that would give you the right to change it, you would not be allowed to change the software.

In other words: If you want to remove the citation notice to make the software compliant with your interpretation of GPLv3, you first have to accept that the software is already compliant with GPLv3, because nothing else gives you the right to change it. And if you accept this, you do not need to change it to make it compliant."

And this is legal nonsense. If I release something under a license, and then break that license, that doesn't nullify the original license. Claiming otherwise would allow me to un-copyleft someone else's code.

12 days ago

justeleblanc

Also, in what world is OSS financed by citations (which is stated as fact in the manpage)? The whole thing is just bizarre. Do I have to cite the manufacturer of my desk because I wrote my paper there?

12 days ago

[deleted]
12 days ago

RhysU

> It's nonsense because the standard in academic settings is to cite works which contribute scientifically to the current work, not merely utilities.

Whether or not it's standard is irrelevant. Ole asked you to cite him if you use it. So, if you publish academically, either don't use it or cite him. If not using GNU Parallel hinders your science then the tool must be material to your work flows.

For comparison, how many dumb citations do people add to their papers that point to marginally relevant work coming out of the same research center or academic lineage? Those aren't scientifically relevant but they are standard. Let's not pretend the academy is full of citation purists.

12 days ago

chungy

At least in the real world, free software doesn't demand that you agree with authors or do anything really. For as long as Ole keeps Parallel as free software, we can use it regardless of complying with requests.

Quite honestly, I think the behavior is on the highest order of jerkishness. A nice request could be done in the documentation, instead the path chosen is to bully users of the software.

Once more, because it is free software, we are free to use it despite what Ole thinks. We are free to patch it out too.

12 days ago

RhysU

Ignoring his wishes and patching around them is also being a jerk. The dude didn't have to open source or maintain anything.

12 days ago

[deleted]
12 days ago

hexane360

"Ole asked you to cite him if you use it. So, if you publish academically, either don't use it or cite him."

Why? Whether something has contributed meaningfully to my research is my decision, not Ole's. Not having light "hinders my science", so I'll be sure to cite Edison on all my papers.

I agree with the sibling commentator that Ole's behavior is jerkish. Not because he asked for citations, but that he misleads users by claiming his request is standard, when it is decidedly not. He also obfuscates the voluntary nature of his request as much as possible, to make it seem like citing is a legal requirement. And he is inflammatory in responding to people who make the perfectly valid decision to not cite him, or to patch the notice out.

12 days ago

RhysU

> Why?

The Golden Rule.

You would be pissed if you spent years on something, felt it was a contribution, saw the community use it, asked them to cite it, and weren't cited.

12 days ago

malborodog

Or ya know, just use it and don't cite him?? Seems pretty easy!

12 days ago

Ferret7446

> GNU parallel wants to treat it like a requirement of using the software

I have never felt this, and that is not how FOSS works. By definition, they cannot restrict how you use the software. Thus, the citation request is just a request. Hypothetically, you could slander and ruin the author's life (the extreme polar opposite of a citation) and still freely use the software.

This is no different than an author asking users to retweet, post on reddit, etc. Certainly it may be annoying to some, but it does not restrict how you may use or fork the software.

One could fork GNU parallel to remove the copyright, and let the democratic public user base vote on whether they care enough to use your fork, or if they think you (or the other author) are an asshole, etc.

12 days ago

hexane360

>that is not how FOSS works

Right, and that's what the rest of my sentence was meant to convey. However, the author goes to great extent to obfuscate this fact, as this faq demonstrates: https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

Not once in that 2000-word rant does Ole outright state that citation is entirely voluntary, and not a condition of the license. Instead, he describes the notice's "GPLv3 compatibility" in a way that incorrectly states you must either respect the license notice or treat the software as it is not open-source. He also responds with vitriol to people who do choose to fork his software, as evidenced here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905674

I wouldn't have a problem with the program's current behavior if it simply made you type 'i understand' instead of 'will cite', and made clear that it was a non-binding request. As is, the program attempts to sound like a license agreement while Ole insists to maintainers it is not.

10 days ago

chungy

How about

E) I understand and use GNU Parallel and also completely disagree with the author's insistence that citing tools is appropriate.

Even in your second link, almost everything listed are papers about Parallel itself. If I was writing about Parallel, I'd be fine with citing it. If instead it's the means to another end, I wouldn't.

12 days ago

xyzzy_plugh

It's nonsense because a utility like parallel shouldn't require state, let alone state used only to disable a nag message. It's far less annoying to simply patch out the nag.

As others point out, it's further annoying because it doesn't even make any sense to begin with. If it was asking for donations or something I could maybe even get behind it, but the current message is pretentious and useless. It serves no real purpose.

12 days ago

RhysU

Then hop on the mailing list and suggest he set up a donation drop and donate.

12 days ago

xyzzy_plugh

Arguing with Ole is a waste of time. My parallel is patched, I don't care any longer.

12 days ago

caro11ne

Is this not covered in depth in the f.a.q? https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

10 days ago

BooneJS

Before GNU Parallel I used to use Ruby's workers and job queue to keep ${N} cores busy with work. It sorta worked like GNU parallel but was quite basic. I've since switched to using GNU Parallel. Stable code I don't have to write doesn't have to be maintained... not to mention it has more features than I normally supported.

12 days ago

Alifatisk

What did you use exactly? I am curious, Resque? Sidekick?

12 days ago

BooneJS

Ruby's Queue structure to push work, and Thread for spinning up workers based on the number of cores on the machine. Main thread would push all commands to run to the Queue, followed by ${N} shutdown hints, and ${N} Threads would pick them off in a while loop that would only stop when it saw a shutdown command. Once the last thread consumed the last shutdown hint, all threads were done and the script would exit. This was barely one step beyond a bash script that backgrounded all tasks at once and swamped a host until it slowly finished up.

12 days ago

docandrew

I couldn’t make heads or tails of what this would be useful for from the OP (maybe it’s something I should already have known), but this from the official site was pretty helpful: https://www.gnu.org/software/parallel/parallel_cheat.pdf

13 days ago

psychphysic

That cheat sheet is super enlightening!

But quite useless as it'll print poorly and is overall a waste of resources to have that lovely beach scene in the background.

13 days ago

kakadzhun

Try this resource instead. Although it is 100 pages, the introductory part is already useful in and of itself!

https://zenodo.org/record/1146014/files/GNU_Parallel_2018.pd...

13 days ago

chungy

Does Ole remember to cite LibreOffice in the production of that document?

12 days ago

RadiozRadioz

The beach will certainly make the cheat sheet stick in my memory, I can tell you that much.

13 days ago

bloopernova

I was able to remove the background using LibreOffice to open the PDF.

12 days ago

mianos

I once replaced a 10 machine Hadoop cluster job with a python script and parallel on my laptop because I didn't want to wait for hours for it to finish.

The i7 on my laptop with quite a few CPUS/threads and a few optimisations got the job finished in 10 minutes.

(I later put the Hadoop use on my resume, not the GNU parallel. That's the joke of modern job hunting. There is no interested in what you did, just buzzwords and leetcode. Luckily there are still a few places that value real work or I'd be too old to get a job. :) )

12 days ago

ZoomZoomZoom

If anyone needs a pretty basic alternative with Windows support, there's Rush:

https://github.com/shenwei356/rush

I use it pretty extensively with ffmpeg, imagemagick and the like.

I'd been using the mmstick/parallel for a while, but it moved to RedoxOS repos and then stopped being updated, while still having some issues not ironed out.

https://github.com/shenwei356/rush

12 days ago

seized

Parallel is a fun tool. I use it as a sort of simple slurm to distribute work over many VMs to process tens to hundreds of TBs of data. Sometimes across 2400+ cores.

12 days ago

michalc

I've never been sure if it's too much of a hack, but I've used GNU parallel in Docker containers as a quick and easy way of getting multiple processes running for web applications.

And with the `--halt now,done=1` option (that I think is relatively recent?) it means that if any of the parallel processes exit, parallel would exit itself, the whole container will shut down, and external orchestration would start another one if needed.

13 days ago

vrnvu

Cool tip thanks for sharing! I love letting process crash *when possible* on failures so the OS restart them for me versus trying to handle it manually at process level.

12 days ago

KronisLV

I've used Supervisor pretty successfully for this as well: http://supervisord.org/

Example of installing it in a Debian/Ubuntu container during container build, here's an example Dockerfile:

  RUN apt-get update \
      && apt-get -yq --no-upgrade install \
          supervisor \
      && apt-get clean \
      && rm -rf /var/lib/apt/lists /var/cache/apt/*
Then it's possible to create a configuration file, for example /etc/supervisord.conf, to specify what should run and how:

  [supervisord]
  nodaemon=true
  
  [program:php-fpm]
  command=/usr/sbin/php-fpm8.0 -c /etc/php/8.0/fpm/php-fpm.conf --nodaemonize
  stdout_logfile=/dev/stdout
  stdout_logfile_maxbytes=0
  stderr_logfile=/dev/stderr
  stderr_logfile_maxbytes=0
  
  [program:nginx]
  command=/usr/sbin/nginx
  stdout_logfile=/dev/stdout
  stdout_logfile_maxbytes=0
  stderr_logfile=/dev/stderr
  stderr_logfile_maxbytes=0
And finally it can be run inside of the container entrypoint, along the lines of this in docker-entrypoint.sh:

  #!/bin/bash
  echo "Software versions..."
  nginx -V && supervisord --version
  
  echo "Running Supervisor..."
  supervisord --configuration=/etc/supervisord.conf
Here's more information about the configuration file format, in case anyone is curious: http://supervisord.org/configuration.html

It should be noted that this package will bring in some dependencies, though, which may or may not be okay, depending on how stringent you are about space usage and what's in your containers, example for a Ubuntu container:

  The following NEW packages will be installed:
    libexpat1 libmpdec3 libpython3-stdlib libpython3.10-minimal libpython3.10-stdlib libreadline8 libsqlite3-0 media-types
    python3 python3-minimal python3-pkg-resources python3.10 python3.10-minimal readline-common supervisor
  0 upgraded, 15 newly installed, 0 to remove and 0 not upgraded.
  Need to get 6905 kB of archives.
  After this operation, 25.7 MB of additional disk space will be used.
(just found the piece of software itself useful for this use case, figured I'd share my experiences)

My problem is that it's not always immediately clear how software that would normally run as a systemd service could be launched in the foreground instead. It usually takes a bit of digging around.

13 days ago

michalc

I have previously thought a bit about using something like Supervisor. And if I was running something a bit closer to the metal, with no other infrastructure to restart stuff, then I would be much more pro.

But if inside Docker when something else already has the job of restarting things if they fall over, then it feels a bit over complicated in that there are multiple ways of doing the restarting. Plus, I think there is a touch more visibility - it's all just command line arguments to parallel:

    parallel --will-cite --line-buffer --jobs 2 --halt now,done=1 ::: \
        "some_proc some args" \
        "another_proc some more args"
12 days ago

fbdab103

This is pretty crafty. I do not know supervisor well enough - if one of the services fail, can you engineer supervisor to also crash so that it would bubble up to the container infrastructure? My understanding is that standard supervisor would let the process die and/or restart the service.

12 days ago

KronisLV

Supervisor allows you to have event listeners (e.g. for processes quitting/crashing), so you can use those to achieve that and kill supervisor itself. Here's an example of people doing just that: https://gist.github.com/tomazzaman/63265dfab3a9a61781993212f...

12 days ago

fbdab103

Neato. Do not have an immediate use case for this, but definitely something I will consider for the future.

12 days ago

imglorp

Don't forget "make -j" is another option.

12 days ago

fmajid

Or `xargs -P`

12 days ago

fbdab103

I was just attempting to parallelize a makefile (~500 files, ~20 minutes per file), and I was not happy with the experience. Make syntax for globbing is not ideal. Doubly so as my files had spaces inside of them. All solvable of course, but I feel more comfortable leaning on a parallel/xargs/find workflow than esoteric make syntax to handle the realities of filenames in the wild.

Which is a shame - 95% of my make usage is PHONY targets where I have a task and not a generated artifact. My current use case would have greatly benefited from the native parallelism and the ability to restart only failed files.

12 days ago

anthk

Parallel, vidir to edit directories with nvi/vim, moreutils, detox to scrap out any non-typeable char...

These are a must have today.

12 days ago

InfamousRece

moreutils have its own parallel utility that I actually prefer to Gnu parallel.

12 days ago

anthk

No problems, they almost work the same I think. Oh, another bunch of small tools to help yourself:

    - entr. It runs a command on file/directory changes.
    - spt. Simple pomodoro technique. A good timer to help yourself to work and take rests.
    - herbe. It works great as a notifier for spt. Add "play" from sox to write a script to both
   notify and play a sound in parallel.
    - sox/ffmpeg/imagemagick. Audio, video and image production and conversion on the CLI. A must have.
    - catdoc/antiword/odt2txt/wordgrinder+sc-im+gnuplot. Word/Excel/Libreoffice files reading and editing on the terminal. Gnuplot with help with sc-im. This can be a beast over SSH. With Gnuplot compiled with sixel support (and XTerm) you can do magic.
- iomenu

     - cat bookmarks.txt | iomenu | xargs firefox. Pick from a list of items (one per line) and choose. I think it has fuzzy-finding matches.

I have several more. Simple battery meter (sbm), grabc to grab a color from the screen, pointtools+catpoint to do "presentations" over a terminal, nncp-go+yggdrasil for ad-hoc networking and secure encrypted backups between devices...
12 days ago

[deleted]
12 days ago

andrewshadura

There's also paexec

12 days ago

rurban

I wrote down a small usage example here: https://savannah.gnu.org/forum/forum.php?forum_id=9197

No need for massive distributed clusters when you have a simple perl oneliner

12 days ago

rockwotj

I recently used parallel to write a 1TB data file for testing using all cores

  seq 0 10000 | parallel dd if=/dev/urandom of=/mnt/foo/input bs=10M count=10 seek={}0
12 days ago

codetrotter

Was it noticeably different from

    dd if=/dev/urandom of=/mnt/foo/input bs=10M count=100000
in the amount of time that it took?
12 days ago

rockwotj

Yes, I had 16 cores and I gave up on the this version after several minutes. I don't remember the disk throughput difference but it was significant

12 days ago

adastra22

This should be I/o limited though.

12 days ago

rockwotj

It was a NVMe disk, so it required all cores to saturate the device

11 days ago

globalreset

What's the best rewrite of GNU Parallel in Rust? That citation thing is so annoying.

12 days ago