Zimtohrli: A New Psychoacoustic Perceptual Metric for Audio Compression

95 points
1/20/1970
12 days ago
by judiisis

Comments


Dave_Rosenthal

A few comments:

- My understanding is that a gamma chirp is the established filter to use for an auditory filter bank--any reason you choose an elliptical filter instead?

- I didn't look too closely, but it seems like you are analyzing the output of the filter bank as real numbers. I highly recommend you convolve with a complex representation of the filter and keep all of the math in the complex domain until you collapse to loudness.

- I'd not bucket to discrete 100hz time slices, instead just convolve the temporal masking function with the full time resolution of the filter bank output.

- You want to think about some volume normalization step that would give the final minimized Zimtohrli distance metric between A and B*x, where x is a free variable for volume. Otherwise, a perceptual codec that just tends to make things a bit quieter might get a bad score.

- For fletcher munson, I assume you are just using a curve at a high-ish volume? If so, good :)

- Not sure how you are spacing filter bank center frequencies relative to ERB size, but I'd recommend oversampling by a factor of 2-3. (That is, a few filters per ERB).

Apologies if any of these are off base--I just took a quick look.

11 days ago

qingcharles

This man codecs.

11 days ago

givinguflac

I looked through the deeper explanation and found this interesting:

“Performing a simple experiment where we have 5 separate components

1000 Hz sine probe 57 dB SPL 750 Hz sine masker A at 71dB SPL 800 Hz sine masker B at 71 dB SPL 850 Hz sine masker C at 67 dB SPL 900 Hz sine masker D at 65 dB SPL I record the following data

When playing probe + masker A through D individually I experience the probe approximately as intensely as a 1000Hz tone at 53dB SPL. When playing probe + all maskers I experience the probe approximately as intensely as a 1000Hz tone at 48dB SPL.”

I would be very interested in understanding more about their testing methodology and hardware setup especially.

Is the perceiver a trained listener? Are they using headphones or speakers or some other transducer method?

It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener. Especially given the different frequency response for different listening setups.

The average user has no chance; hence my curiosity of their specific credentials considering they’re building an entirely new perceptual model based on that.

11 days ago

DoctorOetker

>It's awfully difficult to say that there is equivalent perceived SPL for different frequency domains, even as a trained listener.

The snippet you quote doesn't claim comparing intensities at different frequencies.

He is comparing only perceived 1kHz intensities, (in the presence or absence of maskers at other frequencies, whose intensity is not subjectively being scored)

11 days ago

givinguflac

Ah, thank you for clarifying, I misunderstood but still have the same curiosity about their methods .

11 days ago

Thoreandan

Interesting, if hard-to-understand.

It would be nice to see ELi5 explanations for items like this akin to Monty's 'A Digital Media Primer for Geeks' ( https://people.xiph.org/~xiphmont/demo/#:~:text=Xiph )

11 days ago

formerly_proven

I'm guessing the name is meant to allude to cinnamon pig ears (https://en.wikipedia.org/wiki/Palmier).

11 days ago

atoav

Probably, this Zimt is cinnamon, Ohrli is swiss German dialect for ear.

11 days ago

jo-m

Öhrli actually. It maddens me to no end that they pick words which contain umlauts and then leave them out.

11 days ago

kopadudl

On the one hand it makes it easier for non-umlaut people and it makes their names unique, when googling for zopfli und brotli it pointed me to the github repos very quick.

11 days ago

b3orn

Öhrli is actually the swiss german diminutive of Ohr (ear). Swiss german uses -li a lot for diminutives, whereas standard german uses -chen or -lein, the vowel of them stem is turned into an umlaut Ohr -> Öhrli/Öhrchen/Öhrlein.

11 days ago

DoctorOetker

Are there any associated scientific articles and/or datasets that back up the experimental claim/insinuation of matching JNDs or perceptual differences?

Is this a proposal without experimental verification?

11 days ago

Lerc

This seems to be targeted at signals that are already quite close. Is there anything similar for broad ballpark similarity?

Whenever I save searched for such things I have more often encountered techniques designed to detect re-use for copyright reasons.

I have played around with generating instrument sounds from a blend of very few basic waveforms with attack,decay,sustain,release, pitch sliding and bell modulation.

While it is quite fun just trying to make things by tweaking parameters, your ear/perception drifts as you hear the same thing over and over.

It would be really nice to have an automated "how close is this abomination?". I'd even give evolution a go to try and make some more difficult matches.

11 days ago

tux3

How close is broad ballpark, have you tried chromaprint?

It's probably far from state of the art today, but you can get a percentage similarity out of it. I've successfully used it to find similar (or outright duplicate) songs in a big library

11 days ago

Lerc

Things like chromaprint are why I have found it difficult to search for what I want.

These tools are geared towards identifying matches (one to many) Chromaprint specifically bins things into notes assuming it it trying to match music.

I'm after something that will tell me, in human perception, how much a dogs bark sounds like a quack.

One to one comparison of short-ish samples with no assumption of content style.

11 days ago

yalok

It’d be very interesting to see the results for this metric for the existing audio and voice codecs (like AAC, AAC-LD, mp3, opus), and how it compares to the existing metrics for them?

Couldn’t find it in their paper.

11 days ago

ant6n

This says it works on just-noticeable-differences. Would this work well if the quality of the compressed audio is very poor? Could one for example compare two speech codecs at 8Khz, 4bit against the original source to find out which one sounds better?

Or should one just... I dunno, calculate the mean squared error in some sort of continuous frequency domain, perhaps weighted by some hearing curve.

11 days ago

mrob

Audibility of error (and sound in general) depends on what other audio is playing at the same time, with both frequency domain and time domain effects:

https://en.wikipedia.org/wiki/Auditory_masking

Here's a two-part lecture with audio demonstrations by Bernhard Seeber of the Audio Information Processing Group at the Technical University of Munich:

https://www.youtube.com/watch?v=R9UZnMsm9o8

https://www.youtube.com/watch?v=bU0_Kaj7cPk

A simple weighed frequency domain error calculation is not very useful for comparing lossy audio codecs, because effectively exploiting auditory masking to hide the errors is a major factor in codec quality.

11 days ago

jononor

PEAQ/PESQ and visqol is worth trying for that. In principle they operate as you suggest. I keep a short overview of audio quality methods/tools here: https://github.com/jonnor/machinehearing/blob/master/audio-q...

11 days ago

marcodiego

Can it be used to make LAME even better? I mean, I'm still fond of mp3, specially now that it is patent/royalty free and there are literary billions of compatible devices.

11 days ago

iamnotsure

Lossy compression may be a bad idea, brains may not support it very well.

11 days ago

bbstats

very useful - I find a lot of audio SR (compression) algos to sound really bad - likely just because of the loss functions and/or eval metrics are 'inhuman'.

11 days ago

p0nce

How does it compare to visqol v3?

11 days ago

rurban

It is tested against visqol. See the code and esp. COMPARISON.md

Visqol is still the overall winner

7 days ago

[deleted]
11 days ago