Cohere Transcribe: Speech Recognition

219 points
1/21/1970
5 days ago
by gmays

Comments


dinakernel

My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.

In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.

5 days ago

progbits

This is both good and bad. Good ASR can often understand low quality / garbled speech that I could not figure out, but it also "over corrects" sometimes and replaces correct but low prior words with incorrect but much more common ones.

With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!

(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

5 days ago

corlinp

This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe

5 days ago

Bolwin

Many ASR models already support prompts/adding your own terminology. This one doesn't, but full LLMs especially such expensive ones aren't needed for that.

5 days ago

nkzd

Why are you 'worried' about it? Shouldn't we strive for better technology even if it means some will 'lose'?

5 days ago

yorwba

"Better" isn't just about increasing benchmark numbers. Often, it's more important that a system fails safely than how often it fails. Automatic speech recognition that guesses when the input is unclear will occasionally be right and therefore have a lower word error rate, but if it's important that the output be correct, it might be better to insert "[unintelligible]" and have a human double-check.

5 days ago

IshKebab

It's better in terms of WER. It's not better in terms of not making shit up that sounds plausible.

Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.

5 days ago

ks2048

Ideally, you'd be able to specify exactly what you want - do you want to write-out filled pauses ("aaah", "umm")? Do you want to get a transcription of the the disfluencies - re-starts, etc. or just get out a cleaned up version?

5 days ago

Tsarp

ASR has already proved its usefulness. Dictation tools are a prime example. Ever since whisper came out, usefulness for AST models running locally suddenly became a thing. Opened up soo many variants

https://superwhisper.com

https://carelesswhisper.app

https://macwhisper.com

5 days ago

regularfry

For quite a long time there will be a greater advantage to local processing for STT than for TTT chat, or even OCR. Being able to do STT on the device that owns the microphone means that the bandwidth off that device can be dramatically reduced, if it's even necessary for the task at hand.

5 days ago

gruez

> Limitations

>Timestamps/Speaker diarization. The model does not feature either of these.

What a shame. Is whisperx still the best choice if you want timestamps/diarization?

5 days ago

bartman

Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.

5 days ago

catlifeonmars

I wonder if you could run multiple models and average out the timestamps, kind of like how atomic clocks are used together and not separately

5 days ago

stavros

Isn't Elevenlabs the best in this?

5 days ago

gardnr

They can have issues with the timestamps: https://github.com/elevenlabs/elevenlabs-python/issues/707

5 days ago

bartman

I've not tested their speech-to-text yet, but based on the docs it looks promising. Thanks for the suggestion!

5 days ago

stavros

It's fantastic, and their diarization is spot on as well.

5 days ago

akreal

WhisperX is not a model but a software package built around Whisper and some other models, including diarization and alignment ones. Something similar will be built around the Cohere Transcribe model, maybe even just an integration to WhisperX itself.

5 days ago

atoav

I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr

See the very bottom of the page for a transcription with timestamps.

5 days ago

mcbetz

Mistral Voxtral has timestamps and diarization and does a good job for German (have not tested for other languages yet).

5 days ago

GaggiX

There is also: https://github.com/linto-ai/whisper-timestamped

It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.

5 days ago

oezi

Just a warning that plain WhisperX is more accurate and Whisper-timestamped has many weird quirks.

5 days ago

stavros

Diarization is done separately to ASR anyway (it's usually a separate run, after the ASR).

5 days ago

lifesaverluke

5 days ago

angel-

Link doesn't work for me, can you double check it please? Or tell the name of it so I can look it up? Thanks!

5 days ago

satvikpendem

Enable show dead in your HN profile settings. The link works then as it's a dead show HN post.

5 days ago

geooff_

I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic.

It has the most crisp, steady P50 of any external service I've used in a long time.

5 days ago

bluegatty

can u comment on overall quality? their models tend to be a bit smaller and less performant overall.

5 days ago

geooff_

My baseline was Jina, A Chinese model provider. I had major issues with their reliability. I have no comparison to provide in terms of offline metrics as I had to do an emergency migration because their inference service has extended downtimes.

My experience with Cohere and interacting with their sales engineers has been boring, I say that is the most flattering way possible. Embeddings are a core service at this point like VMs and DBs. They just need to work and work well and thats what they're selling.

5 days ago

roflcopter69

[dead]

5 days ago

kieloo

The problem with many STT models is that they seem to mostly be trained on perfectly-accented speech and struggle a lot with foreign accents so I’m curious to try this one as a Frenchman with a rather French English accent.

So far, the best I have found while testing models for my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.

5 days ago

mnbbrown

Ran it over our internal dataset of ~250 recordings of people saying british postcodes (all kinds of accents, etc) - it's competitive for sure!

Soniox (stt-async-v4): 176/248 (71.0%) ElevenLabs (scribe_v2): 170/248 (68.5%) AssemblyAI (universal-3-pro): 166/248 (66.9%) Deepgram (nova-3): 158/248 (63.7%) AssemblyAI (universal-2): 148/248 (59.7%) Cohere (transcribe-03-2026): 148/248 (59.7%) Speechmatics (enhanced): 134/248 (54.0%)

P.s. how do I get this to render correctly on here?

5 days ago

jilijeanlouis

did you try gladia: ranking #1 on STT blind test https://compare-stt.com/

5 days ago

mnbbrown

Added gladia..

- 1. Soniox (stt-async-v4): +176 new cases, running total 176/248 (71.0%)

- 2. ElevenLabs (scribe_v2): +26 new cases, running total 202/248 (81.5%)

- 3. Speechmatics (enhanced): +12 new cases, running total 214/248 (86.3%)

- 4. NVIDIA Parakeet (TDT 0.6B v2): +6 new cases, running total 220/248 (88.7%)

- 5. Mistral (voxtral-mini): +3 new cases, running total 223/248 (89.9%)

- 6. Gladia: +2 new cases, running total 225/248 (90.7%)

- 7. AssemblyAI (universal-2): +1 new cases, running total 226/248 (91.1%)

- 8. Deepgram (nova-3): +1 new cases, running total 227/248 (91.5%)

- 9. Cohere (transcribe-03-2026): +0 new cases, running total 227/248 (91.5%)

- 10. AssemblyAI (universal-3-pro): +0 new cases, running total 227/248 (91.5%)

5 days ago

scotty79

This benchmark should have Whisper large-v3 as one of the models.

4 days ago

Bolwin

Try two newlines between each one

5 days ago

ChrisMarshallNY

That, or add 4 spaces before each line (renders as a <pre>).

5 days ago

mkl

Two spaces: https://news.ycombinator.com/formatdoc

It's for code though, not lists or bullet points.

5 days ago

yorwba

Is the human baseline 248/248?

5 days ago

walthamstow

Assuming all the accents are British, I doubt it. I probably couldn't get all 248 myself.

5 days ago

mnbbrown

They are all transcribed by multiple blinded "accent natives". But yes, your point is valid - going to see if I can tease out the "single person accuracy".

5 days ago

_medihack_

Unfortunately, this model does not seem to support a custom vocabulary, word boosting or an additional prompt.

5 days ago

nodja

It's probably another ASR model that focuses on benchmarks and simple uses instead of more challenging real use cases.

I upload edited gameplay vods of twitch streams on youtube, and use whisper-large-v3 to provide subtitles for accessibility reasons (youtube's own auto-subtitles suck, tho they've been getting better).

My checklist for a good ASR model for my use case is:

1. Have timestamp support.

2. Support overlapping speakers.

3. Accurate transcripts that don't coalesce half words/interrupted sentences.

4. Support non verbal stuff like [coughs], [groans], [laughs], [sighs], etc.

5. Allow context injection of non-trivial sizes (10k+ words)

1 is obvious because without it we can't have subtitles. Force alignment fails too often.

2 is crucial for real world scenarios because in the real world people talk over each other all the time, in my case it's a streamer talking over gameplay audio, or when the streamer has guests over. When 2 people speak the transcript either ignores one of them, or in the worst case, both of them.

3 and 4 are an accessibility thing, if you're deaf or hard of hearing having a more literal transcript of what's being said conveys better how the speaker is speaking. If all subtitles are properly "spell-checked" then it's clear your model is overfit to the benchmarks.

5 Is not a requirement per se, but more of a nice to have. In my use cause the streamer is often reading stream chat so feeding the model the list of users that recently talked, recent chat messages, text on screen, etc. Would make for more accurate transcripts.

I've tried many models, and the closest that fulfill my needs are LLM style models on top of forced alignment. It's too slow, so I've been sticky with whisper because with whisperx I can get a transcript in 5 minutes with just a single command.

One thing all these models do (including whisper) is just omit full sentences, it's the worst thing a model can do.

5 days ago

Nimitz14

3/4 are actually negative value for most customers

5 days ago

satozawa

[dead]

5 days ago

teach

Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?

5 days ago

Doman

Files can be downloaded here: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/...

And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.

5 days ago

gunalx

Most use definition is just awailable weigths.

This kids make sense because "compiling" (training) the model cost inhibitly much, and we can still benefit from the artifacts.

5 days ago

stronglikedan

I presume it means the model itself.

5 days ago

stavros

To clarify, this is SOTA in its size category, right? It's not better than Parakeet, for example?

5 days ago

jwineinger

Looking at the ASR leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), Parakeet (.6B) is still near the top on speed, but about 10th on WER.

5 days ago

stavros

Thanks, I don't know how much to trust benchmarks so I figured I'd ask.

5 days ago

caminanteblanco

Well, to clarify, it is both larger than parakeet in parameter count (parakeet is available in 0.6B and 1.1B), since it's 2B params, and also performs better than it on the benchmarks that hugging face publishes on the openASR leaderboard

5 days ago

stavros

Ahh thanks, I confused my parameter count, thanks. I guess Parakeet is 0.6B, I was somehow thinking 6B.

5 days ago

ChrisMarshallNY

I remember Dragon Dictate. You had to spend ages, training it, and it still did a suckass job.

I recently was interviewed for a podcast, and she published it on Apple Podcasts. Apple does a transcript of the podcast. I assume it’s some kind of AI (not sure if it’s the same engine as Siri -which I’m not too thrilled with).

It made quite a few errors (not too bad -but errors, nonetheless), but the thing that annoyed me the most, is that it didn’t differentiate between speakers.

5 days ago

aitchnyu

You mean the ones designed to work on 64MB of RAM and CPU? I downloaded too many speech recognition and TTS shareware as a kid.

5 days ago

Void_

Just today I shipped support for this in Whisper Memos: https://whispermemos.com/changelog/2026-04-cohere-transcribe

Accurate and fast model, very happy with it so far!

5 days ago

ramon156

I had to set-up fireflies for our company recently. Cool tool, but I'm sending dozens of internal meetings to an american company. Our ISO inspector wouldn't be pleased to know.

This is a good option. Will check it out.

5 days ago

Oras

There are many open source STT models that can run locally on Mac with good performance, such as whisper and Parakeet

5 days ago

topazas

How hard could it be to train other European language(-s)?

5 days ago

gunalx

If you have to ask you dont really need the answer.

Seems to not be to difficult in finding or creating training code. So a pretty decent amount of high quality training data should be many hours. And a few hours in high end data enter GPU compute, and many iterations to get it right.

5 days ago

harvey9

It includes several European languages.

5 days ago

stronglikedan

hence "other" lol

5 days ago

BreezyBadger

Awesome. Going to see if I can port https://scrivvy.ai to this. based in Canada

5 days ago

neom

I knew that name looked familiar. Hello from a fellow Canuck who you follow on twitter. :D

5 days ago

simonw

It's great that this is Apache 2.0 licensed - several of Cohere's other models are licensed free for non-commercial use only.

5 days ago

kalmuraee

Multimodels are way better

5 days ago

Fidelix

Can you clarify? I tested a few and they are rubbish and don't have the same features.

5 days ago

bkitano19

notable omission of deepgram models in comparisons?

5 days ago

[deleted]
5 days ago

scotty79

deepgram seems really good (esp Enhanced and Nova 3 models).

4 days ago

jilijeanlouis

same for gladia it's ranked top 1 in the STT blind tests: https://compare-stt.com/

5 days ago

aplomb1026

[dead]

5 days ago

theaicloser

[flagged]

5 days ago