Speech Dictation Mode for Emacs

127 points
1/20/1970
4 months ago
by adityaathalye

Comments


tbran

To run text-to-speech on my laptop, I've been using Justine Tunney's downloadable single executable Whisper file.

I use it transcribe audio then copy into an LLM to get notes on whatever it is. Helps me decide to watch or listen to something and saves a bunch of time.

Her tweet: https://x.com/JustineTunney/status/1825551821857010143

Instructions from Simon Willison: https://simonwillison.net/2024/Aug/19/whisperfile/

Command line options: https://github.com/Mozilla-Ocho/llamafile/issues/544#issueco...

4 months ago

jwr

Amazing work.

I am also impressed by the advances in technology. 20 years ago, I had severe RSI problems and worked on "vx-mode", a package for interfacing XEmacs to Dragon NaturallySpeaking, the best speech-recognition solution available at the time. My goals were similar, although the result was nowhere near what the OP has done. Also, speech recognition tech was nowhere near what we have now: I still remember buying good microphones, worrying about microphone placement relative to mouth, endless training and re-training…

This kind of software can make a huge difference for many people.

4 months ago

Jeff_Brown

I'm really happy about it but I'm not sure how game changing it would be for a blind person. It seems to require seeing what's on the page.

4 months ago

jwr

Perhaps not for a blind person, but for anyone with RSI or other hand/wrist impairments, this can make a huge difference. I speak from experience, having used dictation to work around RSI issues.

4 months ago

submeta

Year 2080: AGIs help you trinscribe, structure, layout your code/text/thoughts. At the same time: HN posts: „New package for Emacs doing xyz“.

4 months ago

raverbashing

And all it requires is some emacs version bump, some dependency upgrades, some external servers and changing the default shortcut in a confusing lisp file to something that doesn't require pressing 8 keys at the same time

4 months ago

kleiba

Fun fact: even pressing three keys at the same time is rare when using Emacs (although there are some three-key combos I use regularly), most shortcuts consist of consecutive key presses.

4 months ago

fhd2

I sometimes feel like playing the piano :D But the UX is better than you'd think, there's packages that show you what options you have for what key to press next, and the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

Plus you can always just enter the command instead of using the key stroke for it. Again, the default UX for that is a bit weak, but with a few packages it becomes pretty strong.

4 months ago

ashton314

> there's packages that show you what options you have for what key to press next

Rejoice! The excellent which-key package that does this comes bundled with Emacs 30! (Emacs 30 will probably be released soon.)

> enter command… default UX is a bit weak

Agreed: the packages Helm, Ivy, and Vertico make this interface much nicer. I use Vertico [1] personally. Though, from Emacs 29, there are some really nice options you can set. I used the following in my Bedrock starter kit [2] to get nicer tab-completion: as soon as you hit TAB twice you'll get bumped into the Completion buffer to select something with your cursor.

Here's the relevant config:

    (setopt completion-auto-help 'always)                  ; Open completion always; `lazy' another option
    (setopt completions-max-height 20)                     ; This is arbitrary
    (setopt completions-detailed t)
    (setopt completions-format 'one-column)
    (setopt completions-group t)
    (setopt completion-auto-select 'second-tab)            ; Much more eager
    ;(setopt completion-auto-select t)                     ; See `C-h v completion-auto-select' for more possible values
There's more configuration options, of course, but this is helpful:

[1]: https://github.com/minad/vertico [2]: https://codeberg.org/ashton314/emacs-bedrock

4 months ago

spauldo

which-key made it in? Sweet! I've been saying for years it should be in Emacs and turned on by default.

4 months ago

kleiba

True. I often times find myself typing out the command rather than using some obscure key sequence like C-c C-v n (case in point: https://orgmode.org/manual/Key-bindings-and-Useful-Functions...). Since Emacs does tab completion for the command name too, I personally find that a better UX than using the "shortcut" (if I can remember it at all).

4 months ago

pxc

I tend to use search for infrequently used stuff and stuff I'm just trying to learn for the first time, then if I find myself using it several times in a session I look up the keybind to start practicing that. If it sticks, it sticks, and if it doesn't... the search functionality is great!

4 months ago

eptcyka

> the sequences are generally quite logical (e.g. CTRL-x followed by "p" has all the commands related to projects).

They really are not.

4 months ago

argiopetech

Depends on if you count shift. I C-M-% (query-regexp-replace) fairly regularly, and that's 4.

4 months ago

kleiba

Sure, shift counts. I suppose I would bind it to a more convenient keybinding if I used query-regexp-replace regularly, but note that I didn't say there weren't any such keybindings, just that they're rare.

4 months ago

b5n

I assume this varies widely across setups.

    (use-package visual-regexp
      :defer t
      :bind (("C-c r" . vr/replace)
             ("C-c q" . vr/query-replace)
             ("C-r" . vr/isearch-backward)
             ("C-s" . vr/isearch-forward)))

    (use-package visual-regexp-steroids
      :defer t)
4 months ago

wiz21c

year 2080: "M-x ai: imagine you are a smart emacs developper, write a configuration file that sets up LSP"

answer:

"I did it. Please note that you're using a Microsoft protocol. Microsoft has a long history of attacking the 4 core freedoms of the Free Software movement which are

The freedom to run the program as you wish, for any purpose (freedom 0). ..."

4 months ago

pxc

This is kinda ideal tbh. I like how, for instance, F-Droid warns users about anti-features and integrations with proprietary web services. Clear messaging about problematic software + freedom to nonetheless choose those problematic options is great.

That said, I don't think this is the way the FSF evaluates software, or that they'd treat an open protocol like this. I could imagine a warning like this about integrating with a proprietary language server in particular, though— and I'd be grateful for it! A locally-run AI assistant that cared about things like that would be super cool.

4 months ago

marci

4 months ago

anthk

That AI would be running under GNU Hurd with Guix. Also, Scheme simplified itself so hard that it created something akin to the Common Lisp standard unitfying all ice's and srfi's into something manageable from humans in a single package.

Also it rewrote all of the legacy Emacs' Elisp into manageable Emacs Guile (with an uberfast JIT and/or libre Guile microcode from the FSF).

4 months ago

lepisma

Hey, author here. Didn't notice this came up on HN.

I wrote a small follow up trying to write and speak at the same time here https://lepisma.xyz/journal/2024/09/13/can-i-output-two-stre...

4 months ago

pama

Thats a cool idea. Could the LLM find the right location for the audio stream by simply having the context of the buffer, and the location of the text and audio cursor when the intersction starts?

4 months ago

lepisma

I think it could work. In my example of writing docstring, I can see this working out with high probability.

4 months ago

voltaireodactyl

This looks very useful, and beautifully presented — looking forward to being able to use with local model.

4 months ago

Jeff_Brown

I would use this for edits that are hard to do otherwise. Like, instead of typing `M-x align-regexp` and then figuring out what regular expression to type, I would just highlight a passage and say to the LLM "Can you align all the library names in this import statement?"

4 months ago

BeetleB

I did something similar here:

https://blog.nawaz.org/posts/2023/Dec/cleaning-up-speech-rec...

I now use Whisper with a much expanded prompt and have the flow integrated both in Emacs and my WM.

Prior HN discussion:

https://news.ycombinator.com/item?id=40174921

I've since done hours of transcription with it - often transcribing whole emails. The challenge is that my brain thinks very differently while talking compared to while typing. As a result, my output is very verbose, and is very different from what I would have typed. I haven't figured out how to speak as if I'm typing.

4 months ago

ggm

"Emacs: Upgrade to MELPA"

ELPA installed s/w suite: "I'm sorry Dave, I can't do that"

4 months ago

anthk

More like: Emacs: pull all the libre MELPA repos into a local .el file to be checked ondemand. Hide all the propietary depending or propietary repos.

4 months ago

ants_everywhere

nerd-dictation is a decent offline speech dictation tool for Linux that I've used with Emacs https://github.com/ideasman42/nerd-dictation

4 months ago

namidark

Has anyone gotten whisper.el/.cpp to work on OSX with the microphone permissions and Emacs?

4 months ago

zvmaz

Does the author mind if he shared his Emacs configuration? So beautiful!

4 months ago