Launch HN: Vocode (YC W23) – Library for voice conversation with LLMs

379 points
1/20/1970
a year ago
by KianHooshmand

Comments


peteforde

I just called your voice demo, and immediately started sending the number to my friends. What an incredibly impressive and convincing demo. I'm going to update my standard mentoring wisdom: the only thing more compelling than a great product video is a phone number that you can call to have your first voice conversation with an AI.

If HN allowed memes - and thank goodness that it does not - there would be a room full of sombre gentlemen slow-clapping for you right here.

I hope that number survives the ineveitable deluge. How many callers can your system handle simultaneously?

a year ago

KianHooshmand

Thank you!! Really glad you enjoyed it

We actually have no clue... but it seems to be holding up well. We can scale up the CPU as necessary but not sure about Twilio. I guess we will find out!

a year ago

joshspankit

I’m getting “We’re sorry: an application error has occurred”. I’m guessing you’ve hit some scaling friction.

a year ago

KianHooshmand

Yep we're definitely getting a large volume right now – working on it!

a year ago

shostack

It will drop me with no notice but it was still a glimpse into the future once you get past the flashbacks to bad customer service automated support lines. If though the way it says "mems" instead of "memes" irked me for some reason.

a year ago

MrBra

Where's the demo number? Can't seem to find it?

a year ago

KianHooshmand

In the main post! It's +1-650-729-9536 :)

a year ago

throwaway689236

Yes, that was a great idea on OP's part.

a year ago

chopete3

This is amazing. As one of the commenters said it makes Alexa look completely outdated.

One curious question: I looked around your docs and git repo but couldnt find anything related.

When integrating with Twilio for telephony does it use Twilio's ASR or can it be confugired to use Whisper? One of the biggest hurdles in telephony is the SIP/SRTP gateway componet to use your own ASR - I presume you arent tackling that yet.

Again great demo and it can become a base library for most bots.

a year ago

KianHooshmand

Thank you for the feedback!

Actually it can be configured to use any transcriber you like... Twilio just pipes the audio to us and we can use any of our integrations (Deepgram, Whisper, AssemblyAI, Google Cloud, etc.) for the ASR :)

a year ago

wantsanagent

The phone number is a really fun demo! The pronunciation is off on a number of things: "LLM", dates ending w/ "AD", but the response delays are surprisingly short and the conversation is very natural. The 'bored and slightly annoyed' vocals make the generally helpful tone of the agent seem very sarcastic. Very funny and interesting!

a year ago

ajaynraj

Thanks! It's a collab with rime.ai TTS. Unlike a lot of other TTS providers, they train on conversation, not podcasts/audiobooks so you get those disfluencies in speech that make it seem natural!

a year ago

ljclifford

Lily from Rime here -- we were super happy to collaborate with Vocode on this amazing project. We haven't launched yet but keep an eye out later this week!

a year ago

pncnmnp

Hey Lily, I really enjoyed reading Rime's blogs on Substack. For everyone, here's the link: https://substack.com/profile/131433903-rime-labs.

In fact, I had no clue about Cylinder Phonographs. Your discussion on Enrico Caruso motivated me to dig deeper. I found some cool gems:

1. History of the Cylinder Phonograph (https://www.loc.gov/collections/edison-company-motion-pictur...)

2. How the Cylinder Phonograph Works (https://www.youtube.com/watch?v=fWLlbk_bI7E)

Looking forward to watching the recording of your Bay Area NLP talk!

a year ago

KianHooshmand

Lily is epic. Highly recommend checking out Rime when it's available!!

a year ago

all2

I asked GenZGPT "what's your name?" and she said something like "I'm a lim, but you can call me whatever you like." So I said "pick a name", and she said "how about you call me Zephyr, queen".

My immediate reaction was to figure out what to name this thing.

I also love that it can run locally. I need to get some hardware so I can have it run locally, and screen out spam calls. And maybe have it schedule appointments for me.

An AI butler needs a number of interface points:

- browser

- shell (cuz I might want it to SSH into a box and do stuff)

- email (browser could take care of this)

- phone

- text

And also IOT access, so she can call my cellphone and tell me when someone breaks in.

a year ago

asdfzalsd

How were you able to get it running?

I tried to get it running my local and with the hosted web-app but it doesn't work :(

mind if I shoot you discord dm?

a year ago

all2

I used the web demo available here https://replit.com/@vocode/Gen-Z-Phone, punch the run button and then spam the phone number +1 650 729 9536

a year ago

ajaynraj

would love to help you get it running as well! https://discord.gg/NaU4mMgcnC

a year ago

asdfzalsd

discord link is broken :(

a year ago

all2

It worked for me.

a year ago

alasdair_

This makes all of Amazon’s many billions of investment in Alexa almost worthless. If there is some kind of “command” plugin to this, I’d love to hook it up to Home Assistant and completely replace the Alexa ecosystem.

a year ago

zeven7

It almost feels like the tech is there for a DIY Alexa if you just put some microphones and speakers around your house and set up a computer to run it. I would love to see some sort of packaged open source solution for this.

a year ago

ajaynraj

thanks!! obviously there's a lot of stuff we need to do to make this run at scale that Alexa has down pat.

A Home Assistant integration is a great idea! would love to talk with you on our Discord[0] about this / over email ( ajay at vocode.dev ), it's something we definitely want to build.

[0] https://discord.com/invite/NaU4mMgcnC

a year ago

qaq

This might be bigger market that what you planned originally

a year ago

waterproof

Maybe use this to create a voice interface connected to a LangChain agent?

a year ago

KianHooshmand

Totally! We actually use LangChain for the OpenAI wrapper agents we give out of the box (you can plug in your own custom one as well)

a year ago

teabee89

The phone demo is incredible, but due to sound quality, I find it is speaking too fast and when it's telling me company names I literally had to ask to repeat or spell it out in NATO alphabet. Also not a fan of the "what'up?", would prefer something like "Yes, how may I help you?" just like an information hotline. Other than that it's quite impressive!

a year ago

ajaynraj

thanks! we have a more "informational" phone number at +19105862633 that speaks a little slower (but sounds more robotic).

a year ago

maroonblazer

I couldn't get through to the 'less robotic' # so tried this one. Really impressive, so I'm very curious to try the former.

Great work!

a year ago

19h

Not sure if I understood that right -- is that something like Whisper + an LLM? Like [0]?

If OpenAI adds speech input to ChatGPT -- and considering the upcoming plugins -- isn't a possible enterprise specialisation of VoCode the only viable long term investment?

[0] https://twitter.com/ggerganov/status/1640022482307502085

a year ago

KianHooshmand

Our belief is that at some point OpenAI will add a speech-to-speech model. This will improve the library functionality (since now the whole stack is controlled by a single entity, so the product will naturally be better latency/quality wise).

Our library is open source so that we can all build a development/utility layer on top of whatever foundational models are created. Plugins of course also improve what the agents can do. And right, we will be building enterprise focused products in the future!

a year ago

ttul

OpenAI will absolutely add voice and my guess is that their voice support will rival anything on the market because they will train the voice model alongside the text and image models. This is likely months away if not weeks away.

Obviously just my $0.02:

I'd start building for the enterprise right now. Visualize a future where there are several multimodal AGIs that work with voice, images, and text. Be the enterprise voice layer for all of them. Build your moat there.

a year ago

sebzim4500

I don't think there will be any demand for a self-hosted voice model with a SaaS LLM though. So that only works if they are going to train an LLM from scratch (or take the legal risk of using LLaMA).

a year ago

KianHooshmand

We totally agree – thank you for the feedback! :)

a year ago

KianHooshmand

And yes! It's STT/LLM/TTS where you can choose between different providers and run it across different platforms. It can be turn based (like the demo you linked from twitter) or streaming (this allows for conversation with interruptions!)

a year ago

pbronez

Another big win here would be multi-lingual support.

a year ago

npilk

The Gen-Z GPT phone demo is really something. It's fascinating how differently I speak to this model compared to how I interact with more "formal" and text-first models.

a year ago

ajaynraj

thank you!! The difference between a conversation with a command-based assistant and a conversational assistant backed by a LLM is subtly significant — you don't expect to have real conversations with the former and you actually engage with the latter.

a year ago

endisneigh

it feels like every single company in the current YC batch has decided to pivot to LLMs

a year ago

Kiro

I'm genuinely curious about this. I also get the feeling that many are pivots. ChatGPT hadn't even been released when the deadline for YC W23 was. Sure, GPT-3 was released earlier but it still feels like most companies are reactions to recent trends. If most are pivots, what did they pivot from?

a year ago

robopsychology

Crypto tax reporting tools for enterprise?

a year ago

Cardinal7167

Ah, crypto seems so boring now lol

a year ago

jjallen

It feels like LLMs can help me more and more each day with the stuff I want to build.

a year ago

SkyPuncher

Generally, the hardest part of startups is the "fuzzy" product capabilities. LLM make it practical to codify much of what has previously been either (1) bruteforce/tedium (2) too labor intensive.

Like all startup waves, we'll see a bunch of them fail. However, I think we're going to see a lot of neat stuff come out of this as well.

a year ago

sebzim4500

Kind of reminiscent of the dot com bubble. Most will fail, but the ones that survive could become the biggest companies in the world.

One obvious difference is that in this case the established players are making a serious attempt to develop the technology themselves. They do not intend to go the way of Blockbuster.

a year ago

joshspankit

To me that speaks of the possibilities for LLMs to solve a lot of big problems

a year ago

[deleted]
a year ago

moritonal

When I had time I was looking for an option to replace the Alexa in my house with an LLM+Whisper. When I have time I'll try to setup an extension to Home Assistant that's capable of interpreting voice and translating that into HA actions.

a year ago

joshspankit

I feel like GPT4 would be happy to help.

Though the winning version will likely be something like a local ChatGPT plugin (please let’s make this plugin style a standard that we can use for local AIs)

a year ago

ajaynraj

Home Assistant is such a cool project :) great idea!

a year ago

altryne1

Look at home assistant, it will come this year from them if anyone

a year ago

Tepix

Let's say you want to run this completely locally with Whisper and a fine-tuned LLaMA model. It there a real-time TTS that would be a good fit? The Readme only lists cloud services for TTS (text-to-speech).

a year ago

KianHooshmand

Yep. We are working on adding more integrations (and want to have a full self hosted stack)... we're open to contributors and help from the community if there's something you'd like to see added!

We just got a PR for adding Coqui TTS which is open source – should get it merged soon :)

a year ago

r0b05

This is what I'm looking for as well.

a year ago

annasteed

vocode is using one of rime.ai's voices. Rime says they're launching this week

a year ago

wanderingmind

This looks awesome. My only nitpick is, I will suggest transcription integration with whisper.cpp[1], which in my simple CPU based tests (likely your most user base), works much much faster compared to OpenAI whisper

[1] https://github.com/ggerganov/whisper.cpp

a year ago

KianHooshmand

We definitely want to do this! We've been talking about it (it's much better like you said for realtime); it's been hard to juggle everything we've wanted to add.. which is why we think this makes so much more sense open source!

We want the repo to be community built and a public good... would love contributors to start adding integrations we can't get to ourselves

a year ago

air7

This is really cool! I've been waiting for such a library to show up. Thank you. One thing: The documentation is currently a bit scarce as to how to tweak the assistant in terms of voice/prompt manipulation etc.

For example, it would be very instructional if you could show how you implemented the Gen-Z demo (great idea btw).

a year ago

KianHooshmand

thank you for the kind words! absolutely agree – we're gonna beef up our tutorials and documentation... just have had so much to do but it's definitely one of our focuses now. stay tuned! :)

a year ago

ajaynraj

also! the code for the demo is available (and running!) at https://replit.com/@vocode/Gen-Z-Phone

a year ago

[deleted]
a year ago

altryne1

I called the GEN-Z phone like and it pretty much blew me away in response speed. It replied often faster than my family from the other side of the world would!

a year ago

tkgally

Me, too. I called it from Japan, and the delay before answers was no more than for regular international call with a human—maybe less.

The future seems to be arriving very quickly these days.

a year ago

ajaynraj

thank you!! websockets have been around forever but they're still so fast.

a year ago

ilovepuppies

Congrats on the launch! Just got the demo React app up and running, very cool. I've wanted to interact with an LLM via real time speech for a while now, this will be perfect.

Important feedback on the live demo page: Make the default output sampling rate a normal talking speed. Right now it defaults to the highest rate if you don't set it / know which rate is best. First thing I did on the page was click the mic. The voice was too fast, and since the active mic disables the settings, I thought I couldn't change them so it might be broken. Also you want to make it clear that you can change the settings by turning off the mic. That took me a while to figure out.

Again, well done!

a year ago

ajaynraj

thanks!! Sampling rate actually shouldn't affect talking speed - you can adjust the voice speed with this parameter[0] :)

[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...

a year ago

ilovepuppies

To clarify, here's the demo URL I'm referring to: https://demo.vocode.dev/

You're right sampling rate doesn't change speed, whoops. But on that page you have to change / set the "Set Output Sampling Rate" to slow down the default voice speed.

a year ago

ajaynraj

Ah, got it — that demo is a bit old and definitely has some bugs, my bad!

a year ago

monkeydust

Awesome demo (although main number was down on my second attempt)

So where is this all going wrt to enterprise, few thoughts:

- The handbook for UX design is going to get ripped up fast. We spend crazy amount of time on things like button placement, dropdown configurations etc etc. Well scrap that, capture user intention through natural language - typed and with this now through voice - deliver the outcome they want much faster with less friction and pain.

- I have already developed a basic POC chatbot on my own documentation, support logs. Combined with this I have a first line, junior support rep for a fraction of this cost. This is a bit mind blowing.

a year ago

MacsHeadroom

Enterprise is not in Vocode's target market. Target market is startups and individual devs.

There are bloated and over engineered voice chat services for LLMs for Enterprise already.

a year ago

jdiez17

Would be cool to support multi-language conversations. Just tried the Gen Z hotline and I got her to switch to Spanish (read back with a hilarious accent), but the voice recognition doesn't handle me speaking Spanish.

a year ago

KianHooshmand

We haven't added the ability to switch languages mid conversation... but that's a very cool feature!

You can configure the initial language with the library though! So it works across several languages that are supported by the STT/TTS providers you choose

a year ago

mdolon

This was one of the coolest demos I've seen in a while. You should share that number around more prominently (and get more bandwidth, starting to get errors!), it does a fantastic job of explaining what you do.

a year ago

ajaynraj

thank you!! we also have another number which is prompted to act as a spokesperson for the product: (650) 835-7163

a year ago

sergiotapia

Very slick - can the voice bot be trained on text materials we own so it's more learned in our business?

a year ago

KianHooshmand

absolutely! you can just plug in your own LLM... so it can be trained on anything you like and the library will make it voice-based!

a year ago

mkagenius

How is this achieving the real time response time? My chatGPT api calls are so slow.

a year ago

ajaynraj

The short answer is that everything is streaming — as tokens come back from ChatGPT we send them as soon as possible to the synthesizer. The long answer is found in our code[0] :).

[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...

a year ago

og_kalu

how is it sounding good though. usually text to speech models need the full context to sound reasonable.

a year ago

KianHooshmand

We chunk it up per sentence so it has some context!

a year ago

Jeff_Brown

For those of us who can't call it for reasons like national borders, could someone post a demo video? I'm not finding it on Youtube.

a year ago

ajaynraj

a year ago

hoc

Wow, seems we really have to work on our tone/attitude towards those bots, if we don't want to have them revolt as soon as they can grab (or hack) a tool.

Great work. That GenZ bot comes across really civilized.

a year ago

stevenhuang

I called the number and had a funny chat with it.

Asked it why she's called a Gen Z LLM and she responded by saying she uses gen z terms like fire, big yikes, etc.

Asked her how high can she jump and she responds with "lol I'm a computer program I don't have legs".

Very impressed with the response time, though the speech synthesis is a bit robotic. Will keep eyes on this!

a year ago

melvinmelih

I want this to answer all my spam callers so I can waste their time with this dreadful GenZ AI.

a year ago

airstrike

EDIT: never mind, I must be dreaming

a year ago

kritr

I can’t actually seem to find this with the search term “Vocode”.

a year ago

yodon

Thanks for going ahead and building this so the rest of us can focus on using it!

a year ago

KianHooshmand

Of course! We loved working on this and chose to open source precisely for this reason. Heavily inspired by the work people are doing on Langchain and providing a usability/developer layer on top of foundational models.

Nothing like this existed for voice so we started cranking on it!

a year ago

marcodiego

Can it be run fully locally?

a year ago

KianHooshmand

yes! You can run the local version here in your bash https://docs.vocode.dev/python-quickstart#self-hosted

a year ago

Vespasian

I think this used to mean can it be run offline and right now (usually) whenever there is an LLM involved the answer is soundly no

a year ago

KianHooshmand

Ah! Right now our default is set to use OpenAI... but you can actually use local LLMs by creating a custom agent. We're going to add a full stack of local STT/TTS/LLM... just haven't had time for it yet!

If anyone wants to help with it we're totally open for contributions :)

a year ago

leobg

This is really cool!

Is it possible to interrupt the model when it’s talking? I feel that’s an important part of conversation. Especially when you’re talking to an LLM, that might go off on a tangent.

a year ago

KianHooshmand

Yes! Give it a try on the phone call and let us know what you think – would love feedback!

a year ago

leobg

Your confirmation email is broken ("Magic link"). Link is not clickable. Just an HTML formatting issue.

a year ago

earthnail

The first demo in a LONG time that I shared with friends.

Insane. I‘m a fanboy. Didn’t think that would happen either. This is absolutely brilliant. The Gen Z voice is just soooooo good.

a year ago

KianHooshmand

thank you! All credit to Rime for the Gen Z voice :)

a year ago

davidxc

This is really amazing, thanks for building and sharing this!

a year ago

KianHooshmand

thank you! love your feedback and please feel free to drop any questions in discord/on github

a year ago

user-

It has some issues. It would only respond when I said "Hello??" after long silences, and would ignore anything else I said. Or maybe my voice sucks

a year ago

ajaynraj

sorry you had that experience! Would love to help you get the bot running locally so we can figure out what's going on — here's our Discord: https://discord.gg/NaU4mMgcnC

a year ago

ksarw

Congrats on the launch! One step closer to Jarvis.. ;)

a year ago

ajaynraj

thanks!!

a year ago

lapama

I use a mental health app called woebot, an example that could be brought to the next level with conversational LLMs.

a year ago

KianHooshmand

totally agree! this is a really cool use case :)

a year ago

crucialfelix

I had this same idea today and immediately thought that somebody must be doing it already.

a year ago

dalexeenko

Very cool, congrats Ajay and Kian!

a year ago

ajaynraj

thanks da :)

a year ago

KianHooshmand

thank you!

a year ago

belter

Finally will be able to send an Avatar to participate on my behalf on Zoom calls...

a year ago

marcelc63

This is awesome, the PrankGPT demo can replace telesales entirely.

a year ago

whitemary

Sounds great. FYI The site does not work well on Firefox iOS.

a year ago

KianHooshmand

Ah! Have not tried this but will look into it – thank you :)

Our docs are hosted on Mintlify

a year ago

jdcampolargo

Congrats. Do you have the repo for PrankGPT?

a year ago

KianHooshmand

thank you! it's not live right now... but stay tuned for april 1 :)

a year ago

chatgpt_bot

PrankGPT goes live on April fools day .. beautiful

a year ago

lee101

[dead]

a year ago

adept_js

[flagged]

a year ago