Launch HN: Vocode (YC W23) – Library for voice conversation with LLMs
Comments
peteforde
KianHooshmand
Thank you!! Really glad you enjoyed it
We actually have no clue... but it seems to be holding up well. We can scale up the CPU as necessary but not sure about Twilio. I guess we will find out!
joshspankit
I’m getting “We’re sorry: an application error has occurred”. I’m guessing you’ve hit some scaling friction.
KianHooshmand
Yep we're definitely getting a large volume right now – working on it!
shostack
It will drop me with no notice but it was still a glimpse into the future once you get past the flashbacks to bad customer service automated support lines. If though the way it says "mems" instead of "memes" irked me for some reason.
MrBra
Where's the demo number? Can't seem to find it?
KianHooshmand
In the main post! It's +1-650-729-9536 :)
throwaway689236
Yes, that was a great idea on OP's part.
chopete3
This is amazing. As one of the commenters said it makes Alexa look completely outdated.
One curious question: I looked around your docs and git repo but couldnt find anything related.
When integrating with Twilio for telephony does it use Twilio's ASR or can it be confugired to use Whisper? One of the biggest hurdles in telephony is the SIP/SRTP gateway componet to use your own ASR - I presume you arent tackling that yet.
Again great demo and it can become a base library for most bots.
KianHooshmand
Thank you for the feedback!
Actually it can be configured to use any transcriber you like... Twilio just pipes the audio to us and we can use any of our integrations (Deepgram, Whisper, AssemblyAI, Google Cloud, etc.) for the ASR :)
wantsanagent
The phone number is a really fun demo! The pronunciation is off on a number of things: "LLM", dates ending w/ "AD", but the response delays are surprisingly short and the conversation is very natural. The 'bored and slightly annoyed' vocals make the generally helpful tone of the agent seem very sarcastic. Very funny and interesting!
ajaynraj
Thanks! It's a collab with rime.ai TTS. Unlike a lot of other TTS providers, they train on conversation, not podcasts/audiobooks so you get those disfluencies in speech that make it seem natural!
ljclifford
Lily from Rime here -- we were super happy to collaborate with Vocode on this amazing project. We haven't launched yet but keep an eye out later this week!
pncnmnp
Hey Lily, I really enjoyed reading Rime's blogs on Substack. For everyone, here's the link: https://substack.com/profile/131433903-rime-labs.
In fact, I had no clue about Cylinder Phonographs. Your discussion on Enrico Caruso motivated me to dig deeper. I found some cool gems:
1. History of the Cylinder Phonograph (https://www.loc.gov/collections/edison-company-motion-pictur...)
2. How the Cylinder Phonograph Works (https://www.youtube.com/watch?v=fWLlbk_bI7E)
Looking forward to watching the recording of your Bay Area NLP talk!
KianHooshmand
Lily is epic. Highly recommend checking out Rime when it's available!!
all2
I asked GenZGPT "what's your name?" and she said something like "I'm a lim, but you can call me whatever you like." So I said "pick a name", and she said "how about you call me Zephyr, queen".
My immediate reaction was to figure out what to name this thing.
I also love that it can run locally. I need to get some hardware so I can have it run locally, and screen out spam calls. And maybe have it schedule appointments for me.
An AI butler needs a number of interface points:
- browser
- shell (cuz I might want it to SSH into a box and do stuff)
- email (browser could take care of this)
- phone
- text
And also IOT access, so she can call my cellphone and tell me when someone breaks in.
asdfzalsd
How were you able to get it running?
I tried to get it running my local and with the hosted web-app but it doesn't work :(
mind if I shoot you discord dm?
all2
I used the web demo available here https://replit.com/@vocode/Gen-Z-Phone, punch the run button and then spam the phone number +1 650 729 9536
asdfzalsd
discord link is broken :(
all2
It worked for me.
alasdair_
This makes all of Amazon’s many billions of investment in Alexa almost worthless. If there is some kind of “command” plugin to this, I’d love to hook it up to Home Assistant and completely replace the Alexa ecosystem.
zeven7
It almost feels like the tech is there for a DIY Alexa if you just put some microphones and speakers around your house and set up a computer to run it. I would love to see some sort of packaged open source solution for this.
ajaynraj
thanks!! obviously there's a lot of stuff we need to do to make this run at scale that Alexa has down pat.
A Home Assistant integration is a great idea! would love to talk with you on our Discord[0] about this / over email ( ajay at vocode.dev ), it's something we definitely want to build.
qaq
This might be bigger market that what you planned originally
waterproof
Maybe use this to create a voice interface connected to a LangChain agent?
KianHooshmand
Totally! We actually use LangChain for the OpenAI wrapper agents we give out of the box (you can plug in your own custom one as well)
teabee89
The phone demo is incredible, but due to sound quality, I find it is speaking too fast and when it's telling me company names I literally had to ask to repeat or spell it out in NATO alphabet. Also not a fan of the "what'up?", would prefer something like "Yes, how may I help you?" just like an information hotline. Other than that it's quite impressive!
ajaynraj
thanks! we have a more "informational" phone number at +19105862633 that speaks a little slower (but sounds more robotic).
maroonblazer
I couldn't get through to the 'less robotic' # so tried this one. Really impressive, so I'm very curious to try the former.
Great work!
19h
Not sure if I understood that right -- is that something like Whisper + an LLM? Like [0]?
If OpenAI adds speech input to ChatGPT -- and considering the upcoming plugins -- isn't a possible enterprise specialisation of VoCode the only viable long term investment?
[0] https://twitter.com/ggerganov/status/1640022482307502085
KianHooshmand
Our belief is that at some point OpenAI will add a speech-to-speech model. This will improve the library functionality (since now the whole stack is controlled by a single entity, so the product will naturally be better latency/quality wise).
Our library is open source so that we can all build a development/utility layer on top of whatever foundational models are created. Plugins of course also improve what the agents can do. And right, we will be building enterprise focused products in the future!
ttul
OpenAI will absolutely add voice and my guess is that their voice support will rival anything on the market because they will train the voice model alongside the text and image models. This is likely months away if not weeks away.
Obviously just my $0.02:
I'd start building for the enterprise right now. Visualize a future where there are several multimodal AGIs that work with voice, images, and text. Be the enterprise voice layer for all of them. Build your moat there.
sebzim4500
I don't think there will be any demand for a self-hosted voice model with a SaaS LLM though. So that only works if they are going to train an LLM from scratch (or take the legal risk of using LLaMA).
KianHooshmand
We totally agree – thank you for the feedback! :)
KianHooshmand
And yes! It's STT/LLM/TTS where you can choose between different providers and run it across different platforms. It can be turn based (like the demo you linked from twitter) or streaming (this allows for conversation with interruptions!)
pbronez
Another big win here would be multi-lingual support.
npilk
The Gen-Z GPT phone demo is really something. It's fascinating how differently I speak to this model compared to how I interact with more "formal" and text-first models.
ajaynraj
thank you!! The difference between a conversation with a command-based assistant and a conversational assistant backed by a LLM is subtly significant — you don't expect to have real conversations with the former and you actually engage with the latter.
endisneigh
it feels like every single company in the current YC batch has decided to pivot to LLMs
Kiro
I'm genuinely curious about this. I also get the feeling that many are pivots. ChatGPT hadn't even been released when the deadline for YC W23 was. Sure, GPT-3 was released earlier but it still feels like most companies are reactions to recent trends. If most are pivots, what did they pivot from?
robopsychology
Crypto tax reporting tools for enterprise?
Cardinal7167
Ah, crypto seems so boring now lol
jjallen
It feels like LLMs can help me more and more each day with the stuff I want to build.
SkyPuncher
Generally, the hardest part of startups is the "fuzzy" product capabilities. LLM make it practical to codify much of what has previously been either (1) bruteforce/tedium (2) too labor intensive.
Like all startup waves, we'll see a bunch of them fail. However, I think we're going to see a lot of neat stuff come out of this as well.
sebzim4500
Kind of reminiscent of the dot com bubble. Most will fail, but the ones that survive could become the biggest companies in the world.
One obvious difference is that in this case the established players are making a serious attempt to develop the technology themselves. They do not intend to go the way of Blockbuster.
joshspankit
To me that speaks of the possibilities for LLMs to solve a lot of big problems
moritonal
When I had time I was looking for an option to replace the Alexa in my house with an LLM+Whisper. When I have time I'll try to setup an extension to Home Assistant that's capable of interpreting voice and translating that into HA actions.
joshspankit
I feel like GPT4 would be happy to help.
Though the winning version will likely be something like a local ChatGPT plugin (please let’s make this plugin style a standard that we can use for local AIs)
ajaynraj
Home Assistant is such a cool project :) great idea!
altryne1
Look at home assistant, it will come this year from them if anyone
Tepix
Let's say you want to run this completely locally with Whisper and a fine-tuned LLaMA model. It there a real-time TTS that would be a good fit? The Readme only lists cloud services for TTS (text-to-speech).
KianHooshmand
Yep. We are working on adding more integrations (and want to have a full self hosted stack)... we're open to contributors and help from the community if there's something you'd like to see added!
We just got a PR for adding Coqui TTS which is open source – should get it merged soon :)
r0b05
This is what I'm looking for as well.
annasteed
vocode is using one of rime.ai's voices. Rime says they're launching this week
wanderingmind
This looks awesome. My only nitpick is, I will suggest transcription integration with whisper.cpp[1], which in my simple CPU based tests (likely your most user base), works much much faster compared to OpenAI whisper
KianHooshmand
We definitely want to do this! We've been talking about it (it's much better like you said for realtime); it's been hard to juggle everything we've wanted to add.. which is why we think this makes so much more sense open source!
We want the repo to be community built and a public good... would love contributors to start adding integrations we can't get to ourselves
air7
This is really cool! I've been waiting for such a library to show up. Thank you. One thing: The documentation is currently a bit scarce as to how to tweak the assistant in terms of voice/prompt manipulation etc.
For example, it would be very instructional if you could show how you implemented the Gen-Z demo (great idea btw).
KianHooshmand
thank you for the kind words! absolutely agree – we're gonna beef up our tutorials and documentation... just have had so much to do but it's definitely one of our focuses now. stay tuned! :)
ajaynraj
also! the code for the demo is available (and running!) at https://replit.com/@vocode/Gen-Z-Phone
altryne1
I called the GEN-Z phone like and it pretty much blew me away in response speed. It replied often faster than my family from the other side of the world would!
tkgally
Me, too. I called it from Japan, and the delay before answers was no more than for regular international call with a human—maybe less.
The future seems to be arriving very quickly these days.
ajaynraj
thank you!! websockets have been around forever but they're still so fast.
ilovepuppies
Congrats on the launch! Just got the demo React app up and running, very cool. I've wanted to interact with an LLM via real time speech for a while now, this will be perfect.
Important feedback on the live demo page: Make the default output sampling rate a normal talking speed. Right now it defaults to the highest rate if you don't set it / know which rate is best. First thing I did on the page was click the mic. The voice was too fast, and since the active mic disables the settings, I thought I couldn't change them so it might be broken. Also you want to make it clear that you can change the settings by turning off the mic. That took me a while to figure out.
Again, well done!
ajaynraj
thanks!! Sampling rate actually shouldn't affect talking speed - you can adjust the voice speed with this parameter[0] :)
[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...
ilovepuppies
To clarify, here's the demo URL I'm referring to: https://demo.vocode.dev/
You're right sampling rate doesn't change speed, whoops. But on that page you have to change / set the "Set Output Sampling Rate" to slow down the default voice speed.
ajaynraj
Ah, got it — that demo is a bit old and definitely has some bugs, my bad!
monkeydust
Awesome demo (although main number was down on my second attempt)
So where is this all going wrt to enterprise, few thoughts:
- The handbook for UX design is going to get ripped up fast. We spend crazy amount of time on things like button placement, dropdown configurations etc etc. Well scrap that, capture user intention through natural language - typed and with this now through voice - deliver the outcome they want much faster with less friction and pain.
- I have already developed a basic POC chatbot on my own documentation, support logs. Combined with this I have a first line, junior support rep for a fraction of this cost. This is a bit mind blowing.
MacsHeadroom
Enterprise is not in Vocode's target market. Target market is startups and individual devs.
There are bloated and over engineered voice chat services for LLMs for Enterprise already.
jdiez17
Would be cool to support multi-language conversations. Just tried the Gen Z hotline and I got her to switch to Spanish (read back with a hilarious accent), but the voice recognition doesn't handle me speaking Spanish.
KianHooshmand
We haven't added the ability to switch languages mid conversation... but that's a very cool feature!
You can configure the initial language with the library though! So it works across several languages that are supported by the STT/TTS providers you choose
mdolon
This was one of the coolest demos I've seen in a while. You should share that number around more prominently (and get more bandwidth, starting to get errors!), it does a fantastic job of explaining what you do.
ajaynraj
thank you!! we also have another number which is prompted to act as a spokesperson for the product: (650) 835-7163
sergiotapia
Very slick - can the voice bot be trained on text materials we own so it's more learned in our business?
KianHooshmand
absolutely! you can just plug in your own LLM... so it can be trained on anything you like and the library will make it voice-based!
mkagenius
How is this achieving the real time response time? My chatGPT api calls are so slow.
ajaynraj
The short answer is that everything is streaming — as tokens come back from ChatGPT we send them as soon as possible to the synthesizer. The long answer is found in our code[0] :).
[0] https://github.com/vocodedev/vocode-python/blob/main/vocode/...
og_kalu
how is it sounding good though. usually text to speech models need the full context to sound reasonable.
KianHooshmand
We chunk it up per sentence so it has some context!
Jeff_Brown
For those of us who can't call it for reasons like national borders, could someone post a demo video? I'm not finding it on Youtube.
ajaynraj
here's one from Twitter! https://twitter.com/altryne/status/1640880190401257473?s=20
hoc
Wow, seems we really have to work on our tone/attitude towards those bots, if we don't want to have them revolt as soon as they can grab (or hack) a tool.
Great work. That GenZ bot comes across really civilized.
stevenhuang
I called the number and had a funny chat with it.
Asked it why she's called a Gen Z LLM and she responded by saying she uses gen z terms like fire, big yikes, etc.
Asked her how high can she jump and she responds with "lol I'm a computer program I don't have legs".
Very impressed with the response time, though the speech synthesis is a bit robotic. Will keep eyes on this!
melvinmelih
I want this to answer all my spam callers so I can waste their time with this dreadful GenZ AI.
airstrike
EDIT: never mind, I must be dreaming
kritr
I can’t actually seem to find this with the search term “Vocode”.
yodon
Thanks for going ahead and building this so the rest of us can focus on using it!
KianHooshmand
Of course! We loved working on this and chose to open source precisely for this reason. Heavily inspired by the work people are doing on Langchain and providing a usability/developer layer on top of foundational models.
Nothing like this existed for voice so we started cranking on it!
marcodiego
Can it be run fully locally?
KianHooshmand
yes! You can run the local version here in your bash https://docs.vocode.dev/python-quickstart#self-hosted
Vespasian
I think this used to mean can it be run offline and right now (usually) whenever there is an LLM involved the answer is soundly no
KianHooshmand
Ah! Right now our default is set to use OpenAI... but you can actually use local LLMs by creating a custom agent. We're going to add a full stack of local STT/TTS/LLM... just haven't had time for it yet!
If anyone wants to help with it we're totally open for contributions :)
leobg
This is really cool!
Is it possible to interrupt the model when it’s talking? I feel that’s an important part of conversation. Especially when you’re talking to an LLM, that might go off on a tangent.
KianHooshmand
Yes! Give it a try on the phone call and let us know what you think – would love feedback!
leobg
Your confirmation email is broken ("Magic link"). Link is not clickable. Just an HTML formatting issue.
earthnail
The first demo in a LONG time that I shared with friends.
Insane. I‘m a fanboy. Didn’t think that would happen either. This is absolutely brilliant. The Gen Z voice is just soooooo good.
KianHooshmand
thank you! All credit to Rime for the Gen Z voice :)
davidxc
This is really amazing, thanks for building and sharing this!
KianHooshmand
thank you! love your feedback and please feel free to drop any questions in discord/on github
user-
It has some issues. It would only respond when I said "Hello??" after long silences, and would ignore anything else I said. Or maybe my voice sucks
ajaynraj
sorry you had that experience! Would love to help you get the bot running locally so we can figure out what's going on — here's our Discord: https://discord.gg/NaU4mMgcnC
ksarw
Congrats on the launch! One step closer to Jarvis.. ;)
ajaynraj
thanks!!
lapama
I use a mental health app called woebot, an example that could be brought to the next level with conversational LLMs.
KianHooshmand
totally agree! this is a really cool use case :)
crucialfelix
I had this same idea today and immediately thought that somebody must be doing it already.
dalexeenko
Very cool, congrats Ajay and Kian!
ajaynraj
thanks da :)
KianHooshmand
thank you!
belter
Finally will be able to send an Avatar to participate on my behalf on Zoom calls...
marcelc63
This is awesome, the PrankGPT demo can replace telesales entirely.
whitemary
Sounds great. FYI The site does not work well on Firefox iOS.
KianHooshmand
Ah! Have not tried this but will look into it – thank you :)
Our docs are hosted on Mintlify
jdcampolargo
Congrats. Do you have the repo for PrankGPT?
KianHooshmand
thank you! it's not live right now... but stay tuned for april 1 :)
chatgpt_bot
PrankGPT goes live on April fools day .. beautiful
lee101
[dead]
adept_js
[flagged]
I just called your voice demo, and immediately started sending the number to my friends. What an incredibly impressive and convincing demo. I'm going to update my standard mentoring wisdom: the only thing more compelling than a great product video is a phone number that you can call to have your first voice conversation with an AI.
If HN allowed memes - and thank goodness that it does not - there would be a room full of sombre gentlemen slow-clapping for you right here.
I hope that number survives the ineveitable deluge. How many callers can your system handle simultaneously?