coqui STT VS Silero VAD

coqui STT

Coqui STT is a speech to text engine, forked from Mozilla’s DeepSpeech

Stellar quality.Highly portable.No strings attached.Supports 8 kHz and 16 kHz.Models < one megabyte in size.Supports 30, 60 and 100 ms chunks.Trained on 100+ languages, generalizes well.One chunk ~ 1ms on a single thread.

Landing page //
2023-04-15

Landing page //
2023-09-21

coqui STT

Website: coqui.ai
$ Details
Categories: #Knowledge Sharing #Knowledge Search #Speech Recognition And Processing #Transcription #AI

Edit details

Silero VAD

Website: github.com
$ Details: -
Categories: #Internet Of Things #AI #GitHub #Transcription

Edit details

Category Popularity

0-100% (relative to coqui STT and Silero VAD)

coqui STT

Silero VAD

Transcription

38 38%

Transcription

62% 62

AI

33 33%

AI

67% 67

Knowledge Sharing

100 100%

Knowledge Sharing

0% 0

Productivity

0 0%

Productivity

100% 100

User comments

Share your experience with using coqui STT and Silero VAD. For example, how are they different and which one is better?

Social recommendations and mentions

Based on our record, coqui STT should be more popular than Silero VAD. It has been mentiond 13 times since March 2021. We are tracking product recommendations and mentions on various public social media platforms and blogs. They can help you identify which product is more popular and what people think of it.

coqui STT mentions (13)

Ask HN: Open-source, local Text-to-Speech (TTS) generators
I just noticed that https://coqui.ai/ is "Shutting down". I'm building a web app (React / Django) which takes a list of affirmations & goals (in Markdown files), puts them into a database (SQlite), and uses voice synthesis to create voice audio files of the phrases. These are combined with a relaxed backing track (ffmpeg), made into playlists of 10-20 phrases (randomly sampled, or according to a theme: "mind"... - Source: Hacker News / about 3 hours ago
What things are happening in ML that we can't hear oer the din of LLMs?
Not sure how relevant this is but note that Coqui TTS (the realistic TTS) has already shut down https://coqui.ai. - Source: Hacker News / about 1 month ago
Help bringing some peace to my family.
You can take a look at https://coqui.ai. Source: 8 months ago
Best ai voice generator for temp voice acting in animation film?
I haven't messed with anything more fancy than Festival but I would look at coqui.ai. Source: 11 months ago
Put myself into a Billy Bass using GPT trained on my voice
This. You can create voice models for TTS with a variety of systems - commercial and free eg https://www.resemble.ai https://coqui.ai etc - and use that with gpt text. But I don’t think you can get gpt to directly do the tts. My guess is OP accidentally made this confusing in their post title. Source: 11 months ago

Silero VAD mentions (5)

New models and developer products announced at OpenAI DevDay
>How do you detect speech starting and stopping? https://github.com/snakers4/silero-vad. - Source: Hacker News / 6 months ago
[Discussion] Video Translation Task
You could look into https://github.com/guillaumekln/faster-whisper especially the VAD section (Voice Activity Detector) using https://github.com/snakers4/silero-vad. Source: 10 months ago
Using Whisper to transcribe the entire Forensic Files series
I also had the same synchronization issue, so I wrote a WebUI/CLI that uses Silero-VAD that first splits the audio whenever there a silent portion (or every 30 seconds), and I haven't experienced it since:. Source: 11 months ago
Whisper - A new free AI model from OpenAI that can transcribe Japanese (and many other languages) at up to "human level" accuracy
By the way, I've updated the WebUI to now also support using Silero VAD to break up the audio into distinct sections, and run Whisper on each section and then combine them into one single transcript/SRT file. Source: over 1 year ago
Whisper - A new free AI model from OpenAI that can transcribe Japanese (and many other languages) at up to "human level" accuracy
And while googling this, I stumbled upon this discussion on the Whisper GitHub repository, which seems to suggest that the issue is that the current VAD (Voice Activity Detection) is quite poor, and that it can be resolved by using another VAD (like silero-vad). This might be something I want to add to my WebUI in the future. Source: over 1 year ago