Introducing SpeechLLM 2.0: Our Most Accurate Model Yet
We're excited to announce the release of SpeechLLM 2.0, featuring improved accuracy, faster processing, and support for 20 new languages.
Sarah Chen
VP of Engineering
Today marks a major milestone in VM Hunter's journey: the release of SpeechLLM 2.0, our most advanced voicemail detection model to date. After 18 months of research and development, we're proud to share what we've built with the world.
What's New in SpeechLLM 2.0
Improved Accuracy
Our new model achieves 99.7% accuracy in distinguishing between live humans and voicemail systems—a significant improvement from the 97.2% accuracy of our previous version. This might seem like a small jump in percentage, but for high-volume call centers processing millions of calls per month, this translates to tens of thousands of correctly routed calls.
The improvement comes from several innovations:
- **Transformer-based architecture**: We've moved from our previous CNN-LSTM hybrid to a pure transformer architecture, allowing the model to capture longer-range dependencies in audio patterns.
- **Multi-task learning**: The model now simultaneously learns to detect voicemail greetings, beep tones, and silence patterns, sharing representations across tasks.
- **Adversarial training**: We trained against synthetic voicemail recordings specifically designed to fool our detector, making the production model more robust.
40% Faster Processing
Speed matters in real-time call routing. SpeechLLM 2.0 processes audio streams in under 30 milliseconds, down from 50ms in the previous version. This improvement comes from:
- Optimized attention mechanisms that reduce computational complexity
- Better use of hardware acceleration on modern GPUs
- Streaming inference that begins analysis before the audio buffer is complete
20 New Languages
We've expanded our language support from 30 to 50 languages, including:
- Thai, Vietnamese, and Indonesian
- Polish, Czech, and Hungarian
- Hebrew, Arabic, and Farsi
- Swahili, Yoruba, and Amharic
Each language model was trained on thousands of hours of real voicemail recordings, with native speakers helping to verify accuracy across regional dialects.
Technical Deep Dive
For those interested in the technical details, our model uses a 12-layer transformer encoder with 768 hidden dimensions and 12 attention heads. The audio input is processed through a mel-spectrogram frontend with 80 mel bins and a 25ms window size.
We use a novel "audio tokenization" approach where continuous audio features are discretized into a vocabulary of 8,192 tokens, allowing us to leverage techniques from NLP for audio processing.
Training was conducted on our distributed GPU cluster using 128 A100 GPUs over 3 weeks, with a total training cost of approximately $400,000. The resulting model is 85MB, small enough to be deployed at the edge for on-premise customers.
Migration Guide
Existing customers can upgrade to SpeechLLM 2.0 by simply updating their API endpoint:
// Old endpoint
https://api.vmhunter.com/v1/analyze// New endpoint https://api.vmhunter.com/v2/analyze ```
The request and response formats remain backward compatible, so no code changes are required beyond the URL update. We'll continue supporting the v1 API for 12 months to give everyone time to migrate.
What's Next
We're already working on SpeechLLM 3.0, which will introduce:
- Real-time transcription of voicemail greetings
- Sentiment analysis for detected human speech
- Custom model fine-tuning for enterprise customers
Thank you to everyone who has been part of this journey. We're excited to see what you build with SpeechLLM 2.0.