Back to Blog
Product

Introducing SpeechLLM 2.0: Our Most Accurate Model Yet

We're excited to announce the release of SpeechLLM 2.0, featuring improved accuracy, faster processing, and support for 20 new languages.

Sarah Chen

VP of Engineering

February 25, 2026
5 min read

Today marks a major milestone in VM Hunter's journey: the release of SpeechLLM 2.0, our most advanced voicemail detection model to date. After 18 months of research and development, we're proud to share what we've built with the world.

What's New in SpeechLLM 2.0

Improved Accuracy

Our new model achieves 99.7% accuracy in distinguishing between live humans and voicemail systems—a significant improvement from the 97.2% accuracy of our previous version. This might seem like a small jump in percentage, but for high-volume call centers processing millions of calls per month, this translates to tens of thousands of correctly routed calls.

The improvement comes from several innovations:

  • **Transformer-based architecture**: We've moved from our previous CNN-LSTM hybrid to a pure transformer architecture, allowing the model to capture longer-range dependencies in audio patterns.
  • **Multi-task learning**: The model now simultaneously learns to detect voicemail greetings, beep tones, and silence patterns, sharing representations across tasks.
  • **Adversarial training**: We trained against synthetic voicemail recordings specifically designed to fool our detector, making the production model more robust.

40% Faster Processing

Speed matters in real-time call routing. SpeechLLM 2.0 processes audio streams in under 30 milliseconds, down from 50ms in the previous version. This improvement comes from:

  • Optimized attention mechanisms that reduce computational complexity
  • Better use of hardware acceleration on modern GPUs
  • Streaming inference that begins analysis before the audio buffer is complete

20 New Languages

We've expanded our language support from 30 to 50 languages, including:

  • Thai, Vietnamese, and Indonesian
  • Polish, Czech, and Hungarian
  • Hebrew, Arabic, and Farsi
  • Swahili, Yoruba, and Amharic

Each language model was trained on thousands of hours of real voicemail recordings, with native speakers helping to verify accuracy across regional dialects.

Technical Deep Dive

For those interested in the technical details, our model uses a 12-layer transformer encoder with 768 hidden dimensions and 12 attention heads. The audio input is processed through a mel-spectrogram frontend with 80 mel bins and a 25ms window size.

We use a novel "audio tokenization" approach where continuous audio features are discretized into a vocabulary of 8,192 tokens, allowing us to leverage techniques from NLP for audio processing.

Training was conducted on our distributed GPU cluster using 128 A100 GPUs over 3 weeks, with a total training cost of approximately $400,000. The resulting model is 85MB, small enough to be deployed at the edge for on-premise customers.

Migration Guide

Existing customers can upgrade to SpeechLLM 2.0 by simply updating their API endpoint:

// Old endpoint
https://api.vmhunter.com/v1/analyze

// New endpoint https://api.vmhunter.com/v2/analyze ```

The request and response formats remain backward compatible, so no code changes are required beyond the URL update. We'll continue supporting the v1 API for 12 months to give everyone time to migrate.

What's Next

We're already working on SpeechLLM 3.0, which will introduce:

  • Real-time transcription of voicemail greetings
  • Sentiment analysis for detected human speech
  • Custom model fine-tuning for enterprise customers

Thank you to everyone who has been part of this journey. We're excited to see what you build with SpeechLLM 2.0.