Back to Blog
Engineering

The Challenges of Multilingual Voicemail Detection

How we trained our models to understand voicemails in 50+ languages while maintaining high accuracy across regional accents.

Dr. Aisha Patel

ML Research Lead

January 28, 2026
10 min read

Supporting 50+ languages isn't just about translating "Please leave a message." It requires understanding the linguistic and cultural nuances of how people around the world interact with voicemail systems.

The Multilingual Challenge

When we started VM Hunter, we focused on English—specifically American English. Expanding to other languages revealed challenges we hadn't anticipated.

Linguistic Diversity

Languages differ in fundamental ways that affect voicemail detection:

Word Order: English follows Subject-Verb-Object order, but Japanese uses Subject-Object-Verb. This affects where key phrases like "leave a message" appear in the audio stream.

Phonetics: Mandarin Chinese is tonal—the same syllable with different tones has different meanings. Our models needed to understand tonal patterns specific to voicemail greetings.

Formality Levels: Japanese has multiple politeness levels. Formal voicemail greetings use different vocabulary and speech patterns than casual ones.

Cultural Differences

Voicemail conventions vary by culture:

  • **Germany**: Greetings often include the caller's expected callback time
  • **Japan**: Apologies for not answering are common
  • **Brazil**: Greetings tend to be longer and more personal
  • **India**: Multiple languages may appear in a single greeting (code-switching)

Technical Challenges

Beyond linguistics, we faced technical hurdles:

  • **Data scarcity**: Some languages have very few available voicemail recordings
  • **Accent variation**: Hindi alone has dozens of regional accents
  • **Code-switching**: Speakers often mix languages (Spanglish, Hinglish)

Our Approach

We developed a multi-pronged strategy to address these challenges.

Universal Audio Representations

Instead of training separate models for each language, we developed a shared audio representation that captures speech patterns across languages.

Our approach uses self-supervised learning on 100,000 hours of unlabeled audio from 100+ languages. The model learns to:

  • Distinguish speech from non-speech sounds
  • Identify speaker changes
  • Recognize prosodic patterns (rhythm, stress, intonation)

This pre-trained representation transfers remarkably well to voicemail detection in new languages, even with limited labeled data.

Language-Specific Fine-Tuning

While the base representation is universal, voicemail detection requires language-specific knowledge. We fine-tune on labeled data for each language, with a minimum of:

  • 10,000 labeled voicemail recordings
  • 10,000 labeled human answer recordings
  • Coverage of major regional accents

For languages with less available data, we use data augmentation and synthetic data generation.

Accent Adaptation

Within each language, we account for accent variation through:

Accent Embeddings: Similar to speaker embeddings, we learn a representation of accent that helps the model adapt.

Regional Models: For high-volume languages (English, Spanish, Mandarin), we train regional variants: - English: US, UK, Australian, Indian, South African - Spanish: Mexican, Castilian, Argentine, Caribbean - Mandarin: Standard, Taiwanese, Singaporean

Handling Code-Switching

Many speakers naturally switch between languages. Our approach:

  1. . Detect language switches in the audio stream
  2. . Apply the appropriate language model for each segment
  3. . Combine predictions across segments

For common code-switching pairs (English-Spanish, Hindi-English), we train dedicated models on mixed-language data.

Data Collection

High-quality training data is the foundation of multilingual support.

Partnership Program

We partner with call centers in 30+ countries to collect labeled voicemail recordings. Partners receive:

  • Free VM Hunter access during the data collection period
  • Revenue share for high-quality contributions
  • Early access to new language support

Annotation Process

Each recording goes through:

  1. . **Automatic pre-labeling**: Our existing models provide initial labels
  2. . **Human review**: Native speakers verify and correct labels
  3. . **Quality assurance**: A separate team audits a random sample
  4. . **Dispute resolution**: Disagreements are resolved by senior linguists

We maintain a network of 500+ annotators covering all supported languages.

Synthetic Data Generation

For rare languages, we augment real data with synthetic voicemails:

  1. . **Text-to-Speech**: Generate greetings using neural TTS systems
  2. . **Voice Conversion**: Transform English voicemails into other languages while preserving acoustic patterns
  3. . **Template Combination**: Mix and match greeting components

Synthetic data helps bootstrap models for new languages, though real data always produces better results.

Evaluation Methodology

Measuring multilingual performance requires careful methodology.

Per-Language Metrics

We track accuracy, precision, recall, and F1 score for each language independently. Our release criteria:

  • Minimum 95% accuracy on held-out test set
  • Balanced performance across voicemail and human classes
  • Coverage of major regional accents

Accent Fairness

We specifically test for accent bias:

  • Models must achieve within 2% accuracy across all major accents
  • No systematic errors for specific demographic groups
  • Regular audits by external fairness researchers

Real-World Validation

Lab metrics don't always reflect production performance. We validate with:

  • Beta testing with native-speaking customers
  • A/B testing against our previous models
  • Continuous monitoring after launch

Results and Lessons Learned

After three years of multilingual development, here's what we've learned:

What Worked

  • **Transfer learning**: Pre-training on unlabeled audio dramatically reduced data requirements
  • **Native annotators**: Quality improved 20% when using native speakers vs. non-native
  • **Regional models**: Dedicated models for major accents outperformed one-size-fits-all

What Didn't Work

  • **Machine translation of greetings**: Synthetic greetings generated by translating English sounded unnatural
  • **Accent normalization**: Trying to map all accents to a "standard" hurt accuracy
  • **Rushed launches**: Early releases without adequate testing damaged customer trust

Surprising Findings

  • Some languages (Japanese, Korean) had inherently higher voicemail rates due to cultural norms
  • Code-switching was more common than expected, even in "monolingual" regions
  • Carrier-specific voicemail systems varied more than we anticipated

Future Directions

Our multilingual roadmap includes:

  1. . **20 new languages by end of 2026**: Focus on African and Southeast Asian languages
  2. . **Real-time language identification**: Automatically detect and adapt to the speaker's language
  3. . **Dialect-level support**: Move beyond country-level to regional dialect support
  4. . **Low-resource language toolkit**: Enable customers to add their own languages

Language is fundamental to human communication. We're committed to making VM Hunter work for everyone, regardless of which language they speak.