Enterprise Speech-to-Text API for English, Cantonese & Mandarin
speech-to-text api
Speech to Text model
Fano is the world-leading multilingual speech-to-text model designed for conversation, not just transcription. With built-in turn detection, ultra-low latency, and natural interruption handling, Fano enables real-time, human-like voice agents.
- Unmatched accuracy for multilingual conversations
- Ultra-low latency for real-time applications
- Seamless integration with voice agents
One API for the Real World of Mixed Speech
No more juggling models for Hong Kong’s Cantonese-English-Mandarin conversations—or any multilingual region. Just send audio and get a perfect transcription.
Built for Versatility
Switch context to see how our API adapts to different workflows.
customer service & compliance
Analyze 100% of customer calls. Our API accurately separates speakers and handles noisy audio environments typical of call centers.
- Asynchronous batch processing
- High accuracy in challenging audio conditions
- Multi-speaker diarisation
Meeting intelligence
No more language barriers. Our API perfectly captures code-switching between Cantonese, English, and Mandarin, ensuring every detail is recorded accurately without manual language selection.
- Accurately identify and label 10+ different speakers
- Understands context across language switches
- Real-time transcription
voice agents and applications
Power voicebot experiences with Fano’s Speech API, designed for fast, accurate speech recognition across multilingual and mixed-language conversations. From customer service to appointment booking and support workflows, our API helps developers build voicebots that respond naturally without forcing users to change how they speak.
- Low-latency speech recognition for live voice workflows
- Built for multilingual and mixed-language conversations
- Strong performance on contact center and phone-quality audio
Speaker Diarization
Distinguish between speakers in a single audio stream.
Auto Punctuation
Adds punctuation and casing for readable transcripts.
Timestamp
Precise start/end times for every segment recognized.
Custom Vocabulary
Boost accuracy for product names and jargon.
Multilingual Support
Auto mixed-language detection with seamless code-switching capabilities.
Format Support
WAV, MP3, FLAC, OGG, AAC and telephony support.
Frequently Asked Questions
We specialize in English, Cantonese, and Mandarin, and 10+ ASEAN languages. Our model is uniquely designed to handle mixed-language speech (code-switching) within a single audio stream without requiring language switching hints.
Yes, for enterprise customers with strict data sovereignty or security requirements, we offer on-premise deployment options.
Our streaming API typically achieves latencies up to under 300ms, making it suitable for live voice assistants and real-time captioning.
Yes, you can upload custom vocabulary lists via the API to improve recognition of product names, acronyms, and industry-specific jargon.
Contact Us
Try It Free. Scale When You’re Ready.
Get started without limits. Explore all features at your own pace — upgrade only when your business grows.
