Speechmatics Review: The Speech-to-Text API Builders Are Reaching For in 2026

Speechmatics has been building speech recognition since the 1980s. Here is an honest look at what it actually does, who it is built for, how to start for free, and how it stacks up against the other speech-to-text APIs developers reach for first.


Most people discover Speechmatics the same way: they need speech-to-text that actually works across accents, and Whisper alone is not cutting it anymore. Or they are building a voice agent and they want something fast enough to feel like a real conversation, not a walkie-talkie.

The company has been at this longer than most. Dr. Tony Robinson, the founder, started applying neural networks to speech recognition at Cambridge University in the 1980s. That is not a footnote. That is decades of compounding domain knowledge that shows up in the accuracy numbers today.

This post covers what Speechmatics actually is, what makes it different from Deepgram and AssemblyAI, where the pricing lands, how to get started in under ten minutes, and the honest limitations worth knowing before you commit.


What Speechmatics Actually Is

Speechmatics is an enterprise-grade speech API. Not a consumer app. Not a transcription service you paste audio into on a website. It is infrastructure you plug into your product or workflow via API, and it handles the hard parts: multiple languages, heavy accents, speaker diarization, real-time transcription, and on-device deployment for privacy-critical environments.

The three core things it does:

  • Speech-to-text (STT): Batch and real-time transcription across 55+ languages, with both a Standard model (faster and cheaper) and an Enhanced model (highest accuracy).
  • Text-to-speech (TTS): Low-latency voice synthesis, optimized for voice agents. English now, with more languages coming.
  • Voice Agent API: A full stack for building conversational voice agents, with native integrations into Pipecat, LiveKit, and Vapi.

It is used by companies like Adobe (for on-device transcription inside Premiere Pro), AI Media (delivering 120x more live captioning throughput than before), and LiveKit (powering more than 100,000 developers building AI agents on their platform).


What Actually Sets It Apart

There are plenty of speech APIs. Here is what makes Speechmatics different in ways that actually matter at the product level.

Accuracy across accents and languages

Most speech models were trained on clean English audio from a narrow demographic. They fall apart the moment you introduce a regional accent, non-native speaker, or language other than English. Speechmatics was built around the goal of understanding every voice. Its 55+ language coverage reaches over four billion people. According to independent Pipecat benchmarks published in April 2026, Speechmatics scored 83.2% perfect transcripts at a 1.07% pooled word error rate on voice agent test sets, the best accuracy in the benchmark. Deepgram was at 1.62%. AssemblyAI at 3.02%.

The trade-off: Speechmatics’ median Time to Final Segment (TTFS) in the same benchmark was 495ms, compared to Deepgram’s 247ms. That is the honest version of the story. It is more accurate and slightly slower. Whether that matters for your use case depends on whether you would rather wait 250ms extra or have users repeat themselves.

Deployment flexibility

You can run Speechmatics in the cloud, on-premise, or on-device. That last one matters a lot in healthcare, legal, and government contexts where audio cannot leave a specific environment. Adobe Premiere uses the on-device version so editors can transcribe footage locally without sending audio to any server.

No data logging by default

Speechmatics does not log your audio data as a default. It is GDPR compliant, HIPAA compliant, SOC 2 Type II certified, and ISO 27001:2022 accredited. For medical, legal, and financial use cases, this is not a nice-to-have. It is the whole reason companies choose it over cheaper alternatives.

Medical model

There is a dedicated medical transcription model that reduces errors on clinical terminology by up to 50% compared to the general model. If you are building ambient scribing or clinical dictation products, this is a meaningful difference.


Pricing: What It Actually Costs

Speechmatics has three tiers. Here is what each one gives you.

PlanPriceFree AllowanceBest For
Free$0/month2,400 minutes STT + 1M characters TTSDevelopers testing the API
ProFrom $0.24/hr (STT)Same free tier, then pay-as-you-goGrowing products, serious usage
EnterpriseCustom pricingNo rate limits, custom models, on-premScale, privacy-critical environments

A few things worth knowing about the free tier: it gives you 2,400 minutes (40 hours) of speech-to-text per month and 1 million characters of TTS. That is genuinely enough to build and test a real product. No credit card required to start.

Pro tier usage is capped at 6,000 hours per month. Volume discounts kick in automatically above 500 hours per month. For very high-volume users, additional discounts are available from 24,000 hours per year.

There is also a Startup Program offering up to $50,000 in credits for qualifying early-stage companies, plus full API access, all 56+ languages, and dedicated technical onboarding.

The Standard and Enhanced model distinction matters for cost control. Standard is faster and cheaper. Enhanced delivers the highest accuracy. You can mix them across different jobs based on what each task requires.


How to Get Started: Step by Step

Here is exactly how to go from zero to your first transcription in about ten minutes.

  1. Create a free account. Go to portal.speechmatics.com/signup. No credit card needed. You get 40 hours of STT and 1 million TTS characters to work with immediately.
  2. Grab your API key. Once logged in, navigate to your account settings. Your API key is there. Copy it. This is what authenticates every request you make to the API.
  3. Read the docs. The Speechmatics documentation is well-structured. Start with the Batch transcription quickstart if you have pre-recorded audio, or the Real-time quickstart if you are building something live.
  4. Run your first batch transcription. The simplest call sends an audio file to the batch API and gets a JSON transcript back. Here is what that looks like in Python:
import requests
url = "https://asr.api.speechmatics.com/v2/jobs/"
headers = {"Authorization": "Bearer YOUR_API_KEY"}
with open("audio.mp3", "rb") as f:
files = {"data_file": f}
data = {
"config": '{"type": "transcription", "transcription_config": {"language": "en"}}'
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

That gets you back a job ID. You then poll the job endpoint to retrieve the completed transcript when it is ready.

  1. Try real-time transcription. For live audio, Speechmatics uses a WebSocket connection. The docs have ready-to-run examples in Python and JavaScript. Real-time sessions are limited to 2 concurrent connections on the free tier and 50 on Pro.
  2. If you are building a voice agent, plug into one of the native integrations: Vapi (no-code, fastest to deploy), LiveKit (WebRTC-based, good for engineers), or Pipecat (open-source, full control of the pipeline). Pipecat has a public GitHub Academy repo from Speechmatics with example voice bots you can clone and run.
  3. Monitor your usage. Check your account dashboard to see how many minutes and characters you have consumed. Set spending controls before running anything at volume on Pro.

Real-World Use Cases (With Specifics)

The use cases below are documented from Speechmatics’ own published case studies and integrations, not marketing claims.

Live captioning at scale. AI Media used Speechmatics to deliver 120x more captioning volume than their previous setup. They cover live events, sports broadcasts, and news in real time. The accuracy requirements for live captioning are unforgiving, there is no second chance to fix a wrong word before it appears on screen.

Video editing with on-device speech recognition. Adobe integrated Speechmatics’ on-device model into Premiere Pro. The goal was cloud-grade accuracy running locally on a laptop, without sending footage to any external server. The engineering challenge was fitting the model into a form factor efficient enough for battery-powered devices. They did it.

Voice agents for contact centers. NCI, a captioning service, used Speechmatics and saw a 99% increase in the use of automated captioning across their platform. Contact center use cases benefit from the combination of accuracy and diarization: knowing not just what was said, but who said it.

Meeting platforms and note-taking. Any product that transcribes meetings across multiple speakers and languages is a natural fit. The 55+ language support means a single API integration handles a globally distributed team without having to stitch together different models per region.

Medical scribing. The Medical Model cuts errors on clinical terminology by up to 50%. For ambient scribing applications where doctors speak naturally and the system converts that into structured notes, accuracy on drug names, procedures, and anatomy is the whole product.

Legal transcription. Court reporters and law firms need verbatim accuracy and speaker identification. Speechmatics’ approach to handling multiple simultaneous speakers makes it viable for deposition and courtroom transcription.


How It Compares to Deepgram and AssemblyAI

These three names come up together constantly. Here is the honest version of how they sit relative to each other.

SpeechmaticsDeepgramAssemblyAI
Accuracy (Pipecat benchmark WER)1.07% (best)1.62%3.02%
TTFS latency (median)495ms247ms256ms
Languages55+30+20+
On-device deploymentYesNoNo
Free tier40 hrs/monthPaid plans from $0.0059/minFree tier available
Medical modelYesYes (Nova Medical)No
HIPAA compliantYesYesYes
Startup grantsUp to $50k in creditsVia partnershipsVia partnerships

Here is the thing: Deepgram is faster. If you are building a real-time voice agent where every 100ms of latency changes the conversation feel, Deepgram’s speed advantage is real. Speechmatics will tell you accuracy matters more than speed, and they make a compelling argument. But if speed-to-first-response is your primary constraint, that trade-off is worth knowing upfront.

AssemblyAI has a different product philosophy. It invests more in LLM-powered features sitting on top of transcription: auto chapters, sentiment analysis, entity detection. If you want packaged intelligence rather than raw accuracy, AssemblyAI has more of that out of the box. Speechmatics is more infrastructure-first.


Limitations Worth Knowing

No product review is worth reading if it skips this part.

TTS is English-only for now. The speech-to-text covers 55+ languages. Text-to-speech currently supports English only, with more languages listed as coming soon. If you are building a multilingual voice agent that also speaks back to users, you will need to supplement TTS with another provider for non-English outputs.

Real-time latency is higher than Deepgram. At 495ms median TTFS versus Deepgram’s 247ms, Speechmatics is noticeably slower at returning final transcripts in voice agent contexts. It scores better on accuracy, but the speed gap is documented and real. For voice agents where perceived responsiveness is a key product metric, this matters.

It is an API, not a finished product. You are not getting a UI-based transcription tool or a ready-to-use meeting recorder. Speechmatics is infrastructure. If you want something you can hand to a non-technical user without any setup, this is not it. You will need developers to integrate the API into your product.

Free tier is for developers, not production. The 2 concurrent real-time sessions on the free plan work fine for testing. They are not production capacity. Pro gives you 50 concurrent sessions, which covers most early-stage products.

Custom model training requires enterprise. Domain-specific fine-tuning and custom model development are enterprise-only features. If you have highly specialized vocabulary (specific industrial equipment, niche medical subspecialty, proprietary product names) and want the model trained on your data, that conversation happens at the enterprise tier.


Who Should Actually Use This

Skip Speechmatics if you are looking for a plug-and-play meeting recorder app or a consumer transcription tool. That is not what it is.

Use it if you are building any of the following:

  • A voice agent that needs to handle multiple languages and accents without falling over
  • A healthcare or legal product where audio privacy and accuracy on domain-specific terminology are non-negotiable
  • A media or broadcasting product that needs live captioning at scale
  • A meeting intelligence platform that needs speaker-aware transcription across a globally distributed team
  • A product where audio cannot leave the device or on-premise environment
  • Anything where accuracy is the primary selection criterion and you can absorb slightly higher latency

If you are an engineer evaluating STT APIs, the fact that Speechmatics offers 40 hours free per month with no credit card means there is no reason not to run the benchmark yourself before committing to anything.


The Bigger Picture: Why Speech Accuracy Is Finally a Competitive Moat

Voice agents were a party trick two years ago. They are becoming product infrastructure now. Every major AI platform is shipping a voice layer: ChatGPT’s voice mode, Claude’s voice features, Gemini Live. The interfaces people use for AI are shifting from typing to talking.

What this means practically: the accuracy gap between speech providers is about to matter a lot more than it did when voice was an add-on feature. If your voice agent mishears one in every ten words, your users will use the text interface instead. The product fails at the voice layer, and the underlying LLM quality is irrelevant.

Speechmatics’ entire positioning, captured in their phrase “speed you can trust,” is a bet that enterprise builders will eventually choose accuracy over raw latency once their products are live and they see what repeated misrecognitions do to user retention. Based on the Pipecat benchmark data and the customer list they have built, including Adobe, AI Media, and LiveKit, that bet looks reasonable.

The company was named one of Fast Company’s Most Innovative Companies in AI in 2023, and ranked on the FT 1000 list of Europe’s Fastest Growing Companies five years running through 2023. It is not a startup finding its footing. It has been doing this long enough to have the receipts.


If you are building anything with voice and you have not yet stress-tested your STT accuracy across accents and noise conditions, that is the test worth running before you ship. The free tier is a free benchmark.


Sources and further reading

  1. Speechmatics official website
  2. Speechmatics Pricing
  3. About Speechmatics
  4. Speed you can trust: The STT metrics that matter for voice agents, Speechmatics blog (April 2026)
  5. Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere, Speechmatics (2026)
  6. AI Media Case Study: Delivering 120X more with voice AI, Speechmatics
  7. NCI Case Study: Redefining real-time captioning, Speechmatics
  8. Pipecat STT Benchmark results, GitHub
  9. LiveKit and Speechmatics: Enabling 100,000+ developers with leading speech recognition, Speechmatics
  10. Speechmatics Startup Program

Leave a comment

Website Built by WordPress.com.

Up ↑