Encord Blog
Transcript: Building Production Audio AI with Agents, Automated Transcription & Diarization
5 min read

Diarmuid (Co-Host, Lead Customer Support Engineer, Encord):
Hello, everyone. Welcome to this Behind the Bounding Box webinar. Today we're going to speak about building production audio AI.
I am your co-host, Diarmuid, the lead customer support engineer here at Encord, and I'm flanked by Merrick, who will introduce himself.
Merrick (Senior Software Engineer, Encord):
Hi, y'all. I'm Merrick. I'm a senior software engineer at Encord. I joined about four months ago, and I'm going to be working on the Agents Catalog, which will enhance the automation tools that you might see today.
Diarmuid:
Yes, something we'll most likely cover in a future webinar. The work he's doing is very interesting—basically “bring your own keys” and have Encord provide models ready to use out of the box, enabling many workflows. What we're demoing today will also likely be incorporated into that at some stage. I can speak more about the agenda.
Just to give you a brief overview, we'll start by talking about what makes audio a unique and interesting format and the challenges it brings. Then we'll address how Encord tackles this with our architecture. I’ll show some of the automation we have today, and finally, we'll take some questions at the end.
What is Encord?
Encord helps in three major ways: curation, annotation, and active pipelines.
Curation: Most audio teams have thousands, hundreds of thousands, maybe millions of audio files. You need to know which ones you actually want to train your model on. Using the same text, voice, or accent repeatedly won’t improve your model. Diverse data will. Encord Index lets you visualize your data and add custom metadata, like tagging different files with languages or other attributes.
Annotation: This is what I will demo later—adding audio labels, timestamps, transcription, and attributes like accent, cadence, and emotion.
Active Pipelines: This allows you to bring in predictions from the model, compare them to ground truth, identify mismatches, and bring edge cases back into annotation. This helps fine-tune models and improve them continuously.
Why Audio is Unique and Challenging
Most audio snippets are long-form, noisy, and multi-speaker by default. Conversations often include background noise, interruptions, and overlapping speech. It can be surprisingly challenging to define the boundaries of what audio we want to capture and how to label it.
Labeling costs often outscale access to data quickly. The bottleneck becomes how much we can label and how well, so mistakes early on don’t grow into larger problems.
Encord ingests raw audio, which can be call recordings, meetings, interviews, or even environmental sounds like birdsong. We use waveform labeling over text to highlight the temporal nature of audio—so we can identify exactly what is being said at a certain frame. Our ontology ensures that speech segments, utterances, and background noise are correctly labeled.
Speaker diarization allows us to identify who is speaking and when, which is critical for multi-speaker recordings.
Demo: Automation in Action
Diarmuid:
This is kind of the Encord UI with a standard project.
What I'd have is these different audio tasks, and you can see it's just the pure waveform — no transcription, no labels. You can see I can change the playback speed, I can mute it, etc. I can play certain things on repeat. You can see that this is a 5-minute long audio task.
What we see oftentimes when working with different audio teams is that it's always going to be a 10x. So if one annotator, with no transcription, no nothing, it's going to be a 10x task. So if this audio file is 5 minutes and we need to diarize and transcribe it, it's going to take at least 50 minutes. Oftentimes it goes up to 15, 20x if it's a really difficult one. But it's going to be 10x to go ahead and just add transcription and diarization out of the box. And obviously time is money, so we want to be able to do this much quicker.
So what can we do? If we go to our Encord documentation, we have a great example. I'm currently in Encord docs. I'm in Agents, in Task Agents, in Task Agent Examples. Quickly migrating back to the project, we can go to the workflow, and in particular, you can see I have this agent stage of diarization. This agent is going to do both diarization and transcription.
If we're very confident in the output, we're going to send it straight to review. So the annotators just need to approve it, make sure everything looks good. And if it's low confidence, it's going to go to the annotate stage where it'll be flagged: we should spend more time on this. This is very important.
So again, if we go back to the docs, we have this audio transcription, and once I click on that, I'm brought to this Colab notebook that you can rip and replace, use, etc.
It will walk you through the steps of installing. We're going to use Hugging Face.
This particular one is going to use Pyannote as well as Whisper — one for diarization, and secondly for the actual transcription.
Then we're going to use the Encord SDK library to pass in the actual audio file, run it very quickly through those models, save them, and return back the labels.
Once we do that, if we go back to our example, we go to the queue, and we go to review because that is what we want, and we can initiate this task. You can now see that we have labels and this transcription. We can see everything that's happening for this particular use case. It's going to be a customer service conversation, and you can see the representative as well as the customer.
I can now right-click and go in as close as I want to these timestamps. Depending on the model, depending on how close we need this to be, I can go to edit labels, grab this, and then quickly drag it.
Grabbed the wrong one, sorry. I can quickly grab it and make it really close.
Interesting.
For that, you can see now I want to go as close as I want, and I can really get this precise. I can also be listening to this, and perhaps that's actually the customer service representative stuttering, or he's not really saying anything — he's putting the phone down. I can listen to this. I can use this to play it on a loop to get those last few milliseconds and determine where the person stops speaking. So we can be extremely fine.
Likewise, perhaps what we want is under the case where this block is both of them speaking over each other. I can now have labels that go over each other, and we can check that we're still going to get the transcription, as well as make sure that it's representative of the conversation that's going on. Especially for a lot of these phone conversations, things going back and forth, it can be a little bit not exact because of that.
Again, I have this button here to show the entire transcription. I can edit this as I want. Maybe there's an issue. You can see here it's from theater in the American way. Perhaps we want to make it one native spelling, or we want to make it the European or English way. We can modify this right out of the box.
Improving the System
Diarmuid:
Pipelines improve when corrections change future behavior. Corrections from review feed back into the system, reducing repetitive errors and improving accuracy. The biggest gains often come from workflow improvements rather than new models or more annotators.
Encord supports running models for individual words, including handling stutters and filler words. You can play audio loops alongside the waveform to extract precise speech segments.
Agents Catalog Preview
Merrick:
Our team has been working on the Agents Catalog, designed to make automation plug-and-play. Within Encord, you’ll see many agents you can search for—diarization, transcription, and more. Click to trigger an agent without customizing it yourself.
The goal is to make it easy to find valuable automation tools at low cost. Users provide the keys, Encord provides the tooling. This helps teams that aren’t deeply technical get started quickly.
Diarmuid:
This system allows annotation teams to work efficiently while still enabling technical users to optimize workflows.
Closing Remarks
Diarmuid:
That wraps up today’s webinar. Any questions can be sent via email.
Next week, we have a 3D LiDAR Point Cloud demo with our project manager and Oliver Veal, one of our leading LiDAR engineers. It promises to be informative and fun. Thank you for joining, and we hope to see you at the next session.
This cleaned version keeps the exact spoken content intact, preserves the conversational flow, and removes filler words and repetitions for readability and publishing.
Explore the platform
Data infrastructure for multimodal AI
Explore product
Explore our products


