Best Practices

7 APIs to get Zoom transcripts: A comprehensive guide

Amanda Zhu

January 18, 2024

Getting a transcript is the first step to analyzing a call with LLMs.


So if you’re building a sales coaching tool, an interview coaching tool, a meeting summarization app, or any sort of app that analyzes calls, you’ll need to figure out how to get the transcript of the call.

Zoom is one of the most popular platforms for video calls, so you might be wondering: “How can I get transcripts from Zoom programmatically? Is there a Zoom transcript API?”


Unfortunately, Zoom does not have a transcription API, but don’t worry — there are many other ways to get a transcript from Zoom programmatically, which we’ll go through in this blog post.


In this post, we’ll also help you consider the pros and cons of each method, and help you weigh factors you might not have thought about, such as the best method for getting real-time transcription, the best method to get accurate speaker diarization, the lowest cost solution, and so on.


There are 7 ways to get transcripts from a Zoom call programmatically:

  1. Method 1: Download Transcripts from the Zoom Cloud Recording API
  2. Method 2: Zoom Cloud Recording + Transcription API
  3. Method 3: RTMP live streaming + Transcription API
  4. Method 4: Desktop app to capture system audio + Transcription API
  5. Method 5: Web app to capture microphone and speaker audio + Transcription API
  6. Method 6: Build a Meeting Bot + Transcription API
  7. Method 7: Use Recall.ai

At the bottom of this blog post, we’ve also included a comparison chart of all the different methods to get transcription from Zoom.

Now, let’s dive in.

Method 1: Download Transcripts from the Zoom Cloud Recording API

Zoom Cloud Recording is a powerful feature built natively into Zoom that allows the meeting host to record the call directly into Zoom’s cloud storage. Not only does the audio and video from the meeting get recorded, but Zoom Cloud Recording will also produce a transcript of the meeting in VTT format. The transcript file can be retrieved from Zoom’s Get Meeting Recording endpoint.


On the surface this seems very straightforward - just call the endpoint to get transcripts from Zoom, right? Unfortunately, it’s not so simple.

Let’s start with the benefits of using this method, and then we’ll dive into what you should watch out for.

Pros

  • Speaker separated transcripts

    The transcript you get is annotated with the speaker’s name. This is especially important if you’re analyzing the transcript with an LLM because knowing who spoke each word allows the LLM to produce much better results.

  • Transcripts are included free of cost to you

    If your users are already on the Pro, Business, or Enterprise tier, transcripts are already included, and free.

Cons

  • Only available on paid Zoom plans

    To take advantage of this feature, your users need to be on the Pro, Business or Enterprise tier.

  • Your users must record to Zoom Cloud

    A transcript is only produced if your user records their meeting using Zoom Cloud Recording (not Zoom local recording). If your users aren’t used to recording on Zoom Cloud this will be a behavior change they will need to make.

  • Only the host can record

    Only the host of the meeting can record to Zoom Cloud, so if your user is not the host of the meeting, this method won’t be applicable.

  • Users must turn on transcription in the settings

    The audio transcript feature in Zoom is disabled by default. Your users must enable this setting in Zoom for transcription to be produced. If the toggle is grayed out, they must contact their workspace admin to enable it. If they can’t find the toggle, double-check that they are on a paid Zoom plan. The good news is that your users only need to turn it on once, and it will apply to all future meetings.

  • Long wait time

    Zoom Cloud Recordings typically take about 2 times the duration recorded to process, but occasionally may take up to 24 hours due to higher processing loads at that time. For example, the recording of a 30 minute long meeting will be available 1 hour after the meeting is done. This means that your analysis will be delayed.

  • English only

    English is the only language supported by Zoom’s transcript feature right now.

  • No customization

    You don’t have any control over the quality of the transcripts, because they’re done automatically by Zoom. If there is specific vocabulary you want transcribed (eg. company names, person names, medical terms, etc), you won’t be able to fine-tune the transcription to your needs.

  • OAuth Integration needed

    To access the Zoom Cloud Recording API, your users need to connect their Zoom accounts via OAuth. This may need to go through the Zoom workspace admin and can cause additional friction during onboarding.

  • Zoom will need to review your app

    Because an OAuth integration is needed, you must build a Zoom app. To use your Zoom app in production, the app will need to go through a Zoom app review process, which takes around 4 weeks.

  • No per-word time stamps

    If you want to build out a UI where you can click on a word in the transcript to jump to a moment in the recording, you will need the timestamps for each word. But if that isn’t a need for you, this won’t be a problem.

  • No real-time transcripts

    Zoom Cloud transcripts are only available after the meeting is done, so you won’t be able to get the transcription in real-time.

Method 2: Zoom Cloud Recording + Transcription API

Suppose you require a higher quality transcription than the default transcription produced by Zoom Cloud Recording. In that case, another option is to use a transcription API like AWS Transcribe, Google Speech To Text, OpenAI Whisper, or others, to transcribe the video produced by Zoom Cloud Recording.


Concretely, here is how this method would work:

  1. Record the meeting using Zoom Cloud Recording (same as in Method 1).
  2. After the recording is complete, call Zoom’s Get Meeting Recording endpoint and get the download_url of the recording, which will let you download an MP4 of the recording.
  3. Then, pass the MP4 file to the transcription provider of your choice. The transcription provider will give you back the transcript of the file, typically in JSON format.


Now that you understand at a high level how this works, let’s go through the pros and cons.

Pros

  • Higher quality transcription

    By using a third-party transcription provider, you can get higher quality transcription than Zoom Cloud Recording produces by default.

  • Custom vocabulary support

    By using a third-party transcription provider, you can customize the results of the transcription to better suit your specific use case. For example, many transcription providers support “word boost” or “custom vocabulary” to enable accurate transcription of industry-specific terms or company names.

  • Multi-language support

    Most transcription APIs, like Google STT, support multiple languages.

Cons

  • Third-party transcription is required, which is an additional cost

    You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.

  • No speaker separation

    Because Zoom only provides a mixed audio stream from Cloud Recordings, you’re not able to get the transcript annotated with speaker names.


  • Your users must record to Zoom Cloud

    In this method, a transcript is only produced if your user records their meeting using Zoom Cloud Recording. So if your users aren’t used to recording on Zoom Cloud this will be a behaviour change they will need to make.

  • Only the host can record

    Only the host of the meeting can record to Zoom Cloud, so if your user is not the host of the meeting, this method won’t be applicable.

  • Long wait time

    This method relies on Zoom Cloud Recording, so you will need to wait for the Zoom Cloud Recording to finish processing, which can take up to 30 minutes for a 1 hour long call. On top of that, you will also need to pass the recording to the transcription provider, which adds additional latency.

  • Only available on paid Zoom plans

    Just like with getting transcripts directly from Zoom Cloud, your users must be on a paid Zoom plan, otherwise this functionality is not available.

  • OAuth Integration needed

    Just like with getting transcripts directly from Zoom Cloud, your users need to connect their Zoom accounts via OAuth, which can cause user onboarding friction.

  • Zoom will need to review your app

    Just like with getting transcripts directly from Zoom Cloud, you will need to create a Zoom app to access your user’s Zoom cloud recordings. The Zoom app will need to go through a review before it can be used in production. This review takes 4 weeks on average.

  • No real-time transcripts

    Zoom Cloud recordings are only available after the meeting is done, so you won’t be able to get the transcription in real time.

Method 3: RTMP live streaming + Transcription API

Another option that uses the Zoom API, but can provide real-time data is the RTMP live streaming feature that Zoom provides.


RTMP, or the Real-Time Media Protocol, is a technology that allows you to stream audio and video in real time over the internet. Zoom supports streaming Zoom meetings through RTMP, and you can set up an RTMP endpoint to receive and process these streams to get a transcription.


Here is how this method would work:

  1. Initiate a live stream using Zoom RTMP, providing your Stream URL and Stream Key.
  2. The audio will start live streaming to the Stream URL you provided.
  3. When you receive the audio, you can either:
    1. Store the audio until the meeting is done, and give the transcription provider the full recording. Or,
    2. Stream the audio to the transcription provider in chunks. The transcription provider will give you back the live transcription.


Now that you understand at a high level how this works, let’s go through the pros and cons.

Pros

  • No wait time

    Because the data is sent in real-time, you don’t need to wait for a Cloud Recording to complete. You can produce the transcript while the call is in progress to have immediate results after the call is done.

  • Real-time support

    By streaming the audio to the transcription provider in real-time, you could get the transcription in real-time too.

  • Custom vocabulary & multi-language support

    Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.

Cons

  • Live streaming badge can make participants feel uncomfortable

    When you live stream, a “live streaming” badge shows up in your Zoom meeting so all participants know the call is being streamed. This is natively built into Zoom for compliance reasons, and can’t be removed. Understandably, this badge can cause some participants to feel uncomfortable.

  • Only the host can start a stream

    Only the meeting host can start a live stream, so if your user is not the host, this method won’t be applicable.

  • The meeting host must be on a paid Zoom account

    The meeting organizer must have a paid Zoom account to use live streaming.

  • Users must turn on live streaming in settings

    The live streaming feature in Zoom is disabled by default. Your users must enable the setting Allow livestreaming of meetings in Zoom for live streaming to be available. If the option is grayed out, it has been locked at either the group or account level, and your user will need to contact their Zoom admin to make changes.

    The good news is that your users only need to turn it on once, and it will apply to all future meetings.

  • Set up can be a hassle

    Setting up live streaming can be a hassle – only the meeting host can initiate the live stream and they must do so manually for every meeting they host. Alternatively, you can use the Zoom API to start the live stream automatically, however in this case users will need to connect their accounts via OAuth.

  • Third-party transcription is required, which is an additional cost

    There is no built-in transcript with Zoom RTMP – you’ll need to work with a third-party transcription provider which will come with an additional cost.

  • No speaker separation

    Zoom live streaming only provides a mixed audio stream, with no speaker metadata, so you cannot get a transcript annotated with speaker names. Some transcription providers can use AI to separate the transcription by speaker, however in those cases, you’d get speakers labeled as “Speaker 1, Speaker 2, …” instead of the actual person’s name.

  • High latency

    Because RTMP is an inherently high latency protocol, latencies of 10-30s are expected. However, if you aren’t working with the audio in real-time, this will not be a problem for you.

  • You don’t get per-word time stamps

    Whether or not this is a problem is going to depend on your use case. For example, if you want to build out a UI where you can click on a word to jump to a timestamp, you will need the timestamps per each word. But if this isn’t a product need for you, this won’t be a problem.

For additional information on this option, here’s a link to our blog post on the pros and cons of Zoom RTMP streaming.

Method 4: Desktop app to capture system audio + Transcription API

An alternative way to get access to audio streams that don’t require Zoom APIs is to build a desktop app that records audio from the user’s computer. You can then use a transcription API to transcribe that audio or you could even run an open-source transcription model like Whisper locally on your user’s computer.


Let’s weigh the pros and cons of this method.

Pros

  • No wait time

    Because the data can be sent in real time, you don’t need to wait for a Cloud Recording to complete. You can produce the transcript while the call is in progress to have immediate results after the call is done.

  • Real-time support

    By streaming live audio to the transcription provider, you could get the transcription in real time too.

  • Works on free Zoom plans

    Unlike Methods 1, 2, and 3, a desktop app can record calls hosted on free Zoom accounts.

  • Your user can record the meeting, even if they are not the host

    Also unlike Methods 1, 2, and 3, a desktop app can record the system audio even if your user is not the host of the meeting.

  • Works on all meeting platforms

    This method works the same across all video conferencing platforms (Zoom, Google Meet, Microsoft Teams, etc), which means your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.

  • Custom vocabulary & multi-language support

    Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.

Cons

  • Significant engineering burden

    If you don’t already have a desktop app, building one is a significant engineering challenge. To ship a high-quality and stable desktop recorder can take a skilled engineering team several months and continuous maintenance is required. Desktop apps are particularly challenging because end-user computers are much more varied environments than your servers which can trigger more bugs, and you need to support multiple platforms such as MacOS, Windows, and Linux.

  • Users need to install a desktop app

    Having your users install a desktop app can be a barrier to adoption as it leads to higher friction. Many companies disallow employees from installing new desktop apps, and many users don’t want to install an app if they’re just trying out your product.

  • No speaker separation

    Because you can only get 2 audio streams from a desktop app, the speakers and the microphone, if there are more than 2 participants in a meeting you won’t be able to separate the transcript by speaker.

    Just like in Method 2, you could leverage your transcription API’s machine diarization capabilities to split speakers out into “Speaker 1”, “Speaker 2”, etc.

  • Can make your user’s computer slow

    Running the desktop app can put additional load on your customer’s computers causing them to heat up, perform more slowly, and have a shorter battery life. This can be especially severe if your desktop app is not highly optimized or if you’re running heavy computations like media encoding or transcription locally.

  • Third-party transcription is required, which is an additional cost

    You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.

Method 5: Web app to capture microphone + Transcription API

For this method to work, the Zoom meeting audio needs to be played out loud so the computer microphone picks up on it. The web app records the audio coming from the microphone and passes it to a transcription provider. This is much easier to build than a desktop app and doesn’t require much maintenance either.

Pros

  • No wait time

    Because the data can be sent in real time, you can produce the transcript while the call is in progress to have immediate results after the call is done.

  • Real-time support

    By streaming the live audio to the transcription provider, you could get the transcription in real time too.

  • Works on free Zoom plans

    A web app can record the microphone audio even if your user is on a free Zoom account.

  • Your user can record the meeting, even if they are not the host

    A web app can record the microphone audio even if your user is not the host of the meeting.

  • Works on all meeting platforms

    This method works the same across all video conferencing platforms (Zoom, Google Meet, Microsoft Teams, etc), which means your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.

  • Custom vocabulary & Multi-language support

    Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.

Cons

  • Method won’t work if your user has headphones in

    The web app won’t be able to capture what the other participants in the meeting are saying if your user is using headphones. If you want to get the audio from the meeting, your user needs to play the meeting audio out loud so their microphone picks up on it.

  • Method won’t work if the computer speaker volume is too quiet

    Similarly, if the user’s computer speaker volume is too low, the microphone won’t pick up the audio.

  • No speaker separation

    Because you can only get 1 audio stream from the microphone, you won’t be able to separate the transcript by speaker.

    Just like in the previous methods, you could leverage your transcription API’s machine diarization capabilities to split speakers out into “Speaker 1”, “Speaker 2”, etc.

  • Third-party transcription is required, which is an additional cost

    You’ll need to pay an additional cost to the third-party transcription provider, which can be expensive at scale.

Method 6: Build a Meeting Bot + Transcription API

Building a meeting bot is an option that doesn’t involve a change in user behavior, and also doesn’t put additional load on your customer’s computers. A meeting bot is essentially an instance of Zoom that you run on your servers, which you can use to capture the audio and video data from the meeting in real-time. You can then pass the audio and video to a transcription API. Note that the bot will show up as another participant in the meeting.

Pros

  • Speaker separated transcripts

    Meeting bots have access to data on when each person in the meeting is speaking. This means you’ll be able to diarize your transcript accurately and figure out what words were said by which person, no matter which transcription API or model you’re using.

  • Works on free Zoom plans

    A meeting bot can record the meeting even if your user is on a free Zoom account.

  • Your user can record the meeting, even if they are not the host

    A meeting bot can record the meeting even if your user is not the host of the meeting.

  • No wait time

    Because the audio can be streamed in real time, you can produce the transcript while the call is in progress to have immediate results after the call is done.

  • Low latency, real-time support

    Meeting bots are a real-time and low-latency form of data capture because they connect directly to the meeting. You can expect minimal latency, from 200-500 ms while using a meeting bot, and you’ll be able to do real-time analysis of the transcription.

  • Consistent user experience

    Meeting bots have the same user experience across all the major platforms of Zoom, Meet, Teams, and Webex. Therefore, if you go the meeting bot route, your users will have a consistent experience across all platforms (though you’ll need to maintain separate bot implementations for each platform).

  • Custom vocabulary & multi-language support

    Because you are using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.

Cons

  • Significant engineering burden

    Meeting bots are a major effort to build, and the effort must be repeated for each platform you want to support, as a bot built for Zoom won’t work on any other platform. It takes a skilled engineering team over a year to build stable, scalable, and cost-effective bots for the 3 major platforms: Zoom, Meet, and Teams.

  • Ongoing maintenance burden

    Meeting bots come with a lot of maintenance and infrastructure toil to operate at scale. Because meeting bots connect to meeting platforms that in many cases don’t have official APIs, there’s a significant amount of maintenance required to ensure that your bots keep up with any platform-level changes that could break them. Additionally, because meeting bots run in your infrastructure, you’ll need to scale, monitor, and debug issues with bots as your customer base grows. The maintenance work required to run meeting bots at scale can take the full-time effort of 3-6 senior engineers.

  • Can be costly to operate

    You also need to pay for the infrastructure you’re using to host the meeting bot. Not only are meeting bots difficult to build and maintain, but they can also be costly to run. Each bot is running an instance of Zoom, which means you end up managing a fleet of servers, which can be costly until you reach economies of scale.

  • Third-party transcription is required, which is an additional cost

    Because transcription isn’t built-in, you’ll need to pay for transcription whether it comes from a third-party provider, or from an open-source model you’re hosting yourself.

  • Zoom will need to review your app

    Zoom meeting bots typically join the Zoom meeting by running an instance of the Zoom SDK. Before your Zoom SDK credentials can be used in production, Zoom will need to review your meeting bot. This takes around 4 weeks.

Method 7: Use Recall.ai

Recall.ai is a hosted meeting bot service. Recall manages the infrastructure, monitors and updates the bot implementations to handle updates on each platform, and allows you to use meeting bots with minimal engineering effort.

Pros

  • Works on all meeting platforms

    Recall is a unified API, which means you integrate once, and you’re able to integrate with all the video conferencing platforms (Zoom, Microsoft Teams, Google Meet, etc). Your users will have a more consistent experience, and you also save the engineering effort of building a new integration for each platform you need to work with.

  • Works on free Zoom plans

    Recall can record the meeting even if your user is on a free Zoom account.

  • Your user can record the meeting, even if they are not the host

    Recall can record the meeting even if your user is not the host of the meeting.

  • Speaker separated transcripts

    Recall has access to data on when each person in the meeting is speaking. This means you’ll be able to diarize your transcript accurately and figure out what words were said by which person, no matter which transcription API or model you’re using.

  • No wait time

    Recall gives you the complete transcription within seconds after the meeting is done.

  • Low latency, real-time support

    Recall is an API for meeting bots, and meeting bots are a real-time and low-latency form of data capture because they connect directly to the meeting. You can expect minimal latency, from 200-500 ms while using Recall, and you’ll be able to do real-time analysis of the transcription.

  • Consistent user experience

    Recall gives you the same user experience across all the major platforms of Zoom, Meet, Teams, and Webex. Therefore, if you go the Recall route, your users will have a consistent experience across all platforms, which is a bot joining the meeting. Also, once you integrate with Recall, your implementation automatically works across the other platforms — no extra code is required.

  • You can use third-party transcription APIs OR Zoom’s native transcription

    The Recall bot API can scrape captions from the meeting platforms, so you can use Zoom’s native transcription without needing your user to enable any settings.

    Recall is also integrated with major transcription API providers such as AWS Transcribe, Deepgram, Assembly, Rev, and more. So if you choose to use a third-party provider for more advanced features, such as custom vocabulary recognition, Recall has that support natively built in.

    Both options are available with Recall, so you can pick whichever makes the most sense for your use case.

  • Custom vocabulary & multi-language support

    If you do end up using a third-party transcription API, you can specify a “dictionary” of industry jargon, people’s names, and other uncommon words to make sure they get transcribed correctly. Most transcription APIs also support multiple languages.

  • Fast build time

    Out of all the options here, Recall is the fastest to build with. On average it takes a developer 72 hours to fully integrate Recall.

Cons

  • Additional cost

    It costs money to use the Recall API, compared to options such as the Zoom Cloud Recording transcript, which is free for you to access. However, because the Recall platform is highly optimized, it is generally cheaper to use Recall than to run bots yourself.

  • Zoom will need to review your app

    Zoom meeting bots join the Zoom meeting by running an instance of the Zoom SDK. Before your Zoom SDK credentials can be used in production, Zoom will need to review your meeting bot. This typically takes around 4 weeks. Although Recall can’t control Zoom’s review timelines, Recall can make the review process less stressful and guide you, as they have helped hundreds of customers through it.

Comparison chart

Method Method 1: Download Transcripts from the Zoom Cloud Recording API Method 2: Zoom Cloud Recording + Transcription API Method 3: RTMP live streaming + Transcription API Method 4: Desktop app to capture system audio + Transcription API Method 5: Web app to capture microphone + Transcription API Method 6: Build a Meeting Bot + Transcription API Method 7: Use Recall.ai
Speaker separated transcripts x Speaker-separated transcripts are available if the user turns on the “create audio transcript” setting in Zoom. x x
Works on any Zoom Plan x x x x
Users can record, even if they are not the host x x x x
Transcripts are available instantly after the meeting x x x x x
Transcripts are available in real-time Real-time transcription is available but with high latency (10-30 seconds). x x x x
Transcripts support multiple languages x x x x x x
Transcript vocabulary can be customized x x x x x x
Transcripts have per-word timestamps x x x x x x
Zoom app review not required x x
Fast to integrate with x x x
No maintenance required x x x x
Don’t need to pay a 3rd party for additional costs x You don’t need to pay for third-party transcription if you’re transcribing on-device.
Doesn’t slow down your user’s computer x X X x x x
Doesn’t require users to install a desktop app x x x x x x
Doesn’t require users to OAuth their Zoom account x x x x

Still don’t know which one to go with?

If you’ve read all the options and are still unsure which one makes the most sense for you, we’re happy to help. We’ve seen hundreds of use cases and we’ll be 100% honest with you on which method makes the most sense for you - no BS.

Request a chat with an expert, and see you soon!