Auto speech-to-text (Indonesian) with AWS Transcribe and Python

Amazon Web Service Transcribe provides API to automatically convert an audio speech file (mp3/wav) into a text (transcript). It supports not only English language but also Indonesian and several other languages. It even gives free 60 minutes/month for first 12 months transcription for first user.

If you are using Python, then the good news is there is an official well-documented module to access AWS APIs (including Transcribe) named boto3 (pip install boto3). By using it, creating a script to automatically convert an audio speech files is a breeze! 🏖️

In this post, I’m sharing a simple tutorial to create simple script to transcribe an audio file extracted from a Youtube video, Pidato Kenegaraan Presiden Jokowi (Joko Widodo Presidential Speech) (2:21-3:42). IMO the final result is surprisingly very good:

bapak, ibu, saudara saudara sebangsa dan setanah air. Mimpi kita, cita cita kita di tahun dua ribu empat puluh lima. Pada satu abad Indonesia merdeka, mestinya Insya Allah Indonesia telah keluar dari jebakan pendapatan kelas menengah. Indonesia telah menjadi negara maju dengan pendapatan menurut hitung hitungan tiga ratus dua puluh juta rupiah per kapita per tahun atau dua puluh tujuh juta per kapita per bulan. Itulah target kita. Itulah target kita bersama. Mimpi kita di tahun dua ribu empat puluh lima, produk domestik bruto Indonesia mencapai tujuh triliun dolar dan Indonesia sudah masuk ke lima besar ekonomi dunia. Dengan kemiskinan mendekati nol persen, kita harus menuju ke sana.

First, we need create credentials to access AWS APIs. We can do it by creating an IAM user in IAM (Identity and Access Management) section in our AWS Console which has policies to access AmazonTranscribeFullAccess and AmazonS3FullAccess. Once we created an IAM user, we can generate Access Keys (Access Key ID & Access Key Secret) in the Security Credentials.

Finally, take a note of these 3 values required by boto3 to call AWS APIs:

  1. Access Key ID
  2. Access Key Secret
  3. Region Code (eg. us-east-1, ap-southeast-1, ca-central-1)
IAM User with AmazonTranscribeFullAccess & AmazonS3FullAccess policies

Now, we can start to code!

First, initialize boto3 clients using values above:

import configparser, boto3

# config = use configparser to load values from external file

# init AWS session
session = boto3.session.Session(
s3 = session.client('s3')
transcribe = session.client('transcribe')

Then, upload the input audio file into our S3 bucket. If we don’t have any existing bucket, we can create it first using create_bucket():

res = s3.upload_file(file_path, bucket_name, file_name)

Finally, start transcription job by also providing a job name (in this case, I use the file name), the language code (check the complete list here), the S3 URL of the input file, and a bucket name (where the transcription result will be saved):

job_name = file_name    
s3_file = f's3://{bucket_name}/{file_name}'
res = transcribe.start_transcription_job(
    Media={'MediaFileUri': s3_file}, 

The job will require some times to complete, depending on the length of the audio file. So we need to periodically check the job’s status:

import time

# wait until job to complete
completed = False
while not completed:
    res = transcribe.list_transcription_jobs(
    if 'TranscriptionJobSummaries' in res:
        if len(res['TranscriptionJobSummaries']) > 0:
            job = res['TranscriptionJobSummaries'][0]
            completed = job['TranscriptionJobStatus'] == 'COMPLETED'
            print(f'Job is completed')
    if not completed:
        print(f'Waiting for job to complete...')

Once the job is completed, we need to download the transcription result from the bucket defined when calling start_transcription_job(). The result file will be in JSON format and the name will be {job_name}.json.

import json

result_file = f'{file_name}.json'
s3.download_file(bucket_name, result_file, result_file)
with open(result_file, 'r') as f:
	res_file = json.load(f)

Besides the transcript itself, the result file also contains information (timing, confidence score, POS tag) about each token detected in the audio file. For example:


That’s all! You can check the complete code here.

Enjoy! 🍻

5 2 votes
Article Rating
Notify of
Inline Feedbacks
View all comments