Amazon Web Service Transcribe provides API to automatically convert an audio speech file (mp3/wav) into a text (transcript). It supports not only English language but also Indonesian and several other languages. It even gives free 60 minutes/month for first 12 months transcription for first user.
If you are using Python, then the good news is there is an official well-documented module to access AWS APIs (including Transcribe) named boto3
(pip install boto3
). By using it, creating a script to automatically convert an audio speech files is a breeze! ?️
In this post, I’m sharing a simple tutorial to create simple script to transcribe an audio file extracted from a Youtube video, Pidato Kenegaraan Presiden Jokowi (Joko Widodo Presidential Speech) (2:21-3:42). IMO the final result is surprisingly very good:
bapak, ibu, saudara saudara sebangsa dan setanah air. Mimpi kita, cita cita kita di tahun dua ribu empat puluh lima. Pada satu abad Indonesia merdeka, mestinya Insya Allah Indonesia telah keluar dari jebakan pendapatan kelas menengah. Indonesia telah menjadi negara maju dengan pendapatan menurut hitung hitungan tiga ratus dua puluh juta rupiah per kapita per tahun atau dua puluh tujuh juta per kapita per bulan. Itulah target kita. Itulah target kita bersama. Mimpi kita di tahun dua ribu empat puluh lima, produk domestik bruto Indonesia mencapai tujuh triliun dolar dan Indonesia sudah masuk ke lima besar ekonomi dunia. Dengan kemiskinan mendekati nol persen, kita harus menuju ke sana.
First, we need create credentials to access AWS APIs. We can do it by creating an IAM user in IAM (Identity and Access Management) section in our AWS Console which has policies to access AmazonTranscribeFullAccess
and AmazonS3FullAccess
. Once we created an IAM user, we can generate Access Keys (Access Key ID & Access Key Secret) in the Security Credentials.
Finally, take a note of these 3 values required by boto3
to call AWS APIs:
us-east-1
, ap-southeast-1
, ca-central-1
)Now, we can start to code!
First, initialize boto3
clients using values above:
import configparser, boto3 # config = use configparser to load values from external file # init AWS session session = boto3.session.Session( aws_access_key_id=config['default']['aws_access_key_id'], aws_secret_access_key=config['default']['aws_secret_access_key'], region_name=config['default']['region'] ) s3 = session.client('s3') transcribe = session.client('transcribe')
Then, upload the input audio file into our S3 bucket. If we don’t have any existing bucket, we can create it first using create_bucket()
:
res = s3.upload_file(file_path, bucket_name, file_name)
Finally, start transcription job by also providing a job name (in this case, I use the file name), the language code (check the complete list here), the S3 URL of the input file, and a bucket name (where the transcription result will be saved):
job_name = file_name s3_file = f's3://{bucket_name}/{file_name}' res = transcribe.start_transcription_job( TranscriptionJobName=job_name, LanguageCode='id-ID', Media={'MediaFileUri': s3_file}, OutputBucketName=bucket_name )
The job will require some times to complete, depending on the length of the audio file. So we need to periodically check the job’s status:
import time # wait until job to complete completed = False while not completed: res = transcribe.list_transcription_jobs( JobNameContains=job_name, MaxResults=1 ) if 'TranscriptionJobSummaries' in res: if len(res['TranscriptionJobSummaries']) > 0: job = res['TranscriptionJobSummaries'][0] completed = job['TranscriptionJobStatus'] == 'COMPLETED' print(f'Job is completed') if not completed: print(f'Waiting for job to complete...') time.sleep(5)
Once the job is completed, we need to download the transcription result from the bucket defined when calling start_transcription_job()
. The result file will be in JSON format and the name will be {job_name}.json
.
import json result_file = f'{file_name}.json' s3.download_file(bucket_name, result_file, result_file) with open(result_file, 'r') as f: res_file = json.load(f) print(res_file['results']['transcripts'][0]['transcript'])
Besides the transcript itself, the result file also contains information (timing, confidence score, POS tag) about each token detected in the audio file. For example:
[ {"start_time":"2.44","end_time":"2.73","alternatives":[{"confidence":"1.0","content":"bapak"}],"type":"pronunciation"}, {"alternatives":[{"confidence":"0.0","content":","}],"type":"punctuation"}, {"start_time":"2.73","end_time":"3.28","alternatives":[{"confidence":"1.0","content":"ibu"}],"type":"pronunciation"}, {"alternatives":[{"confidence":"0.0","content":","}],"type":"punctuation"} ]
That’s all! You can check the complete code here.
Enjoy! ?
Getting verified SSL information with Python (3.x) is very easy. Code examples for it are…
By default, Spring Data Couchbase implements single-bucket configuration. In this default implementation, all POJO (Plain…
Last year, Google released Firebase Auth Emulator as a new component in Firebase Emulator. In…
One of the authentication protocol that is supported by most of Google Cloud services is…
If you need to to add a spatial information querying in your application, PostGIS is…
I had a project to build a simple website that split uploaded video into parts…