Text to Speech with AWS Polly

For a slightly different experiment, I decided to try out TTS (Text to Speech).

There are quite a few options here, but basically you've got:

Big Cloud offerings from AWS, Google and Microsoft
Dedicated web services specializing in TTS
Local models

Since I already have lots of experience with AWS, I decided to try Polly.

If you're already setup on AWS, using Polly is incredibly simple. Literally 10 lines of code:

import boto3

polly = boto3.client("polly", region_name="us-east-1")

response = polly.synthesize_speech(
    Text="Hello, this is synthetic speech",
    OutputFormat="mp3",
    VoiceId="Emma",
    Engine="standard"
)

with open("speech_en_uk.mp3", "wb") as f:
    f.write(response["AudioStream"].read())

This creates an MP3 file with the recorded voice.

A few things that I quickly found out:

Several voices are available for each language/variant

Many sound quite good. But some do sound pretty unnatural, at least to me. It's worth experimenting to find which ones you prefer. I did that by programatically generating a series of MP3s for each voice, so that I could listen and compare. You can also do ad-hoc TTS with different voices in the AWS user interface.

There are different tiers for each voice

Polly has tiers called "Standard TTS", "Neural", "Long-form" and "Generative". The higher tiers really do sound better, but they're more expensive. For small experiments, it makes no difference. This cost me $0.00 because I didn't exceed the "free" usage level.

My Experiment

My full experiment here was for Spanish vocab learning. I wanted to create an audio file that I could listen to in the car. Nothing fancy. It would simply:

read a vocab word in English
read the same word translated to Spanish
read an example sentence in English
read the same sentence translated to Spanish (twice, and 20% slower)

Repeat for a long list of vocab words. Quick, only 1-2 seconds between each one.

That's it. That's all this program does. It makes a custom MP3 of vocab words and sentences.

I've never been very happy with the audio materials for language learning apps. They're always too slow. Or they have long gaps between each word/sentence. Or some long introduction before getting to the vocab. I like that my MP3 is fully customizable with nothing extraneous.

We'll see if this suits me any better.

Either way, this was fun to make using Claude Code. It took maybe 30 minutes. You could zero-shot this with a single prompt, but I spent a little bit of extra time playing with the different voices and setting up a file format that I can use in the future.