*I recently updated my website's structure, which impacted the blog and the descriptions I wrote about my project. So now I am writing blog posts for presenting my old projects until I reach the Present.*
So, back in spring 2019, some American show aired and got quite some success. Some (French) family members wanted to see it, but their English was quite poor. I found them some subtitles I found online, but the timecodes did not match the actual video file. And I have to say, manual synchronisation is a cumbersome process. So I tried to come up with an algorithm to do it for me. More specifically, I developped a Python module that automatically syncs an external .SRT file with a video. Here is how it works:
1. It extracts the audio track from the video.
2. Using speech-to-text, it generates a partial transcript of the video.
3. The target subtitles are translated the video language.
4. The two are matched together, and aligned adequately.
The first step consisted in a survey of available software for the speech-to-text and translation steps. I was looking for free, preferably open-sourced, offline software, with multi-language support.
The first search result is of course the [Google Speech-to-Text API](https://cloud.google.com/speech-to-text?hl=fr). And of course, it works well: transcripts are good and a lot of languages are supported. However, it requires an online access to the Google API, which is very limited, considering the *free* condition. I therefore decided not to go for Google. Instead, I focused on the [CMUSphinx Project](https://cmusphinx.github.io/). While still supporting a lot of languages (see [here](https://github.com/ychalier/subalign#supported-languages)), it is free and offline. But its usage is not as easy as Google API's.
First, voluminous language models must be downloaded for each language to be parsed. Then, I faced some **encoding** issues. My first attempts on clear audio recordings yielded gibberish transcripts. I soon suspected the issue was coming from the encoding, but I had quite some troubles finding what exact format was needed as input. I finally found it on the [CMUSphinx FAQ page](https://cmusphinx.github.io/wiki/faq/):
CMUSphinx decoders do not include format converters, they have no idea how to decode encoded audio. Before processing audio must be converted to PCM format. Recommended format is 16khz 16bit little-endian mono. If you are decoding telephone quality audio you can also decode 8khz 16bit little-endian mono, but you usually need to reconfigure the decoder to input 8khz audio. For example, pocketsphinx has -samprate 8000 option in configuration.
The command I ended up using to extract the audio from video files is:
ffmpeg -i -ar 16000 -ac 1 -f wav -bitexact -acodec pcm_s16le