In this tutorial, we’ll learn how to build an application with Python and FFmpeg that allows us to turn voice recordings into cool videos that can be easily shared on social media.

By the end of this tutorial, we’ll have turned a voice recording into a video that looks similar to the one below.

Tutorial requirements

To follow this tutorial, you will need the following components.

  • One or more audio recordings that you want to convert to video. Programmable voice recordings stored in your Twilio account are very useful for this tutorial.
  • Python3.6+ is installed.
  • Install FFMPEG version 4.3.1 or later.

Creating a project structure

In this section, we will create our project directory, where we will create subdirectories to hold the recordings, images, fonts, and videos we will use in this tutorial. Finally, we’ll create a Python file that will contain code that allows us to create and edit videos using FFmpeg.

Open a terminal window and enter the following command to create a project directory to move into.

mkdir twilio-turn-recording-to-video
cd twilio-turn-recording-to-video
Copy the code

Use the following command to create four subdirectories.

mkdir images
mkdir fonts
mkdir videos
mkdir recordings
Copy the code

The Images directory is where we will store the background image of the video. Download this image and store it in the images directory named bg.png. This image was originally downloaded from Freepik.com.

In the Fonts directory, we will store font files for writing text in our videos. Download this font and store it in the fonts directory named Leaguegogermany-condensedregular. Otf. This font was originally downloaded from Fontsquirrel.com.

The Videos directory will contain videos and animations that will be added to the background image. Download this spinningRecord with the Twilio logo in the middle and save it in the videos directory called spinningrecord.mp4. The source image used in this video was downloaded from Flaticon.com.

Recordings directory is where we will store the audio recordings that will become videos. Add one or more of your own voice recordings to this directory.

Now that we have created all the directories we need, open your favorite code editor and create a file called main.py in your project’s top-level directory. This file will contain the code responsible for turning our recordings into videos.

If you don’t want to follow every step of the tutorial, you can get the full project source code here.

Turn an audio file into a video

In this section, we will add code that allows us to turn a recording into a video, displaying the sound waves of the recording.

We will use FFmpeg to generate a video from an audio file. So, to call FFmpeg and related programs from Python, we’ll use Python’s subprocess module.

Run a command

Add the following code to the main.py file.

import subprocess


def run_command(command) :
    p = subprocess.run(
        command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
    )
    print('Done!!! ')
    print('stdout:\n{}'.format(p.stdout.decode()))

    return p.stdout.decode().strip()
Copy the code

In the code block above, we have imported the subprocess module and created a run_command() function. As the name implies, this function is responsible for running a command with passed arguments. When the command is complete, we print the output and return it to the caller.

Gets the duration of a recording

Add the following code under the run_command() function.

def get_rec_duration(rec_name) :
    rec_path = "./recordings/{}".format(rec_name)

    command = "ffprobe -i {rec_path} -show_entries format=duration -v quiet \ -of csv=\"p=0\"".format(rec_path=rec_path)

    rec_duration = run_command(command)
    print("rec duration", rec_duration)

    return rec_duration
Copy the code

Here, we create a function called get_rec_duration(). This function retrieves the duration of a recording. This function takes a recording name (rec_name) as an argument prefixed with the name of the recording directory and stored in the rec_PATH local variable.

The FFprobe program, part of FFmpeg, is used to create a command string to get the duration of the recording. We use this command to call the run_command() function and store the returned value in rec_duration.

Finally, we print and then return the recording time we got.

Recording duration is required to specify that the duration of the video to be generated by it is the same.

Convert audio to video

Add the following code under the get_rec_duration() function.

def turn_audio_to_video(rec_name, rec_duration) :
    rec_path = "./recordings/{}".format(rec_name)
    bg_image_path = "./images/bg.png"
    video_name = "video_with_sound_waves.mp4"

    command = 'ffmpeg -y -i {rec_path} -loop 1 -i {bg_image_path} -t {rec_duration} \ -filter_complex "[0:a]showwaves=s=1280x150:mode=cline:colors=00e5ff[fg]; \ drawbox = x = 0, y = 285: w = 1280: h = 150: color = [email protected]: t = the fill [bg]; \ [bg][fg]overlay=format=auto:x=(W-w)/2:y=(H-h)/2 " \ -map 0:a -c:v libx264 -preset fast -crf 18 -c:a aac \ -shortest ./videos/{video_name}'.format(
        rec_path=rec_path,
        bg_image_path=bg_image_path,
        rec_duration=rec_duration,
        video_name=video_name,
    )

    print(video_name)
    run_command(command)
    return video_name
Copy the code

The turn_audio_to_video() function turns the recording into a video that displays the sound waves of the recording. This function takes the recording name (rec_name) and recording time (rec_duration) as parameters.

The FFmpeg command to generate video from audio uses the recording path (rec_path), the path of the background image (bg_image_path), and the output file name of the video (video_name).

Let’s take a closer look at the FFmpeg command.

ffmpeg -y -i {rec_path} -loop 1 -i {bg_image_path} -t {rec_duration} \
-filter_complex \"[0:a]showwaves=s=1280x150:mode=cline:colors=00e5ff[fg]; \ drawbox = x = 0, y = 285: w = 1280: h = 150: t = the fill [color = [email protected]:bg]; \
[bg] [fg]overlay=format=auto:x=(W-w)/2:y=(H-h)/2 \" \
-map 0:a -c:v libx264 -preset fast -crf 18 -c:a aac -shortest ./videos/{video_name}
Copy the code

The [-] y (https://ffmpeg.org/ffmpeg-all.html#Main-options) tell ffmpeg output file, if it exists on disk.

The [-] I (https://ffmpeg.org/ffmpeg-all.html#Main-options) option specifies the input content. In this example, we have two input files, the recording file, rec_path, and the image we use has a background, which is stored in bg_image_path.

The [- loop] (https://ffmpeg.org/ffmpeg-all.html#loop) option through repetition (cycle) input file to generate a video. In this case, we are looping over the image we entered in bG_image_path. The default is 0 (no loop), so we set it to 1 (loop) to repeat the image in all video frames.

The [-t] (https://ffmpeg.org/ffmpeg-all.html#Main-options) option to specify a duration in seconds, or use the “hh: mm: ss. [XXX]” syntax. Here we use the recording duration (rec_duration) value to set the duration of our output video.

[- filter_complex] (https://ffmpeg.org/ffmpeg.html#filter_005fcomplex_005foption) filter, allows us to define a complex filter graph, one with an arbitrary number of input and/or output of the filter. This is a complex option that requires a number of parameters, discussed below.

First, we use the Showwaves filter to convert the voice recording, referenced as [0:a], to the video output. The s parameter is used to specify the output video size, which we set to 1280×150. The mode parameter defines how the audio wave is drawn. The available values are. Point, line, p2p, and cline. The colors parameter specifies the color of the waveform. The drawing of the waveform is specified as label [fg].

We use the Drawbox filter to draw a colored box on our background image to help highlight the waveform. The x and y arguments specify the upper-left coordinates of the box, while w and h sets its width and height. The color parameter sets the color of the box to black with an 80% opacity. T parameter sets the thickness of the box boundary. By setting the value to fill, we create a solid box.

To complete the filter definition, we use overlay, putting the waveform on top of the black box. The configuration of overlay filter is: format, which automatically sets the pixel format; X and y, which specifies the coordinates of the overlay in the video frame. Let’s do some math to specify that x and y should be at the center of our video.

Map for the [-] (https://ffmpeg.org/ffmpeg.html#Stream-selection) option is used to select which input flow should be included in or out of the output. We chose to add all the data streams from our video to our output video.

The [- c: v] (https://ffmpeg.org/ffmpeg.html#Stream-specifiers-1) option is used to encode the video stream, and use the specific codecs. We tell FFmpeg to use the libx264 encoder.

The [- preset] (https://trac.ffmpeg.org/wiki/Encode/H.264) option to a series of options, provide certain encoding speed and compression ratio. We are using the fast option here, but feel free to change the default to slower (good quality) or faster (bad quality) if you prefer.

The [-] CRF (https://trac.ffmpeg.org/wiki/Encode/H.264) option on behalf of the constant rate factor. Rate control determines how many bits will be used per frame. This determines the size of the file and the quality of the output video. For visual lossless quality, a value of 18 is recommended.

The – [c: a] (https://ffmpeg.org/ffmpeg.html#Stream-specifiers-1) option is used to use some kind of codec to encode audio stream. We are encoding the audio with the AAC codec.

The [- shortest] (https://ffmpeg.org/ffmpeg.html#toc-Advanced-options) option tells FFmpeg in the shortest possible stop written to the output end of the input stream.

The./videos/{video_name} option at the end of the command specifies the path to our output file.

If you’re curious, here’s what all the FFmpeg waveform modes discussed above do and what they look like.

Point Draws a Point for each sample.

Line, draw a vertical Line for each sample.

P2p draws a point for each sample and a line between them.

Cline drew a centered vertical line for each sample. This is what we will use in this tutorial.

Add the following code under the turn_audio_to_video() function.

def main() :
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    turn_audio_to_video(rec_name,rec_duration)


main()
Copy the code

In this newly introduced code, we have a function called main(). In it, we store the recording name in a variable named rec_name. You should update this line to include the name of your own recording file.

After that, we call get_rec_duration() to get the recording time.

We then call the turn_audio_to_video function with the recording name and duration and store the returned value in a variable called video_WITH_sound_waves.

Finally, we call the main() function to run the whole process. Remember to replace the value of the rec_name variable with the name of the recording you want to process.

Go back to your terminal and run the following command to generate the video.

python main.py
Copy the code

Look in the videos directory for a file called video_with_sound_waves.mp4. Open it and you should see something similar to the following.

Add video on top of background

In this section, we will add a rotating record video to the bottom left of the generated video. The video we want to add is stored in a file called SpinningRecord.mp4 in the Videos directory.

Go back to your code editor, open the main.py file, and add the following code under the turn_audio_to_video() function.

def add_spinning_record(video_name, rec_duration) :
    video_path = "./videos/{}".format(video_name)
    spinning_record_video_path = "./videos/spinningRecord.mp4"
    new_video_name = "video_with_spinning_record.mp4"

    command = 'ffmpeg -y -i {video_path} -stream_loop -1 -i {spinning_record_video_path} \ -t {rec_duration} -filter_complex "[1:v]scale=w=200:h=200[fg]; \ [0:v] scale=w=1280:h=720[bg], [bg][fg]overlay=x=25:y=(H-225)" \ -c:v libx264 -preset fast -crf 18 -c:a copy \ ./videos/{new_video_name}'.format(
        video_path=video_path,
        spinning_record_video_path=spinning_record_video_path,
        rec_duration=rec_duration,
        new_video_name=new_video_name,
    )

    print(new_video_name)
    run_command(command)
    return new_video_name
Copy the code

Here, we have created a function called add_spinning_record(). This function is responsible for adding the SpinningRecord.mp4 video to the video that displays the sound waves. Its arguments are the name of the previously generated video (video_name) and the time it was recorded (rec_duration).

This function also runs FFmpeg. Here are the details of the command.

$ ffmpeg -y -i {video_path} -stream_loop -1 -i {spinning_record_video_path} \
-t {rec_duration} -filter_complex \"[1:v]scale=w=200:h=200[fg]; \
 [0:v] scale=w=1280:h=720[bg], [bg] [fg]overlay=x=25:y=(H-225)\" \
-c:v libx264 -preset fast -crf 18 -c:a copy ./videos/{new_video_name}
Copy the code

The command above has the following options.

The -y,-t,-c:v,-preset, and -crf options are the same as the FFmpeg command for generation of sound waves.

The -i option has been used before, but in this case we have two videos as input files, the video generated in the previous step, and the rotated recording video file.

The [- stream_loop] (https://ffmpeg.org/ffmpeg.html#Stream-copy) allows us to set the frequency of an input stream is circular. A value of 0 means loops are disabled, while -1 means an infinite loop. We set the rotating recording video to an infinite loop. This will allow FFmpeg to encode the output video without limit, but since we also specify the duration of the output video, FFmpeg will stop encoding when the video reaches this duration.

-filter_complex option: Same as before, but here we have two videos as input files, the video generated in the previous section [0:v] and the video rotated for recording [1:v].

The filter first use [scale] (https://ffmpeg.org/ffmpeg-filters.html#scale) to adjust the size of the rotating record video, make its have the size of 200 x200, and assigned to it (fg). Then, using the Scale filter again, we set the video created in the previous section to 1280×720 and label it [bg]. Finally, we use [overlay] (https://ffmpeg.org/ffmpeg-filters.html#overlay) filter will spin record video in the previous section to create video, coordinates of x = 25, and y = H – 225 on behalf of the video height (H).

The -c: A option was also introduced in the previous section, but in this case, we use the special value copy to tell FFMPEG to copy the audio stream from the source video without recodizing.

The last part of the command,./videos/{new_video_name}, sets the path of our output file.

Replace the code inside the main() function with the following, which adds a call to the add_spinning_record() function.

def main() :
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    video_with_sound_waves = turn_audio_to_video(rec_name, rec_duration)
    add_spinning_record(video_with_sound_waves, rec_duration)
Copy the code

Run the following command on your terminal to generate a video.

python main.py
Copy the code

Look in the videos directory for a file called video_with_Spinning_record.mp4. Open it and you should see something similar to the following.

Add text to the video

In this section, we will add a title to the top section of the video. As part of that, we’ll learn how to use FFmpeg to draw text, change color, size, font, and position.

Go back to your code editor, open the main.py file, and add the following code under the add_spinning_record function.

def add_text_to_video(video_name) :
    video_path = "./videos/{}".format(video_name)
    new_video_name = "video_with_text.mp4"
    font_path = "./fonts/LeagueGothic-CondensedRegular.otf"

    command = "ffmpeg -y -i {video_path} -vf \"drawtext=fontfile={font_path}: \ text='Turning your Twilio voice recordings into videos':fontcolor=black: \ fontsize = 90: box = 1: boxcolor = [email protected] \ : boxborderw = 5: x = ((W / 2) - (tw / 2)) : y = 100 \ "c: \ - a copy/videos / {new_video_name}".format(
        video_path=video_path,
        font_path=font_path,
        new_video_name=new_video_name
    )

    print(new_video_name)
    run_command(command)
    return new_video_name
Copy the code

In this function, we create a function called add_text_to_video(), which calls a new FFmpeg command to draw text. Let’s take a closer look at the FFmpeg command.

ffmpeg -y -i {video_path} -vf \"drawtext=fontfile={font_path}:  \
text='Turning your Twilio voice recordings into videos':fontcolor=black: \ fontsize = 90: box = 1: boxcolor = [email protected]: boxborderw = 5: x = ((W / 2) - (tw / 2)) : y = 100 \ "c: \ - a copy/videos / {new_video_name}Copy the code

The -y and -c:a options are used exactly as before.

The -i option, which defines the input, now has only one input file, the video file generated in the previous section.

The – (vf) (https://ffmpeg.org/ffmpeg.html#Stream-copy) option allows us to create a simple filtergraph, and use it to filter the flow. Here we use this [drawtext] (https://ffmpeg.org/ffmpeg-filters.html#drawtext-1) filter at the top draw text, video some parameters: Fontfile is the fontfile used to draw text, text defines the text to draw (you can change it as you like), fontcolor sets the text color to black, fontsize sets the text size, box enables a box around the text, Boxcolor sets the color of this box to white and the opacity to 50%, BoxBorderw sets the width of the border box, and X and y sets the position where the text will be printed in the video. Let’s draw the text in the middle with a little bit of math.

The last./videos/{new_video_name} option sets the output file, just like the previous FFmpeg command.

Replace the code in the main() function with the following version, which adds the header step.

def main() :
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    video_with_sound_waves = turn_audio_to_video(rec_name, rec_duration)
    video_with_spinning_record = add_spinning_record(video_with_sound_waves, rec_duration)
    video_with_text = add_text_to_video(video_with_spinning_record)
Copy the code

Go back to your terminal and run the following command to generate a captioned video.

python main.py
Copy the code

Look in the videos directory for a file called video_with_text.mp4. Open it and you should see something similar to the following.

conclusion

In this tutorial, we learned how to record a piece of speech into a video that can be shared on social media using some of the advanced options in FFmpeg. I hope this encourages you to learn more about FFmpeg.