(Original content of Hornet’s Nest Technology Official account, ID: MFWtech)

If you are familiar with Hornets Nest, you must know that if you click the release button on the home page of hornets Nest App, you will find that the published content has been simplified into “text and text” or “video”.

For a long time, travel notes, q&A, raiders and other graphic forms of the form has been the development of the hornet’s nest advantage. The promotion of short video to the position alongside text and text is because for today’s mobile Internet users, short video with more real and intuitive content, higher information density and stronger immersion has become a demand. In order to make tourism users have a better content interaction experience, enrich and complete the original content ecosystem, Hornet’s Nest increased the layout of the field of short video.

Now, a large number of short videos are produced in the hornet’s nest every day, covering various local entertainment scenes such as food, visiting restaurants, scenic spots and accommodation experience. Hornet’s nest hopes that the platform’s short video content in addition to “good-looking”, but also “easy to use”. This “easy to use” is not only to provide users with good travel information, but also to make the creation of short videos easier through technology.

To this end, we in the hornet’s nest travel App video editing function provides the “custom editor” and “templates to create” two kind of edit mode, the user can through the “template” video with rapid creation and template of cool video, will also be able to enter the “custom edit” mode to play to their creativity, generating personalized video.

This paper will share the design and business practice of our team’s video editing framework with you around the video editing function in the iOS terminal of Hornet’s Nest Tourism App.

Part.1 Requirements analysis

As mentioned in the introduction, what we need to do is to be able to support two modes of video editing: “custom editing” and “template creation”.

Figure 1: Product schematic

First of all, let’s review the functions that need to be provided in the mode of “custom editing” :

  • Video splicing: Splicing multiple videos into one video in sequence

  • Play pictures: Combine multiple pictures into a video

  • Video clipping: Deletes the content from a certain period of time in a video

  • Video Speed Change: Adjusts the video playback speed

  • Background music: Add background music that can be remixed with the original video

  • Video rewind: A video is played in reverse order

  • Transition: Add some transition effects when switching between two pieces of video

  • Screen editing: screen rotation, canvas partition, setting background color, adding filters, stickers, text and other additional information

With these features, we can meet the needs of the “custom edit” mode, allowing users to complete their own creation through our video editing capabilities. But to further lower the bar for video editing and make it easier to create cool videos, we also need to support template authoring. In other words, “template video” is provided for users. Users only need to select the video or picture to create the same video with the same editing effects as “template video”, realizing “one-click editing”.

After supporting “template creation” mode, the final flow chart of our video editing function is as follows:

Figure 2: Complete flowchart

As shown, in addition to the media file, there is an input from template A, which describes what edits to make to the media file selected by the user. At the same time, there is a template B in the output of the editor, which describes the final edits made by the user after the edits are completed. Which the output of the template B, for we have solved the problem of “template” video sources, namely the “template” video either through operating means of production, can also be the user through the creation “custom edit” mode of video as a template video, make other users to view the video released, can quickly create with video.

Through the above requirement analysis process, it can be concluded that our video editing function mainly supports two capabilities:

  1. General video editing ability

  2. Describe the ability to edit

The division of these two capabilities provides a direction for the design of video editing framework.

Part.2 Frame design

General video editing capabilities are the basic capabilities that a video editing framework needs to provide to support the “custom editing” mode of the business. The “Describe how to edit” capability abstracts the general video editing capabilities to describe “what to edit video” and then transforms this description model into concrete video editing capabilities to support the business “template authoring” mode. So our editing framework can be divided into two main modules:

  • Edit module

  • The module

Between the two modules, a conversion module is needed to complete the bidirectional conversion between the video editing module and the description module. Below is the schematic diagram of the video editing framework we need:

Figure 3: Schematic diagram of video editing framework

  • The specific functions required by the editing module can be added iteratively along with business requirements. The functions we need to support are listed in the figure at present.

  • Description module needs a description model to describe media materials and various editing functions completely. You also need to save the model to a file that can be transferred and distributed, which we call a description file.

  • In addition to the description file, the “template” in the “template Creation” mode also needs the title, cover art and other operational information. Therefore, it is necessary to provide an operation processing function, which allows operation colleagues to process description files into templates.

  • The conversion module is responsible for the task of abstracting the video editing function into description file and parsing the description file into specific editing function, so it is very important to ensure the correctness of abstraction and parsing.

Video editing modules are well implemented on different development platforms, such as AVFoundation, which is provided natively by iOS, GPUImage, a third-party open source library, and FFMPEG, which is more common. Specific implementation schemes can be selected based on business scenarios and project planning. Currently, the scheme we use in iOS is AVFoundation, which is apple’s native. How to implement our video editing framework in conjunction with AVFoundation is described below. Next, we will look at the design and implementation of specific functional modules.

Part.3 Module functions and implementation

3.1 Description Module

3.1.1 Function division

First of all, we analyze the specific functions that need to be supported in the mode of “custom editing”, and find that the editing functions can be divided into two categories: paragraph editing and screen editing based on the editing object as the standard.

  • Paragraph editing: the video segment is regarded as an editing object, without caring about the content of the picture, and is only edited at the level of the video segment, including the following functions:

Figure 4: Paragraph editing

  • Screen editing: The screen content as an editing object, including the following functions:

Figure 5: Screen editing

3.1.2 Video editing description model

With the editing function divided, we also need a video editing description model to describe “what to edit the video”. The following concepts are defined:

  • Time line: a one-way increasing line consisting of time points starting at 0

  • Track: a container with time line as the coordinate system, which stores the content material and “picture editing” function needed at each time point

  1. Tracks have types, and a track supports only one type
  • A segment of an orbit, the part between and at two points in time on the time line to which the orbit belongs
  1. Paragraphs also have types, consistent with the type of track they belong to

List of track types:

Among them, “video”, “picture”, “audio” type track, is to provide picture and sound content track. The remaining types of tracks are tracks that describe exactly what screen editing functions are done. Effects type tracks can specify several screen editing effects, such as rotation, partition, and so on.

Combined with the division of editing functions, we can see that the editing object of paragraph editing function is the paragraph in the track, and the editing object of picture editing function is the content material stored in the track.

After dividing the three concepts of time line, track and paragraph and the two editing behaviors of paragraph editing and screen editing, we describe the video editing process at an abstract level as follows:

Figure 6: Schematic diagram of video editing description model

As shown in the figure above, we have been able to fully describe “what to edit for the video” through this model:

  • Create a 60-second video with video, pictures and music, corresponding to track 1, track 2 and track 3 respectively, with transitions and filters specified by track 4 and track 5 (the other effects will not be described separately, but will be referred to as transitions and filters).

  • The video is composed of [0-20] segments of track 1, [15-35] segments of track 2, [30-50] segments of track 1 and [45-60] segments of track 2.

  • [0-60] The whole video has background music, which is specified by track 3.

  • There are transition effects in three sections [15-20\, [30-35] and [45-50], and the transition effects are specified by track 4.

  • Segment [15-35] has filter effect, which is specified by track 5.

3.1.3 Describing Files and Templates

With the video editing description model described above, we also need concrete files to store and distribute the model, namely description files, which we use JSON files to implement. At the same time, it also needs to provide the ability of operation processing, so that operation colleagues can add some operation information to the description file and generate templates.

  • Description file: Generate a JSON file based on the video editing model

An 🌰

{
    "tracks": [{
            "type": "video"."name": "track_1"."duration": 20."segments": [{
                "position": 0."duration": 20},... ] }, {"type": "photo"."name": "track_2"."duration": 20."segments": [{
                "position": 15."duration": 20},... ] }, {"type": "audio"."name": "track_3"."duration": 60."segments": [{
                "position": 0."duration": 60}]}, {"type": "transition"."name": "track_4"."duration": 5,
            "segments": [{
                "subtype": "fade_black"."position": 15."duration": 5},... ] }, {"type": "filter"."name": "track_5"."duration": 20."segments": [{
                "position": 15."duration": 20}]},... ] }Copy the code
  • Template: A JSON file consisting of a description file and some service information

An 🌰

{
    "title": "Template title"."thumbnail": "Cover Address"."description": "Template Introduction"."profile": {// Describe the file"tracks": [...]   
    }
}
Copy the code

Through the above video editing description model, description file and template definition, combined with the converter, we can generate a description file to describe the user’s final video editing behavior based on the user’s “custom” editing function. Conversely, we can also analyze the description file and edit the material selected by the user according to the description file, so as to quickly generate a video “the same style” as the editing behavior in the description file.

3.2 Editing Module

3.2.1 AVFoundation introduction

AVFoundation audio and video editing is divided into four processes: material mixing, audio processing, video processing and video export.

(1) Material mixer AVMutableComposition

AVMutableComposition is a collection of one or more AvcompositionTracks. Each track stores file information of source media, such as audio and video, according to the timeline.

/ / create a new AVCompositionTrack AVMutableComposition API - (nullable AVMutableCompositionTrack *)addMutableTrackWithMediaType:(AVMediaType)mediaType preferredTrackID:(CMPersistentTrackID)preferredTrackID;Copy the code

Each track consists of a series of track segments. Each track segment stores part of the media data of source files, such as URLS, track identifiers, and time mapping.

/ * / / AVMutableCompositionTrack partial attributes provides a reference to the AVAsset ofwhich the AVAssetTrack is a part  */
AVAsset *asset;

/* indicates the persistent unique identifier for this track of the asset  */
CMPersistentTrackID trackID;

NSArray<AVCompositionTrackSegment *> *segments;
Copy the code

Where the URL specifies the source container of the file, the track ID specifies the source file track that needs to be used, and the time map specifies the time range of the source track, as well as its time range on the composite track.

/ / AVCompositionTrackSegment time mapping CMTimeMapping timeMapping; //CMTimeMapping definition typedef struct {CMTimeRangesource; // eg, media.  source.start is kCMTimeInvalid for empty edits.
	CMTimeRange target; // eg, track.
} CMTimeMapping;
Copy the code

Figure 7: AVMutableComposition compositing a new video

(Source: Official Apple developer documentation)

(2) Audio mix AVMutableAudioMix

Through AVMutableAudioMixInputParameters AVMutableAudioMix can specify any arbitrary time period of the orbit of the volume.

/ / AVMutableAudioMixInputParameters related API CMPersistentTrackID trackID. - (void)setVolumeRampFromStartVolume:(float)startVolume toEndVolume:(float)endVolume timeRange:(CMTimeRange)timeRange;
Copy the code

Figure 8: Audio mixing diagram

(Source: Official Apple developer documentation)

(3) the AVMutableVideoComposition video rendering

We can also use AVMutableVideoComposition to deal directly with the composition of the video track. When working with a single video composition, you can specify its render size, zoom, frame rate and other parameters and output the final video file. Through some video composition instruction (AVMutableVideoCompositionInstruction, etc.), we can change the background color of the video, the application layer instructions.

These layer instructions (AVMutableVideoCompositionLayerInstruction, etc.) can be used for the video track the implementation of the graphics transformation in composition, add graphics, transparency, transformation, increase transparency gradient ramp. In addition, you can apply animations from the Core Animation Framework by setting the animationTool property for Video Composition.

Figure 9: AVMutableVideoComposition video processing

(Source: Official Apple developer documentation)

(4) AVAssetExportSession derived

The export step is relatively simple, just need to create the above steps of the processing object value to the export class object can export the final product.

Figure 10: Export process

(Source: Official Apple developer documentation)

3.2.2 Implementation of editing module

Combined with AVFoundation framework, we implemented the following roles in the video editing module:

  • Track: there are two types of video and audio, which store frame pictures and sound.
  1. In the video type track, extend the picture type track, that is, generate the video track from the empty video file, and feed the selected picture to the mixer as a frame picture.

  2. Additional track: AVFoundation provides AVVideoCompositionCoreAnimationTool this tool, can be convenient to CoreAnimation within the framework of the content is applied to the video frame chart. So to add the text function, we created a series of preview views in the preview side through UIKit, which were exported to the tool’s CALayer.

  • Paragraph: a period of time in a track that is the object of paragraph editing.

  • Instruction: Associated with the specified video paragraph, image processing, drawing each frame.

  1. A single command can associate multiple video tracks and obtain frames from these video tracks within a specified period of time as an object for screen editing.

  2. Instructions in the screen editing concrete implementation scheme, using CoreImage framework. CoreImage itself provides some built-in real-time image processing capabilities. Special effects that CoreImage does not support are implemented by customizing CIKernel.

  • Audio Mixer: For adding music, we use AVMutableAudioMix.

  • Video mixer: The final video file we want to get, usually contains a video track, an audio track. Mixer is to convert our input media resources into tracks, according to user operations or transformation by description model, video section editing, assembly instructions for screen editing, audio track mixing, combined with AVPlayerItem and AVExportSession, to provide real-time preview and final synthesis.

With the above roles, the video editing module on iOS is implemented as follows:

Figure 11: Schematic diagram of video editing module

As shown in the figure above, the mixer contains two video tracks and one audio track. Generally speaking, input video and image files will generate a corresponding video track, theoretically there should be multiple video tracks in the mixer. The mixer in our figure only maintains two video tracks and one audio track, firstly to solve the problem of limited number of video decoders, which will be described in detail later. The second is to ensure the realization of the transition function.

The instruction sequence consists of a number of sequential instructions in time, and each instruction consists of time, frame source track and screen editing effect. The paragraph editing function of video editing function is the splicing of command segment. The screen editing function is the editing process of each instruction section on the frame. The mixer provides a preview function to show the user the editing changes in real time. After the editing effect is determined, the final video file is synthesized through the compositing function provided by the mixer.

3.2.3 converter

With the implementation of the video editing module, we have been able to support the “custom editing” mode. Finally, through the connection of the converter, the description model and the editing module can be integrated together to complete the support of “template authoring” mode. Here, the implementation of the converter is relatively simple, that is, the JSON format description file is parsed into a data model, and the mixer creates its own internal track model according to the material and description model selected by the user, and splice instruction segments.

On the other hand, the mixer assembles its own internal track model and instruction information into a data model to generate a DESCRIPTION file in JSON format when editing and exporting.

Figure 12: Describes the model and edit module related transformations

Part.4 Recent optimization direction

4.1 Trample pits

In the process of implementing the above editing framework, we have encountered many problems, most of which are caused by unclear error information in AVFoundation and time-consuming localization. In summary, most of them are caused by the alignment of orbit timeline. In addition to the timeline alignment problem, we have summarized a few implementation considerations to share with you to avoid the same pit.

(1) Limit on the number of mixer tracks

  • Problem: Avmutable Position can add many tracks at the same time, i.e. multiple video tracks can exist in a composition and can be previewed normally through AVPlayer. Therefore, the editing module we initially implemented supports multiple video tracks in the mixer, as shown in the figure below. This multi-track structure, preview without problems, but when exporting “can not decode” error. Mixer structure diagram before conversion:

Figure 13: Mixer structure diagram before conversion

  • Cause: After verification, it is found that Apple device has a limit on the number of decoders for video playback. When exporting a video, each video track will use one decoder. As a result, if the number of video tracks exceeds the limit of the number of decoders, the video cannot be exported.

  • Solution: Track model conversion to convert the multi-track structure in the original mixer to the current dual-track structure so that the decoder limit is not exceeded when exporting.

(2) Performance optimization: implementation scheme of replaying function

  • Problem: The initial implementation was to export a new video file with the frame sequence in reverse order from the original video file. If the original video file is large and the user cuts only one part of the original video file, the original video file is still processed in reverse order when the video is played back, which is a time-consuming operation.

  • Solution: Obtain the corresponding frame of the video file according to the time point, and only convert the normal time point to the rewind time point, without operating the video file. Similar to the operation on an array, only the subscript is manipulated, not the order of the array.

(3) Performance optimization: reduce memory usage

  • Fault: The preview uses the original image size, which consumes a lot of memory. After multiple HD images are added, a memory alarm is generated during the preview.

  • Solution: Without affecting the user experience, use a low-resolution image for preview, and use the original image when exporting.

4.2 Short-term Planning

At present, this video editing framework works well in the iOS end of Hornet’s Nest Tourism App, which can support the continuous iteration of business and rapidly expand more picture editing functions. Of course, there are still some details that need to be optimized and improved.

In the near future, we will combine machine learning and AR technology to explore some interesting video editing scenarios and provide users with more personalized travel recording tools.

Authors: Li Xu, Zhao Chengfeng, iOS r&d engineer of Hornet’s Nest Content Center.