preface

When it comes to short video editing, we may immediately think of FFmpeg and OpenGL ES, which are daunting and huge frameworks. It is true that the introduction of audio and video development does require certain foundation, but we can still do a lot of things in the field of short video editing just relying on the AVFoundation framework provided by Apple. This series of articles focuses on the role of AVFoundation in short video editing on the iOS platform. This article mainly introduces the basic modules and related data types and precautions needed to learn the AVFoundation framework.

I. Overview of AVFouondation framework

AVFoundation is a fully functional framework for processing multimedia data on iOS, macOS, watchOS, and tvOS. Using AVFoundation, we can play, create, and edit QuickTime Movie and MPEG-4 files, play HLS streams, and build powerful media editing capabilities into our applications.

1.1 iOS multimedia framework system

Let’s start by looking at the AVFouondation’s place in Apple’s multimedia framework. In the multimedia system of iOS, the high-level AVKit provides a highly encapsulated player control class AVPlayerViewController, AVRoutePickerView for switching playback routes (screen casting), And the AVPictureInPictureController effect for the picture in picture and video. Low-level frameworks are mainly C interfaces, where:

  • Core Audio is the lowest level Audio processing interface that directly drives the Audio hardware of the device, providing comprehensive support for music games or professional Audio editing software. The Audio Unit provides interfaces for synthesizing instrument sounds, echo cancellation, mixing, and sound balancing.

Audio Unit has been migrated to the Audio Toolbox Framework.

  • Core VideoImage caching for its relative Core Media (CVPixelBuffer) and cache pools (CVPixelBufferPool) support, provides a frame-by-frame access interface to digital video, and supports Metal(CVMetalTexture), OpenGL (CVOpenGLTexture) and OpenGLES (CVOpenGLESTexture).
  • Core MediaDefines and encapsulates the media processing pipeline (including time information) required by higher-level media frameworks such as AVFoundation and the interfaces and data types used in it (CMSampleBuffer,CMTime). Core Media layer interfaces and data types can be used to efficiently process Media sample data, manage sample data queues (CMSimpleQueue,CMBufferQueue).
  • Core Animation is an animation-related framework for iOS. AVFoundation combines Core Animation with Core Animation to enable developers to add Animation and sticker effects during video editing and playback.

AVFoundation, on the other hand, is located between the high-level framework and the low-level framework. It encapsulates the functions that the low-level framework can only achieve, and provides the interface of OC and Swift. Meanwhile, Apple continuously optimizes the performance of AVFoundation, the middle layer framework, in the process of iteration, and supports the new device and video formats well.

1.2 Introduction to each module of AVFoundation

The AVFoundation framework combines six major technical areas covering the main functions of recording, processing, compositing, controlling, importing and exporting audiovisual media on the Apple platform. The official API document divides AVFoundation into six functional modules: Assets, Playback, Capture recording, Editing, Audio Audio, and Speech.

  • Assets: allows you to load, check, and export media resources and metadata informationAVAssetReaderandAVAssetWriterSample level read and write to media sample data, and useAVAssetImageGeneratorTo get a video thumbnail, useAVCaptionSubtitles creation (MAC OS), etc.
  • Playback: to provide theAVAssetThe playback and playback control functions can be usedAVPlayerPlay an item that can also be usedAVQueuePlayerPlay multiple items,AVSynchronizedLayerWe can combineCore AnimationSynchronize the animation layer with the playback view layer to achieve the effects such as stickers and text in the playback.
  • Capture: Used to take photos and record audio and video. You can configure a built-in camera, microphone or external recording device to build a customized camera function, control the output format of photos and video, or directly modify the audio and video data streams as customized output.
  • Editing: Used to combine, edit, and remix audio and video tracks from multiple sources into oneAVMutableComposition“, can be usedAVAudioMixandAVVideoCompositionControl the details of audio mixing and video composition, respectively.
  • Audio: Play, record and process audio; Configure the application’s system audio behavior. Apple integrated another one in iOS14.5AVFAudioThe framework, which has exactly the same content as this section, will probably treat the audio section separately in the future.
  • Speech: Converts text to audio for reading.

In short video editing, we are dealing with AVAsset or its subclasses, no matter the materials used in editing, the semi-finished products processed in editing or the final exported products. The Assets module in which AVAsset resides is the first content to be learned as the basis of AVFoundation media processing.

Ii. Basic module-Assets

2.1 AVAsset & AVAssetTrack

AVAsset is an abstract and immutable class that defines the way media resources are mixed, modularizing the static properties of media resources as a whole. It provides an abstraction of the basic media format, which means that whether you’re dealing with Quick Time Movie or MP3 audio, the only thing facing developers and the rest of the framework is the concept of resources.

AVAsset is usually instantiated through its subclass AVURLAsset, where the parameter “URL” can come from remote or local or even streaming media, so we don’t have to focus on the source, just on the AVAsset itself.

Creating AVURLAsset using the method in the following code example returns all instances of AVURLAsset. Options in method 2 can be used to customize the initialization of an AVURLAsset to meet specific requirements. For example, by HLS flow create AVURLAsset, use {AVURLAssetAllowsCellularAccessKey: @ NO} as the options of parameters, can stop the retrieve the media data when using cellular networks. And is closely related to video editing AVURLAssetPreferPreciseDurationAndTimingKey, used to indicate whether resources provide an accurate and precise random access “duration” values, video editing requires accurate value, it is recommended to use “YES”, However, this precision may require additional parsing, resulting in longer load times.

Many container formats, such as QuickTime Movie and MPEG-4 files, provide enough summary information for accurate timing and do not require additional parsing to prepare.

AVAsset *asset = [AVAsset assetWithURL:url];
AVURLAsset *asset = [AVURLAsset URLAssetWithURL:url options:@{AVURLAssetPreferPreciseDurationAndTimingKey: @YES}];
Copy the code

An AVAsset instance is a container for one or more AVAssetTrack instances, which model a unified type of “track” of media. A simple video file usually contains an audio track and a video track, and may also contain supplementary content such as hidden captions, captions, or metadata (AVMetadataItem) that describes the content of the media.

Closed Caption, CC Caption for short. Most CC captions are the same as scripts. In addition to dialogue, there are descriptions of the sounds and music produced in the scene, mainly for the hearing-impaired. The word “Closed” also indicates a state that does not have an Open Caption as opposed to a script. If the Caption is the same as the dialog language, it is called “Caption”; if the Caption is different, it is called “Subtitle”.

Creating an AVAsset is a lightweight operation because the underlying media data of AVAsset is lazy-loaded and not loaded until it is fetched, and properties are fetched synchronously. Accessing properties without asynchronous loading beforehand would block the calling thread. However, this depends on the size and location of the media data to be accessed. To avoid blocking threads, it is best to load properties asynchronously before using them. AVAsset and AVAssetTrack followed AVAsynchronousKeyValueLoading agreement, can get loaded asynchronously loading property and state.

@ protocol AVAsynchronousKeyValueLoading / / asynchronous loading is included in the keys of the array properties, used in the handler statusOfValueForKey: error: method determine whether the load is complete. - (void)loadValuesAsynchronouslyForKeys:(NSArray<NSString *> *)keys completionHandler:(nullable void (^)(void))handler; // Obtain the loading status of the key attribute. Status is AVKeyValueStatusLoaded. - (AVKeyValueStatus)statusOfValueForKey:(NSString *)key error:(NSError * _Nullable * _Nullable)outError;Copy the code

WWDC2021What’s New in AVFoundation mentions that the introduction of async/await for the Swift API allows us to use control flow similar to synchronous programming for asynchronous programming.

let asset = AVAsset (url: AssetURL) let duration = TRV await asset.load(.duration)  let (duration, tracks) = try await asset.load(.duration, .tracks)Copy the code

AVAsset attributes:

The tracks property in the code example returns an array of all AVAssetTrack instances contained in an AVAsset instance. Apple also provides the following method for retrieving a subset of tracks based on specific criteria such as track ID, media type, and characteristics. This is also a common method for retrieving a certain type of track in an editing module.

// retrieve track by TrackID - (void)loadTrackWithTrackID:(CMPersistentTrackID) TrackID completionHandler:(void (^)(AVAssetTrack * _Nullable_result, NSError * _Nullable))completionHandler; / / retrieve orbit subset - depending on the type of media (voidloadTracksWithMediaType: (AVMediaType) mediaType completionHandler: (void (^) (NSArray < AVAssetTrack *> * _Nullable NSError * _Nullable))completionHandler; / / - (void) according to the characters of media retrieval orbit subset loadTracksWithMediaCharacteristic: (mediaCharacteristic AVMediaCharacteristic) completionHandler:(void (^)(NSArray<AVAssetTrack *> * _Nullable, NSError * _Nullable))completionHandler;Copy the code

Among them, AVMediaType commonly used are: audio AVMediaTypeAudio, video AVMediaTypeVideo, subtitles AVMediaTypeSubtitle, metadata AVMediaTypeMetadata, etc.

AVMediaCharacteristic used to define the characteristics of media data, such as whether contains HDR video AVMediaCharacteristicContainsHDRVideo orbit, Does it include AVMediaCharacteristicAudible audible content, etc.

2.2 the metadata

Media container formats store descriptive metadata about their media. Each container format has its own unique metadata format. AVFoundation simplifies handling metadata by using its AVMetadataItem class. In its most basic form, an instance of AVMetadataItem is a key-value pair. Represents a single metadata value, such as a movie title or an album illustration.

To use AVMetadataItem effectively, we need to understand how AVMetadataItem organizes metadata. To simplify the lookup and filtering of metadata items, the AVFoundation framework groups related metadata into key Spaces:

  • Key space for a specific format. The AVFoundation framework defines key Spaces in several specific formats that are roughly related to specific container or file formats, such as QuickTime (QuickTime metadata and user data) or MP3 (ID3). However, a single resource may contain metadata values that span multiple key Spaces. To retrieve a complete collection of metadata in a specific format for a resource, usemetadataProperties.
  • Common key space. Several common metadata values, such as the creation date or description of a movie, can exist in multiple key Spaces. To help normalize access to this common metadata, the framework provides a common key space that allows access to a set of finite element data values shared by several key Spaces. To retrieve a common collection of metadata for a resource, use it directlycommonMetadataProperties.

In addition, we can determine which metadata formats the resource contains by calling AVAsset’s availableMetadataFormats property. This property returns an array of string identifiers for each metadata format. It then uses its metadataForFormat: method to retrieve formatt-specific metadata values by passing the appropriate format identifier.

Metadata of an HDR video file shot by an iPhone13 Pro:

CreationDate: 2022-03-01T18:16:17+0800 location: +39.9950+116.4749+044.903/ make: Apple Model: IPhone 13 Pro Software: 15.3.1Copy the code

Although the purpose of this series of articles is not to focus on audio and video codec formats, if you have a video file (.mov) and still want to obtain the encoding type (H264 / HEVC) and conversion function (ITU_R_709_2/ITU_R_2100_HLG) of the video sample, To obtain the format information of audio sample, such as sampling rate, channel number and bit depth, where should we start? Previously, we introduced the separate modeling of audio and video data in the form of track in an AVAsset resource. To obtain the information of video sample format, we only need to retrieve the corresponding video track according to the media type to obtain the formatDescriptions attribute of assetTrack. Can get a sample video format information CMVideoFormatDescription collection, as well as CMAudioFormatDescription, CMClosedCaptionFormatDescription sample data format is used to describe their orbit.

/ / get the data sample format information AVAssetTrack * videoTrack = [[asset tracksWithMediaType: AVMediaTypeVideo] firstObject]; NSArray *videoFormats = VideoTrack.formatDescriptions;Copy the code

The description of the video track format in the video file is as follows:

"<CMVideoFormatDescription 0x2834305a0 [0x1dbce41b8]> { mediaType:'vide' mediaSubType:'hvc1' mediaSpecific: { codecType: 'hvc1' dimensions: 1920 x 1080 } extensions: {{ AmbientViewingEnvironment = {length = 8, bytes = 0x002fe9a03d134042}; BitsPerComponent = 10; CVFieldCount = 1; CVImageBufferChromaLocationBottomField = Left; CVImageBufferChromaLocationTopField = Left; CVImageBufferColorPrimaries = \"ITU_R_2020\"; CVImageBufferTransferFunction = \"ITU_R_2100_HLG\"; CVImageBufferYCbCrMatrix = \"ITU_R_2020\"; Depth = 24; FormatName = HEVC; FullRangeVideo = 0; RevisionLevel = 0; SampleDescriptionExtensionAtoms = { dvvC = { length = 24, bytes = 0x010010254000000000000000000000000000000000000000 }; hvcC = { length = 125, bytes = 0x01022000 0000b000 00000000 78f000fc ... 2fe9a03d 13404280 }; }; SpatialQuality = 512; TemporalQuality = 512; VerbatimSampleDescription = { length = 289, bytes = 0x00000121 68766331 00000000 00000001 ... 3d134042 00000000 }; Version = 0; }}}"Copy the code

2.3 Video Preview

In short video editing, the video resource before the export is called preview. It actually plays content that belongs to the Playback module of AVFoundation, but this series of articles is not focused on the player. Let’s take a quick look at the Class AVPlayer used to preview AVAsset.

AVPlayer initialization requires an AVPlayerItem object, which manages the resource object, provides the class to play the data source, and to play a video, if you just use AVPlayer, you only have sound and no screen, and to display the screen we also need an AVPlayerLayer.

AVAsset *asset = [AVAsset assetWithURL:url]; AVPlayerItem* item = [[AVPlayerItem alloc] initWithAsset:asset]; / / 3. Create a AVPlayer AVPlayer * player = [AVPlayer playerWithPlayerItem: item]; / / 4. Create AVPlayerLayer used to display video AVPlayerLayer * playerLayer = [AVPlayerLayer playerLayerWithPlayer: player]; // 5. Add AVPlayerLayer to view layer [self.view.layer addSublayer:playerLayer]; // 6. Play [player play];Copy the code

2.4 Obtaining video thumbnails

Before exporting the video, there is usually a function to select the cover of the video. This function needs to provide a list of video thumbnails. To obtain the video thumbnails, you need to use AVAssetImageGenerator.

+ (instancetype)assetImageGeneratorWithAsset:(AVAsset *)asset;
- (instancetype)initWithAsset:(AVAsset *)asset;
Copy the code

If you need accurate time capture, you can use the following method to set the time tolerance to kCMTimeZero and set the maximumSize attribute to specify the size of the capture.

. / / precise moment for the thumbnail imageGenerator requestedTimeToleranceBefore = kCMTimeZero; imageGenerator.requestedTimeToleranceAfter = kCMTimeZero; / / specify the thumbnail pictures below size imageGenerator. MaximumSize = CGSizeMake (100, 100);Copy the code

Then call the following method to get a thumbnail of one moment, or multiple moments:

- (nullable CGImageRef)copyCGImageAtTime:(CMTime)requestedTime actualTime:(nullable CMTime *)actualTime error:(NSError * _Nullable * _Nullable)outError; // Get thumbnails of multiple moments, for each image generated, A handler will be called - (void) generateCGImagesAsynchronouslyForTimes: (NSArray requestedTimes < > NSValue * *) completionHandler:(AVAssetImageGeneratorCompletionHandler)handler; // The handler type definition for the above method, ActualTime for the thumbnail of the real time typedef void (^ AVAssetImageGeneratorCompletionHandler) (CMTime requestedTime, CGImageRef _Nullable image, CMTime actualTime, AVAssetImageGeneratorResult result, NSError * _Nullable error);Copy the code

Common data types

3.1 CMTime & CMTimeRange

So to get the video thumbnail we need to pass in a CMTime that represents a moment in time, and we usually think of it as NSTimeInterval(double), AVAudioPlayer and AVAudioRecorder processing times are available in AVFoundation, but floating-point imprecision (simple rounding can result in frame loss) makes it impossible to use in more advanced time-based media development. So Core Media provides a structure to represent time:

typedef struct
{
   CMTimeValue    value;        
   CMTimeScale    timescale;
   CMTimeFlags    flags;
   CMTimeEpoch    epoch;
} CMTime;
Copy the code

CMTimeFlags is a bitmask used to indicate the specified state of time, such as whether data is valid, and CMTimeEpoch represents the epoch, usually 0. We focus on CMTimeValue and CMTimeScale. A CMTime represents time = value/timescale, where timescale represents how many portions of time are divided, and value represents how many portions of time are contained. In order to satisfy most commonly used video frequencies of 24FPS, 25FPS, and 30FPS, we usually set the timescale to their common multiple of 600.

We can create a CMTime using the following method:

CMTime time1 = CMTimeMake(3, 1); // 3 / 1 = 3s CMTime time2 = CMTimeMakeWithSeconds(5, 1); 5s timescale = 1 NSDictionary *timeData = @{(id)kCMTimeValueKey : @2, (id)kCMTimeScaleKey : @1, (id)kCMTimeFlagsKey : @1, (id)kCMTimeEpochKey : @0}; CMTime time3 = CMTimeMakeFromDictionary((__bridge CFDictionaryRef)timeData); // special value // indicates 0 time CMTime time4 = kCMTimeZero; CMTime time5 = kCMTimeInvalid;Copy the code

CMTime operation:

CMTimeSubtract(<#CMTime lhs#>, <#CMTime rhs#>) CMTimeCompare(<#CMTime time1#>, CMTIME_IS_INVALID(<#time#>) // print CMTimeShow(<#CMTime #>)Copy the code

CMTimeRange is used to represent a time range and consists of two CMTime values, the first defining the start time of the time range and the second defining the duration of the time range.

typedef struct
{
    CMTime          start;
    CMTime          duration;
} CMTimeRange;
Copy the code

We can create a CMTimeRange using the following method:

CMTime beginTime = CMTimeMake(5, 1); CMTime endTime = CMTimeMake(12, 1); CMTimeRange timeRange1 = CMTimeRangeMake(beginTime, endTime); CMTimeRange timeRange2 = CMTimeRangeFromTimeToTime(beginTime, endTime); CMTimeRange timeRange3 = kCMTimeRangeZero; CMTimeRange timeRange4 = kCMTimeRangeInvalid;Copy the code

CMTimeRange operation:

/ / get the intersection of two CMTimeRange CMTimeRangeGetIntersection (< # CMTimeRange range# >, CMTimeRangeGetUnion(<#CMTimeRange range#>, <#CMTimeRange otherRange#>) // whether to include a time CMTimeRangeContainsTime(<#CMTimeRange range#>, < # CMTime time# >) / / contains a certain time range CMTimeRangeContainsTimeRange (< # CMTimeRange range# >, CMTIMERANGE_IS_VALID(<#range#>) CMTIMERANGE_IS_INVALID(<#range#>) // print CMTIMERANGE_IS_INVALID CMTimeRangeShow(<#CMTimeRange range#>)Copy the code

3.2 CMSampleBuffer

CMSampleBuffer is often used when processing sample level data using AVFoundation, For example, the output types of data captured by AVFoundation’s Capture module camera, and the types of data operated by AVAssetReader and AVAssetWriter during reading and writing.

CMSampleBuffer also comes from the Core Media framework. It is the Core base object that the system uses to move Media sample data through the Media pipeline. The role of CMSampleBuffer is to encapsulate the basic sample data and provide format description and time information.

CMSampleBuffer contains zero or more compressed (or uncompressed) samples of a specific media type (audio, video, hybrid, etc.). CMSampleBuffer can contain:

  • Sample data. Contains one of the following:
    • One or more media samplesCMBlockBuffer.CMBlockBufferIs encoded data that is not decoded.
    • One or more media samplesCVPixelBuffer.CVPixelBufferIs the data before encoding or after decoding.

  • Time information. CMSampleBuffer also contains the display time representing the current sample (Presentation Time Stamp), representing the coding time of the sample (Decode Time Stamp), DTS is mainly used for video decoding. If the decoding time is consistent with the display time sequence, its value can be set tokCMTimeInvalid. Can be used separatelyCMSampleBufferGetPresentationTimeStampandCMSampleBufferGetDecodeTimeStampGet PTS and DTS.
  • Format information. Format information encapsulated inCMFormatDescriptionIn the object. Can be used separatelyCMVideoFormatDescriptionGetCodecTypeandCMVideoFormatDescriptionGetDimensionsGets encoding type and video size. It can also be usedCMGetAttachmentRetrieve the metadata by retrieving the dictionary:
CFDictionaryRef metadataDictionary = CMGetAttachment(sampleBuffer, CFSTR("MetadataDictionary", NULL);
Copy the code

3.3 CVPixelBuffer

CVPixelBufferRef is the pixel buffer type, which is based on the image buffer type and belongs to the data type of the Core Video framework.

typedef CVImageBufferRef CVPixelBufferRef;
Copy the code

Cvelbufferref contains many image-related attributes, such as width, height, PixelFormatType, etc. In addition to the common RGB32, can also support such as kCVPixelFormatType_420YpCbCr8BiPlanarFullRange the YUV more flat data format, Pointer to data through CVPixelBufferGetBaseAddressOfPlane can get each plane. Call the CVPixelBufferLockBaseAddress before get the Address.

An instance of CMSampleBuffer contains the video frame data in the corresponding CVPixelBuffer. We can use CMSampleBufferGetImageBuffer access:

CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(<#A CMSampleBuffer#>);
Copy the code

Four, notes

4.1 Privacy Rights

The Capture module in AVFoundation involves the use of the device’s camera and microphone, and the reading and saving of pictures and videos involves the access to the user’s album. These are listed as user’s privacy rights by Apple. We need to add the description of application for access rights in response to Xcode’s info.plist. The following lists some permissions and their corresponding keys.

Private data key
Album (Read permission) Privacy – Photo Library Usage Description
Album (Write permission) Privacy – Photo Library Additions Usage Description
The microphone Privacy – Microphone Usage Description
The camera Privacy – Camera Usage Description

4.2 configuration AVAudioSession

The audio environment of iOS devices is more complex than that of MAC OS. Apple provides AVAudioSession audio sessions to play the role of middleman between application programs and operating system. We only need to specify the audio behavior of application programs to delegate the management of audio behavior to AVAudioSession. The default audio session is preconfigured with the following behavior:

  • Audio playback is supported, but audio recording is not allowed.
  • Silent mode silences any audio played by the application.
  • Locking the device will mute the app’s audio.
  • When the app plays the audio, it silences any other background audio.

IOS offers six categories to choose from:

AVAudioSessionCategoryAudioProcessing iOS10.0 has been abandoned.

Category Play/record Whether to interrupt other audio Mute or lock screen mode whether to mute
SoloAmbient Only play is is
Ambient Only play no is
MultiRoute Play and record is no
PlayAndRecord Play and record The default YES can be rewritten to NO no
Playback Only play The default YES can be rewritten to NO no
Record Only the recording is No (Recording can be performed even when the screen is locked. You need to configure background mode UIBackgroundModes.)

In addition, the audio session also provides for monitoring, such as telephone call in, the alarm clock rings caused by audio interrupt notification AVAudioSessionInterruptionNotification, And routing lines caused such as headset insert AVAudioSessionRouteChangeReason change notice.

4.3 a multithreaded

AVFoundation is built with the current hardware environment and applications in mind, and the design process relies heavily on multithreading. In the process of use, we need to make clear which thread the API is called by default, or needs to work in which thread, to ensure that the PROCESSING of UI back to the main thread in time, to avoid time-consuming operations blocking the main thread.

conclusion

As the beginning of a series of AVFoundation articles, this paper mainly introduces the overview of AVFoundation and the basic functions of each module. After that, we establish a vision of audio and video files from the perspective of track by learning the AVAssets module. You learned about common data types and considerations for using AVFoundation. In the next article we will cover the support provided by the AVFoundation framework in terms of adding and handling the first step of short video editing.

Refer to the link

AVFoundation AVAudioSession Audio session CMSampleBuffer