preface
When it comes to short video editing, we may immediately think of FFmpeg and OpenGL ES, which are daunting and huge frameworks. It is true that the introduction of audio and video development does require certain foundation, but we can still do a lot of things in the field of short video editing just relying on the AVFoundation framework provided by Apple. This series of articles focuses on the role of AVFoundation in short video editing on the iOS platform. This article mainly introduces the basic modules and related data types and precautions needed to learn the AVFoundation framework.
I. Overview of AVFouondation framework
AVFoundation is a fully functional framework for processing multimedia data on iOS, macOS, watchOS, and tvOS. Using AVFoundation, we can play, create, and edit QuickTime Movie and MPEG-4 files, play HLS streams, and build powerful media editing capabilities into our applications.
1.1 iOS multimedia framework system
Let’s start by looking at the AVFouondation’s place in Apple’s multimedia framework. In the multimedia system of iOS, the high-level AVKit provides a highly encapsulated player control class AVPlayerViewController, AVRoutePickerView for switching playback routes (screen casting), And the AVPictureInPictureController effect for the picture in picture and video. Low-level frameworks are mainly C interfaces, where:
- Core Audio is the lowest level Audio processing interface that directly drives the Audio hardware of the device, providing comprehensive support for music games or professional Audio editing software. The Audio Unit provides interfaces for synthesizing instrument sounds, echo cancellation, mixing, and sound balancing.
Audio Unit has been migrated to the Audio Toolbox Framework.
- Core VideoImage caching for its relative Core Media (
CVPixelBuffer
) and cache pools (CVPixelBufferPool
) support, provides a frame-by-frame access interface to digital video, and supports Metal(CVMetalTexture
), OpenGL (CVOpenGLTexture
) and OpenGLES (CVOpenGLESTexture
). - Core MediaDefines and encapsulates the media processing pipeline (including time information) required by higher-level media frameworks such as AVFoundation and the interfaces and data types used in it (
CMSampleBuffer
,CMTime
). Core Media layer interfaces and data types can be used to efficiently process Media sample data, manage sample data queues (CMSimpleQueue
,CMBufferQueue
). - Core Animation is an animation-related framework for iOS. AVFoundation combines Core Animation with Core Animation to enable developers to add Animation and sticker effects during video editing and playback.
AVFoundation, on the other hand, is located between the high-level framework and the low-level framework. It encapsulates the functions that the low-level framework can only achieve, and provides the interface of OC and Swift. Meanwhile, Apple continuously optimizes the performance of AVFoundation, the middle layer framework, in the process of iteration, and supports the new device and video formats well.
1.2 Introduction to each module of AVFoundation
The AVFoundation framework combines six major technical areas covering the main functions of recording, processing, compositing, controlling, importing and exporting audiovisual media on the Apple platform. The official API document divides AVFoundation into six functional modules: Assets, Playback, Capture recording, Editing, Audio Audio, and Speech.
- Assets: allows you to load, check, and export media resources and metadata information
AVAssetReader
andAVAssetWriter
Sample level read and write to media sample data, and useAVAssetImageGenerator
To get a video thumbnail, useAVCaption
Subtitles creation (MAC OS), etc. - Playback: to provide the
AVAsset
The playback and playback control functions can be usedAVPlayer
Play an item that can also be usedAVQueuePlayer
Play multiple items,AVSynchronizedLayer
We can combineCore Animation
Synchronize the animation layer with the playback view layer to achieve the effects such as stickers and text in the playback. - Capture: Used to take photos and record audio and video. You can configure a built-in camera, microphone or external recording device to build a customized camera function, control the output format of photos and video, or directly modify the audio and video data streams as customized output.
- Editing: Used to combine, edit, and remix audio and video tracks from multiple sources into one
AVMutableComposition
“, can be usedAVAudioMix
andAVVideoComposition
Control the details of audio mixing and video composition, respectively. - Audio: Play, record and process audio; Configure the application’s system audio behavior. Apple integrated another one in iOS14.5
AVFAudio
The framework, which has exactly the same content as this section, will probably treat the audio section separately in the future. - Speech: Converts text to audio for reading.
In short video editing, we are dealing with AVAsset or its subclasses, no matter the materials used in editing, the semi-finished products processed in editing or the final exported products. The Assets module in which AVAsset resides is the first content to be learned as the basis of AVFoundation media processing.
Ii. Basic module-Assets
2.1 AVAsset & AVAssetTrack
AVAsset is an abstract and immutable class that defines the way media resources are mixed, modularizing the static properties of media resources as a whole. It provides an abstraction of the basic media format, which means that whether you’re dealing with Quick Time Movie or MP3 audio, the only thing facing developers and the rest of the framework is the concept of resources.
AVAsset is usually instantiated through its subclass AVURLAsset, where the parameter “URL” can come from remote or local or even streaming media, so we don’t have to focus on the source, just on the AVAsset itself.
Creating AVURLAsset using the method in the following code example returns all instances of AVURLAsset. Options in method 2 can be used to customize the initialization of an AVURLAsset to meet specific requirements. For example, by HLS flow create AVURLAsset, use {AVURLAssetAllowsCellularAccessKey: @ NO} as the options of parameters, can stop the retrieve the media data when using cellular networks. And is closely related to video editing AVURLAssetPreferPreciseDurationAndTimingKey, used to indicate whether resources provide an accurate and precise random access “duration” values, video editing requires accurate value, it is recommended to use “YES”, However, this precision may require additional parsing, resulting in longer load times.
Many container formats, such as QuickTime Movie and MPEG-4 files, provide enough summary information for accurate timing and do not require additional parsing to prepare.
AVAsset *asset = [AVAsset assetWithURL:url];
AVURLAsset *asset = [AVURLAsset URLAssetWithURL:url options:@{AVURLAssetPreferPreciseDurationAndTimingKey: @YES}];
Copy the code
An AVAsset instance is a container for one or more AVAssetTrack instances, which model a unified type of “track” of media. A simple video file usually contains an audio track and a video track, and may also contain supplementary content such as hidden captions, captions, or metadata (AVMetadataItem) that describes the content of the media.
Closed Caption, CC Caption for short. Most CC captions are the same as scripts. In addition to dialogue, there are descriptions of the sounds and music produced in the scene, mainly for the hearing-impaired. The word “Closed” also indicates a state that does not have an Open Caption as opposed to a script. If the Caption is the same as the dialog language, it is called “Caption”; if the Caption is different, it is called “Subtitle”.
Creating an AVAsset is a lightweight operation because the underlying media data of AVAsset is lazy-loaded and not loaded until it is fetched, and properties are fetched synchronously. Accessing properties without asynchronous loading beforehand would block the calling thread. However, this depends on the size and location of the media data to be accessed. To avoid blocking threads, it is best to load properties asynchronously before using them. AVAsset and AVAssetTrack followed AVAsynchronousKeyValueLoading agreement, can get loaded asynchronously loading property and state.
@ protocol AVAsynchronousKeyValueLoading / / asynchronous loading is included in the keys of the array properties, used in the handler statusOfValueForKey: error: method determine whether the load is complete. - (void)loadValuesAsynchronouslyForKeys:(NSArray<NSString *> *)keys completionHandler:(nullable void (^)(void))handler; // Obtain the loading status of the key attribute. Status is AVKeyValueStatusLoaded. - (AVKeyValueStatus)statusOfValueForKey:(NSString *)key error:(NSError * _Nullable * _Nullable)outError;Copy the code
WWDC2021What’s New in AVFoundation mentions that the introduction of async/await for the Swift API allows us to use control flow similar to synchronous programming for asynchronous programming.
let asset = AVAsset (url: AssetURL) let duration = TRV await asset.load(.duration) let (duration, tracks) = try await asset.load(.duration, .tracks)Copy the code
AVAsset attributes:
The tracks property in the code example returns an array of all AVAssetTrack instances contained in an AVAsset instance. Apple also provides the following method for retrieving a subset of tracks based on specific criteria such as track ID, media type, and characteristics. This is also a common method for retrieving a certain type of track in an editing module.
// retrieve track by TrackID - (void)loadTrackWithTrackID:(CMPersistentTrackID) TrackID completionHandler:(void (^)(AVAssetTrack * _Nullable_result, NSError * _Nullable))completionHandler; / / retrieve orbit subset - depending on the type of media (voidloadTracksWithMediaType: (AVMediaType) mediaType completionHandler: (void (^) (NSArray < AVAssetTrack *> * _Nullable NSError * _Nullable))completionHandler; / / - (void) according to the characters of media retrieval orbit subset loadTracksWithMediaCharacteristic: (mediaCharacteristic AVMediaCharacteristic) completionHandler:(void (^)(NSArray<AVAssetTrack *> * _Nullable, NSError * _Nullable))completionHandler;Copy the code
Among them, AVMediaType commonly used are: audio AVMediaTypeAudio, video AVMediaTypeVideo, subtitles AVMediaTypeSubtitle, metadata AVMediaTypeMetadata, etc.
AVMediaCharacteristic used to define the characteristics of media data, such as whether contains HDR video AVMediaCharacteristicContainsHDRVideo orbit, Does it include AVMediaCharacteristicAudible audible content, etc.
2.2 the metadata
Media container formats store descriptive metadata about their media. Each container format has its own unique metadata format. AVFoundation simplifies handling metadata by using its AVMetadataItem class. In its most basic form, an instance of AVMetadataItem is a key-value pair. Represents a single metadata value, such as a movie title or an album illustration.
To use AVMetadataItem effectively, we need to understand how AVMetadataItem organizes metadata. To simplify the lookup and filtering of metadata items, the AVFoundation framework groups related metadata into key Spaces:
- Key space for a specific format. The AVFoundation framework defines key Spaces in several specific formats that are roughly related to specific container or file formats, such as QuickTime (QuickTime metadata and user data) or MP3 (ID3). However, a single resource may contain metadata values that span multiple key Spaces. To retrieve a complete collection of metadata in a specific format for a resource, use
metadata
Properties. - Common key space. Several common metadata values, such as the creation date or description of a movie, can exist in multiple key Spaces. To help normalize access to this common metadata, the framework provides a common key space that allows access to a set of finite element data values shared by several key Spaces. To retrieve a common collection of metadata for a resource, use it directly
commonMetadata
Properties.
In addition, we can determine which metadata formats the resource contains by calling AVAsset’s availableMetadataFormats property. This property returns an array of string identifiers for each metadata format. It then uses its metadataForFormat: method to retrieve formatt-specific metadata values by passing the appropriate format identifier.
Metadata of an HDR video file shot by an iPhone13 Pro:
CreationDate: 2022-03-01T18:16:17+0800 location: +39.9950+116.4749+044.903/ make: Apple Model: IPhone 13 Pro Software: 15.3.1Copy the code
Although the purpose of this series of articles is not to focus on audio and video codec formats, if you have a video file (.mov) and still want to obtain the encoding type (H264 / HEVC) and conversion function (ITU_R_709_2/ITU_R_2100_HLG) of the video sample, To obtain the format information of audio sample, such as sampling rate, channel number and bit depth, where should we start? Previously, we introduced the separate modeling of audio and video data in the form of track in an AVAsset resource. To obtain the information of video sample format, we only need to retrieve the corresponding video track according to the media type to obtain the formatDescriptions attribute of assetTrack. Can get a sample video format information CMVideoFormatDescription collection, as well as CMAudioFormatDescription, CMClosedCaptionFormatDescription sample data format is used to describe their orbit.
/ / get the data sample format information AVAssetTrack * videoTrack = [[asset tracksWithMediaType: AVMediaTypeVideo] firstObject]; NSArray *videoFormats = VideoTrack.formatDescriptions;Copy the code
The description of the video track format in the video file is as follows:
"<CMVideoFormatDescription 0x2834305a0 [0x1dbce41b8]> { mediaType:'vide' mediaSubType:'hvc1' mediaSpecific: { codecType: 'hvc1' dimensions: 1920 x 1080 } extensions: {{ AmbientViewingEnvironment = {length = 8, bytes = 0x002fe9a03d134042}; BitsPerComponent = 10; CVFieldCount = 1; CVImageBufferChromaLocationBottomField = Left; CVImageBufferChromaLocationTopField = Left; CVImageBufferColorPrimaries = \"ITU_R_2020\"; CVImageBufferTransferFunction = \"ITU_R_2100_HLG\"; CVImageBufferYCbCrMatrix = \"ITU_R_2020\"; Depth = 24; FormatName = HEVC; FullRangeVideo = 0; RevisionLevel = 0; SampleDescriptionExtensionAtoms = { dvvC = { length = 24, bytes = 0x010010254000000000000000000000000000000000000000 }; hvcC = { length = 125, bytes = 0x01022000 0000b000 00000000 78f000fc ... 2fe9a03d 13404280 }; }; SpatialQuality = 512; TemporalQuality = 512; VerbatimSampleDescription = { length = 289, bytes = 0x00000121 68766331 00000000 00000001 ... 3d134042 00000000 }; Version = 0; }}}"Copy the code
2.3 Video Preview
In short video editing, the video resource before the export is called preview. It actually plays content that belongs to the Playback module of AVFoundation, but this series of articles is not focused on the player. Let’s take a quick look at the Class AVPlayer used to preview AVAsset.
AVPlayer initialization requires an AVPlayerItem object, which manages the resource object, provides the class to play the data source, and to play a video, if you just use AVPlayer, you only have sound and no screen, and to display the screen we also need an AVPlayerLayer.
AVAsset *asset = [AVAsset assetWithURL:url]; AVPlayerItem* item = [[AVPlayerItem alloc] initWithAsset:asset]; / / 3. Create a AVPlayer AVPlayer * player = [AVPlayer playerWithPlayerItem: item]; / / 4. Create AVPlayerLayer used to display video AVPlayerLayer * playerLayer = [AVPlayerLayer playerLayerWithPlayer: player]; // 5. Add AVPlayerLayer to view layer [self.view.layer addSublayer:playerLayer]; // 6. Play [player play];Copy the code
2.4 Obtaining video thumbnails
Before exporting the video, there is usually a function to select the cover of the video. This function needs to provide a list of video thumbnails. To obtain the video thumbnails, you need to use AVAssetImageGenerator.
+ (instancetype)assetImageGeneratorWithAsset:(AVAsset *)asset;
- (instancetype)initWithAsset:(AVAsset *)asset;
Copy the code
If you need accurate time capture, you can use the following method to set the time tolerance to kCMTimeZero and set the maximumSize attribute to specify the size of the capture.
. / / precise moment for the thumbnail imageGenerator requestedTimeToleranceBefore = kCMTimeZero; imageGenerator.requestedTimeToleranceAfter = kCMTimeZero; / / specify the thumbnail pictures below size imageGenerator. MaximumSize = CGSizeMake (100, 100);Copy the code
Then call the following method to get a thumbnail of one moment, or multiple moments:
- (nullable CGImageRef)copyCGImageAtTime:(CMTime)requestedTime actualTime:(nullable CMTime *)actualTime error:(NSError * _Nullable * _Nullable)outError; // Get thumbnails of multiple moments, for each image generated, A handler will be called - (void) generateCGImagesAsynchronouslyForTimes: (NSArray requestedTimes < > NSValue * *) completionHandler:(AVAssetImageGeneratorCompletionHandler)handler; // The handler type definition for the above method, ActualTime for the thumbnail of the real time typedef void (^ AVAssetImageGeneratorCompletionHandler) (CMTime requestedTime, CGImageRef _Nullable image, CMTime actualTime, AVAssetImageGeneratorResult result, NSError * _Nullable error);Copy the code
Common data types
3.1 CMTime & CMTimeRange
So to get the video thumbnail we need to pass in a CMTime that represents a moment in time, and we usually think of it as NSTimeInterval(double), AVAudioPlayer and AVAudioRecorder processing times are available in AVFoundation, but floating-point imprecision (simple rounding can result in frame loss) makes it impossible to use in more advanced time-based media development. So Core Media provides a structure to represent time:
typedef struct
{
CMTimeValue value;
CMTimeScale timescale;
CMTimeFlags flags;
CMTimeEpoch epoch;
} CMTime;
Copy the code
CMTimeFlags is a bitmask used to indicate the specified state of time, such as whether data is valid, and CMTimeEpoch represents the epoch, usually 0. We focus on CMTimeValue and CMTimeScale. A CMTime represents time = value/timescale, where timescale represents how many portions of time are divided, and value represents how many portions of time are contained. In order to satisfy most commonly used video frequencies of 24FPS, 25FPS, and 30FPS, we usually set the timescale to their common multiple of 600.
We can create a CMTime using the following method:
CMTime time1 = CMTimeMake(3, 1); // 3 / 1 = 3s CMTime time2 = CMTimeMakeWithSeconds(5, 1); 5s timescale = 1 NSDictionary *timeData = @{(id)kCMTimeValueKey : @2, (id)kCMTimeScaleKey : @1, (id)kCMTimeFlagsKey : @1, (id)kCMTimeEpochKey : @0}; CMTime time3 = CMTimeMakeFromDictionary((__bridge CFDictionaryRef)timeData); // special value // indicates 0 time CMTime time4 = kCMTimeZero; CMTime time5 = kCMTimeInvalid;Copy the code
CMTime operation:
CMTimeSubtract(<#CMTime lhs#>, <#CMTime rhs#>) CMTimeCompare(<#CMTime time1#>, CMTIME_IS_INVALID(<#time#>) // print CMTimeShow(<#CMTime #>)Copy the code
CMTimeRange is used to represent a time range and consists of two CMTime values, the first defining the start time of the time range and the second defining the duration of the time range.
typedef struct
{
CMTime start;
CMTime duration;
} CMTimeRange;
Copy the code
We can create a CMTimeRange using the following method:
CMTime beginTime = CMTimeMake(5, 1); CMTime endTime = CMTimeMake(12, 1); CMTimeRange timeRange1 = CMTimeRangeMake(beginTime, endTime); CMTimeRange timeRange2 = CMTimeRangeFromTimeToTime(beginTime, endTime); CMTimeRange timeRange3 = kCMTimeRangeZero; CMTimeRange timeRange4 = kCMTimeRangeInvalid;Copy the code
CMTimeRange operation:
/ / get the intersection of two CMTimeRange CMTimeRangeGetIntersection (< # CMTimeRange range# >, CMTimeRangeGetUnion(<#CMTimeRange range#>, <#CMTimeRange otherRange#>) // whether to include a time CMTimeRangeContainsTime(<#CMTimeRange range#>, < # CMTime time# >) / / contains a certain time range CMTimeRangeContainsTimeRange (< # CMTimeRange range# >, CMTIMERANGE_IS_VALID(<#range#>) CMTIMERANGE_IS_INVALID(<#range#>) // print CMTIMERANGE_IS_INVALID CMTimeRangeShow(<#CMTimeRange range#>)Copy the code
3.2 CMSampleBuffer
CMSampleBuffer is often used when processing sample level data using AVFoundation, For example, the output types of data captured by AVFoundation’s Capture module camera, and the types of data operated by AVAssetReader and AVAssetWriter during reading and writing.
CMSampleBuffer also comes from the Core Media framework. It is the Core base object that the system uses to move Media sample data through the Media pipeline. The role of CMSampleBuffer is to encapsulate the basic sample data and provide format description and time information.
CMSampleBuffer contains zero or more compressed (or uncompressed) samples of a specific media type (audio, video, hybrid, etc.). CMSampleBuffer can contain:
- Sample data. Contains one of the following:
- One or more media samples
CMBlockBuffer
.CMBlockBuffer
Is encoded data that is not decoded. - One or more media samples
CVPixelBuffer
.CVPixelBuffer
Is the data before encoding or after decoding.
- One or more media samples
- Time information. CMSampleBuffer also contains the display time representing the current sample (
Presentation Time Stamp
), representing the coding time of the sample (Decode Time Stamp
), DTS is mainly used for video decoding. If the decoding time is consistent with the display time sequence, its value can be set tokCMTimeInvalid
. Can be used separatelyCMSampleBufferGetPresentationTimeStamp
andCMSampleBufferGetDecodeTimeStamp
Get PTS and DTS. - Format information. Format information encapsulated in
CMFormatDescription
In the object. Can be used separatelyCMVideoFormatDescriptionGetCodecType
andCMVideoFormatDescriptionGetDimensions
Gets encoding type and video size. It can also be usedCMGetAttachment
Retrieve the metadata by retrieving the dictionary:
CFDictionaryRef metadataDictionary = CMGetAttachment(sampleBuffer, CFSTR("MetadataDictionary", NULL);
Copy the code
3.3 CVPixelBuffer
CVPixelBufferRef is the pixel buffer type, which is based on the image buffer type and belongs to the data type of the Core Video framework.
typedef CVImageBufferRef CVPixelBufferRef;
Copy the code
Cvelbufferref contains many image-related attributes, such as width, height, PixelFormatType, etc. In addition to the common RGB32, can also support such as kCVPixelFormatType_420YpCbCr8BiPlanarFullRange the YUV more flat data format, Pointer to data through CVPixelBufferGetBaseAddressOfPlane can get each plane. Call the CVPixelBufferLockBaseAddress before get the Address.
An instance of CMSampleBuffer contains the video frame data in the corresponding CVPixelBuffer. We can use CMSampleBufferGetImageBuffer access:
CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(<#A CMSampleBuffer#>);
Copy the code
Four, notes
4.1 Privacy Rights
The Capture module in AVFoundation involves the use of the device’s camera and microphone, and the reading and saving of pictures and videos involves the access to the user’s album. These are listed as user’s privacy rights by Apple. We need to add the description of application for access rights in response to Xcode’s info.plist. The following lists some permissions and their corresponding keys.
Private data | key |
---|---|
Album (Read permission) | Privacy – Photo Library Usage Description |
Album (Write permission) | Privacy – Photo Library Additions Usage Description |
The microphone | Privacy – Microphone Usage Description |
The camera | Privacy – Camera Usage Description |
4.2 configuration AVAudioSession
The audio environment of iOS devices is more complex than that of MAC OS. Apple provides AVAudioSession audio sessions to play the role of middleman between application programs and operating system. We only need to specify the audio behavior of application programs to delegate the management of audio behavior to AVAudioSession. The default audio session is preconfigured with the following behavior:
- Audio playback is supported, but audio recording is not allowed.
- Silent mode silences any audio played by the application.
- Locking the device will mute the app’s audio.
- When the app plays the audio, it silences any other background audio.
IOS offers six categories to choose from:
AVAudioSessionCategoryAudioProcessing iOS10.0 has been abandoned.
Category | Play/record | Whether to interrupt other audio | Mute or lock screen mode whether to mute |
---|---|---|---|
SoloAmbient | Only play | is | is |
Ambient | Only play | no | is |
MultiRoute | Play and record | is | no |
PlayAndRecord | Play and record | The default YES can be rewritten to NO | no |
Playback | Only play | The default YES can be rewritten to NO | no |
Record | Only the recording | is | No (Recording can be performed even when the screen is locked. You need to configure background mode UIBackgroundModes.) |
In addition, the audio session also provides for monitoring, such as telephone call in, the alarm clock rings caused by audio interrupt notification AVAudioSessionInterruptionNotification, And routing lines caused such as headset insert AVAudioSessionRouteChangeReason change notice.
4.3 a multithreaded
AVFoundation is built with the current hardware environment and applications in mind, and the design process relies heavily on multithreading. In the process of use, we need to make clear which thread the API is called by default, or needs to work in which thread, to ensure that the PROCESSING of UI back to the main thread in time, to avoid time-consuming operations blocking the main thread.
conclusion
As the beginning of a series of AVFoundation articles, this paper mainly introduces the overview of AVFoundation and the basic functions of each module. After that, we establish a vision of audio and video files from the perspective of track by learning the AVAssets module. You learned about common data types and considerations for using AVFoundation. In the next article we will cover the support provided by the AVFoundation framework in terms of adding and handling the first step of short video editing.
Refer to the link
AVFoundation AVAudioSession Audio session CMSampleBuffer