This article through a real MP4 file layer by layer analysis, to understand the specific composition of the MP4 file.
We used mp4Parse, an online MP4 parsing tool, for analysis and validation.
Noun definition
Before we can parse the mp4 package format, we need to understand the definition of the terms involved in MP4, as follows:
- Box: Also known as Atom, is the basic unit of an MP4 file. It is an object-oriented building block defined by a unique type identifier (type) and a length (size), consisting of header and data sections
- Fullbox: An extension to box with versino (1 byte) and Flags (3 bytes)
- Containerbox: A box is called a Containerbox when its data contains other boxes
- Header: box header file, including size, Type and, if fullbox, versino (1 byte) and flags (3 bytes)
- Data: The actual data of the box, which can be specific data or nested boxes
- Sample: Can be viewed as a basic unit of media data. Video sample Indicates a frame of data or a group of consecutive video frames. An audio sample is a continuous compressed piece of audio that contains a certain number of audio samples
- Chunk: a collection of consecutive samples of the same media type. The number of sample samples for each chunk is not fixed. The number can be the same or different
- Track: A complete media resource with continuous time stamps, usually including audio track and video track in MP4 files
The box structure can be seen below:
Pictures fromDetailed description of MP4 file format – Structure overview
Mp4 Structure composition
Mp4 files are made up of boxes, which can be viewed as the basic unit of an MP4 file. See Table 1 in the ISO_IEC_14496-12 document for specific box, or refer to this article
The example analysis
Let’s analyze each specific box as a specific MP4 file. Each box is introduced from several parts, such as binary data, box structure, field parsing table, etc. The video used for the analysis can be downloaded here.
Ftyp (File Type box)
Ftyp is the logo box of an MP4 file. There is only one box in the file and it is placed at the beginning of the file. This contains the specific specification that the current video follows and its version number.
moov(movie box)
A Moov box is a Container box. The specific data is stored in a subbox. There is only one. Mainly used to store MP4 metadata information, such as audio and video duration, timescale and so on. In order to quickly open the video file in the on-demand file, it is placed before MDAT, followed by FTYP. If the data is recorded and played, it is usually placed after MDAT.
MVHD (Movie Header Box)
Mvhd Box is a mooV subbox that defines overall media-related information, such as file creation and modification time, video duration, timescale, etc.
Trak (Track box)
Trak is a Container box, which is a subbox of MOOV. There can be one or more Trak boxes in a MOOV. Each TRAk contains metadata information for that track.
TKHD (Track Header Box)
TKHD is a subbox of Trak, and it is a full box. The overall media information of this Track includes length, image width and height, etc. The default value for the flags field in the default header is 7, which is obtained by bitwise or operation. Track_enabled (0x000001), track_in_movie (0x000002), and TRACK_in_PREVIEW (0x000004) are bitwise or result values. There are several values for falgs:
- Track_enabled: a value of 0x000001 indicates that the track is enabled, while a value of 0x000000 indicates that the track is not enabled.
- Track_in_movie: the value is 0x000002, indicating that the current track will be used during playback;
- Track_in_preview: The value is 0x000004, indicating that the current track is used for preview mode.
- Track_size_is_aspect_ratio: The value is 0x000008, indicating that the width and height are not expressed in pixels
Mdia (Media Box)
Mdia Box is a subbox of Trak. It is a Container box, so there is only the box header, and the data part is a subbox, which needs further parsing
MDHD (Media Header Box)
MDHD is a subbox of MDIA. It is a full box. The Box mainly defines media-independent overall information, among which the two most concerned fields are timescale and duration, which respectively represent the timestamp and duration information of the Track
Handler Reference Box (HDLR)
HDLR is a subbox of MDIA. It is a full box. It is used to identify the type of track.
Aligned (8) Class HandlerBox extends FullBox(' HDLR ', version = 0, 0) {unsigned int(32) pre_defined = 0; unsigned int(32) handler_type; const unsigned int(32)[3] reserved = 0; string name; }Copy the code
Minf (Media Information Box)
Minf is a subbox of MDIA. It is a Container box, which contains all the key information in track, including VMHD, DINF, and STBL boxes
VMHD (Video Media Header Box)
VMHD box is a subbox of Minf, which is a full box. The box defines specific color and pattern information
Aligned (8) Class VideoMediaHeaderBox extends FullBox(' VMHD ', version = 0, 1) { template unsigned int(16) graphicsmode = 0; template unsigned int(16)[3] opcolor = {0, 0, 0}; }Copy the code
Dinf (Data Information Box)
Dinf Box is a Container box used to determine how to locate media information. Dinf usually contains a DREF, that is, a data Reference box. Dref contains several urls or Urns. These boxes form a table to locate track data. To put it simply, track can be divided into several segments, and each segment can obtain data according to the address pointed to by “URL” or “URN”. The sequence number of these segments will be used in the sample description to form a complete track. Typically, the location string in the “URL” or “URN” is empty when the data is completely contained in the file.
Dinf box is a container box, including only size and type, mainly see data subbox dref Dref (Data Reference box)
Dref is a full box
Aligned (8) Class DataReferenceBox extends FullBox(' dref ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { DataEntryBox(entry_version, entry_flags) data_entry; }}Copy the code
URL/URN
URL/URN is a full box. From the binary data, the value of the location field is empty. Therefore, the current track is a completed track.
Aligned (8) class DataEntryUrlBox (bit(24) flags) extends FullBox(' URL ', version = 0, flags) {string location; }Copy the code
Aligned (8) class DataEntryUrnBox (bit(24) flags) extends FullBox(' URN ', version = 0, flags) {string name; string location; }Copy the code
STBL (Sample Table Box)
The STBL contains the offset, PTS, duration and other information of each sample in the media stream. It is a Container box. Sample Description Box (STSD), Time to Sample Box (STTS), Composition Time to Sample Box (CTTS), and Sample to Chunk Box (STSC), sample size box(STSZ or STZ2), Time to Sample box(STTS), and chunk Offset box(STCO or CO64). If you want to find the corresponding data in the MDAT table, you need to find the offset corresponding to sample from the above tables
I’m going to focus on the header data, and the data is the subbox, and I’ll continue with that.
STSD (Sample Description Box)
The frame description table, Full Box, is also a Container box. Provides detailed information about the type of encoding used, as well as any initialization information required for that encoding. The specific codec initialization information is placed in the subbox.
aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){ int i ; unsigned int(32) entry_count; for (i = 1 ; i <= entry_count ; I++){switch (handler_type){// audio track case 'soun' : AudioSampleEntry(); break; // video track case 'vide' : VisualSampleEntry(); break; // Hint track case 'hint' : HintSampleEntry(); break; }}}Copy the code
avc1
The video data used in MP4 is avC1 encoded and nALU segmented by fixed length fields. The box contains information about the video’s width and height.
aligned(8) abstract class SampleEntry (unsigned int(32) format) extends Box(format){
const unsigned int(8)[6] reserved = 0;
unsigned int(16) data_reference_index;
}
class VisualSampleEntry(codingname) extends SampleEntry (codingname){
unsigned int(16) pre_defined = 0;
const unsigned int(16) reserved = 0;
unsigned int(32)[3] pre_defined = 0;
unsigned int(16) width;
unsigned int(16) height;
template unsigned int(32) horizresolution = 0x00480000; // 72 dpi
template unsigned int(32) vertresolution = 0x00480000; // 72 dpi
const unsigned int(32) reserved = 0;
template unsigned int(16) frame_count = 1;
string[32] compressorname;
template unsigned int(16) depth = 0x0018;
int(16) pre_defined = -1;
// other boxes from derived specifications
CleanApertureBox clap; // optional
PixelAspectRatioBox pasp; // optional
}
class AVCSampleEntry() extends VisualSampleEntry (‘avc1’){
AVCConfigurationBox config;
MPEG4BitRateBox (); // optional
MPEG4ExtensionDescriptorsBox (); // optional
}
Copy the code
- avcc
Subbox of AVC1 box, this box contains real SPS PPS and other information, including video codec parameters
Class AVCConfigurationBox extends Box (' avcC) {AVCDecoderConfigurationRecord AVCConfig (); } aligned(8) class AVCDecoderConfigurationRecord { unsigned int(8) configurationVersion = 1; unsigned int(8) AVCProfileIndication; unsigned int(8) profile_compatibility; unsigned int(8) AVCLevelIndication; (6) a bit reserved = '111111' b; unsigned int(2) lengthSizeMinusOne; Reserved bit (3) = '111' b; unsigned int(5) numOfSequenceParameterSets; for (i=0; i< numOfSequenceParameterSets; i++) { unsigned int(16) sequenceParameterSetLength; bit(8*sequenceParameterSetLength) sequenceParameterSetNALUnit; } unsigned int(8) numOfPictureParameterSets; for (i=0; i< numOfPictureParameterSets; i++) { unsigned int(16) pictureParameterSetLength; bit(8*pictureParameterSetLength) pictureParameterSetNALUnit; } if( profile_idc == 100 || profile_idc == 110 || profile_idc == 122 || profile_idc == 144 ) { bit(6) reserved = '111111' b; unsigned int(2) chroma_format; Bit (5) reserved = '11111' b; unsigned int(3) bit_depth_luma_minus8; Bit (5) reserved = '11111' b; unsigned int(3) bit_depth_chroma_minus8; unsigned int(8) numOfSequenceParameterSetExt; for (i=0; i< numOfSequenceParameterSetExt; i++) { unsigned int(16) sequenceParameterSetExtLength; bit(8*sequenceParameterSetExtLength) sequenceParameterSetExtNALUnit; }}}Copy the code
STTS (Decoding Time to Sample Box)
The box is a subbox of the STBL and is a full box. The table stores the correspondence between DTS (decode time stamp) and SAMPLE, and the DTS of each sample can be calculated. To save storage space, instead of recording DTS per frame, we divided consecutive samples with equal duration per frame into groups.
Aligned (8) class TimeToSampleBox extends FullBox(' STTS ', version = 0, 0) {unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_delta; }}Copy the code
stss(Sync Sample Box )
STSS is a subbox of STBL, which is a full box. The array containing each keyframe in all frames is subscripted by +1, meaning that the keyframe queue starts at 1. The box does not exist in the audio track
Aligned (8) Class SyncSampleBox extends FullBox(' STSS ', version = 0, 0) {unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_number; }}Copy the code
ctts(Composition Time to Sample Box )
Ctts box is a full box that stores the difference between PTS and DTS. If there is no B frame, there is no box. The storage format is similar to that of STTS, where the stored value is changed from the sample duration to the difference between the PTS and DTS of the current sample. The PTS of Sample can be calculated from this table and STTS.
Pts (n) = offset (n) + DTS (n)
Aligned (8) Class CompositionOffsetBox extends FullBox(' CTTS ', version, 0) {unsigned int(32) entry_count; int i; if (version==0) { for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_offset; } } else if (version == 1) { for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; signed int(32) sample_offset; }}}Copy the code
Note: Only part of the data was intercepted
stsc(Sample To Chunk Box )
This table records how many samples are contained in each chunk. If consecutive chunks contain the same number of samples, they are put into the same collection. Each collection contains the following three fields:
First_chunk: index value of chunk at the beginning of the current collection, starting with 1
Samples_per_chunk: Indicates the number of samples for each chunk
Sample_description_index: Index of the sample description of the sample point. The value ranges from 1 to the number of entries in the sample description table
Here’s an example:
The first chunk contains 10 samples, the second chunk contains 10 samples, and the third chunk contains 12 samples
The first chunk has the same sample as the second chunk, so it is placed in a set with first_chunk being 1. The sample of the third chunk is different from that of the first two, and is formed in the other set. The first_chunk of the set is 3, because the index of the first chunk in the set is 3.
The values for each collection are shown in the following table:
first_chunk | samples_per_chunk | sample_description_index |
---|---|---|
1 | 10 | 1 |
3 | 12 | 1 |
Aligned (8) class SampleToChunkBox extends FullBox(' STSC ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { unsigned int(32) first_chunk; unsigned int(32) samples_per_chunk; unsigned int(32) sample_description_index; }}Copy the code
STSZ (sample size box)
The Size of each Sample is the number of bytes contained. Contains the total number of samples in the media and a table giving the size of each sample. This box is relatively large in volume.
Aligned (8) Class SampleSizeBox extends FullBox(' STSZ ', version = 0, 0) {unsigned int(32) sample_size; unsigned int(32) sample_count; if (sample_size==0) { for (i=1; i <= sample_count; i++) { unsigned int(32) entry_size; }}}Copy the code
Note: Contains partial data
Stco (Chunk Offset Box)
Full box. This Box stores the chunk Offset, which represents the location of each chunk in the file, so we can find the chunk Offset in the file, and then we can read the size of each sample based on the association between the other tables.
Stco comes in two forms. If your video is too large, it may cause the chunk offset to exceed the 32bit limit. So, here’s an extra CO64 Box for the big video. Its efficacy is equivalent to stCO, and it is also used to indicate the position of sample in mdAT box. However, chunk_offset is 64-bit.
Aligned (8) class ChunkOffsetBox extends FullBox(' stco ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { unsigned int(32) chunk_offset; }}Copy the code
Mdat (Media Data Box)
Mdat Box stores real media data. From the MP4parse, you see that the offset of MDAT is 40.
field | The number of bytes | data | The corresponding value | meaning |
---|---|---|---|---|
Box size | 4 | 00 42 1D 4D | 4332877 | The box size |
Box type | 4 | 6D 64 61 74 | mdat | ASCII value |
data |
How to find the corresponding sample data. Let’s try it out.
We know from stCO that the offset of the first sample is 48. We know from STSZ that the size of the first SAMPLE is 766 bytes. so
00, 00, 02, 7F is the first frame of the video.
This video stores AVC1 data, which consists of the following formats:
avc1 data = NALU length + NALU header + Nalu Data……. NALU length + NALU header + Nalu Data
The length Sizeminusone field in avCC indicates the length of the current NAL (NALU Header Length + NALU Data Length).
So the length of the first NAL is 00 00 02 7F, which is 639 bytes. But the size of the first sample mentioned above is 766. Why is that wrong? Let’s see.
The first NAL length is 639, plus the initial offset of 48 and the length of nalu Length of 4 bytes. The second NAL offset is 48+ 4 +639 = 691. Let’s see that the offset 691(0x2b3) is 00 00 00 77, which is 119. (4+ 119) + (4+639) = 766. It can now be concluded that the first sample contains two NalU.
By analogy, you can find a data corresponding to each sample
conclusion
This article is mainly by analyzing a real MP4 file to understand and familiar with the structure of MP4. For some beginner students, it is easier to get familiar with MP4.