This article through a real MP4 file layer by layer analysis, to understand the specific composition of the MP4 file.

We used mp4Parse, an online MP4 parsing tool, for analysis and validation.

Noun definition

Before we can parse the mp4 package format, we need to understand the definition of the terms involved in MP4, as follows:

Box: Also known as Atom, is the basic unit of an MP4 file. It is an object-oriented building block defined by a unique type identifier (type) and a length (size), consisting of header and data sections
Fullbox: An extension to box with versino (1 byte) and Flags (3 bytes)
Containerbox: A box is called a Containerbox when its data contains other boxes
Header: box header file, including size, Type and, if fullbox, versino (1 byte) and flags (3 bytes)
Data: The actual data of the box, which can be specific data or nested boxes
Sample: Can be viewed as a basic unit of media data. Video sample Indicates a frame of data or a group of consecutive video frames. An audio sample is a continuous compressed piece of audio that contains a certain number of audio samples
Chunk: a collection of consecutive samples of the same media type. The number of sample samples for each chunk is not fixed. The number can be the same or different
Track: A complete media resource with continuous time stamps, usually including audio track and video track in MP4 files

The box structure can be seen below:

Pictures fromDetailed description of MP4 file format – Structure overview

Mp4 Structure composition

Mp4 files are made up of boxes, which can be viewed as the basic unit of an MP4 file. See Table 1 in the ISO_IEC_14496-12 document for specific box, or refer to this article

The example analysis

Let’s analyze each specific box as a specific MP4 file. Each box is introduced from several parts, such as binary data, box structure, field parsing table, etc. The video used for the analysis can be downloaded here.

Ftyp (File Type box)

Ftyp is the logo box of an MP4 file. There is only one box in the file and it is placed at the beginning of the file. This contains the specific specification that the current video follows and its version number.

moov(movie box)

A Moov box is a Container box. The specific data is stored in a subbox. There is only one. Mainly used to store MP4 metadata information, such as audio and video duration, timescale and so on. In order to quickly open the video file in the on-demand file, it is placed before MDAT, followed by FTYP. If the data is recorded and played, it is usually placed after MDAT.

MVHD (Movie Header Box)

Mvhd Box is a mooV subbox that defines overall media-related information, such as file creation and modification time, video duration, timescale, etc.

Trak (Track box)

Trak is a Container box, which is a subbox of MOOV. There can be one or more Trak boxes in a MOOV. Each TRAk contains metadata information for that track.

TKHD (Track Header Box)

TKHD is a subbox of Trak, and it is a full box. The overall media information of this Track includes length, image width and height, etc. The default value for the flags field in the default header is 7, which is obtained by bitwise or operation. Track_enabled (0x000001), track_in_movie (0x000002), and TRACK_in_PREVIEW (0x000004) are bitwise or result values. There are several values for falgs:

Track_enabled: a value of 0x000001 indicates that the track is enabled, while a value of 0x000000 indicates that the track is not enabled.

Track_in_movie: the value is 0x000002, indicating that the current track will be used during playback;

Track_in_preview: The value is 0x000004, indicating that the current track is used for preview mode.

Track_size_is_aspect_ratio: The value is 0x000008, indicating that the width and height are not expressed in pixels

Mdia (Media Box)

Mdia Box is a subbox of Trak. It is a Container box, so there is only the box header, and the data part is a subbox, which needs further parsing

MDHD (Media Header Box)

MDHD is a subbox of MDIA. It is a full box. The Box mainly defines media-independent overall information, among which the two most concerned fields are timescale and duration, which respectively represent the timestamp and duration information of the Track

Handler Reference Box (HDLR)

HDLR is a subbox of MDIA. It is a full box. It is used to identify the type of track.

Aligned (8) Class HandlerBox extends FullBox(' HDLR ', version = 0, 0) {unsigned int(32) pre_defined = 0; unsigned int(32) handler_type; const unsigned int(32)[3] reserved = 0; string name; }Copy the code

Minf (Media Information Box)

Minf is a subbox of MDIA. It is a Container box, which contains all the key information in track, including VMHD, DINF, and STBL boxes

VMHD (Video Media Header Box)

VMHD box is a subbox of Minf, which is a full box. The box defines specific color and pattern information

Aligned (8) Class VideoMediaHeaderBox extends FullBox(' VMHD ', version = 0, 1) { template unsigned int(16) graphicsmode = 0; template unsigned int(16)[3] opcolor = {0, 0, 0}; }Copy the code

Dinf (Data Information Box)

Dinf Box is a Container box used to determine how to locate media information. Dinf usually contains a DREF, that is, a data Reference box. Dref contains several urls or Urns. These boxes form a table to locate track data. To put it simply, track can be divided into several segments, and each segment can obtain data according to the address pointed to by “URL” or “URN”. The sequence number of these segments will be used in the sample description to form a complete track. Typically, the location string in the “URL” or “URN” is empty when the data is completely contained in the file.

Dinf box is a container box, including only size and type, mainly see data subbox dref Dref (Data Reference box)

Dref is a full box

Aligned (8) Class DataReferenceBox extends FullBox(' dref ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { DataEntryBox(entry_version, entry_flags) data_entry; }}Copy the code

URL/URN

URL/URN is a full box. From the binary data, the value of the location field is empty. Therefore, the current track is a completed track.

Aligned (8) class DataEntryUrlBox (bit(24) flags) extends FullBox(' URL ', version = 0, flags) {string location; }Copy the code

Aligned (8) class DataEntryUrnBox (bit(24) flags) extends FullBox(' URN ', version = 0, flags) {string name; string location; }Copy the code

STBL (Sample Table Box)

The STBL contains the offset, PTS, duration and other information of each sample in the media stream. It is a Container box. Sample Description Box (STSD), Time to Sample Box (STTS), Composition Time to Sample Box (CTTS), and Sample to Chunk Box (STSC), sample size box(STSZ or STZ2), Time to Sample box(STTS), and chunk Offset box(STCO or CO64). If you want to find the corresponding data in the MDAT table, you need to find the offset corresponding to sample from the above tables

I’m going to focus on the header data, and the data is the subbox, and I’ll continue with that.

STSD (Sample Description Box)

The frame description table, Full Box, is also a Container box. Provides detailed information about the type of encoding used, as well as any initialization information required for that encoding. The specific codec initialization information is placed in the subbox.

aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type) extends FullBox('stsd', 0, 0){ int i ; unsigned int(32) entry_count; for (i = 1 ; i <= entry_count ; I++){switch (handler_type){// audio track case 'soun' : AudioSampleEntry(); break; // video track case 'vide' : VisualSampleEntry(); break; // Hint track case 'hint' : HintSampleEntry(); break; }}}Copy the code

avc1

The video data used in MP4 is avC1 encoded and nALU segmented by fixed length fields. The box contains information about the video’s width and height.

aligned(8) abstract class SampleEntry (unsigned int(32) format) extends Box(format){
     const unsigned int(8)[6] reserved = 0;
     unsigned int(16) data_reference_index; 
} 

 

class VisualSampleEntry(codingname) extends SampleEntry (codingname){    
    unsigned int(16) pre_defined = 0;    
    const unsigned int(16) reserved = 0;    
    unsigned int(32)[3]  pre_defined = 0;    
    unsigned   int(16)   width;      
    unsigned   int(16)   height;      
    template unsigned int(32)  horizresolution = 0x00480000; // 72 dpi    
    template unsigned int(32)  vertresolution  = 0x00480000; // 72 dpi    
    const unsigned int(32)  reserved = 0;    
    template unsigned int(16)  frame_count = 1;    
    string[32]   compressorname;      
    template unsigned int(16)  depth = 0x0018;    
    int(16)  pre_defined = -1;    
    // other boxes from derived specifications    
    CleanApertureBox      clap;      //   optional      
    PixelAspectRatioBox   pasp;      //   optional   
}

class AVCSampleEntry() extends VisualSampleEntry (‘avc1’){ 
    AVCConfigurationBox config; 
    MPEG4BitRateBox (); // optional 
    MPEG4ExtensionDescriptorsBox (); // optional 
}
Copy the code

avcc

Subbox of AVC1 box, this box contains real SPS PPS and other information, including video codec parameters

Class AVCConfigurationBox extends Box (' avcC) {AVCDecoderConfigurationRecord AVCConfig (); } aligned(8) class AVCDecoderConfigurationRecord { unsigned int(8) configurationVersion = 1; unsigned int(8) AVCProfileIndication; unsigned int(8) profile_compatibility; unsigned int(8) AVCLevelIndication; (6) a bit reserved = '111111' b; unsigned int(2) lengthSizeMinusOne; Reserved bit (3) = '111' b; unsigned int(5) numOfSequenceParameterSets; for (i=0; i< numOfSequenceParameterSets; i++) { unsigned int(16) sequenceParameterSetLength; bit(8*sequenceParameterSetLength) sequenceParameterSetNALUnit; } unsigned int(8) numOfPictureParameterSets; for (i=0; i< numOfPictureParameterSets; i++) { unsigned int(16) pictureParameterSetLength; bit(8*pictureParameterSetLength) pictureParameterSetNALUnit; } if( profile_idc == 100 || profile_idc == 110 || profile_idc == 122 || profile_idc == 144 ) { bit(6) reserved = '111111' b; unsigned int(2) chroma_format; Bit (5) reserved = '11111' b; unsigned int(3) bit_depth_luma_minus8; Bit (5) reserved = '11111' b; unsigned int(3) bit_depth_chroma_minus8; unsigned int(8) numOfSequenceParameterSetExt; for (i=0; i< numOfSequenceParameterSetExt; i++) { unsigned int(16) sequenceParameterSetExtLength; bit(8*sequenceParameterSetExtLength) sequenceParameterSetExtNALUnit; }}}Copy the code

STTS (Decoding Time to Sample Box)

The box is a subbox of the STBL and is a full box. The table stores the correspondence between DTS (decode time stamp) and SAMPLE, and the DTS of each sample can be calculated. To save storage space, instead of recording DTS per frame, we divided consecutive samples with equal duration per frame into groups.

Aligned (8) class TimeToSampleBox extends FullBox(' STTS ', version = 0, 0) {unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_delta; }}Copy the code

stss(Sync Sample Box )

STSS is a subbox of STBL, which is a full box. The array containing each keyframe in all frames is subscripted by +1, meaning that the keyframe queue starts at 1. The box does not exist in the audio track

Aligned (8) Class SyncSampleBox extends FullBox(' STSS ', version = 0, 0) {unsigned int(32) entry_count; int i; for (i=0; i < entry_count; i++) { unsigned int(32) sample_number; }}Copy the code

ctts(Composition Time to Sample Box )

Ctts box is a full box that stores the difference between PTS and DTS. If there is no B frame, there is no box. The storage format is similar to that of STTS, where the stored value is changed from the sample duration to the difference between the PTS and DTS of the current sample. The PTS of Sample can be calculated from this table and STTS.

Pts (n) = offset (n) + DTS (n)

Aligned (8) Class CompositionOffsetBox extends FullBox(' CTTS ', version, 0) {unsigned int(32) entry_count; int i; if (version==0) { for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; unsigned int(32) sample_offset; } } else if (version == 1) { for (i=0; i < entry_count; i++) { unsigned int(32) sample_count; signed int(32) sample_offset; }}}Copy the code

Note: Only part of the data was intercepted

stsc(Sample To Chunk Box )

This table records how many samples are contained in each chunk. If consecutive chunks contain the same number of samples, they are put into the same collection. Each collection contains the following three fields:

First_chunk: index value of chunk at the beginning of the current collection, starting with 1

Samples_per_chunk: Indicates the number of samples for each chunk

Sample_description_index: Index of the sample description of the sample point. The value ranges from 1 to the number of entries in the sample description table

Here’s an example:

The first chunk contains 10 samples, the second chunk contains 10 samples, and the third chunk contains 12 samples

The first chunk has the same sample as the second chunk, so it is placed in a set with first_chunk being 1. The sample of the third chunk is different from that of the first two, and is formed in the other set. The first_chunk of the set is 3, because the index of the first chunk in the set is 3.

The values for each collection are shown in the following table:

first_chunk	samples_per_chunk	sample_description_index
1	10	1
3	12	1

Aligned (8) class SampleToChunkBox extends FullBox(' STSC ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { unsigned int(32) first_chunk; unsigned int(32) samples_per_chunk; unsigned int(32) sample_description_index; }}Copy the code

STSZ (sample size box)

The Size of each Sample is the number of bytes contained. Contains the total number of samples in the media and a table giving the size of each sample. This box is relatively large in volume.

Aligned (8) Class SampleSizeBox extends FullBox(' STSZ ', version = 0, 0) {unsigned int(32) sample_size; unsigned int(32) sample_count; if (sample_size==0) { for (i=1; i <= sample_count; i++) { unsigned int(32) entry_size; }}}Copy the code

Note: Contains partial data

Stco (Chunk Offset Box)

Full box. This Box stores the chunk Offset, which represents the location of each chunk in the file, so we can find the chunk Offset in the file, and then we can read the size of each sample based on the association between the other tables.

Stco comes in two forms. If your video is too large, it may cause the chunk offset to exceed the 32bit limit. So, here’s an extra CO64 Box for the big video. Its efficacy is equivalent to stCO, and it is also used to indicate the position of sample in mdAT box. However, chunk_offset is 64-bit.

Aligned (8) class ChunkOffsetBox extends FullBox(' stco ', version = 0, 0) {unsigned int(32) entry_count; for (i=1; i <= entry_count; i++) { unsigned int(32) chunk_offset; }}Copy the code

Mdat (Media Data Box)

Mdat Box stores real media data. From the MP4parse, you see that the offset of MDAT is 40.

field	The number of bytes	data	The corresponding value	meaning
Box size	4	00 42 1D 4D	4332877	The box size
Box type	4	6D 64 61 74	mdat	ASCII value
data

How to find the corresponding sample data. Let’s try it out.

We know from stCO that the offset of the first sample is 48. We know from STSZ that the size of the first SAMPLE is 766 bytes. so

00, 00, 02, 7F is the first frame of the video.

This video stores AVC1 data, which consists of the following formats:

avc1 data = NALU length + NALU header + Nalu Data……. NALU length + NALU header + Nalu Data

The length Sizeminusone field in avCC indicates the length of the current NAL (NALU Header Length + NALU Data Length).

So the length of the first NAL is 00 00 02 7F, which is 639 bytes. But the size of the first sample mentioned above is 766. Why is that wrong? Let’s see.

The first NAL length is 639, plus the initial offset of 48 and the length of nalu Length of 4 bytes. The second NAL offset is 48+ 4 +639 = 691. Let’s see that the offset 691(0x2b3) is 00 00 00 77, which is 119. (4+ 119) + (4+639) = 766. It can now be concluded that the first sample contains two NalU.

By analogy, you can find a data corresponding to each sample

conclusion

This article is mainly by analyzing a real MP4 file to understand and familiar with the structure of MP4. For some beginner students, it is easier to get familiar with MP4.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Mp4 Example Analysis

Noun definition

Mp4 Structure composition

The example analysis

Ftyp (File Type box)

moov(movie box)

MVHD (Movie Header Box)

Trak (Track box)

TKHD (Track Header Box)

Mdia (Media Box)

MDHD (Media Header Box)

Handler Reference Box (HDLR)

Minf (Media Information Box)

VMHD (Video Media Header Box)

Dinf (Data Information Box)

Dref is a full box

URL/URN

STBL (Sample Table Box)

STSD (Sample Description Box)

avc1

STTS (Decoding Time to Sample Box)

stss(Sync Sample Box )

ctts(Composition Time to Sample Box )

stsc(Sample To Chunk Box )

STSZ (sample size box)

Stco (Chunk Offset Box)

Mdat (Media Data Box)

conclusion

Mp4 Example Analysis

Noun definition

Mp4 Structure composition

The example analysis

Ftyp (File Type box)

moov(movie box)

MVHD (Movie Header Box)

Trak (Track box)

TKHD (Track Header Box)

Mdia (Media Box)

MDHD (Media Header Box)

Handler Reference Box (HDLR)

Minf (Media Information Box)

VMHD (Video Media Header Box)

Dinf (Data Information Box)

Dref is a full box

URL/URN

STBL (Sample Table Box)

STSD (Sample Description Box)

avc1

STTS (Decoding Time to Sample Box)

stss(Sync Sample Box )

ctts(Composition Time to Sample Box )

stsc(Sample To Chunk Box )

STSZ (sample size box)

Stco (Chunk Offset Box)

Mdat (Media Data Box)

conclusion

Related Posts

GraphQL + Apollo + Vue

Weird things to learn in a browser

Mom doesn’t have to worry about me webpack anymore