FLV protocol Overview
Flash Video (FLV) is a kind of streaming media format. Due to its small size and relatively simple protocol, it quickly becomes popular and is widely supported.
The common HTTP-FLV live broadcast protocol uses HTTP to stream transmission of FLV encapsulated audio and video data. For those of you who want to understand HTTP-FLV, it is essential to understand the FLV protocol.
In a nutshell, FLV consists of FLV header and FLV file body, and FLV file body consists of multiple FLV tags.
FLV = FLV header + FLV file body FLV file body = PreviousTagSize0 + Tag1 + PreviousTagSize1 + Tag2 + … + PreviousTagSizeN-1 + TagN
FLV tags are divided into three types:
- Video Tag: Store video-related data;
- Audio Tag: Stores Audio related data;
- Script Tag: stores audio and video metadata.
Before the actual explanation of FLV protocol, the unit is agreed:
type | define |
---|---|
0x… | Hexadecimal data |
SI8 | Signed 8-bit integer |
SI16 | Signed 16-bit integer |
SI24 | Signed 24-bit integer |
SI32 | Signed 32-bit integer |
STRING | Sequence of Unicode 8-bit characters (UTF-8), terminated with 0x00 (unless otherwise specified) |
UI8 | Unsigned 8-bit integer |
UI16 | Unsigned 16-bit integer |
UI24 | An unsigned 24-bit integer |
UI32 | An unsigned 32-bit integer |
xxx [ ] | An array of type XXX |
xxx [n] | An array of type XXX and length n |
FLV header
FLV Header consists of the following fields:
- The first three bytes are always FLV
- The last 4 bytes of content are fixed at 9 (for FLV version 1)
field | The field type | Field meaning |
---|---|---|
Signature | UI8 | Signature, fixed to ‘F’ (0x46) |
Signature | UI8 | Signature, fixed to ‘L’ (0x4c) |
Signature | UI8 | Signature, fixed to ‘V’ (0x56) |
Version | UI8 | Version, such as 0x01, indicates FLV version 1 |
TypeFlagsReserved | UB[5] | Full of 0 |
TypeFlagsAudio | UB[1] | 1 means audio tag, 0 means no tag |
TypeFlagsReserved | UB[1] | Full of 0 |
TypeFlagsVideo | UB[1] | 1 indicates that there is a video tag, and 0 indicates that there is no video tag |
DataOffset | UI32 | Size of the FLV header, in bytes |
FLV file body
FLV file body is regular and consists of a series of tagsizes and tags:
- PreviousTagSize0 is always 0;
- A tag consists of a tag header and a tag body.
- For FLV version 1, the tag header is fixed to 11 bytes, so PreviousTagSize (except for the first) is 11 + the size of the previous tag body;
field | The field type | Field meaning |
---|---|---|
PreviousTagSize0 | UI32 | Is always zero |
Tag1 | FLVTAG | The first tag |
PreviousTagSize1 | UI32 | Size of the previous tag, including the Tag header |
Tag2 | FLVTAG | The second tag |
. | . | . |
PreviousTagSizeN-1 | UI32 | The size of the n-1st tag |
TagN | FLVTAG | N a tag |
PreviousTagSizeN | UI32 | The size of the NTH tag, including the tag header |
FLV tags
The FLV tag consists of tag header and tag body.
The tag header is 11 bytes:
field | The field type | Field meaning |
---|---|---|
TagType | UI8 | The tag type 8: audio 9: video 18: script data Others: Reserved |
DataSize | UI24 | Size of the tag body |
Timestamp | UI24 | Timestamp relative to the first tag (in milliseconds) The first tag has a Timestamp of 0 |
TimestampExtended | UI8 | Timestamp extended field, enabled when Timestamp is 3 bytes short, representing 8 bits higher |
StreamID | UI24 | Is always zero |
Data | Depends on the TagType | If TagType=8, it is AUDIODATA If TagType=9, it is VIDEODATA TagType=18, it is SCRIPTDATAOBJECT |
In playback, the time sequencing of FLV tags depends on the FLV timestamps only. Any timing mechanisms built into the payload data format are ignored.
Audio tags
The definition is as follows:
field | The field type | Field meaning |
---|---|---|
SoundFormat | UB[4] | Audio format, focusing on **10 = AAC ** 0 = Linear PCM, platform endian 1 = ADPCM 2 = MP3 3 = Linear PCM, little endian 4 = Nellymoser 16-kHz mono 5 = Nellymoser 8-kHz mono 6 = Nellymoser 7 = G.711 A-law logarithmic PCM 8 = G.711 mu-law logarithmic PCM 9 = reserved 10 = AAC 11 = Speex 14 = MP3 8-Khz 15 = Device-specific sound |
SoundRate | UB[2] | The sampling rate, for AAC, is always equal to 3 0 = 5.5-kHz 1 = 11-kHz 2 = 22-kHz 3 = 44-kHz |
SoundSize | UB[1] | The sampling accuracy, for compressed audio, is always 16 bits 0 = snd8Bit 1 = snd16Bit |
SoundType | UB[1] | Channel types, for Nellymoser, are always mono; For AAC, it’s always two-channel; 0 = sndMono 1 = sndStereo |
SoundData | UI8[size of sound data] | If it is AAC, then AACAUDIODATA; For others, please refer to the specification; |
Remark:
If the SoundFormat indicates AAC, the SoundType should be set to 1 (stereo) and the SoundRate should be set to 3 (44 kHz). However, this does not mean that AAC audio in FLV is always stereo, 44 kHz data. Instead, the Flash Player ignores these values and extracts the channel and sample rate data is encoded in the AAC bitstream.
AACAUDIODATA
When SoundFormat is 10, it indicates that audio is encoded by AAC. At this time, the definition of SoundData is as follows:
field | The field type | Field meaning |
---|---|---|
AACPacketType | UI8 | 0: AAC sequence header 1: AAC raw |
Data | UI8[n] | If AACPacketType is 0, AudioSpecificConfig is specified If AACPacketType is 1, it is AAC frame data |
The AudioSpecificConfig is explained in ISO 14496-3. Note that it is not the same as the contents of the esds box from an MP4/F4V file. This structure is more deeply embedded.
About AudioSpecificConfig
The pseudocode is as follows: see here
5 bits: object type
if (object type == 31)
6 bits + 32: object type
4 bits: frequency index
if (frequency index == 15)
24 bits: frequency
4 bits: channel configuration
var bits: AOT Specific Config
Copy the code
The definition is as follows:
field | The field type | Field meaning |
---|---|---|
AudioObjectType | UB[5] | The encoder type, for example, 2 indicates AAC-LC |
SamplingFrequencyIndex | UB[4] | Sampling rate index value, such as 4 for 44100 |
SamplingFrequencyIndex | UB[4] | Sampling rate index value, such as 4 for 44100 |
ChannelConfiguration | UB[4] | For example, 2 indicates dual-channel, front-left, and front-right |
Video tags
The definition is as follows:
field | The field type | Field meaning |
---|---|---|
FrameType | UB[4] | Focus on 1 and 2: 1: KEYframe (for AVC, a seekable frame) — h. 264 IDR frame 2: Inter frame (for AVC, a non-seekable frame) — H.264 normal I frame; 3: disposable inter frame (H.263 only) 4: generated keyframe (reserved for server use only) 5: video info/command frame |
CodecID | UB[4] | Codecs with a focus on 7 (AVC) 1: JPEG (currently unused) 2: Sorenson H.263 3: Screen video 4: On2 VP6 5: On2 VP6 with alpha channel 6: Screen video version 2 7: AVC |
VideoData | Depends on CodecID | The actual media type, mainly concerned with 7:AVCVIDEOPACKE 2: H263VIDEOPACKET 3: SCREENVIDEOPACKET 4: VP6FLVVIDEOPACKET 5: VP6FLVALPHAVIDEOPACKET 6: SCREENV2VIDEOPACKET 7: AVCVIDEOPACKE |
AVCVIDEOPACKE
When CodecID is 7, VideoData is AVCVIDEOPACKE, also known as H.264 media data.
AVCVIDEOPACKE is defined as follows:
field | The field type | Field meaning |
---|---|---|
AVCPacketType | UI8 | 0: AVC sequence header 1: AVC NALU 2: AVC end of sequence |
CompositionTime | SI24 | If AVCPacketType=1, it is the time CTS offset; Otherwise, it is 0 |
Data | UI8[n] | 1, if if AVCPacketType = 1, is AVCDecoderConfigurationRecord 2, if AVCPacketType=1=2, then NALU (one or more) 3. If AVCPacketType=2, null |
Here are a few things to explain:
- NALU: In H.264, an abstract logical unit (NALU) is obtained after data is formatted according to specific rules. The data here includes not only the encoded video data, but also the parameter set (PPS, SPS) needed for video decoding.
- AVCDecoderConfigurationRecord: h. 264 video decoding the required parameter set (SPS, PPS)
- CTS: When B frames exist, DTS and PTS may be different in the process of video decoding and presentation. The calculation formula of CTS is PTS-DTS /90, in milliseconds. If B frame does not exist, CTS is fixed to 0.
PPS and SPS are not expanded here.
Script Data Tags
Script Data Tags are usually used to store onMetaData related to audio and video in FLV, such as length, length, width, etc. Its definition is relatively complex, using AMF (Action Message Format) encapsulates a series of data types, such as strings, values, arrays and so on.
field | The field type | Field meaning |
---|---|---|
Objects | SCRIPTDATAOBJECT[] | Any number of ScriptDataObjects |
SCRIPTDATAOBJECTEND | UI24 | Always 9, marking the end of Script Data |
A SCRIPTDATAOBJECT is defined as follows:
field | The field type | Field meaning |
---|---|---|
ObjectName | SCRIPTDATASTRING | Object name |
ObjectData | SCRIPTDATAVALUE | The value of the object |
The definition of SCRIPTDATAVALUE is as follows:
field | The field type | Field meaning |
---|---|---|
Type | SCRIPTDATASTRING | Variable type: 0 = Number type 1 = Boolean type 2 = String type 3 = Object type 4 = MovieClip type 5 = Null type 6 = Undefined type 7 = Reference type 8 = ECMA array type 10 = Strict array type 11 = Date type 12 = Long string type |
ECMAArrayLength | If Type is 8 (array), then UI32 | The length of the array |
ScriptDataValue | If Type == 0 DOUBLE If Type == 1 UI8 If Type == 2 SCRIPTDATASTRING . (A bit long, please refer to the specification) |
The value of the variable |
ScriptDataValueTerminator | If Type==3, it is SCRIPTDATAOBJECTEND If Type==8, it is SCRIPTDATAVARIABLEEND |
End character of Object or Array |
As you can see, the definition of Script Data Tag is relatively complex.
onMetaData
OnMetaData contains audio and video related metadata, encapsulated in Script Data tags, which contain two AMFs.
The first AMF:
- The first byte, 0x02, is a string
- Bytes 2-3: UI16 type 0x000A, indicating string length of 10 (length of onMetaData);
- Byte 4-13: Hexadecimal number corresponding to the string onMetaData (0x6F 0x6E 0x4D 0x65 0x74 0x61 0x44 0x61 0x74 0x61);
Second AMF:
- The first byte, 0x08, represents the array type;
- Bytes 2-5: UI32, indicating the length of the array. The properties of onMetaData are not fixed.
- Byte 6 + : for example, duration, then:
- Byte 6-9:0x0008, indicating the length of 8 bytes;
- Bytes 10-17 0x6475 7261 7469, duration;
- The 18th byte, 0x00, represents a numeric type;
- Bytes 19-26:0x… Is the specific duration;
More onMetaData definitions:
field | The field type | Field meaning |
---|---|---|
duration | DOUBLE | Length of file |
width | DOUBLE | Video width (PX) |
height | DOUBLE | Video Height (PX) |
videodatarate | DOUBLE | Video bit rate (KB /s) |
framerate | DOUBLE | Video frame rate (frame /s) |
videocodecid | DOUBLE | Video codec ID (see Video Tag) |
audiosamplerate | DOUBLE | Audio sampling rate |
audiosamplesize | DOUBLE | Audio sampling accuracy (see Audio Tag) |
stereo | BOOL | Stereo or not |
audiocodecid | DOUBLE | Audio codec ID (see Audio Tag) |
filesize | DOUBLE | Total file size (bytes) |
Write in the back
FLV protocol itself is not complicated, the difficulty in understanding, more often from audio and video codec related knowledge, such as H.264, AAC related knowledge, it is recommended to look up when you do not understand. In addition, the byte order of FLV is big-endian, which must be paid attention to when parsing protocols.
This article for the convenience of explanation, part of the content may not be rigorous, if there are mistakes, please point out.
A link to the
Video_file_format_spec_v10. PDF www.adobe.com/content/dam…
Mpeg-4 Part 3 en.wikipedia.org/wiki/MPEG-4…
FLV file analysis www.jianshu.com/p/e290dca02…
H.264 NALU syntax blog.csdn.net/qq_29350001…