There are times when you need to use Python to process binary data, such as when accessing files or socket operations. This can be done using Python’s struct module. Structs can be used to deal with structures in C.
The three most important functions in struct modules are pack(), unpack(), calcsize()
pack(fmt, v1, v2, …) Encapsulate the data as a string (actually a byte stream similar to a C structure) in the given format (FMT)
Unpack (FMT, string) parses the string of bytes in the given format (FMT) and returns the parsed tuple
Calcsize (FMT) calculates how many bytes of memory a given format (FMT) occupies
Struct supports the following formats:
Format
C Type
Python
The number of bytes
x
pad byte
no value
1
c
char
string of length 1
1
b
signed char
integer
1
B
unsigned char
integer
1
?
_Bool
bool
1
h
short
integer
2
H
unsigned short
integer
2
i
int
integer
4
I
unsigned int
integer or long
4
l
long
integer
4
L
unsigned long
long
4
q
long long
long
8
Q
unsigned long long
long
8
f
float
float
4
d
double
float
8
s
char[]
string
1
p
char[]
string
1
P
void *
long
Note 1. Q and Q are interesting only if the machine supports 64-bit operations
Note 2. Each format may be preceded by a number to indicate a number
Note 3. The s format represents a string of length, 4s represents a string of length 4, but p represents a PASCAL string
Note 4.P is used to convert a pointer whose length is related to the length of the machine word
Note 5. The last one, which can be used to indicate the pointer type, is 4 bytes
To exchange data with structures in C, consider that some C or C ++ compilers use byte alignment, usually a 32-bit system of four bytes, so structs are converted according to local machine byte order. You can use the first character in the format to change the alignment. The definition is as follows:
Character
Byte order
Size and alignment
@
native
Native makes up four bytes
=
native
Standard Specifies the number of bytes
<
little-endian
Standard Specifies the number of bytes
>
big-endian
Standard Specifies the number of bytes
!
network (= big-endian)
Standard Specifies the number of bytes
Use it in the first position of FMT, like ‘@5s6sif’
Example 1:
Let’s say I have a structure
struct Header
{
unsigned short id;
char[4] tag;
unsigned int version;
unsigned int count;
}
Recv received an above struct in string s, and now needs to parse it out using the unpack() function.
import struct
id, tag, version, count = struct.unpack(“! H4s2I”, s)
In the format string above,! Means we’re going to use network byte order parsing, because our data is received from the network, and it’s sent over the network in network byte order. The following H represents the ID of an unsigned short,4s represents a 4-byte string, and 2I represents two unsigned ints.
With just one unpack, we now have our information saved in ID, Tag, Version, count.
Similarly, it is easy to pack local data into struct format.
ss = struct.pack(“! H4s2I”, id, tag, version, count);
The pack function converts id, tag, version, and count to a Header in the specified format. Ss is now a string (actually a byte stream similar to a C structure) that can be sent via socket.send(ss).
Example 2:
import struct
A = 12.34
Change a to binary
bytes=struct.pack(‘i’,a)
Bytes in this case is a string, which is identical in bytes to the binary storage of A.
And then do the reverse operation
Existing binary bytes (actually strings), convert it in reverse to python data types:
a,=struct.unpack(‘i’,bytes)
Notice that unpack returns a tuple
So if there is only one variable:
bytes=struct.pack(‘i’,a)
So, that’s what we need to do when we decode
A, = struct. Unpack (‘ I ‘, bytes) or (a,) = struct. Unpack (‘ I ‘, bytes)
If you use a=struct.unpack(‘ I ‘,bytes), then a=(12.34,) is a tuple instead of the original float.
If it is composed of multiple data, it can be like this:
a=’hello’
b=’world! ‘
c=2
D = 45.123
bytes=struct.pack(‘5s6sif’,a,b,c,d)
Bytes are now written in binary form to files such as binfile.write(bytes).
Bytes =binfile.read()
Unpack () is decoded into python variables
a,b,c,d=struct.unpack(‘5s6sif’,bytes)
‘5s6sif’ this is called FMT, which is a format string made up of numbers plus characters, 5s for a string of five characters, 2i for two integers, etc. Here are the available characters and types, ctype for a one-to-one correspondence with python types.
Note: problems encountered when handling binary files
When we use processing binaries, we need to use the following method
Binfile =open(filepath,’rb’) Read binary file
Binfile =open(filepath,’wb’) write binary file
Binfile =open(filepath,’r’)
There are two differences:
First, if you use ‘r’ and hit ‘0x1A’, it’s considered file closed, which is EOF. This problem does not exist with ‘rb’. That is, if you write in binary and read out text, if there is ‘0X1A’ in it, only part of the file will be read out. Using ‘rb’ will read all the way to the end of the file.
Second, for the string x=’ ABC \ndef’, we can use len(x) to get a length of 7, \n we call a newline character, which is actually ‘0X0A’. When we write in ‘w’, or text, the Windows platform will automatically change ‘0X0A’ to two characters ‘0X0D’, ‘0X0A’, that is, the file length actually becomes 8. When read as ‘r’ text, it is automatically converted to the original newline character. If written in ‘WB’ binary mode, a character is kept unchanged and read as is. So if you write in text and read in binary, you have to think about that extra byte. ‘0X0D’ also known as carriage return. It doesn’t change under Linux. Because Linux only uses ‘0X0A’ for line breaks.
For more python structs, see: Python uses structs to handle binaries