In my work, I will encounter UTF-8 BOM more or less. Sometimes, I have to remove BOM if third-party tools do not support it. For example, the SQL file exported by Ali Cloud has BOM, but Navicat does not support it, so I have to remove BOM.

The test File used in the following is an ALIcloud exported SQL File (265M), which has been cached during the test (File system Inputs as shown by time are close to 0).

Sed to BOM

sed -e '1s/^\xef\xbb\xbf//' fileCopy the code

Use time to look at the sed method:

$ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.33 System time (seconds): 0.11 Percent of the CPU this job got: Elapsed (Wall clock) time (h:mm: SS or M :ss): Elapsed (wall clock) time (h:mm:ss or M: SS): Elapsed (Wall clock)Copy the code

User time is large because sed processes each row, but only the first row actually has a BOM, so CPU is wasted.

Sed also supports in-place updates (-I) :

$ /usr/bin/time -v sed -e '1s/^\xef\xbb\xbf//' sqlResult_1601835.sql -i ... User time (seconds): 1.31 System time (seconds): 3.89 Percent of the CPU this job got: Elapsed (Wall clock) time (h:mm: SS or M: SS): 07.32...Copy the code

It is slower because files are written, and as you can see with Strace, sed updates by writing to a temporary file and overwriting the original file

open("sqlResult_1601835.sql", O_RDONLY) = 3
open("./sedGlXm60", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
...
rename("./sedGlXm60", "sqlResult_1601835.sql")Copy the code

Use the tail to BOM

tail --bytes=+4 fileCopy the code

Tail allows you to skip the BOM and copy the contents of the file directly, reducing unnecessary CPU processing:

$ /usr/bin/time -v tail --bytes=+4 sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.01 System time (seconds): 0.12 Percent of the CPU this job got: Elapsed (Wall clock) time (h:mm: SS or M :ss): 0:00...Copy the code

But tail must redirect itself to the new file and overwrite the old file.

strip-bom

To combine the best of SED and tail, I wrote a strip-BOM that supports in-place updating of files.

Test the redirection first:

$ /usr/bin/time -v php strip-bom.phar sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.11 System time (seconds): 0.22 Percent of the CPU this job got: Elapsed (Wall clock) time (h:mm: SS or M :ss): Elapsed (Wall clock) time (h:mm:ss or M: SS): Elapsed (Wall clock)Copy the code

Only 20% faster than sed, with less User time but more System time. Since it is a circular read and write process, each loop is a read and write call, so I added a parameter to adjust the block size of each read, which can reduce the number of loops and system calls, and can be 60% faster than sed:

$ /usr/bin/time -v php strip-bom.phar -b 16384 sqlResult_1601835.sql > /dev/null ... User time (seconds): 0.06 System time (seconds): 0.12 Percent of the CPU this job got: Elapsed (wall clock) time (h:mm:ss or M :ss): 09:00.19Copy the code

Test in-place updates 30% faster than sed:

$ /usr/bin/time -v php strip-bom.phar -i -b 16384 sqlResult_1601835.sql User time (seconds): Elapsed (wall clock) time (h:mm: SS or M :ss): 0:05.11Copy the code

copy_file_range

Linux 4.5 added a system call:

ssize_t copy_file_range(int fd_in, loff_t *off_in,
                               int fd_out, loff_t *off_out,
                               size_t len, unsigned int flags);Copy the code

You can copy content directly between two file descriptors, and it usually only takes one system call, so you can use sed to copy to temporary files and then overwrite the old files in: Gist

Testing:

$ /usr/bin/time -v ./copy_file_range sqlResult_1601835.sql ... User time (seconds): 0.00 System time (seconds): 2.47 Percent of CPU this job got: Elapsed (wall clock) time (h:mm: SS or M :ss): 0:06.52Copy the code

Reduced system calls are only slightly faster than sed, and copying to temporary files is still slower than strip-BOM in-place updates.

Dos2unix to BOM

I always thought dos2UNIX was CRLF. After reading the comment of Feng_Yu, I read the man page. It turns out that Dos2UNIX has many functions, including the option to remove BOM (-r) :

$/usr/bin/time -v dos2UNIX -r sqlResult_1601835. SQL dos2UNIX: Converting file sqlResult_1601835. Command being timed: "dos2UNIX -r sqlresult_1601835. SQL" User time (seconds): 10.01 System time (seconds): Elapsed (wall clock) time (h:mm: SS or M :ss): 0:18.20Copy the code

The dos2UNIX implementation is similar to SED in that it writes to a temporary file and overwrites it, as well as processing every line, so performance is not as good.