Utf-8 encoding verification

393. Utf-8 encoding validation

Switch to English to receive dynamic feedback

Given an integer array data representing data, returns whether it is a valid UTF-8 encoding.

A character in UTF-8 may be 1 to 4 bytes in length, following the following rules:

For 1-byte characters, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol.
For n-byte characters (n > 1), the first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits, not mentioned, are all unicode codes for this symbol.

Here’s how UTF-8 encoding works:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Copy the code

Note: The input is an array of integers. Only the lowest eight significant bits of each integer are used to store data. This means that each integer represents only 1 byte of data.

Example 1:

Input: data = [197,130,1] output: true description: data representation byte sequence :11000101 10000010 00000001. This is the valid UTF-8 encoding for a 2-byte character followed by a 1-byte character.Copy the code

Example 2:

Input: data = [235,140,4] output: false description: data represents an 8-bit sequence: 11101011 10001100 00000100. The first three bits are all 1's, and the fourth bit is 0 to indicate that it is a 3-byte character. The next byte is a continuation byte starting with 10, which is correct. But the second continuation byte does not start with 10, so it is against the rule.Copy the code

Tip:

1 <= data.length <= 2 * 104
0 <= data[i] <= 255

simulation

Carry out the simulation according to the meaning of the question.

To calculate thedata[i]It takes a few bytes.
- I did the calculationifJudge, details visiblecalculationByte()function
Calculate the totalnAfter the number of bytes, verify the following bytes in sequencenA.

You have to pay attention to the boundary conditions in this process

Whether the number of bytes left in the array is sufficient for encoding
If the coding condition is not met, it should be timelyreturn

#include<iostream> #include<vector> #include<stack> #include <algorithm> using namespace std; class Solution { const int oneMask = 1 << 7; const int towMask = oneMask + (1 << 6); const int thereMask = towMask + (1 << 5); const int fourMask = thereMask + (1 << 4); public: bool validUtf8(vector<int>& data) { if (data.empty()) return false; int i = 0; while (i < data.size()) { int b = calculationByte(data[i]); if (! _validUtf8(data, i, i + b - 1)) return false; i += b; } return true; } bool _validUtf8(vector<int>& data, int left, int right) { if (left > right) return false; if (left == right) return true; if (right >= data.size()) return false; for (int i = left + 1; i <= right; i++) { int value = data[i]; if ((value & oneMask) ! = oneMask) return false; if (((value & (1<<6)) ! = 0)) return false; } return true; } int calculationByte(int value) { int ret = value & oneMask; if (ret == 0) return 1; ret = value & towMask; if ((ret == towMask) && ((value & (1 << 5)) == 0)) return 2; ret = value & thereMask; if ((ret == thereMask) && ((value & (1 << 4)) == 0)) return 3; ret = value & fourMask; if ((ret == fourMask) && ((value & (1 << 3)) == 0)) return 4; return -1; }};Copy the code

393. Utf-8 encoding validation

simulation

Related Posts

5 questions programmers ask themselves

What is the difference between varchar and char in MySQL

“Building block library” makes it easy for ordinary people to build a website, just one person