Integrated Oracle data into MaxCompute, garbled code problem analysis; Why, while Oracle data is not garbled, integration into MaxCompute is garbled? What’s the problem?
1.1 Garbled characters
DataWorks’ data offline integration (DataX) integrates Oracle data into MaxCompute, but the source library is not garbled.
Symptom: [Oracle; tool: plsqL-dev]
[MaxCompute, DataWorks]
select OP_USER from test.mdtsb where
uuid=’161A45E75BC88040E053441074848040′;
1.2 Problem Analysis
Use two Oracle functions:
【 DUMP; CONVERT;
Docs.oracle.com/database/12…
Docs.oracle.com/database/12…
In order to avoid the correct transcoding because the real character set of the column with garbled characters is the same as that of the operating system, the Linux environment is selected, the client character set is SET to UTF8, and SQLplus is used for query analysis.
1. Client environment:
SQL> select userenv(‘language’) from dual;
USERENV(‘LANGUAGE’)
—————————————————-
SIMPLIFIED CHINESE_CHINA.AL32UTF8
2. Use the dump function to print the encoding of the text:
SQL> select dump(OP_USER,1016) from test.mdtsb where
uuid=’161A45E75BC88040E053441074848040′ ;
DUMP(OP_USER,1016)
—————————————————-
Typ=1 Len=6CharacterSet=AL32UTF8: c0,ee,be,b0,ea,bb
This function outputs three messages
1. The length is 6 bytes
2. Character set is AL32UTF8
3. The character encoding is C0, EE,be, B0, EA,bb
Based on this information, we guess that the original character set may be GBK or GB18030 or GB2312. Because three characters, if it’s UTF8 their encoding is mostly one character for three bytes. The encoding of the GBK character set is double-byte.
3. View the GBK characters corresponding to C0, EE,be,b0, EA, and BB
www.qqxiuzi.cn/zh/hanzi-gb…
c0,ee
be,b0
ea,bb
Therefore, it can be identified that this character encoding belongs to GBK class.
4. Corresponding to the Man’s GBK class code
www.qqxiuzi.cn/bianma/zifu…
li
GB2312 code: C0EE
BIG5 code: A7F5
GBK code: C0EE
GB18030 code: C0EE
Unicode encoding: 674E
scene
GB2312 code: BEB0
BIG5 code: B4BA
GBK code: BEB0
GB18030 code: BEB0
Unicode encoding: 666F
son
GB2312 Code: EABB
BIG5 code: A9FE
GBK code: EABB
GB18030 Code: EABB
Unicode encoding: 660A
As you can see, the possible codes for these three characters are GB2312, GB18030, GBK.
5. Convert the string encoding to AL32UTF8 using the convert function
SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where
uuid=’161A45E75BC88040E053441074848040′ ;
CONVERT(OP_USER,’ AL32UTF8′,’ZHS16GBK’)
——————————————————-
Li Jinghao
This result verifies our guess that the character set should be the GBK character set, but because GBK and GB18030 and GB2312 are not completely parent-child set relations, this can only be one of them. However, the GBK character set is currently more commonly used.
SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where
uuid=’161A45E75BC88040E053441074848040′ ;
CONVERT(OP_USER,’ AL32UTF8′,’ZHS16CGB231280′)
——————————————————-
Li Jinghao
SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where
uuid=’161A45E75BC88040E053441074848040′ ;
CONVERT(OP_USER,’ AL32UTF8′,’ZHS32GB18030′)
——————————————————-
Li Jinghao
6. Query the string encoding of UTF8 corresponding to these three Characters
Website: www.qqxiuzi.cn/bianma/Unic…
li
Unicode encoding: 0000674E
UTF8 encoding: E69D8E
UTF16: FEFF674E
UTF32:0000FEFF0000674E
scene
Unicode encoding: 0000666F
UTF8 encoding: E699AF
UTF16 code: FEFF666F
UTF32:0000FEFF0000666F
son
Unicode code: 0000660A
UTF8 encoding: E6988A
UTF16 code: FEFF660A
UTF32:0000FEFF0000660A
So, if the original string is encoding UTF8, theoretically we should through the dump function to obtain the string coding for: e6, 9 d, e, the e6, 99, af, e6, 98, 8 a
1.3 Troubleshooting
Convert from:
SQL> select dump(convert(OP_USER,’AL32UTF8′,’ZHS16GBK’),1016) from test.mdtsb
where uuid=’161A45E75BC88040E053441074848040′ ;
DUMP(CONVERT(OP_USER,’ AL32UTF8′,’ZHS16GBK’),1016)
——————————————————-
Typ = 1 9 characterset Len = = AL32UTF8: e6, 9 d, e, the e6, 99, af, e6, 98, 8 a
The UTF8 character set should display the encoding:
E6, 9 d, 8 e, the e6, 99, af, e6, 98, 8 a.
Conclusion:
The actual encoding of the column stored in the Oracle database is the GBK character set, which is the same as the character encoding in the client environment, so it just shows up. But because Oracle stores the character set as UTF8, it should actually store the character set as UTF8. Therefore, the code of the annotation is inconsistent with the actual code – garbled code.
To deal with:
The convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) function is used to convert the real character set obtained by testing the character set on the corresponding column of the data synchronization task. After converting to ODPS, the data will no longer be garbled.
SELECT convert(KEY1,’ZHS16GBK’,’UTF8′) FROM MATDOC;
[MaxCompute, DataWorks]
select OP_USER from MaxCompute.MDTSB where uuid=’161A45E75BC88040E053441074848040′;
1.4 Target end identification method
In MaxCompute, there is a function that can parse and convert character encoding, but MaxCompute can only store one character set, “UTF-8”, so it is best to convert to UTF-8 at the source or during transmission. The following two functions are character-encoding related functions for MaxCompute.
ENCODE,
The command format
binary encode(string , string )
The command that
Encode STR in charset format.
Parameters that
STR: Mandatory. The STRING type. The string to be recoded.
Charset: Mandatory. The STRING type. Encoding format. The value can be UTF-8, UTF-16, UTF-16LE, UTF-16BE, ISO-8859-1, or US-ASCII.
Return Value Description
Return BINARY. If any input parameter is NULL, NULL is returned.
The sample
Example 1: Encode the string ABC in UTF-8 format. The following command is used as an example.
select encode(“abc”, “UTF-8”);
— return ABC.
· IS_ENCODING
The command format
boolean is_encoding(string , string , string )
The command that
Determines whether the input string STR can be converted from the specified character set from_encoding to another character set to_encoding. It can also be used to determine whether the input is garbled. In general, you can set from_encoding to UTF-8 and to_encoding to GBK.
Parameters that
STR: Mandatory. The STRING type. An empty string can be considered to belong to any character set.
From_encoding, TO_encoding: mandatory. STRING, source and target character sets.
Return Value Description
Return BOOLEAN type. Return True if STR can be successfully converted, False otherwise. If either input parameter is NULL, NULL is returned.
The sample
Example 1: Determine whether a character test or test can be converted from the UTF-8 character set to the GBK character set. The following command is used as an example.
Select is_encoding(‘ test ‘, ‘UTF-8 ‘,’ GBK ‘);
— Returns true.
Select is_encoding(‘ test ‘, ‘UTF-8 ‘,’ GBK ‘);
— Returns true.
Help.aliyun.com/document\_d…
1.5 the appendix
The Convert function
Purpose
CONVERT converts a character string from one character set to another. The datatype of the returned value is VARCHAR2.
-
The
char
argument is the value to be converted. It can be any of the datatypes CHAR, VARCHAR2, NCHAR, NVARCHAR2, CLOB, or NCLOB.
-
The
dest_char_set
argument is the name of the character set to which
char
is converted.
-
The
source_char_set
argument is the name of the character set in which
char
is stored in the database. The default value is the database character set.
Both the destination and source character set arguments can be either literals or columns containing the name of the character set.
For complete correspondence in character conversion, it is essential that the destination character set contains a representation of all the characters defined in the source character set. Where a character does not exist in the destination character set, a replacement character appears. Replacement characters can be defined as part of a character set definition.
Examples
The following example illustrates character set conversion by converting a Latin-1 string to ASCII. The result is the same as importing the same string from a WE8ISO8859P1 database to a US7ASCII database.
SELECT CONVERT(‘ A E I O ø A B C D E ‘, ‘US7ASCII’, ‘WE8ISO8859P1’)
FROM DUAL;
CONVERT (‘ AEIO Ø ABCDE ‘
———————
A E I ? ? A B C D E ?
‘US7ASCII’ is the current Oracle database character set, and ‘WE8ISO8859P1’ is the converted character set.
Common character sets include:
· US7ASCII: US 7-bit ASCII character set
· WE8DEC: West European 8-bit character set
· F7DEC: Digital French 7-bit character set
· WE8EBCDIC500: IBM West European EBCDIC Code Page 500
· WE8ISO8859P1: ISO 8859-1 West European 8-bit character set
· UTF8: Unicode 4.0 UTF-8 Universal character set, CESU-8 compliant
· AL32UTF8: Unicode 4.0 UTF-8 Universal character set
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.