In the use ofSpringParsing aXML beanFile encountered encoding problem, error stack as shown below:

Exception in thread "main" org.springframework.beans.factory.BeanDefinitionStoreException: IOException parsing XML document from resource loaded from byte array; nested exception is com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence. at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.doLoadBeanDefinitions(XmlBeanDefinitionReader.java:416) at  org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:342) at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:310) at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader. java:143) at org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java :109) at org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java :80) at org.springframework.context.support.AbstractRefreshableApplicationContext.refreshBeanFactory(AbstractRefreshableApplicat ionContext.java:123) at org.springframework.context.support.AbstractApplicationContext.obtainFreshBeanFactory(AbstractApplicationContext.java:42 2) at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:352) at me.guimy.XmlApplicationContext.<init>(XmlApplicationContext.java:22) at me.guimy.XmlApplicationContext.main(XmlApplicationContext.java:42) Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence. at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701) at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:372) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1895) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanData(XMLEntityScanner.java:1375) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(XMLDocumentFragmentScannerImpl.j ava:1654) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentSca nnerImpl.java:3014) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java: 505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at org.springframework.beans.factory.xml.DefaultDocumentLoader.loadDocument(DefaultDocumentLoader.java:75) at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.doLoadBeanDefinitions(XmlBeanDefinitionReader.java:396) . 10 moreCopy the code

The critical error message is Invalid BYTE 2 of 2-byte UTF-8 sequence. The solution mentioned in many articles is to change encoding to GBK or some other encoding, but why? What is the root cause of this problem? To clarify this, let’s simplify the code. The key code is as follows:

public class XmlApplicationContext extends AbstractXmlApplicationContext { private Resource configResource; private ClassLoader cl; public XmlApplicationContext(String str) { configResource = new ByteArrayResource(str.getBytes()); cl = this.getClassLoader(); refresh(); } @Override protected Resource[] getConfigResources() { return new Resource[]{this.configResource}; } public static void main(String[] args) { String data = "<? The XML version = \ \ "1.0" encoding = \ "utf-8 \"? >\n" + "<! DOCTYPE beans PUBLIC \"-//SPRING//DTD BEAN//EN\" \"http://www.springframework.org/dtd/spring-beans" + ".dtd\">\n" + "<beans>\n" + " <bean class=\"me.guimy.Student\" id=\"student\">\n" + " <property name=\"name\">\n" + " <value><! [CDATA [Chinese]] > < value > \ n "+" < / property > \ n "+" < / bean > \ n "+" < / beans > "; ApplicationContext applicationContext = new XmlApplicationContext(data); }}Copy the code

This code loads a Spring XML bean file, which contains a value in Chinese, and apparently generates an Invalid byte 2 of 2-byte UTF-8 sequence error when parsing into Chinese. Encoding =”UTF-8″ is already defined, so in theory parsing should not fail to parse Chinese errors. Have you encountered a JDK bug? From the exception stack can see Spring in the JDK to parse the XML call com.sun.org.apache.xerces.internal.jaxp class to parse under the package, so for this question, you can first debug to specific what kind of screen using coding in parsing, The exception stack UTF8Reader uses utF-8 encoding to parse. By debugging the UTF8Reader, you can see that the process of parsing XML is actually a sequence of bytes corresponding to the parsed text. Now that we know that the byte sequence corresponding to the parsed XML text is utF-8 encoding, and that utF-8 encoding is specified in the XML text, the direction of suspicion is how the byte sequence is generated, whether it is utF-8 encoding byte sequence? So now you need to figure out how to convert the XML text into a sequence of bytes. You can look at the constructor of the XmlApplicationContext class. Str.getbytes () converts the String to an array of bytes using the default encoding. Look at the str.getBytes method:

public byte[] getBytes() { return StringCoding.encode(value, 0, value.length); } static byte[] encode(char[] ca, int off, int len) { String csn = Charset.defaultCharset().name(); try { return encode(csn, ca, off, len); } catch (UnsupportedEncodingException x) { warnUnsupportedCharset(csn); } try { return encode("ISO-8859-1", ca, off, len); } catch (UnsupportedEncodingException x) { System.exit(1); return null; }}Copy the code

The String CSN = charset.defaultCharset ().name() takes the default encoding, which is specified by file.encoding, so it’s easy. Println (system.getProperty (“file.encoding”)); GBK (syste.getProperty (“file.encoding”)); Then the question becomes clear: When XML is converted to a byte sequence, the GBK encoding is used to obtain the byte xu, while when converting from a byte sequence to a string, the UTF-8 encoding is parsed. These two encoding are completely different, so it is not strange to declare Invalid byte 2 of 2-byte UTF-8 sequence. Invalid byte 2 of 2-byte UTF-8 sequence or Invalid byte 3 of 3-byte UTF-8 sequence The sequence? To put it simply, GBK encoded byte sequence is parsed under UTF-8 encoding. Utf-8 recognizes that the current two consecutive bytes are one character based on a certain byte, so it parses the second byte, but finds that the second byte does not conform to the second byte encoding rules of UTF-8 two-byte characters. Invalid byte 2 of 2-byte UTF-8 sequence is reported. To explain in more detail, let’s look at utF-8 encoding rules:

1. Unicode characters in the range 0x00-0x7f are encoded in one byte, with the highest bit being 0; 2. All multi-byte character encodings, the first bit of the non-first byte is 1 and the second bit is 0; 3. Unicode characters in the range 0x080-0x7FF are encoded in two bytes, with the first two bits of the first byte being 1 and the third bit being 0; 4. Unicode characters in the 0x0800-0xFFFF range are encoded in three bytes, with the first three digits of the first byte being 1 and the fourth digit being 0; 5. Unicode characters in the 0x010000-0x10FFFF range are encoded in four bytes, with the first four bits of the first byte being 1 and the fifth bit being 0.

For two-byte characters, the first byte is 110xx XXX and the second byte is 10xx XXXX, so if a byte is 10XX XXXX, the first two bytes of the second byte are checked for 10, and if not, it is considered not a valid UTF-8 character. For example, in the Chinese word, the GBK code of the character is 1101 0110 1101 0000. When parsing the encoding rules of UTF-8, the first byte 1101 0110 is recognized as the first byte of a two-byte character, so the second byte should be 10xx XXXX. The second byte of the utF-8 character is Invalid. The second byte of the utF-8 character is Invalid.

exceptional
Welcome to follow the life designer’s wechat public account



longjiazuoA