“This is the 21st day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

SAX parses XML files

Basic usage

import java.io.IOException;
import java.util.ArrayList;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import com.garlick.xml.decode.Decode;

public class SaxXmlDecode extends Decode {
    public void decode(a) {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            factory.newSAXParser().parse(COMPANY_FILE_NAME, new MyHandler());
        } catch(IOException | SAXException | ParserConfigurationException e) { e.printStackTrace(); }}private class MyHandler extends DefaultHandler {
        private ArrayList<Group> groups;
        private Group group;
        private boolean staff = false;
        
        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            super.characters(ch, start, length);
            if(staff && (group ! =null) && group.staffs ! =null) {
                group.staffs.add(newString(ch, start, length)); }}@Override
        public void endDocument(a) throws SAXException {
            super.endDocument();
            print(groups);
        }
        
        @Override
        public void startDocument(a) throws SAXException {
            super.startDocument();
            groups = new ArrayList<Group>();
        }
        
        @Override
        public void startElement(String uri, String localName, String qName, Attributes attributes)
                throws SAXException {
            super.startElement(uri, localName, qName, attributes);
            if (GROUP_ELEMENT_TAG_NAME.equals(qName)) {
                group = new Group();
            } else if (LEADER_ELEMENT_TAG_NAME.equals(qName)) {
                if(group ! =null) {
                    if (group.leaders == null) {
                        group.leaders = new ArrayList<String>();
                    }
                    if (attributes.getValue("name") != null) {
                        group.leaders.add(attributes.getValue("name")); }}}else if (STAFF_ELEMENT_TAG_NAME.equals(qName)) {
                if(group ! =null && group.staffs == null) {
                    group.staffs = new ArrayList<String>();
                }
                staff = true; }}@Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            super.endElement(uri, localName, qName);
            if (GROUP_ELEMENT_TAG_NAME.equals(qName)) {
                if(group ! =null) { groups.add(group); }}else if (STAFF_ELEMENT_TAG_NAME.equals(qName)) {
                staff = false; }}}private class Group {
        ArrayList<String> leaders;
        ArrayList<String> staffs;
    }
    
    private void print(ArrayList<Group> groups) {
        if(groups ! =null && groups.size() > 0) {
            System.out.println(COMPANY_ELEMENT_TAG_NAME);
            for (int index = 0; index < groups.size(); index++) {
                System.out.println("\t" + GROUP_ELEMENT_TAG_NAME + "" + (index + 1));
                Group group = groups.get(index);
                if(group.leaders ! =null && group.leaders.size() > 0) {
                    for (String leader : group.leaders) {
                        System.out.println("\t\t" + LEADER_ELEMENT_TAG_NAME + ":\t"+ leader); }}if(group.staffs ! =null && group.staffs.size() > 0) {
                    for (String staff : group.staffs) {
                        System.out.println("\t\t" + STAFF_ELEMENT_TAG_NAME + ":\t" + staff);
                    }
                }
            }
        }
    }
}
Copy the code

Detailed source code parsing

Initialization of the SAXParserImpl object

To parse an XML file using SAX, you first initialize a SAXParserFactory object with its newInstance function,

public static SAXParserFactory newInstance() { // instantiate the class directly rather than using reflection // Return new SAXParserFactoryImpl(); }Copy the code

Simply new a SAXParserFactory subclass SAXParserFactoryImpl object and call its newSAXParser function

@Override
public SAXParser newSAXParser(a) throws ParserConfigurationException {
    / /... Conditional judgment, branch cannot enter, omitted
    try {
        return new SAXParserImpl(features);
    } 
    / /... catch exception, code delete
}
Copy the code

That is, initialize a SAXParserImpl object directly here

SAXParserImpl(Map<String, Boolean> initialFeatures)
        throws SAXNotRecognizedException, SAXNotSupportedException {
    this.initialFeatures = initialFeatures.isEmpty()
            ? Collections.<String, Boolean>emptyMap()
            : new HashMap<String, Boolean>(initialFeatures);
    resetInternal();
}

private void resetInternal(a)
        throws SAXNotSupportedException, SAXNotRecognizedException {
    reader = new ExpatReader();
    for(Map.Entry<String,Boolean> entry : initialFeatures.entrySet()) { reader.setFeature(entry.getKey(), entry.getValue()); }}Copy the code

Parsing XML files

public void parse(String uri, DefaultHandler dh)
    throws SAXException, IOException {
    / /... Null condition determines code omission
    InputSource input = new InputSource(uri);
    this.parse(input, dh);
}
Copy the code

Initialize the InputSource object and call the overloaded function parse as an argument

public void parse(InputSource is, DefaultHandler dh) throws SAXException, IOException { // ...... XMLReader = this.getxmlReader (); if (dh ! = null) { reader.setContentHandler(dh); reader.setEntityResolver(dh); reader.setErrorHandler(dh); reader.setDTDHandler(dh); } reader.parse(is); }Copy the code

Reader is an ExpatReader object that was just initialized during SAXParserImpl initialization, so the Parse function of ExpatReader is called directly

public void parse(InputSource input) throws IOException, SAXException {
    / /... Null condition determines code omission
    Reader reader = input.getCharacterStream();
    if(reader ! =null) {
        try {
            parse(reader, input.getPublicId(), input.getSystemId());
        }
        / /...
        return;
    }

    // Try the byte stream.
    InputStream in = input.getByteStream();
    String encoding = input.getEncoding();
    // null
    if(in ! =null) {
        try {
            parse(in, encoding, input.getPublicId(), input.getSystemId());
        }
        / /...
        return;
    }

    String systemId = input.getSystemId();
    / /...
    // Try the system id.
    // Create URLConnection and call the overloaded function
    in = ExpatParser.openUrl(systemId);
    try {
        parse(in, encoding, input.getPublicId(), systemId);
    } finally{ IoUtils.closeQuietly(in); }}Copy the code

From the above code, an URLConnection is created and the overloaded parse function is called

private void parse(InputStream in, String charsetName, String publicId, String systemId) throws IOException, SAXException {// Initialize ExpatParser object ExpatParser parser = new ExpatParser(charsetName, this, processNamespaces, publicId, systemId); parser.parseDocument(in); }Copy the code

Initialize an ExpatParser object and call its parseDocument function a. The ExpatParser object is initialized using new

/*package*/ ExpatParser(String encoding, ExpatReader xmlReader,
        boolean processNamespaces, String publicId, String systemId) {
    / /...
    this.encoding = encoding == null ? DEFAULT_ENCODING : encoding;
    this.pointer = initialize(
        this.encoding,
        processNamespaces
    );
}
// native initialize function
private native long initialize(String encoding, boolean namespacesEnabled);
Copy the code

Call the native method initialize (org_apache_harmony_xml_expatParser.cpp)

static jlong ExpatParser_initialize(JNIEnv* env, jobject object, jstring javaEncoding, jboolean processNamespaces) { // Allocate parsing context. std::unique_ptr<ParsingContext> context(new ParsingContext(object)); / /... context->processNamespaces = processNamespaces; // Create a parser. XML_Parser parser; ScopedUtfChars encoding(env, javaEncoding); / /... if (processNamespaces) { // Use '|' to separate URIs from local names. parser = XML_ParserCreateNS(encoding.c_str(), '|'); } else { parser = XML_ParserCreate(encoding.c_str()); } / /... Return fromXMLParser(parser); }Copy the code

Set some default handlers (external/expat/) to initialize the ExpatParser object, using the XML_ParserCreateNS function, which is located in the xmlparse.c file in external

XML_Parser XMLCALL
XML_ParserCreateNS(const XML_Char *encodingName, XML_Char nsSep)
{
  XML_Char tmp[2];
  *tmp = nsSep;
  return XML_ParserCreate_MM(encodingName, NULL, tmp);
}
Copy the code

That is, call the XML_ParserCreate_MM function

XML_Parser XMLCALL XML_ParserCreate_MM(const XML_Char *encodingName, const XML_Memory_Handling_Suite *memsuite, const XML_Char *nameSep) { return parserCreate(encodingName, memsuite, nameSep, NULL); } // Then call parserCreate static XML_Parser parserCreate(const XML_Char *encodingName, const XML_Memory_Handling_Suite *memsuite, const XML_Char *nameSep, DTD *dtd) { XML_Parser parser; / /... { XML_Memory_Handling_Suite *mtemp; Parser = (XML_Parser)malloc(sizeof(struct XML_ParserStruct)); if (parser ! = NULL) { mtemp = (XML_Memory_Handling_Suite *)&(parser->m_mem); mtemp->malloc_fcn = malloc; mtemp->realloc_fcn = realloc; mtemp->free_fcn = free; }} / /... Initialize some default parameters return parser; }Copy the code

We initialize the Parser, initialize some parameters, and then call parserInit to initialize it

static void
parserInit(XML_Parser parser, const XML_Char *encodingName)
{
    / /... Initialize default parameters
}
Copy the code

As above, we initialize the XML_Parser object directly and set some default values for it. After initialization, we call XML_SetNamespaceDeclHandler and other functions to set some of its initial values. We won’t do this here, but you can do it yourself. Initialization of the XML_Parser is complete

B. After that, we will parse the document by calling the ExpatParser parseDocument function

/*package*/ void parseDocument(InputStream in) throws IOException,
        SAXException {
    startDocument();
    parseFragment(in);
    finish();
    endDocument();
}
Copy the code

That is, four functions are called, starting and ending with startDocument and endDocument, This is going to end up calling the startDocument and endDocument functions that are passed in to DefaultHandler so let’s look at parseFragment, the main content function for parsing XML files

private void parseFragment(InputStream in)
        throws IOException, SAXException {
    byte[] buffer = new byte[BUFFER_SIZE];
    int length;
    while((length = in.read(buffer)) ! = -1) {
        try {
            appendBytes(this.pointer, buffer, 0, length);
        }
        / /... catch exception code delete}}private native void appendBytes(long pointer, byte[] xml, int offset,
        int length) throws SAXException, ExpatException;
Copy the code

As you can see from this side, the contents of the XML document are sequentially read into memory, parsed (BUFFER_SIZE Max), and then parsed via the appendBytes function, so

static void ExpatParser_appendBytes(JNIEnv* env, jobject object, jlong pointer, jbyteArray xml, jint byteOffset, jint byteCount) {
    ScopedByteArrayRO byteArray(env, xml);
    / /...
    const char* bytes = reinterpret_cast<const char*>(byteArray.get());
    append(env, object, pointer, bytes, byteOffset, byteCount, XML_FALSE);
}

static void append(JNIEnv* env, jobject object, jlong pointer,
        const char* bytes, size_t byteOffset, size_t byteCount, jboolean isFinal) {
    XML_Parser parser = toXMLParser(pointer);
    ParsingContext* context = toParsingContext(parser);
    context->env = env;
    context->object = object;
    if(! XML_Parse(parser, bytes + byteOffset, byteCount, isFinal) && ! env->ExceptionCheck()) {/ /...
    }
    context->object = NULL;
    context->env = NULL;
}
Copy the code

Parse through the XML_Parse function

enum XML_Status XMLCALL
XML_Parse(XML_Parser parser.const char *s.int len.int isFinal)
{
    if ((parser == NULL) || (len < 0) || ((s == NULL) && (len ! =0))) {
        if(parser ! = NULL) parser->m_errorCode = XML_ERROR_INVALID_ARGUMENT;return XML_STATUS_ERROR;
    }
    switch (parser->m_parsingStatus.parsing) {
        case XML_SUSPENDED:
            parser->m_errorCode = XML_ERROR_SUSPENDED;
            return XML_STATUS_ERROR;
        case XML_FINISHED:
            parser->m_errorCode = XML_ERROR_FINISHED;
            return XML_STATUS_ERROR;
        // initialize to this value
        case XML_INITIALIZED:
            if(parser->m_parentParser == NULL && ! startParsing(parser)) { parser->m_errorCode = XML_ERROR_NO_MEMORY;return XML_STATUS_ERROR;
            }
        /* fall through */
        default:
            // Start parsing
            parser->m_parsingStatus.parsing = XML_PARSING;
    }
    / /...
    {
        void *buff = XML_GetBuffer(parser, len);
        if (buff == NULL)
            return XML_STATUS_ERROR;
        else {
            memcpy(buff, s, len);
            / / buffer
            returnXML_ParseBuffer(parser, len, isFinal); }}}Copy the code

Finally, the XML_ParseBuffer function is called to parse the data

enum XML_Status XMLCALL
XML_ParseBuffer(XML_Parser parser.int len.int isFinal)
{
  const char *start;
  enum XML_Status result = XML_STATUS_OK;

  if (parser == NULL)
      return XML_STATUS_ERROR;
  switch (parser->m_parsingStatus.parsing) {
      case XML_SUSPENDED:
          parser->m_errorCode = XML_ERROR_SUSPENDED;
          return XML_STATUS_ERROR;
      case XML_FINISHED:
          parser->m_errorCode = XML_ERROR_FINISHED;
          return XML_STATUS_ERROR;
      case XML_INITIALIZED:
          if(parser->m_parentParser == NULL && ! startParsing(parser)) { parser->m_errorCode = XML_ERROR_NO_MEMORY;return XML_STATUS_ERROR;
          }
      /* fall through */
      default:
          parser->m_parsingStatus.parsing = XML_PARSING;
  }
  // Initialize the data
  start = parser->m_bufferPtr;
  parser->m_positionPtr = start;
  parser->m_bufferEnd += len;
  parser->m_parseEndPtr = parser->m_bufferEnd;
  parser->m_parseEndByteIndex += len;
  parser->m_parsingStatus.finalBuffer = (XML_Bool)isFinal;
  // Call the m_processor function to parse the XML data. This value is set to prologInitProcessor when the XML_Parser object is initialized
  parser->m_errorCode = parser->m_processor(parser, start, parser->m_parseEndPtr, &parser->m_bufferPtr);

  if(parser->m_errorCode ! = XML_ERROR_NONE) { parser->m_eventEndPtr = parser->m_eventPtr; parser->m_processor = errorProcessor;return XML_STATUS_ERROR;
  } else {
      switch (parser->m_parsingStatus.parsing) {
          case XML_SUSPENDED:
              result = XML_STATUS_SUSPENDED;
              break;
          case XML_INITIALIZED:
          case XML_PARSING:
              if (isFinal) {
                  parser->m_parsingStatus.parsing = XML_FINISHED;
                  return result;
              }
          default:;/* should not happen */
      }
  }

  XmlUpdatePosition(parser->m_encoding, parser->m_positionPtr, parser->m_bufferPtr, &parser->m_position);
  parser->m_positionPtr = parser->m_bufferPtr;
  return result;
}
Copy the code

Call the prologInitProcessor function, parse the data, find the corresponding data through the while loop, read the memory, repeat this stage, complete all data parsing

SAX parses XML summaries

Method of use

1) Create a SAXParserFactory with newInstance for SAXParserFactory, initialize a SAXParserImpl with newSAXParser, and call the parse function. 2) In an object that inherits from DefaultHandler, To startDocument/endDocument startElement/same/charactors function, step by step, and then parse the XML fileCopy the code

Source code analysis

1) Create a SAXParserFactory FactoryImpl object using newInstance for SAXParserFactory. Then create a SAXParserImpl object from its newSAXParser function. 2) Initialize an ExpatReader object when creating a SAXParserImpl object. 3) When calling the parse function of the SAXParserImpl object, Parse calls the Parse function of the ExpatReader object directly. During the parse function call, an ExpatParser object is initialized. 5) Call the initialize function of the libexpat library when the ExpatParser object is initialized. 6) When calling its parseDocument, each line of the XML file is read, parsed, and then parsed step by stepCopy the code

Advantages and disadvantages of SAX parsing XML

  1. In the process of SAX parsing, the parsing is carried out in native layer, so the parsing speed is relatively block
  2. Because SAX parses XML, it reads part of the data for parsing, so it uses less memory and is relatively fixed
  3. Because SAX parsing XML process, need to rewrite DefaultHandler some functions, and some of its functions need to rewrite, and need to be parsed step by step, so need to XML file content, have a corresponding understanding

extension

Because SAXParserFactory provides two overloaded functions of newInstance, newInstance can be customized when overloaded with two functions