This is the 27th day of my participation in Gwen Challenge

Man lives for life itself, not for anything besides life. — Yu Hua, To Live

Before, I mentioned the risks of web crawlers, namely the importance of protecting personal data/information.

Of course, xiaobian here is not to everyone to crawl personal information, but because there is such a possible existence, the more you want to protect their privacy.

So how do we find information that works for us when we crawl the web? Or, how do we print out a set of information in Python?

1. Why should information be extracted?

First of all, when a Python crawler crawls a web page, it cannot print out the entire information of the web page. If you have seen the source code of the web page, you can see that a web page contains a lot of information by pressing F12 or right clicking the source code (or checking it).

This includes front-end HTML or other language code, which can be cumbersome to process, and I don’t think anyone has done this before.

Before extracting information, we should first understand the markings of information. For example, we have a lot of items at home. In order to let others know its functions, we write down the functions of various items on small strips of paper and stick them on the items.

The benefits of information tagging are obvious:

  • The labeled information can form the information organization structure and increase the information dimension.
  • The tagged information can be used for communication, storage or display.
  • The structure of the tag is as valuable as the information.
  • The labeled information is easier for the program to understand and use.

To: To mark the information of HTML in a web page:

H: Hyper T: text M: Markup L: languageCopy the code

HTML is the information organization of the WWW: you can embed hypertext information like sounds, images, and videos into text.

HTML passes the predefined <>… <> Tag forms organize different types of information.

2. Three kinds of information markers

Three internationally recognized information tags: XML, JSON, and YAML

XML

XML: XML(Extensible Markup Language) : A standard Language similar to HTML that uses tags as the main way to build and express information. Such as:

<img scr="china.jpg" size="10">... SCR ="china.jpg" size="10" Attribute <! > for commentsCopy the code

Example:

<person> <first Name> Tian </firstName> <addres> <streeAddr> <city> </city> </address> <prof>Com</prof> </person>Copy the code

HTML came before XML, and XML is based on HTML.

JSON

JSON: JavsScript Object Notation is built with typed key-value pairs, such as: key:value Instance: “name” : “Beijing” “name” : [” Beijing “, “Hunan”] When nested: “name” : {” newname “: “Beijing”, “oldName” : “Hunan”}

Example: {” first Name “:” tian “, “addres” : {” streeAddr “:” hunan province “, “city” : “changsha},” he says: “[” Com”, “ser”]}

YAML

YAML: Yet Another Markup Language

No type key value pair key:value

Name: Beijing

When nested, denoted by indentation

Features: | # said comments – said the whole data parallel relationship

Example:

First Name: Tian Addres: streeAddr: Hunan City: Changsha Prof: -com-serCopy the code

3. Comparison of the three label forms

A simple comparison of information markup forms:

XML is a form of marking up information with <> tags.

JSON is a form marked with typed key-value pairs of information.

YAML is a form of tagging untyped key-value pair information.

Comparison and use of the mainstream of three label forms:

XML: The earliest general information markup language, extensible, but more cumbersome; It is mainly used for information exchange, transmission and information expression on the Internet.

JSON: The information is typed and suitable for application processing (such as JS), which is simpler than XML. It is mainly used for information communication between cloud and nodes of mobile applications. Because there is no annotation, it is also commonly used in programs and interfaces.

YAML: No type of information, the highest proportion of text information, good readability; Configuration files used for various systems, easy to read with comments.

4. Three methods of information extraction

1. Fully parse the mark form of information, and then extract the key information (parse)

You need a tag parser, such as a tag tree traversal for the BS4 library.

Advantages: Accurate information parsing.

Disadvantages: the extraction process is tedious and slow.

2. Ignore the markup and search for key information. (search)

Search: search function for information document.

Advantages: The extraction process is simple and fast.

Disadvantages: The accuracy of extraction results is related to the content of direct information.

3. Fusion method (search + parsing) :

The key information is extracted by combining formal analysis and search method.

You need a tag parser and a text lookup function.

Combining the above two methods is the best choice.

Python crawler series, to be continued…