“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
How much do YOU need to learn about regular expressions in Python crawler collection? Similarly, how many XPath expressions do I have to learn to use? These two questions are many new reptilian questions, and there is no standard answer to these two questions…
This blog gives you an overview of how well you should learn both of them in the beginner crawler stage, and gives you a small goal, the beginner’s range.
Regular expression
In Python, regular expressions are mainly used with the RE module, which is not difficult to use, but regular expression writing is confusing to many friends.
Re is hard to write for the following reasons:
- It is also a separate programming language with its own specification;
- Regular expressions are independent of any programming language, which means they can be combined with any programming language.
- Everyone wrote different expressions, meaning there was no unique answer to “every question”;
- Regex has syntax, modifiers, metacharacters, and precedence of operators. The concept of regex is a little complicated and difficult to begin with.
Now that we have analyzed the reasons why the re is difficult to write, we can overcome them one by one.
First of all, what is a re?
Simply put, a regular expression is a text (string) retrieval mode that matches a target string from a long string.
For example, extract 1234 from ABC1234dferTG.
Beginner’s regular
In the beginning stage, the first thing to do is to understand the basic syntax of re, starting with characters.
Common characters This is very simple, for example, a, 1 is a common character, applied to the regular expression, can match the specified string a or 1.
Follow the character to expand the core of the re, the metacharacter.
Metacharacters metacharacters are syntactic statements that have special meanings in the re.
Common metacharacters are as follows:
\d
: Matches a number;\w
: Matches letters, digits, and underscores (_).
There is a threshold to the learning stage, memorizing metacharacters.
In the beginner stage, especially in the beginner Python crawler collection stage, this is the order to memorize (must master).
.
: Matches any single character other than newline characters (\n, \r).
Matches most of the content;*
: Matches the preceding expression 0 to infinite times.+
: Matches the preceding expression 1 to unlimited times.?
: matches the previous subexpression 0 or 1 times, this?
Another use is when the character is immediately followed by any other qualifier (*
.+
.?
.{n}
.{n,}
.{n,m}
), the matching pattern is non-greedy, and the non-greedy pattern means matching the searched string as little as possible.\
: escape character, for example you want to match.
, need to use\.
;\s
: Note that it is lowercases
, matches any whitespace characters, including Spaces, tabs, page-feeds, etc., in the parsing of HTML, is very common, because web source code often appears in the newline situation;[xyz]
: Matches any character in the box;(pattern)
: Grouping and matching pattern;
Master and skilled use of the above 8 metacharacters, in the beginning of the crawler stage, the general web page can be unimpeded parsing.
Explain the greed model.
If there is a string www.csdn.com, you write the re w+, which matches WWW, and the re will match as many w characters as possible. If you change the re to w+?, the re will match as few w characters as possible, which is the metacharacter?, and the re will go into non-greedy mode.
The beginning of the regular in the reptile
With this metacharacter concept in mind, you will find that the following re appears many times when you go to retrieve 120 previous re articles: (.*?) “, and you can see that this is the most common regular expression. It’s lazy, but it’s really easy to write.
If the page appeared a newline or Spaces, that will be evolved into the regular (. | \ s) *? , combined with the memorized metacharacters above, can you understand their meaning?
Understand to study is a step in the right, after all, we introduced a metacharacter |,. | \ s choice, matching. Or \s, which matches any character followed by a space.
Once you’re familiar with the basic metacharacters, you can expand on them and learn other metacharacters to make your regular expression writing standard and efficient.
In addition to metacharacters, we also need to learn modifiers for re. This content is not much, there are several kinds of:
i
: ignore case;g
: Global matching;m
: multi-line matching;s
:.
Dot symbol, supporting matching Spaces.
The reason for not focusing on this topic is that different programming languages have their own specific implementations, which need to be implemented in the same language. For example, Python’s RE module has specific implementations, and you can search for the re module’s usage.
For the rest of the advanced content, please focus on learning regular grouping, which will be shown in the later summary.
The XPath expression
XPath is XML Path, a language for finding node elements in AN XML document.
If you further study XPath, there are still many knowledge points that need to be added, but as a beginner of crawler collection, you can master the following contents first.
Grammar is a must for beginners
XPath path expression
This path is basically the same as the computer hard disk path.
Distinguish between/and //, which indicate selection from the root node or from a node at any location.
For example, if the following XML document exists, the root phase is root, and other contents are shown as follows
<root>
<book bid="1">
<author>Eraser (Eraser)</author>
</book>
<book bid="2">
<author>Eraser (Eraser)</author>
</book>
</root>
Copy the code
For example, /root/book indicates that the book node is selected from the root node. If /book is used, no data can be matched.
With //book, all book elements can be matched.
All book nodes can also be matched using book directly.
XPath test method, you can create an HTML file, after the developer tools use Ctrl+F to change the search box, you can test, as shown in the figure below.
Of course, the browser automatically generates HTML, HEAD, and BODY nodes.
Once you have the root node concept, you can categorize. To represent the current node,.. Indicates the parent node of the current node. Of course, @ can also select an attribute, such as the following XPath to extract a node whose bid=1 from the book node:
/html/body/root/book[@bid=1]
Copy the code
Extract the syntax as follows:
[@attribute = 'attribute value'] # If the attribute value is a number, it is also possible to drop the double quotesCopy the code
If the @ attribute is directly used, the node with the attribute is extracted.
Other ways to select elements
Select unknown elements
In XPath, you can use * to select an unknown node, such as /book/*/name, to select the name node of all nodes under the book node.
Choose a predicate
A predicate that looks for a particular node, or a node that contains a specified value, is nested in [], for example:
Select the first element /root/book[1] and the last element /root/book[last()].
Extract the attribute value or text value in the tag
In crawler collection, it is often used to extract attribute values of tags or text values inside tags. The following cases can be referred to for extracting attribute values of tags: /book/@cid, and for extracting text of tags: //book/text().
That’s all you need to know about XPath as a beginner, but there are places that tell you to copy XPath directly from the developer tools, like this:
The XPath expression copied by the above method has a lot of redundancy. It is recommended to write it yourself.
The following is a directly copied expression.
/html/body/div[2]/div[5]
Copy the code
Collection time
This article is the 19th blog of “120 Cases of crawler”. Although there is no actual collection of a site, it has combed out what you should learn about regular expression and XPath in the early stage of learning crawler. I believe this article must be conducive to the next stage of learning ~
Reptiles, 120 cases of code download address: codechina.csdn.net/hihell/pyth… Can you give me a Star?
Here we are. No comment, no like, no hide?
Today is day 199/200 of continuous writing. You can follow me, like me, comment on me, favorites me.