One, the introduction

What is a regular expression?

When writing programs or web pages that work with strings, there is often a need to find strings that conform to some complex rules. Regular expressions are a tool for describing these rules. In other words, regular expressions are code that records rules for text.

Chances are you’ve used the Windows/Dos wildcards for file lookups, the * and? . If you want to find all Word documents in a directory, you will search *.doc. In this case, * is interpreted as an arbitrary string. And wildcard, regular expressions are used for text matching tools, rather than just wildcards, it can more accurately describe your needs – and, of course, the price is more complex, such as you can write a regular expression that is used to find all begin with 0, followed by two or three Numbers, followed by a hyphen “-“, The last is a string of 7 or 8 digits (like 010-12345678 or 0376-7654321).

Text format conventions: Metacharacters/syntax format Regular expression A description of or part of a regular expression by the source string that matches it (for analysis).

Characters are the most basic units in which computer software processes words. They may be letters, numbers, punctuation marks, Spaces, line breaks, Chinese characters, and so on. A string is a sequence of 0 or more characters. Text is just text, string. When a string matches a regular expression, it usually means that there are parts of the string (or parts of the string) that meet the conditions given by the expression.

Two, instance introduction

The best way to learn regular expressions is to start with examples, understand them, and then modify and experiment with them yourself. A number of simple examples are given below, and they are explained in detail.

If you are looking for hi in an English novel, you can use the regular expression hi.

This is about as simple a regular expression as you can get, and it matches exactly a string of two characters, h followed by I. Typically, tools that process regular expressions provide an option to ignore case, which, if checked, can match any of the four cases: hi, hi, hi, hi.

Unfortunately, many words contain two consecutive characters: him,history,high, etc. If you look for hi, you’ll also find the hi here. To find the word hi exactly, we should use \bhi\b.

\b is a special code specified by the regular expression (well, some people call it a metacharacter, or metacharacter) that represents the beginning or end of a word, at its boundary. Although words in English are usually separated by Spaces, punctuation marks, or line breaks, \ B does not match any of these word-separator characters, it only matches one position.

If you are looking for hi and there is Lucy not far behind, you should use \bhi\b.*\bLucy\b.

Here,. Is another metacharacter that matches any character except newline. * is also a metacharacter, but instead of representing a character or position, it represents a quantity — it specifies that the content before * can be repeated any number of times in a row to make the entire expression match. Therefore,.* together means any number of characters that do not contain newlines. Now it is clear what \bhi\b.*\bLucy\b means: first a word hi, then any arbitrary character (but not a newline), and finally the word Lucy.

If more precision is needed, \b matches the position that its preceding and following characters are not all (one is, one is not or does not exist)\w.

If we use other metacharacters together, we can construct more powerful regular expressions. Take this example:

0\d\d-\d\d\d matches a string that begins with a 0, then two digits, then a hyphen “-“, and finally eight digits (i.e., the telephone number in China). Of course, this example only matches the 3-digit area code.

Here \d is a new metacharacter that matches one digit (0, or 1, or 2, or…). . – Not a metacharacter, just matches itself — a hyphen (or a minus sign, or a dash, or whatever you want to call it).

The newline character is ‘\n’, the ASCII 10(hex 0x0A) character.

To avoid so many annoying repetitions, we can also write the expression 0\d{2}-\d{8}. Here the {2}({8}) after \d means that the preceding \d must be matched twice (8 times) in succession.

OK, that’s it. Now that you have a basic understanding of regular expressions, subsequent articles will give you a deeper understanding of how to use regular expressions