introduce

  • Regular expressions are commonly used for string matching, string finding, and string replacement. Don’t underestimate its role, in the work and study of flexible use of regular expression processing string can greatly improve efficiency, programming joy is so simple.
  • It can be scary to give a bunch of matching rules all at once, so here’s how to use regular expressions. With actual combat cases.

Learn regular expression matching from a simple example

  • On the first code
public class Demo1 {
    public static void main(String[] args) {
        // String ABC matches the regular expression "..." ", where "." indicates a character
        / / "..." Represents three characters
        System.out.println("abc".matches("..."));

        System.out.println("abcd".matches("...")); }}// Output the result
true
false
Copy the code
  • StringThere is a in the classmatches(String regex)Method that returns a Boolean that tells whether the string matches the given regular expression.
  • In this example, the regular expression is.Of which each.Represents one character, the entire regular expression means three characters, obviously when matchedabcThe result is 0trueMatching,abcdThe result isfalse.

Regular expression support in Java (various languages have implementations)

  • injava.util.regexThere are two classes for regular expressions under the package, one isMatcherClass, another onePatternA typical use of these two classes is given in the.java official documentation as follows:
public class Demo2 {
    public static void main(String[] args) {
        //[a-z] Indicates any character from a to Z. {3} indicates three characters. It means to match a string of three characters and each character belongs to a to Z
        Pattern p = Pattern.compile("[a-z]{3}");
        Matcher m = p.matcher("abc"); System.out.println(m.matches()); }}// Output the result
true
Copy the code
  • If you want to explore the principle behind the regular expression, will involve the compiler principle of automata and other knowledge, not expanded here. In order to achieve easy to understand, here with more vivid language description.
  • PatternYou can think of it as a pattern, and the string needs to match a pattern. Such asDemo2In, the pattern we define isIt is a string of 3 characters. Each character must be one of a to Z characters.
  • We see the creationPatternObject is calledPatternIn the classcompileMethod, that is, to compile a pattern object from the regular expression we passed in. This compiled pattern object makes regular expressions much more efficient and, as a constant, safe for multiple threads to use concurrently.
  • MatcherCan be interpreted as the result of a pattern matching a string. A string matching a pattern can produce many results, which will be explained in a later example.
  • And finally when we callm.matches()Returns the full string matching the pattern
  • The above three lines of code can be reduced to one

    System.out.println("abc".matches("[a-z]{3}"));
  • However, if a regular expression needs to be matched repeatedly, this method is inefficient.

Preliminary understanding. + *?

  • Before the introduction, the first thing to explain is that the specific meaning of regular expressions do not have to be strong back, the meaning of each symbol in the Java official documentsPatternThe class is defined in detail in the class description or online. Even better if you can cook it.
public class Demo3 {
    /** * In order to omit each written print statement, the output statement is wrapped *@param o
     */
    private static void p(Object o){
        System.out.println(o);
    }

    /** *. Any character (may or may not match line terminators) * X? X, once or not at all * X* X, zero or more times * X+ X, one or more times * X{n} X, Exactly n times x {n,} x, at least n times x {n,m} x, at least n but not more than m times@param args
     */
    public static void main(String[] args) {
        p("a".matches("."));
        p("aa".matches("aa"));
        p("aaaa".matches("a*"));
        p("aaaa".matches("a+"));
        p("".matches("a*"));
        p("a".matches("a?"));

        \d A digit: [0-9] \d A digit: [0-9] \d A digit: [0-9] \d A digit: [0-9
        p("2345".matches("\ \ d {2, 5}"));
        // \\. Used to match "."
        p("192.168.0.123".matches("\ \ d {1, 3} \ \ \ \ d {1, 3} \ \ \ \ d {1, 3} \ \ \ \ d {1, 3}"));
        // [0-2] must be a number ranging from 0 to 2
        p("192".matches("[2-0] [0-9] [0-9]." ")); }}// Output the result
/ / to true
Copy the code

The scope of

  • []Used to describe the range of a character. Here are some examples
public class Demo4 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        //[ABC] refers to one of the letters in ABC
        p("a".matches("[abc]"));
        //[^ ABC] refers to characters other than ABC
        p("1".matches("[^abc]"));
        // A to Z or a to Z characters, the following three are written in or
        p("A".matches("[a-zA-Z]"));
        p("A".matches("[a-z|A-Z]"));
        p("A".matches("[a-z[A-Z]]"));
        //[a-z &&[REQ]] indicates the characters from A to Z and belong to one of REQ
        p("R".matches("[A-Z&&[REQ]]")); }}// Output the resultAre alltrue
Copy the code

Know \s \w \d \

  • Here are regular representations of numbers and letters, the most commonly used characters in programming.

about\

  • Here are the highlights of the most difficult to understand\If you want to use special characters in a Java string, you must do so by pretracting\To escape.
  • As an example, consider this stringThe teacher said loudly :" students, hand in your homework!" "If we have no escape character, then the end of the opening double quote should beSaid."Here, however, we need to use double quotes in our string, so we need to escape them
  • The string after escaping characters is"The teacher said loudly :\" students, hand in your homework! \ ""So that our original meaning can be correctly identified.
  • Same thing if we’re going to use it in a string\I should also add one in front\, so is represented in the string as"\ \"
  • So how do you indicate a match in a regular expression\Well, the answer is"\ \ \ \".
  • We consider separately: since the expression in the regular expression\You also need to escape, so the first one\ \Represents an escape character in a regular expression\Behind,\ \In a regular expression\Is represented in a regular expression\.
  • If this feels a little convoluted, look at the code below
public class Demo5 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        /** * \d A digit: [0-9] digit * \d A digit: [^0-9] digit * \s A whitespace character: [\t\n\x0B\f\r] whitespace * \S A non-whitespace character: [^\ S] whitespace * \w A word character: [a-za-z_0-9] Characters and underscores * \W a non-word character: [^\ W] Non-word characters and underscores */
        // \\s{4} indicates four whitespace characters
        p(" \n\r\t".matches("\\s{4}"));
        // \\S indicates non-whitespace
        p("a".matches("\\S"));
        // \\w{3} represents letters and underscores
        p("a_8".matches("\\w{3}"));
        p("abc888&^%".matches("[a-z] {1, 3} \ \ d + [% ^ & *] +"));
        / / match \
        p("\ \".matches("\ \ \ \")); }}// Output the resultAre alltrue
Copy the code

Boundary processing

  • ^In brackets it means to take the opposite(^)Is the beginning of the string if it is not in brackets.
public class Demo6 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        /** * ^ The beginning of a line * $The end of a line * \b a word boundary */
        p("hello sir".matches("^h.*"));
        p("hello sir".matches(".*r$"));
        p("hello sir".matches("^ h [a-z] {1, 3} o \ \ b. *"));
        p("hellosir".matches("^ h [a-z] {1, 3} o \ \ b. *")); }}Copy the code

Exercise: Match blank lines to email addresses

  • Given an article, how do you determine how many blank lines there are? Use regular expressions for easy matching. Note that blank lines may include Spaces, tabs, and so on.
p(" \n".matches("^[\\s&&[^\n]]*\\n$"));
Copy the code
  • Explanation:^[\\s&&[^\n]]*Is a space but not a newline,\\n$It ends with a newline character
  • Below is the matching mailbox
p("[email protected]".matches("[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+"));
Copy the code
  • Explanation:[\\w[.-]]+One or more alphanumeric underscores.or-Composition,@And then it’s the @ sign, and then it’s the same thing[\\w[.-]]+And then\ \.matching.And finally, again[\\w]+

The Matcher classmatches().find()andlookingAt()

  • matches()Method matches the entire string to the template.
  • find()Matches are performed from the current position, if the string is passed in firstfind(), the current position is the beginning of the string. See the following code example for a detailed analysis of the current position
  • lookingAt()Method matches from the beginning of the string.
public class Demo8 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\ \ d {3, 5}");
        String s = "123-34345-234-00";
        Matcher m = pattern.matcher(s);

        // Show matches(), which matches the entire string.
        p(m.matches());
        // The result is false. If you want to match 3 to 5 digits, you will fail to match at -

        // Then show find() by using the reset() method to set the current position to the beginning of the string
        m.reset();
        p(m.find());//true 123 was successfully matched
        p(m.find());//true Matched 34345 successfully
        p(m.find());//true 234 was successfully matched
        p(m.find());False Failed to match 00

        // Let's use reset() instead of matches() and see what happens to the current position
        m.reset();/ / to reset
        p(m.matches());//false failed to match the entire string, the current position is -
        p(m.find());// true Matched 34345 successfully
        p(m.find());// true 234 was successfully matched
        p(m.find());// false matches the beginning edge of 00
        p(m.find());// false: nothing matches, failed

        // Show lookingAt(), start from scratch
        p(m.lookingAt());//true find 123, success}}Copy the code

The Matcher classstart()andend()

  • If a match is successfulstart()Used to return the starting position of the match,end()Used to return the position after the end of the match character
public class Demo9 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\ \ d {3, 5}");
        String s = "123-34345-234-00";
        Matcher m = pattern.matcher(s);

        p(m.find());//true 123 was successfully matched
        p("start: " + m.start() + " - end:" + m.end());
        p(m.find());//true Matched 34345 successfully
        p("start: " + m.start() + " - end:" + m.end());
        p(m.find());//true 234 was successfully matched
        p("start: " + m.start() + " - end:" + m.end());
        p(m.find());False Failed to match 00
        try {
            p("start: " + m.start() + " - end:" + m.end());
        }catch (Exception e){
            System.out.println("An error was reported...");
        }
        p(m.lookingAt());
        p("start: " + m.start() + " - end:"+ m.end()); }}// Output the result
true
start: 0 - end:3
true
start: 4 - end:9
true
start: 10 - end:13
falseAn error...true
start: 0 - end:3
Copy the code

Replacement string

  • To replace a string, you first need to find the string to be replaced, which is new hereMatcherClassgroup(), which returns the matched string.
  • Let’s look at an example of a stringjavaConvert to uppercase.
public class Demo10 {
    private static void p(Object o){
        System.out.println(o);
    }

    public static void main(String[] args) {
        Pattern p = Pattern.compile("java");
        Matcher m = p.matcher("java Java JAVA JAva I love Java and you");
        p(m.replaceAll("JAVA"));// The replaceAll() method replaces all matched strings}}// Output the result
JAVA Java JAVA JAva I love Java and you
Copy the code

Upgrade: Case-insensitive search and replace strings

  • To make matching case insensitive, specify case insensitive when creating the template template
public static void main(String[] args) {
    Pattern p = Pattern.compile("java", Pattern.CASE_INSENSITIVE);// Specify case insensitive
    Matcher m = p.matcher("java Java JAVA JAva I love Java and you");
    p(m.replaceAll("JAVA"));
}
// Output the result
JAVA JAVA JAVA JAVA I love JAVA and you
Copy the code

Re-upgrade: case-insensitive replacement of the specified string found

  • This demonstrates converting the odd-numbered string to uppercase and the even-numbered string to lowercase
  • So this is going to introduceMatcherClassappendReplacement(StringBuffer sb, String replacement), which requires passing a StringBuffer to concatenate the string.
public static void main(String[] args) {
    Pattern p = Pattern.compile("java", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher("java Java JAVA JAva I love Java and you ?");
    StringBuffer sb = new StringBuffer();
    int index = 1;
    while(m.find()){
        //m.appendReplacement(sb, (index++ & 1) == 0 ? "java" : "JAVA"); A simpler way to write itif((index & 1) == 0){// even m.appendreplacement (sb,"java");
        }else{
            m.appendReplacement(sb, "JAVA"); } index++; } m.appendTail(sb); // Add the rest of the string to p(sb); } JAVA JAVA JAVA JAVA I love JAVA and you?Copy the code

grouping

  • To start with a problem, look at this code
public static void main(String[] args) {
    Pattern p = Pattern.compile("\ \ d {3, 5} [a-z] {2}");
    String s = "123aa-5423zx-642oi-00";
    Matcher m = p.matcher(s);
    while(m.find()){ p(m.group()); }}// Output the result
123aa
5423zx
642oi
Copy the code
  • Where the regular expression"\ \ d {3, 5} [a-z] {2}"Three to five digits followed by two letters, and each matched string is printed.
  • What if I wanted to print the number in each matching string.
  • The first thing you might want to do is to match the matched strings again, but that’s too cumbersome. Grouping mechanisms can help you group in regular expressions.
  • Use () for grouping, here we divide letters and numbers into groups"(\ \ d {3, 5}) ([a-z] {2})"
  • And then on the callm.group(int group)Method is passed in the group number
  • Note that the group numbers start at 0, which represents the entire regular expression. After 0, each open parenthesis from left to right in the regular expression corresponds to a group. In this expression the first group is numbers and the second group is letters.
public static void main(String[] args) {
    Pattern p = Pattern.compile("(\ \ d {3, 5}) ([a-z] {2})"); // The regular expression consists of three to five digits followed by two letters String s ="123aa-5423zx-642oi-00";
    Matcher m = p.matcher(s);
    while(m.find()){ p(m.group(1)); }} // Output 123 5423 642Copy the code

Actual combat 1: Grab the email address in the web page (crawler)

  • Suppose we have some high-quality resources on hand and intend to share them with netizens, so we go to the post bar to send a message to leave an email address to send resources. Unexpectedly, netizens are enthusiastic, leaving nearly a hundred mailbox. But it’s too tiring to copy and send one by one, so let’s consider doing it programmatically.
  • Instead of expanding on the email section, we’ll focus on using the regular expressions we’ve learned to extract all the email addresses from a web page.
  • First get the HTML code of a post, find a random one, click jump, right-click in the browser to save the HTML file
  • Next look at the code:
public class Demo12 {
    public static void main(String[] args) {
        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader("C:\\emailTest.html"));
            String line = "";
            while((line = br.readLine()) ! = null){// read every line of the file parse(line); }} catch (FileNotFoundException e) {e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }finally {if(br ! = null){ try { br.close(); br = null; } catch (IOException e) { e.printStackTrace(); } } } } private static void parse(String line){ Pattern p = Pattern.compile("[\\w[.-]]+@[\\w[.-]]+\\.[\\w]+");
        Matcher m = p.matcher(line);
        while(m.find()){
            System.out.println(m.group());
        }
    }
}
//输出结果
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
...
Copy the code

Actual combat 2: code statistics small program

  • One final practical example: Count how many lines of code, how many lines of comments, and how many blank lines there are in a project. Might as well do a statistics of their own projects, find that unconsciously is also a person who has written thousands of lines of code…
  • I picked up a project on Github, a small project written purely in Java for the sake of statistics. Click the jump
  • Here is the code. In addition to determining that a blank line uses a regular expression, the code line and comment line use the String API
public class Demo13 {
    private static long codeLines = 0;
    private static long commentLines = 0;
    private static long whiteLines = 0;
    private static String filePath = "C:\\TankOnline";

    public static void main(String[] args) {
        process(filePath);
        System.out.println("codeLines : " + codeLines);
        System.out.println("commentLines : " + commentLines);
        System.out.println("whiteLines : "+ whiteLines); } @param pathStr */ public static void process(String pathStr){File File = new File(pathStr);if(file.isdirectory ()){// look for file [] fileList = file.listfiles ();for(File f : fileList){ String fPath = f.getAbsolutePath(); process(fPath); }}else if(file.isfile ()){// If it is a file, check whether it is a.java fileif(file.getName().matches(".*\\.java$")){
                parse(file);
            }
        }
    }

    private static void parse(File file) {
        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader(file));
            String line = "";
            while((line = br.readLine()) ! = null){ line = line.trim(); // Clear the Spaces at the beginning and end of each lineif(line.matches("^[\\s&&[^\\n]]*$")){// note that it does not end with \n, because in br.readline () it would remove \n whiteLines++; }else if(line.startsWith("/ *") || line.startsWith("*") || line.endsWith("* /")){
                    commentLines++;
                }else if(line.startsWith("/ /") || line.contains("/ /")){
                    commentLines++;
                }else{
                    if(line.startsWith("import") || line.startsWith("package"){// Guides don't countcontinue;
                    }
                    codeLines++;
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if(null ! = br){ try { br.close(); br = null; } catch (IOException e) { e.printStackTrace(); }}}}} // Output codeLines: 1139 commentLines: 124 whiteLines: 172Copy the code

Greedy and non-greedy modes

  • After two actual battles, I believe you have mastered the basic use of regular expressions. Here are the greedy and non-greedy modes. By looking at the official API we found thatPatternClass has the following definitions:
Greedy QuantiFIERS Greedy mode X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n} X, exactly n times X{n,} X, At least n times X{n,m} X, at least n but not more than m times Reluctant quantifiers?? X, once or not at all X*? X, zero or more times X+? X, one or more times X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but not more than M times Possessive essive Quantifiers X? + X, once or not at all X*+ X, zero or more times X++ X, one or more times X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but not more than m timesCopy the code
  • All three modes express the same idea. In the previous explanation, we all used the greedy mode. So what’s the difference between the other two? This is illustrated by the following code example.
public static void main(String[] args) {
    Pattern p = Pattern.compile(". {3, 10} [0-9]. "");
    String s = "aaaa5bbbb6"; Matcher m = p.matcher(s);if(m.find()){
        System.out.println(m.start() + "-" + m.end());
    }else {
        System.out.println("not match!"); }} // Output 0 to 10Copy the code
  • A regular expression consists of 3 to 10 characters plus a number. When matching in greedy mode, the system will first swallow 10 characters, then check whether the last one is a number, and find that there is no character, so it will throw out a character, and match the number again, the match is successful, and get0 to 10.
  • The following is shown in non-greed mode (reluctant, reluctant)
public static void main(String[] args) {
    Pattern p = Pattern.compile("., 10} {3? [0-9]. ""); // Add a? String s ="aaaa5bbbb6";
    Matcher m = p.matcher(s);
    if(m.find()){
        System.out.println(m.start() + "-" + m.end());
    }else {
        System.out.println("not match!"); }} // Output 0 to 5Copy the code
  • In non-greedy mode, it swallows only 3 characters (minimum 3), then determines if the next one is a number, then it doesn’t, then it swallows one character back, then determines if the next one is a number, then it prints0 to 5
  • Finally, the exclusive mode is shown, which is usually done only in the pursuit of efficiency and is used less often
public static void main(String[] args) {
    Pattern p = Pattern.compile(". 3, 10} {+ [0-9]. ""); // + String s ="aaaa5bbbb6";
    Matcher m = p.matcher(s);
    if(m.find()){
        System.out.println(m.start() + "-" + m.end());
    }else {
        System.out.println("not match!"); } // Output not match!Copy the code
  • Exclusive mode swallows 10 characters at once, determines if the last one is a number, and does not swallow or spit any more characters regardless of whether it matches.

The end of the

  • May regular expressions make your programming experience more enjoyable.