Regular expressions in Java
- Regular expression syntax
-
- \
- (^) (# _16)
- $
- *
- +
- ?
- {n}
- {n,}
- {n, m}
- ?
- .
- (pattern)
- (? :pattern)
- (? =pattern)
- (? ! pattern)
- (? <=pattern)
- (? <! pattern)
- x|y
- [xyz]
- [^xyz]
- [a-z]
- [^a-z]
- [:name:]
- [=elt=]
- [.elt.]
- \b
- \B
- \cx
- \d
- \D
- \f
- \n
- \r
- \s
- \S
- \t
- \v
- \w
- \W
- \xnn
- \num
- \n
- \nm
- \nml
- \un
- \p{P}
- < \ \ >
- (a)
- |
- Regular expression usage scenarios
-
- String substitution
-
- Matcher
-
- appendReplacement()
- appendTail()
- The sample
- Character checking
Regular expression syntax
\
-
Mark the next character as:
- A special character
- One literal character (12) : ^, $, (,), *, +,? ,,,, {|
- A backward reference
- An octal escape character
-
Example:
- \n – Newline character
- \ \ \
- \ [- (
^
- Matches the start of the input string
- If you set the Multiline property of the RegExp object, ^ also matches the position after \n or \r
$
- Matches the end of the input string
- If you set the Multiline property of the RegExp object, $also matches the position before \n or \r
*
-
Matches the preceding subexpression zero or more times
-
Equivalent to {0}
-
Example:
- Zo * -z or ZO or zoo
+
-
Matches the previous subexpression one or more times
-
Equivalent to {1}
-
Example:
- Zo + -ZO or zoo
?
-
Matches the preceding subexpression zero or once
-
Equivalent to {0, 1}
-
Example:
- do(es)? – do or does
{n}
-
The match is determined n times. N is a non-negative integer
-
Example:
- O {2} – Does not match Bob, but matches food
{n,}
-
Match at least n times. N is a non-negative integer
-
O {0,} is equivalent to o*
-
O {1,} is equivalent to o+
-
Example:
- O {2,} – Does not match Bob, but matches Looooog
{n, m}
-
The minimum number of matches is n, and the maximum number is m. Both n and m are non-negative integers, where n<=m
-
O {0,1} is equivalent to o?
-
Example:
- O {1,3} – matches the first three o’s of loooooog
?
-
Non-greedy quantization: When this character is followed by any of the other modifiers *, +,? , {n}, {n,}, {n,m}, the matching mode is non-greedy
- The non-greedy pattern is to match as few strings as possible
- The regular expression defaults to greedy mode, which matches as many strings as possible
-
Example:
- o+? – Matches a single O in loooooog
- O + – Matches all o’s of Loooooog
.
- Matches any single character except \r, \n
- If need to match \ r \ n, characters, need to use (. | | \ r \ n)
(pattern)
- Matches pattern and gets the matching substring, which is used for backward reference
- The Matches obtained can be obtained from the collection that generated Matches
(? :pattern)
-
Matches pattern but does not get a matching substring
-
This is a non-fetch match that does not store the matched substring for backward reference
-
Or characters used in the replacement | to combine various parts of a model is very useful
-
Example:
- industr(? : y | ies) is equivalent to industry | industries
(? =pattern)
-
Positive affirmative prelookup: Matches the lookup string at the beginning of any string that matches pattern
-
This is a non-fetch match, meaning that the match does not need to be fetched for later use
-
Presearch does not consume characters. That is, at the beginning of a match, the next match search starts immediately after the last match occurs, rather than starting the match search from the character that contains the pre-check character
-
Example:
- Windows(? = 95 NT | | 98 | 2000) – can match Windows in Windows, but can not match the Windows 10 of Windows
(? ! pattern)
-
Positive negative pre-check: Matches the search string at the beginning of any string that does not match pattern
-
This is a non-fetch match, meaning that the match does not need to be fetched for later use
-
Presearch does not consume characters. That is, at the beginning of a match, the next match search starts immediately after the last match occurs, rather than starting the match search from the character that contains the pre-check character
-
Example:
- Window(? ! 95 NT | | 98 | 2000) – can match the Windows in Windows 10, but can not match the Windows in Windows
(? <=pattern)
-
Reverse affirmative prelookup: Reverse matches the lookup string at any string that matches pattern
-
Example:
- (? < = 95 NT | | 98 | 2000) Windows – can match 2000 Windows in Windows, but can’t match ten Windows in Windows
(? <! pattern)
-
Reverse negation prelookup: Reverse matches any string that does not match pattern
-
Example:
- (?
x|y
-
Or match
-
If not enclosed in parentheses, the range is the entire regular expression. Otherwise, it just matches the string in parentheses
-
Example:
- Z | food – z or food
- | f (z) oo – zoo or foo
[xyz]
-
Character set. Matches any of the contained characters
-
Only the special character backslash \ can retain the special meaning of escape characters. Other symbols such as *, +, (,) are ordinary characters
- Decarbonate ^ indicates a negative set of characters if they appear first. If it occurs between characters, it is a normal character
- Hyphen – indicates a character range if it occurs in the middle of a string. It is a normal character if it appears at the beginning or end
- The closing parenthesis is also a normal character if it appears first
-
Example:
- [ABC] – can match ain plain
[^xyz]
-
A collection of excluded characters. Matches any character not listed
-
Example:
- [^ ABC] – can match plin in plain
[a-z]
-
Character range. Matches any character in the specified range
-
Example:
- [a-z] – Can match any lowercase character from a to Z
[^a-z]
-
Exclude the range of type characters. Matches any character that is not in the specified range
-
Example:
- [^a-z] – Matches any character not in the range from a to z
[:name:]
- Adds a character from a named character class to an expression. Can only be used in square bracket expressions
[=elt=]
- Add or subtract characters equivalent to ELT in the current locale. Can only be used in square bracket expressions
[.elt.]
- Add the sort element ELT to the expression. Can only be used in square bracket expressions
- This syntax is used when some collated elements consist of more than one character. For example, in the 29-alphabet Spanish,CH comes after C as a single letter, resulting in the order cinco, credo, chispa
\b
-
Matches word boundaries. That is, the position between words and Spaces
-
Example:
- Er \b – can match er in never, but not er in verb
\B
-
Matches non-word boundaries
-
Example:
- Er \B – Can match the ER in verb, but not the ER in never
\cx
-
Matches the control character specified by x
-
The value of x must be a-Z or one of a-Z characters, otherwise c is considered A literal C character
-
The value of the control character equals the minimum 5 bits of the value of x (remainder of decimal 32)
-
Example:
- \cM – Matches control-m or carriage return
- \ca – \u001
- \cb – \u002
\d
- Matches a numeric character
- Equivalent to [0-9]
\D
- Matches a non-numeric character
- Equivalent to [^ 0-9]
\f
- Matches a feed character
- This is equivalent to \x0c and \cL
\n
- Matches a newline character
- Equivalent to \x0a and \cJ
\r
- Matches a carriage return
- Equivalent to \x0d and \cM
\s
- Matches any whitespace character
- Includes Spaces, tabs, page feeds, etc
- Equivalent to [\f\n\r\t\v]
\S
- Matches any non-whitespace character
- Equivalent to [^ \f\n\r\t\v]
\t
- Matches a TAB character
- This is equivalent to \x0b and \cI
\v
- Matches a vertical TAB character
- Equivalent to \x0b and \cK
\w
- Matches any word character including underscores
- Equivalent to [A Za – z0-9 _]
\W
- Matches any non-word character
- Equivalent to [^ A Za – z0-9 _]
\xnn
-
Hexadecimal escape character sequence. Matches a character represented by two hexadecimal digits nn
-
ASCII encoding can be used in regular expressions
-
\x041 is equivalent to \x04&1
-
Example:
- \x41 – A
\num
-
References a substring that matches the num parenthesized subexpression of the regular expression
-
Num is a positive decimal integer starting from 1, and can be up to 9, 31, 99, or even infinite
-
Example:
- (.). \1 – Matches two consecutive identical characters
\n
-
Identifies an octal escape value or a backreference:
- If at least n subexpression is obtained before \n, n is a backreference
- Otherwise,n is an octal number 0-7. In this case,n is an octal escape value
\nm
-
Identifies an octal escape value or a backward reference
- If the subexpression is obtained at least nm before \nm, nm is a backward reference
- If \nm is preceded by at least n get subexpression, n is a backward reference followed by the literal m
- Otherwise, if none of the previous conditions are met and both n and m are octal digits 0-7, then \nm will match the octal escape value nm
\nml
- If n is an octal number 0-3, and both m and L are octal numbers 0-7, \ NML will match the octal escape value NML
\un
-
Unicode escape character sequences
-
N is a Unicode character represented by four hexadecimal digits
-
Example:
- \u00A9 – Matching copyright symbol (©)
\p{P}
-
A Unicode regular expression prefix
-
The lowercase P is a property, representing a Unicode property
-
The P in braces represents the punctuation character of one of the seven character attributes in the Unicode character set. There are six other attributes:
- L:
- M: Mark symbol. They don’t usually appear alone
- Z: delimiter. Such as Spaces, line breaks and so on
- S: symbols, such as mathematical symbols, currency symbols and so on
- N: Numbers, such as Arabic numerals, Roman numerals and so on
- C: Other characters
-
-
Note: This syntax is not supported in JavaScript
< \ \ >
-
Match word beginning \ < and end \ >
-
Example:
- \
– Can match the in the string for the wise, not the in otherwise \>
- \
(a)
- Define the expression between (and) as a group group, and save characters that match the expression to a temporary region. A regular expression can hold up to nine such temporary regions, referenced using the \1 through \9 symbols
|
-
Perform a logical or OR operation on two matching conditions
-
Example:
- (question | its ehrs) – can match it belongs to hin or it belongs to its ehrs, cannot match it belongs to them
Regular expression usage scenarios
String substitution
- Conversion date format example
Matcher
-
The Matcher class provides four methods to replace a matching string with a specified string:
- replaceAll()
- replaceFirst()
- appendReplacement()
- appendTail()
-
Focus on the appendReplacement() and appendTail() methods
appendReplacement()
-
appendReplacement(StringBuffer sb, String replacement):
- Replaces the current matching substring with the specified string
- And add the replaced substring and the string since the last match to a StringBuffer object
appendTail()
-
appendTail(StringBuffer sb):
- Adds the remaining characters after the last match to a StringBuffer object
The sample
-
String fatcatfatcatfat, regular expression pattern cat:
- AppendReplacement (sb, “dog”) is called after the first match, and the StringBuffer is fatDog. That is, cat in fatcat is replaced with dog and appended to sb with the content before the matching substring
- After the second match, an appendReplacement(sb, “dog”) is invoked, at which point the sb’s contents become FatdogFatDog
- The last appendTail(sb) call, then sb’s content becomes Fatdogfatdogfat