Public account: You and the cabin by: Peter Editor: Peter
Hello, I’m Peter
Two functions are used to extract information from text in Pandas: Extract + Extractall
The extract function
Grammar specification
The extract function is used as follows, with only three arguments:
Series.str.extract(pat, flags=0, expand=None)
Copy the code
The specific interpretation of parameters is as follows:
- Pat: string or regular expression
- Flags: integer
- Expand: Boolean value, whether to return DataFrame. T- yes, F- no
Simulated data
Let’s take a look at a simple case provided on the official website. The following is a simulated data Series:
Match 1
In the following example, two sets of schema data are matched; A pair of () matches a group:
- [ab] : matches any letter of ab
- \ D: Matches a number
Through the results, we can find two points:
- When matching multiple sets of rules, NaN is used instead if there is no match
- When the first set of patterns does not match, the second set of matches is invalid
In c3, although \d matches the number, [ab] does not match. C does not satisfy either of ab, so NaN is still the whole thing
Match the two
The difference between the following match and the above match is that there are multiple question marks? The result was different
When doing a regular match, the question mark? Represents the match of 1 or 0 of the preceding elements. So in C3, [ab] can be said to match zero, and NaN is used instead, which is also a match
Match 3
Specify the column name when matching to generate the final DataFrame:
The specified use of column names? P
Parameter expand use
About the use of parameter expand:
- Expand = True: DataFrame is returned
- Expand = False: returns a Series or Index
By comparing the following two examples, we can see expand in action:
Extractall function
Grammar specification
Extract returns only the first matched character; The Extractall will match all returned characters
Series.str.extractall(pat, flags=0)
Copy the code
The specific interpretation of parameters is as follows:
- Pat: string or regular expression
- Flags: integer
The return value must be a DataFrame data box
Simulated data
Here is a simulation of a new data:
Here are three examples to compare the two functions:
Compared to 1
Matching in single group mode
Compare the two
Matching in multi-group mode:
Compare the three
Matches in multi-group mode, plus column names:
Practical cases
Here’s an example of how to use the extract function:
Simulated data
The name field contains both name and gender, and the address field contains both province and city:
df = pd.DataFrame({
"name": ["Tom-male"."Peter male"."Jimmy-female"."Mike male"."John-female"]."address": [Shenzhen city, Guangdong Province."Guangzhou, Guangdong Province"."Hangzhou, Zhejiang Province".Nanjing, Jiangsu Province."Changsha, Hunan Province"]}
)
df
Copy the code
Extract the provinces
Quickly extract province information from address, where.*? Matches anything
Extract province + city
At the same time extract province + city, can also specify the column name information:
Extract the name + gender
Extract both the name and gender from the field name, \w for matching one letter and + for matching multiple characters
Regular matching knowledge
Here’s a quick primer on regular matching, courtesy of Google Analytics:
The wildcard
. | Matches any single character (letter, number, or symbol) | 1. Can match 10, 1A 1.1 can match 111, 1A1 |
---|---|---|
? | Matches the preceding character 0 or 1 times | 10? Can match 1, 10 |
+ | Matches the preceding character 1 or more times | 10+ matches 10 and 100 |
* | Matches the preceding character 0 or more times | 1* matches 1 and 10 |
| | Create OR (OR) matches Do not use at the end of expressions | 1 | 10 can match 1, 10 |
locator
^ | Matches adjacent characters at the beginning of the string | ^10 matches10,100,10X; Failed to match 110, 1,10x |
---|---|---|
$ | Matches adjacent characters at the end of the string | Ten dollars will match one10, 1010; Can’t match100,10x |
Question mark (?)
Question mark (?) Matches the preceding character 0 or 1 times. For example, 10? Can match:
- 1: The 0 before the question mark matches 0 times
- 10: The 0 before the question mark matches once
A plus sign (+)
The plus sign (+) matches the preceding character 1 or more times. For example, 10+ can match:
- 10:0 matches once
- 100:0 matches twice
- 1000:0 matches three times
An asterisk (*)
The asterisk (*) matches the preceding character 0 or more times. For example, 10* can match:
- 1: matches 0 times
- 10: Matches once
- 100
- 1000
I will write a detailed article on regular matching based on Python’s RE module