Pandas text handles high-order functions extract + Extractall

Public account: You and the cabin by: Peter Editor: Peter

Hello, I’m Peter

Two functions are used to extract information from text in Pandas: Extract + Extractall

The extract function

Grammar specification

The extract function is used as follows, with only three arguments:

Series.str.extract(pat, flags=0, expand=None)
Copy the code

The specific interpretation of parameters is as follows:

Pat: string or regular expression
Flags: integer
Expand: Boolean value, whether to return DataFrame. T- yes, F- no

Simulated data

Let’s take a look at a simple case provided on the official website. The following is a simulated data Series:

Match 1

In the following example, two sets of schema data are matched; A pair of () matches a group:

[ab] : matches any letter of ab
\ D: Matches a number

Through the results, we can find two points:

When matching multiple sets of rules, NaN is used instead if there is no match
When the first set of patterns does not match, the second set of matches is invalid

In c3, although \d matches the number, [ab] does not match. C does not satisfy either of ab, so NaN is still the whole thing

Match the two

The difference between the following match and the above match is that there are multiple question marks? The result was different

When doing a regular match, the question mark? Represents the match of 1 or 0 of the preceding elements. So in C3, [ab] can be said to match zero, and NaN is used instead, which is also a match

Match 3

Specify the column name when matching to generate the final DataFrame:

The specified use of column names? P

Parameter expand use

About the use of parameter expand:

Expand = True: DataFrame is returned
Expand = False: returns a Series or Index

By comparing the following two examples, we can see expand in action:

Extractall function

Grammar specification

Extract returns only the first matched character; The Extractall will match all returned characters

Series.str.extractall(pat, flags=0)
Copy the code

The specific interpretation of parameters is as follows:

Pat: string or regular expression
Flags: integer

The return value must be a DataFrame data box

Simulated data

Here is a simulation of a new data:

Here are three examples to compare the two functions:

Compared to 1

Matching in single group mode

Compare the two

Matching in multi-group mode:

Compare the three

Matches in multi-group mode, plus column names:

Practical cases

Here’s an example of how to use the extract function:

Simulated data

The name field contains both name and gender, and the address field contains both province and city:

df = pd.DataFrame({
    "name": ["Tom-male"."Peter male"."Jimmy-female"."Mike male"."John-female"]."address": [Shenzhen city, Guangdong Province."Guangzhou, Guangdong Province"."Hangzhou, Zhejiang Province".Nanjing, Jiangsu Province."Changsha, Hunan Province"]}
    )
df
Copy the code

Extract the provinces

Quickly extract province information from address, where.*? Matches anything

Extract province + city

At the same time extract province + city, can also specify the column name information:

Extract the name + gender

Extract both the name and gender from the field name, \w for matching one letter and + for matching multiple characters

Regular matching knowledge

Here’s a quick primer on regular matching, courtesy of Google Analytics:

The wildcard

.	Matches any single character (letter, number, or symbol)	1. Can match 10, 1A 1.1 can match 111, 1A1
?	Matches the preceding character 0 or 1 times	10? Can match 1, 10
+	Matches the preceding character 1 or more times	10+ matches 10 and 100
*	Matches the preceding character 0 or more times	1* matches 1 and 10
\|	Create OR (OR) matches Do not use at the end of expressions	1 \| 10 can match 1, 10

locator

^	Matches adjacent characters at the beginning of the string	^10 matches10,100,10X; Failed to match 110, 1,10x
$	Matches adjacent characters at the end of the string	Ten dollars will match one10, 1010; Can’t match100,10x

Question mark (?)

Question mark (?) Matches the preceding character 0 or 1 times. For example, 10? Can match:

1: The 0 before the question mark matches 0 times
10: The 0 before the question mark matches once

A plus sign (+)

The plus sign (+) matches the preceding character 1 or more times. For example, 10+ can match:

10:0 matches once
100:0 matches twice
1000:0 matches three times

An asterisk (*)

The asterisk (*) matches the preceding character 0 or more times. For example, 10* can match:

1: matches 0 times
10: Matches once
100
1000

I will write a detailed article on regular matching based on Python’s RE module

Pandas text handles high-order functions extract + Extractall

The extract function

Grammar specification

Simulated data

Match 1

Match the two

Match 3

Parameter expand use

Extractall function

Grammar specification

Simulated data

Compared to 1

Compare the two

Compare the three

Practical cases

Simulated data

Extract the provinces

Extract province + city

Extract the name + gender

Regular matching knowledge

The wildcard

locator

Question mark (?)

A plus sign (+)

An asterisk (*)

Related Posts

“AI principle interpretation” introduction and interpretation of powerful parallel ability of MindSpore1.2

“This apple is not that apple” to look at the intention to identify those things

VGG and vector retrieval engine -Milvus are used to build a graph search system easily