Pandas series: It all starts with exploding functions
Some time ago, a colleague of big data development left the department. When he and other colleagues of development were transferring work projects, I went to audit, because some problems of the company’s business logic were involved. When talking about a project, he said:
The business logic is… I implemented this using the Hive explosion function.
At that time, he gave a simple example to illustrate the function of the explosion function, and I wrote down the name of the function on the spot: the explosion function. I don’t use Hive very much in my work, so I wondered: Can Pandas implement this feature?
Explosion function
What requirements does the explosion function fulfill? Let me recall an example:
Now you have a piece of data that contains the order number and the price of the items in the order (3 items per order). Using the Hive explosion function, it looks like this: the column to row function is implemented. This allows for subsequent aggregation operations.
Hive implements the explosion function in two functions, you can search for:
- explode(col)
- Lateral View: A profiling function
Pandas implementation
The specific requirements
There happens to be a recent requirement that uses the exploding function functionality, implemented in Pandas. On the left side of the table below are the sales records for each order, including the salesperson, the merchandiser, and the cashier, who may or may not be the same person.
Now we need to count the sales of each employee. Sales performance refers to how many orders employees participated in. Statistics of sales performance of each employee:
- As a salesman, a merchandiser, a statement of any one, can be involved in the sale of the order;
- In the same order, an employee’s participation in multiple times is only counted as one time
What exactly does the explosion function do? On the right is what we want:
- Zhang SAN: Participated in order No. 1- salesman, order No. 2- Merchandiser, account teller, quantity 2
- Li Si: Participated in order No. 1- merchandiser, cashier, Order No. 2- salesman, Order No. 3- Merchandiser, quantity 3
- Wv: Participated in order No. 3- salesman, statement teller, quantity 1
To solve the process
1. The following data was simulated in Pandas:
2. Create a new field: Employee
3. Explode the order number in Pandas. Explode the order number in Pandas
4. Statistical results
In this way, different performances can be counted according to different fields, such as employees, salesmen, merchandisers, etc
What are the pandas
Pandas is Python’s core third-party library for data processing and analysis. It has fast, flexible, and unambiguous data structures.
Pandas is a powerful toolset for analyzing structured data; It is based on Numpy (another Python library that provides high-performance matrix operations) and can be used for data mining and analysis, as well as data cleaning.
What are Pandas for
The specific use of Pandas will form a series in the future, please look forward to!
Conclusion: Embrace PANDAS, farewell to Excel!
Wechat official account
Wechat official account: Youerhuts, welcome to follow!
It’s a lovely little house. Cottage owner, one hand code to seek survival, one hand to cook to enjoy life, welcome your presence