Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~
This article was published by Brzhang
Data cleaning
First, why do you need to clean the data
Data cleaning is very boring. People who do data research cannot avoid this link. The fundamental reason is that the data we get from various channels may appear:
1. Unreasonable data. For example, some people in the sample are more than 120 years old, the height of the floor is 1000 floors, and some other very unreasonable scenes.
For example, in the sample, almost all of the data is integer. However, some of the data is string type. If you feed the data directly to the algorithm without processing it, it will generally crash.
3, computer for processing type string is difficult, sometimes, we need to convert him to numeric type, this design to a mapping relationship, for example, the sample of gender, the male, the female], we can be converted into 1, 2, the type of houses (single room, one room one hall, a hall, three rooms one hall, shops] can corresponding enumeration, For example, I’m dealing with the house orientation example
Def parse_orientation(row): if 'southwest' in row: return 1 elif 'northeast' in row: return 2 elif 'east' in row: Return 3 elif 'face south' in row: return 4 elif 'face northwest' in row: return 5 elif 'face north' in row: return 6 elif 'face southeast' in row: Return 7 elif 'face north and south' in row: return 8 elif 'face west' in row: return 9 else: return 10Copy the code
Wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait, wait.
Data cleaning needs to know what black technology
Usually we get the data of data model can be simplified to form, you are useless XSL, CSV or json array, the pandas can be used to read, read, the following work is basically use in some pandas API to do data cleaning work, the following, I read a copy of house prices information data table, This data is, of course, my own crawler based on the last article.
Pandas is a computer computer that handles data cleaning for pandas, and I’m sure you’ll find a Cheatsheet that works for pandas.
To see some of the apis used by pandas, use:
1. Take common operations of subsets
Loc supports subsets by column name strings, WHILE ILOC supports subsets by array indexes (starting at 0), usually preceded by row dependent constraints and preceded by column dependent constraints. For example, I get
2. Handle blank data rows
This is nice and simple, and an API can delete or populate samples with blank data.
This will not be demonstrated, because I am crawling data, so in the crawling process, I have carried out some basic data processing, the program control can not appear blank data, so I also suggest that we write crawler to get data, which can reduce the pressure of data cleaning.
3. Apply series
Apply () applies a function to a column or row, and applymap() applies a function to each element of the DataFrame. Map, on the other hand, is an operation that operates on each element of a Series, as shown in the following example. Here I have processed the EGE column to normalize numbers and text to numbers.
In fact, this operation can be done entirely with map:
df['ege'] = df['ege'].map(parse_house_age)
df.head(5)
Copy the code
It’s exactly the same, because we only took one column.
Data cleaning in a more advanced manner, using various charts
1. Use scatter plots
2. Thermodynamic value diagram of housing price:
The diagram describes the room distribution interval, which can be cleaned to see some problems.
3. The frequency histogram helps us to quickly find some unusual pigs, whose infrequency makes it difficult to doubt the authenticity of such data.
Ok, so basically, this process involves using your brain to take the raw data you get and slowly turn it into the data you need for the algorithm below.
Machine learning in action! Quick introduction to online advertising business and CTR knowledge
This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the
Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!