I. What is data analysis
Data analysis refers to the use of appropriate statistical analysis methods to analyze a large amount of data collected, summarize, understand and digest them, in order to maximize the function of data development, play the role of data, and maximize the value of data
2. What does data analysis do
Data analysis is the process of studying and summarizing data in detail in order to extract useful information and form conclusions. The mathematical foundations of data analysis were established in the early 20th century, but it was not until the advent of computers that practical operations were made possible and data analysis became widespread.
- Analyze users’ consumption behavior
- Develop plans for promotional activities
- Develop promotion timing and granularity
- Calculate user activity
- Analyze product buy-back strength
- Analyze click-through rates
- Determine launch time
- Formulate AD target group plan
- Decide on the release of relevant platforms
- .
Data analysis the use of appropriate methods to analyze large amounts of data collected to help people make decisions and take appropriate actions
- Insurance companies use large amounts of claims data to determine what is likely to be fraudulent
- Alipay automatically adjusts the amount of huabei based on a large number of users’ consumption records and behaviors
- The short video platform pushes favorite videos to users based on users’ clicking and viewing behavior data
3. Why study data analysis
- Job demand
- Is the foundation of Python data science
- It’s the foundation of machine learning
Iv. Realization process of data analysis
- Ask questions
- To prepare data
- Analyze the data
- Get the conclusion
- Visualization of results
5. Set up the data analysis environment
1. Anaconda
-
Download the installation package from www.anaconda.com
-
Note: The installation directory cannot contain Chinese characters and special symbols
Anaconda integrates all the environments needed for data analysis and machine learning
2. Jupyter
- Jupyter is a visual development tool based on Web provided by Anaconda
3. Basic use of Jupyter
-
Start: type jupyter Notebook in the terminal and press Enter
-
Create a New file: New –>> PYTHon3
-
Cell (block of code) has two modes
- Code: Write code
- Markdown: Write notes
-
shortcuts
-
Add cell A or B
-
Delete: x
-
Example Change the cell mode
- Change to MarkDown mode:
m
- Change to code mode:
y
- Change to MarkDown mode:
-
Run shift+ Enter
-
Auto completion: TAB
-
Open the help document: Shift + TAB
-
How to use Python for data analysis
We can’t do data analysis in Python without the following three musketeers
- Numpy
- Pandas
- Matplotlib
Numpy module
- Numerical Python (Numpy) is the basic library for performing scientific calculations in Python. The emphasis is on numerical computation, which is the basis of most of Python’s scientific computation library and is used for numerical operations performed on large, multidimensional arrays.
1. Creation of Numpy
-
Use array() to create a one-dimensional array
- use
array()
Create a multidimensional array
-
Use zero() to create a multidimensional array
-
Use ones() to create a multidimensional array
- use
linspace()
Creates a one-dimensional arithmetic array
-
Use arange() to create a one-dimensional arithmetic array
-
Use random.randint() to create a random multidimensional array
2. Common attributes of Numpy
- shape
- ndim
- size
- dtype
3. Index and slice of Numpy
-
Indexes operate the same way as lists
-
Slicing operation
4. Matrix operations for Numpy
-
Matrix deformation
-
Cascade operation
- Splicing multiple Numpy arrays horizontally or vertically
- Axis Indicates the axial parameter
-
0: column
-
Line 1:
-
-
Common aggregation operations
- sum, max, min, mean
-
Commonly used statistical functions
-
Standard deviation: a measure of the spread of the mean values of a set of data
-
Variance: The variance in statistics is the mean of the square of the difference between each sample and the mean of the entire sample, i.e. Mean ((x-x.bean ())**2. In other words, the standard deviation is the square root of the variance.
-
Pandas module
1. Why are you watching Pandas
Numpy will help us deal with numerical data, while Pandas will help us deal with other types of data.
2. Data structures for Pandas
- Series
- Is an object similar to a one-dimensional array, consisting of the following two parts:
- Values: set of data (ndarray type)
- Index: Indicates the relevant data index label
- Is an object similar to a one-dimensional array, consisting of the following two parts:
- DataFrame
- Is a tabular data structure. It has both row and column indexes.
- Row index: index
- Column index: columns
- Values: values
- Is a tabular data structure. It has both row and column indexes.
3. The Series operation
3.1 Creation of Series
Index is used to specify an explicit index to enhance the readability of Series.
You can also use a dictionary as a data source.
3.2 Series index and slice
3.3 Series common properties
- shape
- size
- index
- values
- dtypes
3.4 Series common methods
head()
.tail()
unique()
isnull()
.notnull()
3.5 Series arithmetic operations
Count elements with the same index otherwise fill in the blanks
4. The DataFrame operation
4.1 Creating a DataFrame
Can be created using ndarray.
You can also use a dictionary as a data source.
Index is used to specify an explicit index to enhance readability of the DataFrame.
4.2 DataFrame index and slice
-
Iloc: Take rows by implicit index
-
Loc: Rows are fetched by an explicit index
-
Slice the rows
-
Slice the column
4.3 DataFrame Common attribute
- shape
- values
- columns
- index
4.4 Common methods for DataFrame
With the Series
4.5 Arithmetic operations for DataFrame
With the Series
4.6 Cascading and Merging DataFrame
Cascade operation
- pd.concat
- pd.append
Next we forged two sets of DataFrame data.
Using pd. Concat ()
- Match the cascade
- Horizontal cascade
- The cascade does not match
- A mismatch is when the dimension and index of the cascade are not consistent. For example, column indexes are inconsistent in vertical cascading mode, and row indexes are inconsistent in horizontal cascading mode.
- There are two ways to connect
- External connection: complement NaN (default mode)
- Inner join: Connect only matched items
PS: You must use parameters if you want to preserve data integrityjoin='outer'
(External connection)
Using pd. Append ()
- The cascading mode can only be vertical, external, and internal (generally not used).
merge
pd.merge()
merge
与concat
The difference is,merge
The merge needs to be based on a common column.- use
pd.merge()
When merging, the column with the same column name is automatically merged as the key. - Note: The order of the elements in each column is not required to be consistent
One-to-one merger
First let’s forge two sets of DataFrame.
Using pd. The merge ()
One to many merge
First let’s forge two sets of DataFrame.
Using pd. The merge ()
Many-to-many merge
First let’s forge two sets of DataFrame.
Using pd. The merge ()
The merge() method can also use the left_on and right_on arguments. The how argument can also specify a different connection mode.
5. Data cleaning based on Pandas
5.1 Why Do YOU Need to Clean Data
- There may be missing values in the original data (null values)
- These values are meaningless and interfere with the production of our analytical results
- Duplicate values
- Repeated values do not need to be analyzed and processed multiple times
- outliers
- Due to different data collection methods, outliers may be generated in the data, and outliers will also interfere with the production of our analysis results
5.2 Processing missing Values
- There are two missing values:
- None
- np.nan(NaN)
- The difference between two missing values
- None: Indicates the type of the None object
- Np. nan: floating point type
Why do you need floating-point nulls in data analysis instead of object nulls?
None+1
Will be submitted to theTypeError
And thenp.nan+1
As a result,nan
. It doesn’t interfere or interrupt the operation.- NaN can participate in the calculation
- None does not participate in the operation
In Pandas, if a null value of the form None is encountered in the data, it will be forcibly converted to NaN.
Missing value processing operation
Let’s fake a set of data with missing values.
-
Method 1: Filter for missing values (delete empty row data)
-
Isnull () to match any ()
-
-
Notnull () to match all ()
-
Use dropna() to delete missing row or column data directly
-
Method 2: fillna() fills the missing value
5.3 Processing Duplicate Data
Let’s fake a set of data with duplicate values.
-
Use the drop_duplicates ()
5.4 Handling Outliers
What are outliers?
- Outliers are values that are likely to bias or affect materially significant estimates and increase the error variance.
Next we falsify a set of data with outliers.
Then we implement the cleaning of outliers.
6. Pandas advanced operations
6.1 Replacement Operations
- The replacement operation can be synchronized between a Series and a DataFrame
- Single value replacement
- Normal substitution: Replaces all the elements that meet the requirements
to_replace=15, value='value'
- Specify single-value substitutions by column
To_replace ={column label: replace value}, value='value'
- Normal substitution: Replaces all the elements that meet the requirements
- More value,
- List to replace
to_replace=[], value=[]
- Dictionary replacement (recommended)
to_replace={to_replace: value, to_replace: value}
- List to replace
First let’s forge a set of dataframes.
Use the replace ()
6.2 Mapping operations
- Concept: Create a list of mappings that bind the values element to a specific label or string (providing different representations of an element value)
- Map is a Series method and can only be called by Series
First let’s forge a set of dataframes.
usemap()
Example: 50% tax is paid on any portion of salary over 3000. Each person’s take-home pay is calculated
6.3 Group Aggregation
- Core of data classification processing:
groupby()
functiongroups
Property to view the grouping
grouping
Next we’ll forge a set of dataframes.
usegroupby()
和 groups
The aggregation
Advanced data aggregation
- use
groupby()
You can also use it after groupingtransform()
和apply()
Provide custom functions to perform more operations df.groupby('item')['price'].sum()
< = = >df.groupby('item')['price'].apply(sum)
transform()
和apply()
I’m going to do it intransform()
orapply()
Pass in the functiontransform()
和apply()
You can also pass in onelambda
expression
6.4 Loading Data
- Read data from a CSV file
- Read data from the database
Matplotlib module
- The Matplotlib module helps us visualize and chart the data easily.
First we pour the global module
1. Draw a line graph
1.1 Drawing single and multiple line graphs
1.2 Set the coordinate scale
1.3 Legend Setting
1.4 Set the identity of the axis
1.5 Legend saving
1.6 Styles and styles of curves
There are many other parameters of the style oh, please see the library source code for details.
2. Draw a bar chart
The rest of the usage is similar to line charts.
3. Draw a histogram
- It’s a special bar graph, also called a density graph.
The parameters of PLT. Hist ()
bins
: can be an integer value of the number of bin’s or a sequence of bin’s. The default value is 10normed
: If True, the values of the histogram are normalized to form the probability density. The default value is Falsecolor
: Specifies the color of the histogram. It can be a single color value or a sequence of colors. If more than one data set is specified, such as a DataFrame object, the color sequence is set to the same order. If not specified, a default line color will be usedorientation
: By settingorientation
为horizontal
Create a horizontal histogram. The default value isvertical
The rest of the usage is similar to line charts.
4. Draw a pie chart
pie()
, the pie chart also has only one parameter x- A pie chart is good for showing the proportion of each part to the whole, and a bar chart is good for comparing the size of each part
The rest of the usage is similar to line charts.
5. Scatter plot
scatter()
, the general trend of the dependent variable changing with the independent variable
The rest of the usage is similar to line charts.
PS: Welcome to put forward valuable comments, if you want to ask technical questions can leave a message in the message area or add the developer’s wechat (wechat signal: X118422) to consult ~