Python is now the standard language and one of the standard platforms for the use of data analysis and data science. So how can a novice quickly get started with Python data analysis?

According to the general workflow of data analysis, relevant knowledge, skills and learning guide are summarized below.

The general workflow of data analysis is as follows:

  1. The data collection
  2. Data storage and extraction
  3. Data cleaning and preprocessing
  4. Data modeling and analysis
  5. Data visualization

1. data collection

Data sources are divided into internal data and external data. The internal data is mainly the data in the enterprise database, while the external data is mainly obtained by downloading some public data or using the network crawler. (If data analysis only deals with internal data, this step can be ignored.)

We can download the open data set directly, so the key knowledge content of this part is the web crawler. So we must master the skills: Basic Python syntax, Python crawler writing.

Basic Python syntax: Master basic knowledge of elements (lists, dictionaries, tuples, etc.), variables, loops, functions, etc., to be able to write code skillfully, at least without syntax errors.

Python crawler content: Learn how to implement web crawlers using mature Python libraries such as URllib, BeautifulSoup, Requests, and scrapy.

Since most sites have their own anti-crawling mechanisms, we need to learn some techniques to deal with anti-crawling strategies of different sites. It mainly includes: regular expression, simulating user login, using proxy, setting crawl frequency, using cookie information and so on.

Recommended Resources:

  • Python3 concise tutorial
  • Stupid way to learn Python3

2. Data storage and extraction

When it comes to data storage, the one thing you can’t get away with is databases. SQL language as the most basic database tools, this is indispensable. Common relational and non-relational databases also need to be understood.

SQL language: the most basic four operations, add delete change check. You need to know it by heart. Because specific data may be extracted, you need to be able to write SQL statements to extract specific data. When dealing with some complex data, it also involves grouping and aggregation of data and establishing connections between multiple tables.

MySQL and MongoDB: Understand the basic use of MySQL and MongoDB, and understand the differences between the two databases. Learned these two databases, other databases can be based on this very quickly.

Recommended Resources:

  • MySQL Foundation Course
  • MongoDB Basics

3. Data cleaning and pretreatment

Often the data we get is dirty, with duplications, missing data, outliers and so on. At this time, we need to clean and preprocess the data to solve the interference factors, so as to analyze the results more accurately.

For data preprocessing, we mainly use Python’s Pandas library.

Pandas: a library for processing data that provides rich data structures and functions for manipulating tables and time series.

Master selection, missing value processing, repeated value processing, blank and outlier value processing, related operations, merge, grouping, etc.

Recommended Resources:

  • Pandas Basic data processing
  • Pandas has hundreds of titles
  • Tutorials – pandas 0.25.1 documentation
  • Use Python for data analysis

4. Data modeling and analysis

Data analysis is not only about data processing, but also about mathematics and machine learning.

Probability theory and statistical knowledge: basic statistics (mean, median, mode, etc.), descriptive statistics, variance, standard etc), statistics knowledge (overall and sample, parameter and statistic, etc.), a probability distribution and hypothesis testing (various distribution and hypothesis testing process), conditional probability, bayesian and other knowledge of probability theory.

Machine learning: Master commonly used machine learning classification, regression, clustering algorithms and principles, understand feature engineering foundation, parameter tuning methods and Python data analysis packages scipy, Numpy, SciKit-learn, etc.

  • NumPy: A general-purpose library that not only supports commonly used numeric arrays, but also provides functions for efficiently processing these arrays.
  • SciPy: A scientific computing library for Python that greatly expands and overlaps NumPy’s capabilities. Numpy and SciPy used to share the underlying code, but have since parted ways.

Recommended Resources:

  • Simple statistics
  • Statistical learning method
  • NumPy basis for numerical computation
  • NumPy 100 questions big chong level
  • SciPy fundamentals of scientific computing

5. Data visualization

Data visualization, which relies heavily on Python’s Matplotlib and Seaborn.

  • Matplotlib: A 2D drawing library that provides good support for drawing graphs and images. Currently, Matplotlib is incorporated into SciPy and supports NumPy.
  • Seaborn: Graphical visualization Python package based on Matplotlib. It provides a highly interactive interface that allows users to create attractive statistical charts

Recommended Resources:

  • Matplotlib Data graphing basics

According to the above content, step by step to complete the learning, basically can meet the requirements of the primary data analyst. But don’t forget, after mastering basic skills, but also more practice, pay attention to actual combat in order to better improve skills.

Here are some examples of good projects:

  • Basic data analysis of China’s insurance industry in the past five years
  • Current situation analysis of data analysis post under the background of Internet winter in Hangzhou
  • Price prediction using regression decision tree based on jingdong mobile phone sales data

The above cases are from the students of the course “Building + Data Analysis and Mining actual combat” of the experimental building.