Widely used data analysis
Google’s data analysis can predict an upcoming flu outbreak in an area, allowing targeted prevention; Taobao can analyze your browsing and consumption data and recommend products accurately for you; NetEase Cloud Music, which has a very good reputation, customizes daily playlists for different people through its similarity algorithm…
Data is becoming more and more common, down to our individual social networks, consumption information, movements… , to the enterprise sales, operation data, product production data, traffic network data……
Data analytics can maximize the value of data on how to gain invisible knowledge from massive data, how to use data to arm marketing efforts, optimize products, conduct user research, and support decisions.
On the one hand, the amount of data of enterprises is increasing on a large scale, and the demand for data analysis is increasing day by day. On the other hand, there are far fewer candidates for data analysts than for other technical positions.
So, how does xiaobai quickly acquire the ability to analyze data? There are a lot of books on Zhihu, and you’ve probably heard about a lot of learning methods, but if you’ve tried them, they don’t have anything to do with productivity.
What skills should a data analyst have
The most effective way to identify a learning path is to look at the specific skills required by a specific occupation or job.
We pulled together some of the most representative data analyst positions to see what skills are needed to be a well-paid data analyst.
In fact, enterprises’ demand for basic skills of data analysts is not very different, which can be summarized as follows:
- SQL database basic operations, basic data management
- Excel/SQL for basic data analysis and presentation
- Data analysis in scripting languages, Python or R
- Ability to obtain external data, such as crawlers
- Basic data visualization skills and ability to write data reports
- Familiar with common data mining algorithms: mainly regression analysis
Secondly, it is the process of data analysis. Generally, a data analysis project can be implemented according to such steps as “data acquisition – data storage and extraction – data preprocessing – data modeling and analysis – data visualization”. According to this process, the segmentation points to master for each part are as follows:
What is the effective learning path? That’s the process of data analysis. By doing this, you will know what you need to accomplish in each section, what you need to learn, and what is not necessary.
Let’s talk about what to learn and how to learn from each part.
❶ –
Data capture: Open data, Python crawlers
There are two main ways to obtain external data.
The first is to obtain external public data sets. Some scientific research institutions, enterprises and governments will open some data, and you need to go to specific websites to download these data. These data sets are usually well-established and of relatively high quality. Here are some common sites where data sets can be obtained:
UCI: Classic data set open at the University of California, Irvine, used by many data mining LABS.
http://archive.ics.uci.edu/ml/datasets.html
National data: Data from The National Bureau of Statistics of China, including China’s economy and people’s livelihood and other aspects of data.
http://data.stats.gov.cn/
CEIC: Economic data of more than 128 countries, can accurately find in-depth data such as GDP, import and export retail, sales and so on.
http://www.ceicdata.com/zh-hans
China Statistics Information Network: the official website of the National Bureau of Statistics, which collects statistical information on national economic and social development.
http://www.tjcn.org/
Uyi Data: initiated by the State Information Center, it is a leading data trading platform in China with a lot of free data.
http://www.youedata.com/
Another way to get external data is through crawlers.
For example, you can get the recruitment information of a certain position on the recruitment website through crawler, the rental information of a certain city on the rental website, the list of movies with the highest rating on Douban, the list of likes on Zhihu and the list of comments on NetEase cloud music. Based on the data that the Internet crawls, you can analyze a certain industry, a certain population.
Before you can crawl, you need to know some Python basics: elements (lists, dictionaries, tuples, etc.), variables, loops, functions…
And how to implement web crawlers using Python libraries (urllib, BeautifulSoup, requests, scrapy). Urllib +BeautifulSoup is recommended for beginners.
Commonly used e-commerce websites, question and answer websites, second-hand trading websites, marriage websites, recruitment websites and so on, can climb to very valuable data.
❷ –
Data access: SQL language
Excel has no problem with general analysis when dealing with tens of thousands of data. Once the data volume becomes too large, the database can solve this problem well. And most enterprises, will be in the form of SQL to store data, if you are an analyst, also at least understand SQL operations, can query, extract the company’s data.
SQL, as the most classical database tool, provides the possibility for the storage and management of massive data, and greatly improves the efficiency of data extraction. You need to master the following skills:
- Extract context-specific data: The data in an enterprise database must be large and complex, and you need to extract what you need. For example, you can extract all the sales data of 2017, the data of the top 50 products sold this year, the consumption data of users in Shanghai and Guangdong according to your needs… SQL can do this for you with simple commands.
- Add, delete, search, modify database: these are the most basic database operations, but they can be implemented with simple commands, so you just need to remember the commands.
- Grouping and aggregations of data, how to establish associations between multiple tables: This section is the advanced operation of SQL, association between multiple tables, when you work with multiple dimensions, data sets is very useful, it also allows you to work with more complex data.
SQL this part is relatively simple, mainly is to master some basic statements. Of course, it is recommended that you find several data sets to actually operate, even if it is the most basic query, extraction, etc.
❸ –
Data preprocessing: Python (Pandas)
Most of the time, the data we get is not clean, such as data repetition, missing, outliers and so on. At this time, it is necessary to clean the data and process the data that affects the analysis, so as to obtain more accurate analysis results.
For example, sales data are not timely recorded in some channels, and some data are recorded repeatedly. For example, there are many invalid operations on user behavior data that are meaningless for analysis and need to be deleted.
Then we need to use corresponding methods to deal with it, for example, incomplete data, whether we directly remove this data, or use adjacent values to complete, these are all issues that need to be considered.
For data preprocessing, you will be able to handle general data cleaning. Here are some things to know:
- Options: Data access (labels, specific values, Boolean indexes, etc.)
- Missing value processing: Deleting or populating missing data rows
- Duplicate value processing: the judgment and deletion of duplicate values
- Outlier handling: Remove unnecessary whitespace and extreme and abnormal data
- Related operations: descriptive statistics, Apply, histogram, etc
- Merge: Merge operations that conform to various logical relationships
- Grouping: data division, function execution separately, data reorganization
- Reshaping: Quickly generate Pivottables
There are many tutorials for pandas on the web, and they are very simple to use.
– ❹ –
Knowledge of probability theory and statistics
What is the overall distribution of the data? What are populations and samples? How to apply basic statistics such as median, mode, mean and variance? How to do hypothesis testing in different scenarios? Data analysis methods are mostly derived from the concept of statistics, so statistical knowledge is also essential. Here are some things to know:
- Basic statistics: mean, median, mode, percentile, extremum, etc
- Other descriptive statistics: skewness, variance, standard deviation, significance, etc
- Other statistical knowledge: population and sample, parameters and statistics, ErrorBar
- Probability distribution and hypothesis testing: various distributions and hypothesis testing processes
- Other probability theory knowledge: conditional probability, Bayes, etc
With a basic knowledge of statistics, you can use these statistics for basic analysis. By visualizing the indicators of the data, we can draw a lot of conclusions: for example, what are the top 100, what is the average level, what is the trend of change in recent years…
You can use Seaborn, Matplotlib, etc. (Python packages) to do some visual analysis, go through various visual statistics, and come up with instructive results.
❺ –
Python data analysis
If you know anything about it, there are a lot of Python data analysis books out there, but they’re all very thick and very difficult to learn. But the most useful bits of information are just a few of these books.
For example, if you master the regression analysis method, through linear regression and logistic regression, you can actually carry out regression analysis on most of the data and draw relatively accurate conclusions. The knowledge points to master in this part are as follows:
- Regression analysis: linear regression, logistic regression
- Basic classification algorithms: decision tree, random forest…
- Basic clustering algorithm: K-means……
- Feature Engineering Fundamentals: How to Optimize models with Feature Selection
- Tuning method: How to tune the parametric optimization model
- Python data analysis packages: scipy, numpy, scikit-learn, etc
At this stage of data analysis, focus on understanding the method of regression analysis, most of the problems can be solved, using descriptive statistical analysis and regression analysis, you can get a good analysis conclusion.
Then you will know which algorithm model is more suitable for different types of problems. For model optimization, you need to learn how to improve the accuracy of prediction through feature extraction and parameter adjustment. This is a bit of data mining and machine learning flavor, in fact, a good data analyst, should be a junior data mining engineer.
You can use Python’s SciKit-Learn library for data analysis, data mining modeling, and analysis.
– ❻ –
System actual combat and data thinking
At this point, you’ll have basic data analysis skills. But also according to different cases, different business scenarios for actual combat, practice the ability to solve practical problems.
As for the public data set mentioned above, you can find some data in the direction you are interested in and try to analyze it from different angles to see what valuable conclusions can be obtained.
You can also find some problems that can be used for analysis from life and work. For example, there are many problems that can be mined from the data of e-commerce, recruitment, social networking and other platforms mentioned above.
In the beginning, you may not think very well of the questions, but as you accumulate experience, you will gradually find the direction of the analysis, what are the general analysis dimensions, such as Top rankings, average, regional distribution, year-over-year, correlation analysis, future trend prediction, etc. As you get more experience, you’ll get a feel for the data, which is what we call data thinking.
There are many pits in data analysis of zero-based learning, which are summarized as follows:
- 1. Environment configuration, tool installation and environment variables are unfriendly to xiao Bai;
- 2. Lack of reasonable learning path, Python, HTML learning, extremely easy to give up;
- 3.Python has many packages and frameworks to choose from, and I don’t know which is friendlier;
- 4. Unable to find solutions to problems, learning stagnates;
- 5. The information on the Internet is scattered and unfriendly to xiao Bai, and much of it looks cloudy and foggy;
- 6. Good skills, but unable to think and analyze specific problems systematically;
- ……………………………………
Want to learn about Python data analysis?