Link to scattered tutorial articles and provide a detailed introduction to help you get started in data science more effectively.
The problem
Since June 2017, I have been writing a series of data science tutorials in my simple book column yushu Zhilan.
This started with a programming workshop for graduate students. Inspired by Coach Yan’s innovative thinking training camp, I recorded and relayed the production process of ciyun after class and shared it with everyone.
How to Make word Clouds in Python? It was very warmly received by readers.
Since then, it has been out of control.
At the request of readers and in combination with my own learning, scientific research and teaching practice, I have been sharing more articles related to data science.
As the readership grew, the questions I received became more varied.
Many of my readers’ questions have already been addressed in other articles, so I sometimes use the phrase “Please refer to my other article”… , link to…” To help the reader solve the problem.
Before constructing empathy, I would expect to ask questions like:
Why don’t they read my other articles?
But now, I can feel their doubts:
How did I know you wrote another article?
Scattered articles are not easy to systematically study and retrieve. So in November 2017, I put together an index post for a series of data science tutorials I had written.
I link to this index post at the end of every new tutorial and keep it updated.
However, such a simple title index still cannot meet the needs of many readers.
Some readers follow the tutorial through the word cloud and find that if they analyze the Chinese text, they will get garbled:
What should you do?
Furthermore, what if you want to change the border of the word cloud to the specified shape?
By looking at the headlines, you may not be able to easily see which article will help you solve these problems, and may even choose not to.
I decided to do this guide.
This article is no longer a simple list of titles and links from the task point of view; It is the cognitive habit of starting from easy to difficult, reorganizing the order of the text, briefly introducing the content, and indicating possible problems.
I hope it will be helpful to your study.
Based on the environment
Most of the tutorials are run and demonstrated in the Python runtime environment Jupyter Notebook.
The easiest way to install this runtime environment is to install the Anaconda integration suite.
Try to install Anaconda, run the first Jupyter Notebook, and print a “Hello World!” Come out.
With this foundation, you can try different data science tasks.
My advice is to make word clouds first.
Because it’s simple, and there’s a sense of accomplishment.
The word cloud
Follow the tutorial “How to Make word Clouds in Python?” Step by step. With a few lines of Python code, you can create a word cloud like this.
I’ve also updated it with a video tutorial called “How to Make word Clouds in Python?” for you to watch.
See this article “How to Do Chinese Word Segmentation in Python?” , you can make Chinese word clouds like this.
If you want to change the appearance of the word cloud border, refer to this article “Python Programming Problems: What do Liberal Arts students Do?” The last part.
At this point, you have mastered the basics of Python runtime installation, text file reading, common package calls, visual analysis and result rendering, and Chinese word segmentation.
Looking back, is the sense of accomplishment overflowing?
A virtual environment
If you are careful, you may have noticed that the graphic tutorial and video content are not exactly the same.
The entire series is currently in Python version 3.6. The word cloud tutorial shows the actual version 2.7 of Python.
Why is that?
As the technology has evolved, Python has gradually moved to version 3.x.
Many third party packages have announced schedules to support 3.X as soon as possible and drop 2.X support.
In just six months, you can see how quickly technology, communities and the environment are changing.
Some packages, however, still support only 2.x Python. Although there are fewer and fewer such packages.
You need to be an “amphibian” for a while, and don’t limit yourself to using a lower version of Python for “positional reasons”. It is oneself that suffers in this way.
What does it take to be an amphibian?
One way is to use Anaconda’s virtual environment. See how to Use a Python Virtual Environment in Jupyter Notebook. .
Your initial installation of Anaconda for Python 2.7 does not prevent you from quickly setting up a 3.6 Python virtual environment.
With this secret, you’ll be able to navigate between different versions of Python.
Natural language processing
Next, let’s try Natural Language Processing (NLP).
Emotion analysis is one of the popular applications of NLP in many social sciences.
How to Do Sentiment Analysis in Python? This article, respectively from English and Chinese two cases, respectively using different software packages, targeted to solve the application requirements.
You only need a few lines of code to get Python to tell you where your emotions are going. Isn’t that impressive?
With emotion analysis as the foundation, you can try to add dimensions and analyze larger volumes of data.
By adding a time dimension, changing public opinion can be continuously analyzed.
How to Use Python to Visualize public Opinion Time Series? This article gives you a step-by-step guide to visualizing the results of emotion analysis on a time scale:
This is a little bit more ugly.
But we need to be tolerant of clumsiness at first, iteration and refinement.
Hope for a full score, for a few geniuses, is really nothing more than everyday.
But for most people, it’s the beginning of procrastination.
You can’t wait to try your own data for a time series visualization.
However, if the date data differs from the sample, problems may arise.
At this point, don’t panic, please refer to “Python Programming problems, what do Liberal Arts students Do?” The second part, which has the detailed error reason analysis and countermeasure display.
After viewing it, the analysis graph iterates like this:
Now, do you get a feel for sentiment analysis?
If you don’t want to use a third party emotion classification algorithm and want to train a more accurate emotion classification model yourself, you can see how to Train Chinese Text Emotion Classification Model with Python and Machine Learning. The article.
All this emotion analysis is really just a polarity analysis (positive vs negative). But as we all know, human emotion is actually composed of many aspects.
How to decompose the multi-dimensional changes of emotional characteristics from the text?
How to Use Python and R for Emotional Analysis of Game of Thrones Storylines Take a look at the script for an episode of Game of Thrones and here’s what you get:
If you’re a fan of Game of Thrones, which episode is depicted here?
Make a guess, then open the passage and compare with the end.
For the visual analysis part of this article, R is used.
R is also a very popular open source tool in data science. It may not be as versatile or as popular as Python (after all, Python can do a lot of other things besides data science), but it has a very good ecosystem because of the support and contributions of many scientists in statistics.
What if you want to extract several important keywords from a single long text?
Please read how to Extract Chinese Keywords with Python. The article. It uses mature keywords extraction algorithms such as vectorization and TextRank to solve the problem.
Recess answering questions
As you accumulate knowledge, skills and experience, your questions probably increase as well.
Some students have questions about this teaching method — the case is interesting and easy to learn, but how can I apply it to my own study, work and research?
I wrote a q&A for you called “How can Liberal Arts Students Learn Data Science effectively?” . The following aspects are mentioned in the article:
- How to specify a target?
- How to determine depth?
- How to enhance collaboration?
When it comes to collaboration, Github is the world’s largest open-source repository.
Github will also be used several times throughout the tutorial to store code and data so that you can run the results of the tutorial repeatedly.
How to Get started on Github effectively? This article provides documentation and video tutorial resources to help you master this data bonanza.
Many readers often ask this question at this stage: Teacher, if you want to learn Python, I recommend this book.
You already see the benefits of Python, right?
How to Learn Python effectively? It helps you categorize your learning characteristics. Depending on the results of the classification, you can choose the learning path that is more suitable for you.
Recommended materials include not only books but moOCs. Hopefully, this interactive approach will help you get started with data science.
Machine learning
You can try further analysis.
Take Machine Learning for example.
The great thing about machine learning is that for those problems where you (actually humans) cannot accurately describe the steps to solve them, the machine can use observation, trial and error of a large number of cases (data) to build a relatively useful model to automate the problem or to provide an auxiliary basis for human decision making.
In general, machine learning falls into three main categories:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
So far this column has covered some examples of the first two categories.
The biggest difference between supervised and unsupervised learning is data.
Data has been labeled (usually manually assigned labels), generally using supervised learning;
The data is not labeled, so unsupervised learning is usually used.
In supervised learning, we used the classification task as an example.
To borrow or not to Borrow: How to Use Python and Machine Learning to Help You make Decisions? In the case of loan approval decision.
The specific machine learning algorithm is the decision tree.
Some students said that when drawing this decision tree, they encountered problems.
This is mainly due to differences in the operating environment and the installation of dependent tools that were not completed correctly.
How do liberal Arts students Deal with Python programming problems? The first part of the book describes these problems in detail. Please follow the steps listed to try to solve them.
Not only that, but this article shows you a task-oriented approach to learning that will hopefully improve your Python and data science learning.
For unsupervised learning, we covered how to Extract Topics from Massive Text in Python. .
It uses a method called clustering (LDA) to help you visualize the possible categories and key keywords from the vast array of documents you might find interesting.
In this paper, the processing of stopwords is mentioned, but the specific application method of Chinese stopwords is not given.
How to Train Chinese Text sentiment Classification Models with Python and Machine Learning? In this article, I not only give a detailed introduction to the processing of stop words, but also apply Naive Bayes model of supervised learning to emotion analysis. I teach you how to train your emotional classification model.
Deep learning
Deep learning refers to machine learning using Deep Neural networks.
Compared with traditional machine learning methods, it uses a more complex model structure, requires more data support, and consumes more computing resources and time to train.
Common deep learning applications include speech recognition, computer vision, and machine translation.
Of course, the most mentioned in the news is playing weiqi:
The cases we offer are less challenging to the limits of human intelligence and more relevant to daily work and life.
How to Find Lost Customers with Python and Deep Neural Networks I introduced you to the basic structure of deep neural network.
This article presents a basic example of supervised learning using feedforward neural networks, using customer churn warning as an example.
For the practical part, we use Tensorflow as the back end and TFLearn as the front section to construct your own first deep neural network.
How to Find Lost Customers with Python and Deep Neural Networks At the end of this article, you’ll find resources for further learning.
If you need to install the Tensorflow deep Learning framework (Google), please read Tensorflow to perform PIP upgrade installation pit.
Armed with the basics of deep neural networks, we fiddle with computer vision.
How to Recognize Images with Python and Deep Neural Networks? One paper, for example, the classification of doraemon and WALL · E these two robots of various fancy image collection.
Convolutional Neural Network (CNN) comes into its own.
This article helps you analyze the functions of different layers in convolutional neural networks.
We try to avoid formulas and instead use images, giFs, and plain, concise language to explain concepts to you.
The deep learning framework we use is Apple’s TuriCreate. You will invoke a very deep convolutional neural network to help us shift tOU ji qu qiao, with very high classification accuracy with very little training data.
Some readers try, the accuracy of the test set actually reached 100% (depending on the different running environment, there are differences), great shout. But at the same time, it’s incredible.
To explain this “miracle” and to answer readers’ questions about how to search for images on private datasets, I wrote how to Find Similar Images using Python and Deep Neural Networks. .
I hope that after reading this article, you have a deeper understanding of Transfer Learning.
If you’re still a little confused about the basics of convolutional neural networks after these two papers, that’s okay, because my graduate students have the same problem.
To this end, I have specially recorded a video explaining q&A.
In this video, I mainly talk about the following aspects:
- The basic structure of deep neural network;
- Realization of neuron’s computing function;
- How to train deep neural network;
- How to choose the optimal model (hyperparameter adjustment);
- Basic principle of convolutional neural network;
- Implementation of transfer learning;
- Question answered.
Hopefully, after you read about the neural network model of computer vision in the paper, you’ll be able to do it.
Another set of authors came to ask:
Teacher, I use Windows, dead or alive can not install TuriCreate, but how to do?
While I was worrying for them, I happened to find a treasure. How to Run a Python Deep Learning Framework in the Cloud for Free? .
You can deploy and execute Apple’s Deep Learning framework on Google Cloud Linux hosts with minimal effort using a free GPU…
Does that sound like a dream?
Thank you, Google, for contributing to the accumulation of human knowledge.
Data acquisition
After deep learning, you’ll find yourself a data junkie.
Because if you don’t have a lot of data, you can’t support your deep neural network.
How do you get the data?
We need to distinguish the source of the data.
There are many sources of data. However, for researchers, network data and literature data are more commonly used.
At present, the mainstream (legal) network data methods are mainly divided into three categories:
- Open data set download;
- API to read;
- Crawling is it.
How to Read Open Data in Python? In this article, I showed you how to download open data sets and use them in Python.
This article introduces the methods and processes of reading, preliminary processing and visualization of open data file formats such as CSV/Excel, JSON and XML.
What do you do if there’s no open data set organized for you to download, and the site only provides an API?
How to Get Web Data for Free with R and API? In this article, we use R to read the Wikipedia API, get a record of the number of visits to a given item, and make a visualization.
If there’s no one to sort out the data for you, and there’s no open API for you, you’re going straight to the sledgehammer.
Requests_html is a user-friendly, easy-to-use package for crawling web pages. You can try to crawl specific types of links in a web page.
Document data may be stored in a variety of formats, but PDF is the most common.
In response to a number of reader requests, I wrote how to Batch Extract PDF Text Content with Python. .
You can bulk extract the text content of PDF documents and perform various analyses.
The analysis in this article is relatively simple, we only count the number of document characters.
But use your imagination and you may be able to produce very valuable analysis results.
Hopefully these articles will help you efficiently get good data to support your own machine learning model.
summary
In this article, the current data science articles in Yushu Zhilan are sorted out and linked to help you see the logical dependencies between them.
In the column, data science articles mainly focus on the following aspects:
- Environment building;
- Basic introduction;
- Natural language processing;
- Machine learning;
- Deep learning;
- Data acquisition;
- Answering questions.
As you may have noticed, we have a lot more to cover.
Don’t worry.
This column will continue to add new content. This guide will also be updated from time to time. Welcome to follow.
discuss
Which aspect of the data science section of this column do you prefer? What else would you like to read? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.