Recommend an ApacheCN open source machine learning roadmap:

Github.com/apachecn/Ai…

Note: you need to go directly to the url or “read the article” to open the link in the article \

The roadmap

Follow the steps: 1 => 2 => 3, you can be a Daniel!

1. Machine learning – Basics

  • The Machine Learning in Action (Machine Learning field) | ApacheCN (apache Chinese)

  • Machine Learning in Action – Chinese version – with table of contents. PDF

  • — Thanks to the e-book “Machine Learning Practice-Apachecn.pdf” generated by Feilong Little Brother

  • The video has been updated, if you find it valuable, please help to Star [follow-up learning activities: Sklearn, Kaggle, Pytorch and TensorFlow]

  • — Video website: Youku/Bilibili/Acfun/netease Cloud Classroom, which can be directly played online. (Link at bottom)

  • Red Stone: Machine learning Notes by Lin Xuantian, Taiwan University

  • A recommended machine learning note:

    Feisky. Xyz/machine – lea…

    Machine learning Chapter 1: Fundamentals of machine learning
    Machine learning Chapter 2: KNN nearest Neighbor algorithm
    Machine learning Chapter 3: Decision trees
    Machine learning Chapter 4: Naive Bayes
    Machine learning Chapter 5: Logistic regression
    Machine learning Chapter 6: SVM support vector machines
    Online composite content Chapter 7: Integrated Methods (Random Forest and AdaBoost)
    Machine learning Chapter 8: The Return
    Machine learning Chapter 9: Tree regression
    Machine learning Chapter 10: K-means clustering
    Machine learning Chapter 11: Association analysis using Apriori algorithm
    Machine learning Chapter 12: FP-growth discovers frequent itemsets efficiently
    Machine learning Chapter 13: Using PCA to simplify data
    Machine learning Chapter 14: Using SVD to simplify data
    Machine learning Chapter 15: Big data and MapReduce
    Actual Ml project Chapter 16: Recommended Systems (Migrated)
    Summary of the first issue 2017-04-08: Summary of the first issue

How to get into machine learning?

What about the video?

  1. Theory major – It is recommended to learn Andrew Ng’s video (Ng’s video is absolutely authoritative, there is no doubt about this)
  2. Strong coding ability – Please read our “Machine Learning Practice – Teaching Edition”
  3. Weak coding ability – I suggest you read our “Machine learning Practice – Discussion edition”, but when you read the theory, read the teaching edition – theory section; There’s a lot of nonsense in the discussion board, but it goes through the code line by line; So, mix freely according to your own needs.

Introduction to Khan Academy

  • Khan Academy – netease Open courses
The probability of statistical Linear algebra
Khan Academy (Probability) Khan Academy (Statistics) Khan Academy (Linear Algebra)

Machine Learning Video – ApacheCN Teaching Edition

AcFun B station
youku Netease Cloud Classroom

Machine/Deep Learning Video by Andrew Ng

Machine learning Deep learning
Ng machine learning Neural networks and deep learning

2. Deep learning – Basics

Deep learning requires learning

  1. Reverse pass:
  2. www.cnblogs.com/charlotte77…
  3. The principle of the CNN:
  4. www.cnblogs.com/charlotte77…
  5. RNN principle:
  6.  Blog.csdn.net/qq_39422642…
  7. LSTM:
  8. Blog.csdn.net/roslei/arti…

3. Natural language processing

Learning process – complex inner changes!!

Since learning NLP, I found the typical differences between China and foreign countries:1Attitudes towards resources are quite opposite:1) Domestic: it is like holding a meeting for the fame of working clothes, but there is no dry goods, all are symbolic PPT presentation, not for the people who are doing2) Abroad: As if to promote the progress of NLP, distributors of all kinds of dry materials and concrete implementation. (Specifically: Python natural language processing)2. Realization of the paper:1) all kinds of lofty paper implementation, but still haven't seen a decent GitHub project! (Maybe I'm not good at searching, so I haven't found it.)2(I don't understand!3Open source framework1Tensorflow/PyTorch: TensorFlow/PyTorch2) Domestic open source framework: ah, really can not give examples! But it is as good as abroad! (Although MXNet is developed by many Chinese, it is not considered as a domestic open source framework. Deep learning of Hands-on Learning based on MXNet/ / zh.diveintodeeplearning.org) Chinese tutorial, has been lived by god (nervegrowold) and aston · teaching recording, public release. Documentation + Season 1 tutorial + Video)Every time go deep all want to turn over the wall, every time go deep all want Google, every time look at home of say: Harbin Institute of Technology, information fly, in science and technology, Baidu, Ali much cow force, but the data still get abroad to look for! Sometimes really quite ruthless! Really look down on their own domestic technology environment! Of course, thanks to many bloggers in China, especially for some introductory demos and basic concepts. 【 In-depth level is limited, do not understand 】Copy the code

  • Must-see materials for introductory tutorial [Add competition links] :

    Github.com/apachecn/Ai…

  • Python Natural Language Processing Version 2:

    Usyiyi. Making. IO/NLP – py – 2 – z e…

  • A comprehensive knowledge system of NLP compiled by Liuhuanyong is recommended:

    liuhuanyong.github.io

1. Usage Scenarios (Baidu Open Courses)

The first part is introduction

  • 1) Introduction to natural language processing

Part two: Machine translation

  • 2.) Machine translation

The third part is chapter analysis

  • 3.1.) Chapter Analysis – Content overview
  • 3.2.) Chapter Analysis – Content tags
  • 3.3.) Chapter Analysis – Emotional analysis
  • 3.4.) Chapter analysis – automatic summary

Part IV UNIT- Language Understanding and Interaction Techniques

  • 4.) UNIT- Language understanding and interaction technology

Application field

Chinese word segmentation:

  • Build a DAG figure
  • Dynamic programming search, synthesize positive and negative (positive weighted reverse output) to obtain the maximum probability path of DAG
  • SBME corpus is used to train a HMM + Viterbi model to solve the problem of unknown words

1. Text Classification

Text categorization is an indicator of sentences or documents, such as E-mail spam categorization and sentiment analysis.

Here are some great text categorization datasets for beginners.

  1. Reuters Newswire Subject Classification (Reuters -21578). A series of news documents appearing in Reuters in 1987, indexed by category. See also RCV1, RCV2 and TRC2.
  2. IMDB Film Review Sentiment Classification (Stanford). A series of movie reviews from the website IMdb.com and their positive or negative emotions.
  3. Newsgroup Film Review Sentiment Classification (Cornell). A series of movie reviews from the website IMdb.com and their positive or negative emotions.

For more information, see the post: Data Sets for Single-label Text Categorization.

Sentiment analysis

Competition Address:

www.kaggle.com/c/word2vec-…

  • Scheme 1 (0.86) : WordCount + naive Bayes
  • Scheme 2 (0.94) : LDA + classification model (KNN/decision tree/Logistic regression/SVM/XGboost/random forest)
    • A) The effect of decision tree is not very good, this continuous feature is not suitable
    • B) 200 topics are adjusted by parameters, and the effect of information preservation is relatively good (calculation topic)
  • Scheme 3 (0.72) : WORD2vec + CNN
    • To tell the truth: without a good machine, is not adjusted out of a good result (: escape

The effectiveness of the model was evaluated by AUC

2. Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or within a word. It is a pre-requisite for tasks such as speech recognition and machine translation.

It is a pre-requisite for tasks such as speech recognition and machine translation.

Here are some good beginner language modeling data sets.

  1. Project Gutenberg, a series of free books that can be retrieved in plain text in a variety of languages.
  2. There are more formal corpora that are well studied; Example: Brown University Standard Corpus of Modern American English. A large sample of English words. Google’s billion word corpus.

New found

  • Chinese word segmentation new word discovery
  • Python3 uses mutual information and left and right information entropy for Chinese word segmentation neologism discovery
  • Github.com/zhanzecheng…

Sentence similarity recognition

  • Project address: www.kaggle.com/c/quora-que…
  • Solution: Word2vec + bi-gru

Text error correction

  • bi-gram + levenshtein

3. Image Captioning

Mage captioning is the task of generating a text description for a given image.

Here are some good beginner image captions data sets.

  1. Public objects in context (COCO). Contains a collection of over 120,000 images with descriptions
  2. Flickr 8 k. A collection of 8,000 descriptive images from Flickr.com.
  3. Flickr 30 k. A collection of 30,000 descriptive images from Flickr.com. For more, see the post:

Explore the Image caption Dataset, 2016

4. Machine Translation

Machine translation is the task of translating text from one language to another.

Here are some good machine translation data sets for beginners.

  1. Coordinating Member of Parliament, 36th Parliament of Canada. Pairs of English and French sentences.
  2. European Parliament Proceedings Parallel Corpus 1996-2011. Sentences to a set of European languages. A large number of standard data sets are available for the annual Machine translation Challenge; See:

Statistical machine translation

Machine translation

  • Encoder + Decoder(Attention)
  • Reference Case:
  • Pytorch.apachecn.org/cn/tutorial…

5. Question Answering

A question and answer is a task in which a sentence or sample text is provided from which a question is posed and must be answered.

Here are some good data sets for beginner question answers.

  1. Stanford Question Response Data Set (SQuAD). Answer questions about wikipedia articles.
  2. Deepmind question response Corpus. Answer questions about news articles from the Daily Mail.
  3. Amazon q&A data. Answer questions about Amazon products. For more information, see the post:

Data sets: How do I get a corpus of q&A sites such as Quora or Yahoo Answers or Stack Overflow to analyze the quality of Answers?

6. Speech Recognition

Speech recognition is the task of converting spoken audio into human-readable text.

Here are some great beginner speech recognition data sets.

  1. TIMIT Acoustics – Speech Continuous speech Corpus. It’s not free, but it’s on the market because of its widespread use. Spoken American English and related transcriptions.
  2. VoxForge. A project to build an open source database for speech recognition.
  3. LibriSpeech ASR Corpus. A large collection of Audio books in English from LibriVox.

7. Automatic Document Summarization

Document summaries are the task of creating short, meaningful descriptions of larger documents.

Here are some good beginner document summary data sets.

  1. Legal case report data set. Collected 4,000 legal cases and their briefs.
  2. TIPSTER Text abstracts evaluation Conference Corpus. Nearly 200 documents and their abstracts were collected.
  3. AQUAINT Corpus of English news Texts. Not free, but widely available. A corpus of news articles. For more information:

Document Understanding meeting (DUC) tasks. Where can I find a good data set for text summarization?

Named entity recognition

  • Bi-LSTM CRF

  • Reference Case:

    Pytorch.apachecn.org/cn/tutorial…

  • CRF Recommended documents:

    www.jianshu.com/p/55755fc64…

Text in this paper,

  • The removable

  • word2vec + textrank

  • Word2vec

    www.zhihu.com/question/44…

  • Textrank recommended documents:

    Blog.csdn.net/BaiHuaXiu12…

Graph computing

  • Data set: data/ NLP /graph
  • Spark graphX Practice.pdf

Further reading

If you wish to go further, this section provides a list of additional datasets.

  1. Text data sets used in wikipedia research

  2. Data sets: What are the main textual corpora used by computational linguists and natural language processing researchers?

  3. Stanford Statistical Natural Language Processing Corpus

  4. An alphabetical list of NLP data sets

  5. The agency to me

  6. Open deep learning data on DL4J

  7. NLP data set

  8. Domestic open data sets:

    Bosonnlp.com/dev/resourc…

    The original address

    Github.com/apachecn/Ai…

    Note: you need to go directly to the url or “read the article” to open the link in the article

Please follow and share ↓↓↓\

ID: 92416895\

Currently, it ranks no.1 in the knowledge planet of machine learning

Past wonderful review \

  • Conscience Recommendation: Introduction to machine learning and learning recommendations (2018 edition) \

  • Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)

  • Printable version of Machine learning and Deep learning course notes \

  • Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • The mathematical foundations of machine learning

  • Machine learning essential treasure book – “statistical learning methods” python code implementation, ebook and courseware

  • Blood vomiting recommended collection of dissertation typesetting tutorial (complete version)

  • Installation of Python (Anaconda+Jupyter Notebook +Pycharm)

  • What if Python code is ugly? Recommend a few artifacts to save you