“Mlmemoirs” describe the 50 best public data sets for machine learning. These memoirs were used by Github, Forbes, and CMU website to describe the best public data sets for machine learning

Written by MLMemoirs Guo Yipu ****

Mlmemoirs have compiled a list of the 50 best public data sets for machine learning based on information provided by Github, Forbes and CMU. These memoirs describe the best public data sets for machine learning

Notice in advance:

First, find the meaning of the data set

According to the CMU, there are a few things to look for in a usable data set:

The data set is not cluttered, otherwise it would take a lot of time to clean up the data.

The dataset should not contain too many rows or columns, otherwise it will be difficult to use.

The cleaner the data, the better, and cleaning up large data sets can be time-consuming.

There should be an interesting question that can be answered with data.

Where to find the data set

  • Kaggle: Competitive folks will be familiar with Kaggle, which has all sorts of interesting data sets, from ramen ratings to basketball stats and even Pet licenses from Seattle. www.kaggle.com/
  • UCI Machine Learning Library: One of the oldest sources of data sets, is the first stop for finding interesting data sets. While the datasets are user-contributed and therefore of varying cleanliness, the vast majority are clean and can be downloaded directly from the UCI Machine learning library without registration. mlr.cs.umass.edu/ml/
  • VisualData: Classed computer vision data sets, searchable ~ www.visualdata.io/

Well, here are the 50 data sets, and thanks to some late additions, the total is over 50.

Machine learning data sets

The picture

  • Labelme: Annotated large image data sets. Labelme.csail.mit.edu/Release3.0/…
  • ImageNet: we are familiar with ImageNet, goddess Li Fei Fei participated in the creation of the same name competition affect the entire computer vision community. image-net.org/
  • LSUN: scene understanding and many ancillary tasks (estimated room layout, significant sexual prediction, etc.) lsun.cs.princeton.edu/2016/
  • MS COCO: also a well-known computer vision data set, the game of the same name is slaughtered by the Chinese every year. mscoco.org/
  • COIL 100:100 different objects are imaged at each Angle of 360 degree rotation. Www1.cs.columbia.edu/CAVE/softwa…
  • Vision Genome: a very detailed knowledge base of vision. visualgenome.org/
  • Google Open Images: a collection of 9 million image urls under Creative Commons “has annotated more than 6,000 categories of tags.” Research.googleblog.com/2016/09/int…
  • Field tagging surface: 13,000 face tagging images for developing applications involving facial recognition. vis-www.cs.umass.edu/lfw/
  • Stanford Dog Dataset: 20, 580 images of dogs, including 120 different breeds. Vision.stanford.edu/aditya86/Im…
  • Interior scene recognition: contains 67 interior categories and 15,620 images. Web.mit.edu/torralba/ww…

Sentiment analysis

  • Multi-domain Sentiment Analysis dataset: A slightly older dataset using product reviews from Amazon. www.cs.jhu.edu/~mdredze/da…
  • IMDB Reviews: A data set for binary sentiment classification, but also a bit old and a bit small, with about 25,000 movie reviews. Ai.stanford.edu/~amaas/data…
  • Stanford Emotion Tree Bank: Standard emotion data set with emotion annotations. Nlp.stanford.edu/sentiment/c…
  • Sentiment140: A popular data set that uses 160,000 tweets with pre-deleted emojis. Help.sentiment140.com/for-student…
  • Twitter Sentiment for American Airlines: February 2015 Twitter data for American Airlines, categorized as positive, negative and neutral tweets. www.kaggle.com/crowdflower…

Natural language processing

  • HotspotQA dataset: A question and answer dataset with natural, multi-hop questions, with strong oversight to support the facts, for a more easily interpreted question and answer system. hotpotqa.github.io/
  • Enron data set: E-mail data from Enron senior management. www.cs.cmu.edu/~./enron/
  • Amazon Reviews: Contains 18 years of approximately 35 million reviews on Amazon, data including product and user information, ratings and text reviews. Snap.stanford.edu/data/web-Am…
  • Google Books Ngrams: A series of characters in Google Books. Aws.amazon.com/datasets/go…
  • Blogger Corpus: Collected 681,288 blog posts from Blogger.com, each containing at least 200 commonly used English words. The u.c. s.b iu. Ac. Il / ~ koppel/bio…
  • Wikipedia Link data: The full text of Wikipedia, containing nearly 1.9 billion words from more than 4 million articles, can be searched by paragraph, phrase, or part of the paragraph itself. Code.google.com/p/wiki-link…
  • Gutenberg Ebook List: A list of annotated ebook books from the Gutenberg Project. www.gutenberg.org/wiki/Gutenb…
  • Hansards Parliamentary Texts: 1.3 million sets of texts from the records of the 36th Canadian Parliament. www.isi.edu/natural-lan…
  • Jeopardy: An archive of over 200,000 questions from the quiz show Jeopardy. www.reddit.com/r/datasets/…
  • English spam message collection: a data set composed of 5574 English spam messages. www.dt.fee.unicamp.br/~tiago/smss…
  • Yelp Reviews: Yelp, the “public Review” of the United States, is an open data set they have released containing over 5 million reviews. www.yelp.com/dataset

Spambase for UCI: A large spam data set that is useful for spam filtering. Archive.ics.uci.edu/ml/datasets…


  • Berkeley DeepDrive BDD100k: The largest autonomous driving data set to date, containing over 100,000 videos, including over 1,100 hours of driving experience at different times of day and in weather conditions. The annotated images are from the New York and San Francisco areas. bdd-data.berkeley.edu/
  • Baidu Apolloscapes: Baidu’s large data set defines 26 different objects, such as cars, bicycles, pedestrians, buildings, street lamps, etc. apolloscape.auto/
  • Comma.ai: Over 7 hours of highway driving with details including the car’s speed, acceleration, steering Angle, and GPS coordinates. Archive.org/details/com…
  • Oxford’s robotic Car: This dataset comes from Oxford’s robotic car, which ran over and over the same road in Oxford, England, more than 100 times over the course of a year, capturing different combinations of weather, traffic and pedestrians, as well as long-term changes in buildings and road works. robotcar-dataset.robots.ox.ac.uk/
  • Cityscape Dataset: A large dataset of urban streetscapes from 50 different cities. www.cityscapes-dataset.com/
  • CSSAD data set: This data set is useful for the perception and navigation of autonomous vehicles. However, the data set is heavily skewed towards the path of developed countries. Aplicaciones. Cimat. Mx/Personal/jb…
  • KUL Belgian Traffic Signs Dataset: Over 10,000 notes from thousands of physical traffic signs in Flanders, Belgium. www.vision.ee.ethz.ch/~timofter/t…
  • MIT AGE Lab: Samples of over 1,000 hours of multi-sensor driving data sets collected at AgeLab. Lexfridman.com/automated-s…
  • LISA: UC San Diego Intelligent and Safe Vehicle Laboratory data set, including traffic signs, vehicle detection, traffic lights, and trajectory patterns. Cvrr.ucsd.edu/LISA/datase…
  • Bosch Small Traffic Light dataset: A small traffic light dataset for deep learning. hci.iwr.uni-heidelberg.de/node/6132
  • LaRa Traffic light Identification: A Traffic light dataset for Paris. www.lara.prd.fr/benchmarks/…
  • WPI data set: Data set for traffic light, pedestrian, and lane detection. Computing.wpi.edu/dataset.htm…


  • Mimic-iii: A publicly available data set from the MIT Computational Physiology Laboratory that labels the health data of approximately 40,000 patients in intensive care, including demographic, vital signs, laboratory tests, drugs, etc. mimic.physionet.org/

General data set

In addition to machine learning specific data sets, there are some other general data sets that may be interesting

Public government data sets

  • Data.gov: This site downloads Data from multiple U.S. government agencies, including all sorts of weird Data, from government budgets to test scores. However, much of this data needs further study. www.data.gov/
  • The Food Environment Atlas: Data on how Local Ingredients affect the American Diet. The catalog. Data. Gov/dataset/foo…
  • School Finance Systems: a survey of school finance systems in the United States. The catalog. Data. Gov/dataset/Ann…
  • Chronic disease data: Regional chronic disease indicators in the United States. The catalog. Data. Gov/dataset/u – s…
  • National Center for Education Statistics: Educational institutions and educational demographics, not only for the United States, but also for some other parts of the world. nces.ed.gov/
  • UK Data Service: The UK’s largest social, economic and demographic data set. www.ukdataservice.ac.uk/
  • Data America: Comprehensive visualization of U.S. public data. datausa.io/
  • National Bureau of Statistics of China. www.stats.gov.cn/

Finance and Economics

  • Quandl: A good source of economic and financial data to help build models for predicting economic indicators or stock prices. www.quandl.com/
  • World Bank open Data: global demographic data, as well as a large data set of economic and development indicators. data.worldbank.org/
  • Imf data: Data published by the INTERNATIONAL Monetary Fund on international finance, interest rates on debt, foreign exchange reserves, commodity prices and investment. www.imf.org/en/Data
  • Financial Times Market Data: The latest information on financial markets from around the world, including stock price indices, commodities and foreign exchange. markets.ft.com/data/
  • Google Trends: Data on Internet search behavior and trending news stories around the world. www.google.com/trends?q=go…
  • American Economic Association: US Macroeconomic data. www.aeaweb.org/resources/d…


Mlmemoirs: the 50 Best Public Data Sets for Machine Learning medium.com/datadriveni…

Note: there are some sites that need scientific Internet access to open.

What if you don’t have the tools at hand? Collect first!

Please follow and share ↓↓↓\

Machine learning beginners \

QQ group: 654173748

Past wonderful review \

  • Easy introduction to machine learning – with recommended learning materials

  • Machine learning learner public account download resources summary (a) \

  • Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)

  • Printable version of Machine learning and Deep learning course notes \

  • Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • Zotero paper Management tool

  • The mathematical foundations of machine learning