0 x00 preface

Data is king, using the same machine learning algorithm, different quality data can be trained to produce different effects of the model. This article will share some of the classic open source datasets in data science.

The text is divided into three parts:

  1. Some of the most commonly used classical data sets are described in detail
  2. How to use Python to gracefully view data sets
  3. Access to other open source datasets

0x01 Classic data set

An overview,

In the table below are some of the most commonly used datasets compiled by Layman that can be used throughout machine learning. These datasets also frequently appear in official examples for SkLearn, Spark ML, and Tenserfolw.

Data set name Data description Data record number Data use Download address
Iris Iris flower data set 150 Classification and clustering Archive.ics.uci.edu/ml/datasets…
Adult Us Census data 48842 Classification and clustering Archive.ics.uci.edu/ml/datasets…
Wine Wine data 178 Classification and clustering Archive.ics.uci.edu/ml/datasets…
20 Newsgroups News data set 19997 Text classification and clustering Qwone.com/~jason/20Ne…
MovieLens A data set of movie ratings 26000000 Recommendation system Grouplens.org/datasets/mo…
MNIST Handwriting recognition data set 70000 Handwriting recognition yann.lecun.com/exdb/mnist/

Second, the Iris

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Iris, also known as Iris flower data set, is a kind of data set for multivariate analysis. Created by the distinguished statistician R.A.Fisher in the mid-1930s, it is recognized as the most famous data set used for data mining. It contains three plant species (Iris Setosa, Iris Versicolor and Iris Virginica) with 50 samples each. It consists of four attributes: Sepal Length, sepal width, petal Length, petal width in cm.

Third, Adult

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

This data is extracted from the 1994 US census database and can be used to predict whether a household’s income exceeds 50K$/year. The class variable of the dataset is whether the annual income exceeds 50K $, and the attribute variable contains important information such as age, job type, education background, occupation, race, etc. It is worth mentioning that there are 7 category variables among the 14 attribute variables.

Fourth, the Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

This dataset contains a total of 178 records for wines from three different origins. 13 properties are the 13 chemical components of wine. The origin of a wine can be inferred by chemical analysis. It is worth noting that all attribute variables are continuous.

Fifth, 20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

The data set contains about 20,000 newsgroup documents, evenly distributed among 20 different newsgroups, and is a classic data set for text classification, which is a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Six, MovieLens

MovieLens data set is a data set about Movie ratings, which contains information about how users rate movies from IMDB and The Movie DataBase. This data set can be used in recommendation systems.

Seven, MNIST

MNIST data set Is a data set for handwriting recognition in the field of machine learning. The data set contains 6 10,000 training sets and 10,000 sample test sets. , the width and height of each sample image is 28*28. The size of these data sets has been normalized and formed a fixed size, so the pre-processing work has basically been completed. In machine learning, many mainstream machine learning tools (including Sklearn) use this data set as an introductory level introduction and application.

0x02 Data exploration

The best way to understand the details of the data is not to look at the documentation, but to look at the distribution and characteristics of the data yourself.

Understand the data

For obtaining the iris dataset, we use the API provided by Sklearn instead of downloading the dataset by ourselves.

1. Data acquisition and description

import pandas as pd

from sklearn.datasets import load_iris



data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)

df.info()



# info describes the result



<class 'pandas.core.frame.DataFrame'>

RangeIndex: 150 entries, 0 to 149

Data columns (total 4 columns):

sepal length (cm) 150 non-null float64

sepal width (cm) 150 non-null float64

petal length (cm) 150 non-null float64

petal width (cm) 150 non-null float64

dtypes: float64(4)

The memory usage is: 4.8 KB

Copy the code

2. Data examples

df.head()

Copy the code
num sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

3. Data description

Describe data. You can use describe to describe various dimensions of a data set, such as the total number of dimensions, average values, etc.

df.describe()

Copy the code
type sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

This is just a quick example, but if you want to go deeper, you can check out the API details on the official website.

0 x03 other

1. UCI data set

The UCI dataset consists of more than 400 data sets for supervised and unsupervised learning, many of which are referenced repeatedly in numerous other data tools, such as Iris, Wine, Adult, CarEvaluation, Forest Fires, and more.

Address: archive.ics.uci.edu/ml/

Sklearn datasets

Datasets.load_iris () sklearn has many built-in datasets, such as the previous datasets.load_iris() is sklearn built-in datasets.

Address: scikit-learn.org/stable/modu…

reference

  • www.zhihu.com/question/63…
  • zhuanlan.zhihu.com/p/25138563
  • Blog.csdn.net/gzhermit/ar…