0 x00 preface
Data is king, using the same machine learning algorithm, different quality data can be trained to produce different effects of the model. This article will share some of the classic open source datasets in data science.
The text is divided into three parts:
- Some of the most commonly used classical data sets are described in detail
- How to use Python to gracefully view data sets
- Access to other open source datasets
0x01 Classic data set
An overview,
In the table below are some of the most commonly used datasets compiled by Layman that can be used throughout machine learning. These datasets also frequently appear in official examples for SkLearn, Spark ML, and Tenserfolw.
Data set name | Data description | Data record number | Data use | Download address |
---|---|---|---|---|
Iris | Iris flower data set | 150 | Classification and clustering | Archive.ics.uci.edu/ml/datasets… |
Adult | Us Census data | 48842 | Classification and clustering | Archive.ics.uci.edu/ml/datasets… |
Wine | Wine data | 178 | Classification and clustering | Archive.ics.uci.edu/ml/datasets… |
20 Newsgroups | News data set | 19997 | Text classification and clustering | Qwone.com/~jason/20Ne… |
MovieLens | A data set of movie ratings | 26000000 | Recommendation system | Grouplens.org/datasets/mo… |
MNIST | Handwriting recognition data set | 70000 | Handwriting recognition | yann.lecun.com/exdb/mnist/ |
Second, the Iris
This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Iris, also known as Iris flower data set, is a kind of data set for multivariate analysis. Created by the distinguished statistician R.A.Fisher in the mid-1930s, it is recognized as the most famous data set used for data mining. It contains three plant species (Iris Setosa, Iris Versicolor and Iris Virginica) with 50 samples each. It consists of four attributes: Sepal Length, sepal width, petal Length, petal width in cm.
Third, Adult
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Prediction task is to determine whether a person makes over 50K a year.
This data is extracted from the 1994 US census database and can be used to predict whether a household’s income exceeds 50K$/year. The class variable of the dataset is whether the annual income exceeds 50K $, and the attribute variable contains important information such as age, job type, education background, occupation, race, etc. It is worth mentioning that there are 7 category variables among the 14 attribute variables.
Fourth, the Wine
These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.
This dataset contains a total of 178 records for wines from three different origins. 13 properties are the 13 chemical components of wine. The origin of a wine can be inferred by chemical analysis. It is worth noting that all attribute variables are continuous.
Fifth, 20 Newsgroups
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
The data set contains about 20,000 newsgroup documents, evenly distributed among 20 different newsgroups, and is a classic data set for text classification, which is a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Six, MovieLens
MovieLens data set is a data set about Movie ratings, which contains information about how users rate movies from IMDB and The Movie DataBase. This data set can be used in recommendation systems.
Seven, MNIST
MNIST data set Is a data set for handwriting recognition in the field of machine learning. The data set contains 6 10,000 training sets and 10,000 sample test sets. , the width and height of each sample image is 28*28. The size of these data sets has been normalized and formed a fixed size, so the pre-processing work has basically been completed. In machine learning, many mainstream machine learning tools (including Sklearn) use this data set as an introductory level introduction and application.
0x02 Data exploration
The best way to understand the details of the data is not to look at the documentation, but to look at the distribution and characteristics of the data yourself.
Understand the data
For obtaining the iris dataset, we use the API provided by Sklearn instead of downloading the dataset by ourselves.
1. Data acquisition and description
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.info()
# info describes the result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm) 150 non-null float64
sepal width (cm) 150 non-null float64
petal length (cm) 150 non-null float64
petal width (cm) 150 non-null float64
dtypes: float64(4)
The memory usage is: 4.8 KB
Copy the code
2. Data examples
df.head()
Copy the code
num | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
3. Data description
Describe data. You can use describe to describe various dimensions of a data set, such as the total number of dimensions, average values, etc.
df.describe()
Copy the code
type | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
This is just a quick example, but if you want to go deeper, you can check out the API details on the official website.
0 x03 other
1. UCI data set
The UCI dataset consists of more than 400 data sets for supervised and unsupervised learning, many of which are referenced repeatedly in numerous other data tools, such as Iris, Wine, Adult, CarEvaluation, Forest Fires, and more.
Address: archive.ics.uci.edu/ml/
Sklearn datasets
Datasets.load_iris () sklearn has many built-in datasets, such as the previous datasets.load_iris() is sklearn built-in datasets.
Address: scikit-learn.org/stable/modu…
reference
- www.zhihu.com/question/63…
- zhuanlan.zhihu.com/p/25138563
- Blog.csdn.net/gzhermit/ar…