The Massachusetts Institute of Technology (MIT) has permanently taken the Tiny Images Dataset offline due to allegations of racism and misogyny.

The Massachusetts Institute of Technology (MIT) has issued an apology, announcing that the Tiny Images Dataset has been permanently removed from the shelves, and calling on the community to stop and delete the Dataset, and for users to stop providing the Dataset to others.

In the past year, several well-known data sets published by companies and research institutions have been taken down or permanently banned, These include Microsoft’s MS Celeb 1M celebrity dataset, Duke MTMC Monitoring dataset for pedestrian recognition published by Duke University, and Stanford University’s Brainwash dataset for human head detection.

The Tiny Images Dataset, which has been removed from the shelf, was commissioned and released by MIT in 2006. As its name suggests, this is a data set of tiny images.

Contains 79.3 million 32-by-32-pixel color Images, mostly collected from Google Images.

The data set is large, and the files, metadata, and descriptors are stored as binary files


Use the MATLAB toolbox and index data file to load

At nearly 400 Gb, this data set is one of the most popular data sets in the field of computer vision.

The paper, 80 Million Tiny Images: A Large Dataset for Non-Parametric Object and Scene Recognition, this paper can be queried and quoted 1718 times.

A paper that triggers a self-examination of a large data set

The Tiny Images Dataset has come under fire in a recent paper, Large Image Dataset: A Pyrrhic Win for Computer Vision? (Large data sets: the hidden killer of computer vision?)

The paper strongly questions the compliance of these large data sets.

The paper address: https://arxiv.org/pdf/2006.16923.pdf

Two authors, one is Vinay Prabhu, UnifyID’s chief scientist. UnifyID is an artificial intelligence startup in Silicon Valley that provides customers with a solution to authenticate users.

The other author is Abeba Birhane, a PhD candidate at University College Dublin.

Taking the imagenet-ILSVRC-2012 data set as an example, the authors found that the data set contained a small number of images taken secretly (such as taking photos of others on the beach, or even including private parts), and believed that due to the lax audit, these images seriously violated the privacy of the participants.

What was once a classic data set is now politically incorrect

Unlike ImageNet’s alleged privacy violation, the paper criticized Tiny Images Dataset because there were tens of thousands of racist and misogynistic Images in the Dataset.

It was also pointed out that the problems of discrimination and invasion of privacy in the Tiny Images Dataset were aggravated by the lack of any audit.

Tiny Images Dataset partial selection

The Tiny Images Dataset is labeled based on the WordNet specification, dividing nearly 80 million Images into 75,000 categories.

It is because of WordNet’s partial markup that the data set has been questioned.

WordNet pot, image data set together back

Since its release in 1985, WordNet has been the most standardized and comprehensive English dictionary system in the English-speaking world.

Standard and comprehensive means: objectively collect English words existing in human society, and give them understanding and relevance.

In Tiny Images Dataset, 53,464 different nouns from WordNet are used as tags for Images.

Statistics of sensitive words involving race and gender in the data set

For this reason, it is inevitable to introduce some racial and sexist words by directly quoting expressions existing in human society.

For example, explicit insulting or derogatory terms such as Bi*ch, Wh*re, Ni*ger, etc., have all become related labels to pictures. In addition, there are also some judgmental terms, such as molester, pedophile, etc.

Social impact needs to be measured before scientific research

The authors argue that large image datasets, many of which are constructed without careful measurement of social impact, pose a threat and harm to individual rights.

Now that information is open source, anyone can use an open API to run a query that defines or determines the identity or profile of a human in ImageNet or any other data set, and that’s a real danger and violation for the person involved. The author also gives three directions for solving the problem: first, synthetic reality and data set distillation, such as using (or enhancing) synthetic image instead of real image during model training; Second, to strengthen the data set based on ethical filtering; The third is quantitative data set audit. The author conducts a cross-category quantitative analysis of ImageNet to assess the degree of ethical violations and to measure the feasibility of the model annotation-based approach.

Data sets are taken down: either consciously or under external pressure

MIT is not the first data set to be removed from store shelves because of public pressure or self-awareness. Back in mid-2019, Microsoft took down the famous MS Celeb 1M dataset and announced that it would no longer be used.

MS Celeb 1M data set is a data set obtained by finding 1 million celebrities on the Internet, selecting 100,000 according to their popularity, and then using search engines, each person picks out about 100 images.

MS Celeb 1M dataset

MS Celeb 1M is often used for facial recognition training. The dataset was originally used for THE MSR IRC competition, one of the highest level image recognition competitions in the world, and was also used by companies including IBM, Panasonic, Alibaba, Nvidia and Hitachi.

One researcher points out that there are questions about the ethics, origins and privacy of the data set of facial recognition images. That’s because the images are from the Web, although Microsoft says they were captured and obtained under the Creative Commons license C.C. (The people in the photos are not necessarily licensed, but are licensed by the copyright owners).

Under the agreement, the photos could be used for academic research, but Microsoft did not effectively monitor the use of the data set once it was published.

In addition to the MS Celeb 1M dataset, Duke MTMC monitoring dataset for pedestrian recognition and Stanford Brainwash dataset for human head detection are also available.

Download the other data sets as soon as possible, and maybe take them down tomorrow

Recently, the “Black Lives Matter” movement for racial equality has made people in Europe and the United States in a panic, and the computer and engineering circles have been discussing, disputing and reflecting constantly.

Initially, companies and organizations represented by Github and Go began to modify their naming conventions, such as avoiding the words “Blacklist” and “Whitelist” in favor of neutral words “Blocklist” and “Allowlist.” Or change the default branch name from “master” to “trunk”.

And deep learning pioneer Lecun quit Twitter after being accused of making racist and sexist comments.

Now political correctness may be directed at large data sets.

Admittedly, the design of a large data set has many parts that are not considered and incomplete at the beginning. However, in the current situation, simply taking down the relevant data sets is not the best solution to the bias.

After all, these images are not just in these data sets, these biases are not just words in WordNet.

The data set is taken down, the images are still everywhere on the Internet, and WordNet is out of use, and the words are still in people’s minds. If we want to address AI bias, we need to pay attention to long-standing social and cultural biases.

Lecun: Just a few tweets and I’m out. (show hands)

Download address: https://hyper.ai/datasets/5361

Note: This data set is subject to compliance disputes, so use it with caution.

– the –