Heart of the Machine reporting, Heart of the Machine editorial Department.

Data sets are essential if you want to train a content moderation system to filter out inappropriate information, or do some bold new ideas with GAN. For example, we need to use convolutional neural network to train a classifier to distinguish normal images from limited-level images. But r-rated images are hard to collect and rarely open source. So a developer recently opened source a NSFW image dataset on GitHub. Is that what you want?

Content audit plays a very important role in many fields. It not only needs to identify images or other data that are not suitable for display through classifiers, but also can process these limited-level images with semantic segmentation model (Mask out sensitive parts). This removes inappropriate information without much impact on the content. The project, built by developer alexkimxyz, collected more than 200,000 sensitive images and displayed them in the form of urls on GitHub.

Project address: github.com/alexkimxyz/…

These 200,000 + images are roughly divided into the following five categories, which can be trained with DIFFERENT classifiers using CNN. Here we’ll keep the original description on GitHub:

Each category is a Text, and each line in the Text corresponds to a URL, so it is very easy to read and download, and only a few lines of code to write. Here is a brief presentation of the text and images under the sexy category:

It is also worth noting that a small number of image urls are invalid, so these conditions need to be taken into account in the process of processing. Typically, if the URL is invalid, it will return a 161×81 declaration image.

Of course, the author also provides a script to get the URL and download the image, so we just need to run it. Currently, these scripts have only been tested in the Ubuntu 16.04 Linux distribution.

Here are the important scripts (in the scripts directory) and what they do:

  • 1_get_urls.sh: Traverse the text file under scripts/source_urls to download the image urls for each of the five categories above. Ripme applications perform all the critical parts. Source urls are mainly links to various Subreddits, but can be any site supported by Ripme. Note: The author has already run this script, and its output is in the raw_data directory. There is no need to rerun unless you edit the file under scripts/source_urls.

  • 2_download_from_urls.sh: Downloads the actual image of the URL found in the text file in the raw_data directory.

  • 5_create_train.sh: Create the data/train directory and copy all *.jpg and *.jpeg files from raw_data into it. And delete the corrupted image.

  • 6_create_test.sh: Create the data/test directory and move N = 2000 files randomly for each class from data/ trainto (change this number within the script if different training /test splits are required) to data/test. Or, you can run it multiple times, each time it moves N images per category from data/train to data/test.

Note that after running get_urls.sh, the generated URL text file overwrites the existing text file under raw_data. So after copying the GitHub project, we can also run 2_download_from_urls.sh to download images from the existing raw_data file.

Environment configuration

  • Python3 environment: conda env create -f environment.yml

  • Sudo apt-get install default-jre

  • Linux command-line tools: wget, convert (Imagemagick tool suite), rsync, shuf

How to run

Go to the working directory to scripts and execute each script in the order indicated by the numbers in the file name, for example:

$ bash 1_get_urls.sh # has already been run$ find .. /raw_data -name"urls_*.txt" -exec sh -c "echo Number of urls in {}: ; cat {} | wc -l" \;
Number of urls in ../raw_data/drawings/urls_drawings.txt:
   25732
Number of urls in ../raw_data/hentai/urls_hentai.txt:
   45228
Number of urls in ../raw_data/neutral/urls_neutral.txt:
   20960
Number of urls in ../raw_data/sexy/urls_sexy.txt:
   19554
Number of urls in ../raw_data/porn/urls_porn.txt:
  116521
$ bash 2_download_from_urls.sh
$ bash 3_optional_download_drawings.sh # optional
$ bash 4_optional_download_neutral.sh # optional
$ bash 5_create_train.sh
$ bash 6_create_test.sh
$ cd ../data
$ ls train
drawings hentai neutral porn sexy
$ ls test
drawings hentai neutral porn sexyCopy the code

The execution method of the script is shown above, with a total of 227995 sensitive images in five categories. The script also splits them into training and test sets, so it’s easy to use them directly for the 5 categories of sorting tasks. Of course, if we need to use it for other tasks, there is no need to split it directly.

Using a simple convolutional neural network to categorise tasks directly can achieve 91% accuracy, which is very high because of the ambiguity inherent in manually categorising sensitive data into five categories. The following shows the confusion matrix of 5 classified tasks on the test set:

The diagonal line represents the number of correctly predicted samples, and the others are the number of incorrectly classified samples. This classification task shows that at least 5 categories are discriminative. Whether we use dichotomies for normal content and sensitive content, or use GAN to make some novel models, categories are very discriminative features.

Finally, please use it seriously with respect and for research only (do not report it)…