Next, I will study the music recommendation system, which needs data to demonstrate the algorithm and engineering code, and then summarize the open source music data set on the Internet.

Million Song Dataset

When it comes to music data sets, MSD is definitely the first one, which contains information about a million songs and is 280GB in size. Because of the large amount of data, it uses the H5 file compression format and provides some code for reading such files.

Each song corresponds to a file, and fields include all aspects of the song, such as artist_mbid, artist_name, title, TEMPO, and so on, all listed here. The path is strange, and the Q&A explains that it is impossible to put all files in the same directory. Directories are organized as follows: Certain songs are located as it’s The Echo Nest track IDs of The third, fourth, and fifth level directory, such as MillionSong/data/A/D/H/TRADHRX12903CD3866 h5.

In addition, on the basis of MSD, the community has contributed a number of supplementary data sets to facilitate the study of MSD in various aspects. You can easily find them on the home page.

The SecondHandSongs Dataset: Information about The covers of some songs and The performance values of each cover on SecondHand website.
The musiXmatch Dataset: Provides lyrics data for 77% of The songs in MSD in bag-of-words form.
The last.fm Dataset: See below
The Echo Nest Taste Profile Subset: Echo Nest provides a User-song-play count dataset that can be associated with MSD, including 1 million users and 48 million playback records.
Thisismyjam-to-msd Mapping: ThisisMyJam user data and association to MSD.
Tagtraum genre Annotations: Music genre annotations.
Top MAGD dataset: Music genre annotation.

Lastfm data set

Last.fm is a UK-based web radio and music community that provides developers with a rich API that many organizations and individuals call to generate data sets.

1K users (user full listening history)

This data set is described in Section 2.1 of Recommendation Systems in Action as a representative of implicit feedback data sets with contextual information. It has two files, listening record and user information. The former contains all music playing records and playing time of nearly 1000 listeners up to May 5, 2009, as well as music title, artist name, MusicBrain ID and other information. The latter recorded the gender, age, country and date of registration of all listeners. Among them, the statistics of listening to songs are as follows:

Total Lines: 19150868
Unique Users: 992
Artists with MBID: 107528
Artists without MBDID: 69420

360K users (user top artists)

Along with the 1K dataset comes the 360K Users dataset. Contains User-Artist relationship information and user information. User information is the same as 1K, but the amount of data is up to 360K, the number of times a user has listened to a band in a user-artist relationship file. User-artist file statistics are as follows:

Total Lines: 17559530
Unique Users: 359347
Artists with MBID: 186642
Artists without MBID: 107373

HetRec 2011

This is a dataset from Last.fm presented at the 2011 HetRec conference. Different from the previous two examples, it contains social friends and tag information. Among them, the number of files is relatively large, but the file columns are very small, which is a very simple correlation, and will not be repeated. The statistics are as follows:

1892 users
17632 artists
12717 Friend relationship
92834 user-listened artist relations
11946 tags
186479 tag assignments (tas), i.e. tuples [user, tag, artist]

MSD’s Lastfm

You can see this dataset from Lastfm on MSD’s front page (another one, really easy to mess with). As supplementary information to the MSD, it can be directly associated with its ID. The amount of data is large, which looks like the following:

943,347 matched tracks MSD <-> Last.fm

505,216 tracks with at least one tag

584,897 tracks with at least one similar track

522366 unique tags

8 598,630 (track-tag) Pairs

56,506,688 (track-similar track) pairs

The same weird directory structure as MSD, with a JSON file for each song, looks like this:

The file name is TraaaAW128f429d538.json. This code can be associated with an MSD song, providing basic song, author information, and tags. What’s unique is that Lastfm provides a direct list of songs that are similar to this song, along with similarity values.

Other data sets

fma: Music Audio Large data set, 917 GiB and 343 Days of Creative Commons- Licensed Audio from 106,574 tracks from 16,341 artists and 14,854 Albums, Arranged in a hierarchical taxonomy of 161 genres.
Pitchfork Reviews: Pitchfork is an online music magazine where someone has crawled 18,000 music reviews since 1999 and put them on Kaggle for analysis and study. The format is sqLite file, which mainly provides information about the article ID, title, artist, article link, rating, author, publication date, etc.
50 Years of Pop Music Lyrics: 1964 to 2015Billboard Year End Hot100 Lyrics
MetroLyrics: 380,000 lyrics crawled from MetroLyrics, in CSV format, with fields like Song Title, artist, genre, lyric.
Kkbox: The data set used in WSDM 2018. Kkbox, as an Asian music service, provides a lot of information about Asian songs that none of the others have.
Spotify Song Attributes: The author called Spotify’s API to retrieve 2017 Song data and tried to retrieve and train a model to predict whether or not he liked a Song.

API

According to some official or private API, you can generate custom data sets according to your needs.

last.fm API
echonest API
Spotify API
The Echo Nest / Spotify APIs work together
music brain API
Cloud music API
Quora: What is the best, most complete API or database for searching music data?

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Music data set summary

Million Song Dataset

Lastfm data set

1K users (user full listening history)

360K users (user top artists)

HetRec 2011

MSD’s Lastfm

Other data sets

API

Music data set summary

Million Song Dataset

Lastfm data set

1K users (user full listening history)

360K users (user top artists)

HetRec 2011

MSD’s Lastfm

Other data sets

API

Related Posts

LeetCode — A detailed explanation of search strategies and pruning in search algorithms

Alibaba Cloud database held a future conference to talk about the database trends in 2038

Advanced visualization artifact Plotly play bar graph