Next, I will study the music recommendation system, which needs data to demonstrate the algorithm and engineering code, and then summarize the open source music data set on the Internet.
Million Song Dataset
When it comes to music data sets, MSD is definitely the first one, which contains information about a million songs and is 280GB in size. Because of the large amount of data, it uses the H5 file compression format and provides some code for reading such files.
Each song corresponds to a file, and fields include all aspects of the song, such as artist_mbid, artist_name, title, TEMPO, and so on, all listed here. The path is strange, and the Q&A explains that it is impossible to put all files in the same directory. Directories are organized as follows: Certain songs are located as it’s The Echo Nest track IDs of The third, fourth, and fifth level directory, such as MillionSong/data/A/D/H/TRADHRX12903CD3866 h5.
In addition, on the basis of MSD, the community has contributed a number of supplementary data sets to facilitate the study of MSD in various aspects. You can easily find them on the home page.
- The SecondHandSongs Dataset: Information about The covers of some songs and The performance values of each cover on SecondHand website.
- The musiXmatch Dataset: Provides lyrics data for 77% of The songs in MSD in bag-of-words form.
- The last.fm Dataset: See below
- The Echo Nest Taste Profile Subset: Echo Nest provides a User-song-play count dataset that can be associated with MSD, including 1 million users and 48 million playback records.
- Thisismyjam-to-msd Mapping: ThisisMyJam user data and association to MSD.
- Tagtraum genre Annotations: Music genre annotations.
- Top MAGD dataset: Music genre annotation.
Lastfm data set
Last.fm is a UK-based web radio and music community that provides developers with a rich API that many organizations and individuals call to generate data sets.
1K users (user full listening history)
This data set is described in Section 2.1 of Recommendation Systems in Action as a representative of implicit feedback data sets with contextual information. It has two files, listening record and user information. The former contains all music playing records and playing time of nearly 1000 listeners up to May 5, 2009, as well as music title, artist name, MusicBrain ID and other information. The latter recorded the gender, age, country and date of registration of all listeners. Among them, the statistics of listening to songs are as follows:
- Total Lines: 19150868
- Unique Users: 992
- Artists with MBID: 107528
- Artists without MBDID: 69420
360K users (user top artists)
Along with the 1K dataset comes the 360K Users dataset. Contains User-Artist relationship information and user information. User information is the same as 1K, but the amount of data is up to 360K, the number of times a user has listened to a band in a user-artist relationship file. User-artist file statistics are as follows:
- Total Lines: 17559530
- Unique Users: 359347
- Artists with MBID: 186642
- Artists without MBID: 107373
HetRec 2011
This is a dataset from Last.fm presented at the 2011 HetRec conference. Different from the previous two examples, it contains social friends and tag information. Among them, the number of files is relatively large, but the file columns are very small, which is a very simple correlation, and will not be repeated. The statistics are as follows:
- 1892 users
- 17632 artists
- 12717 Friend relationship
- 92834 user-listened artist relations
- 11946 tags
- 186479 tag assignments (tas), i.e. tuples [user, tag, artist]
MSD’s Lastfm
You can see this dataset from Lastfm on MSD’s front page (another one, really easy to mess with). As supplementary information to the MSD, it can be directly associated with its ID. The amount of data is large, which looks like the following:
- 943,347 matched tracks MSD <-> Last.fm
- 505,216 tracks with at least one tag
- 584,897 tracks with at least one similar track
- 522366 unique tags
- 8 598,630 (track-tag) Pairs
- 56,506,688 (track-similar track) pairs
The same weird directory structure as MSD, with a JSON file for each song, looks like this:
The file name is TraaaAW128f429d538.json. This code can be associated with an MSD song, providing basic song, author information, and tags. What’s unique is that Lastfm provides a direct list of songs that are similar to this song, along with similarity values.
Other data sets
- fma: Music Audio Large data set, 917 GiB and 343 Days of Creative Commons- Licensed Audio from 106,574 tracks from 16,341 artists and 14,854 Albums, Arranged in a hierarchical taxonomy of 161 genres.
- Pitchfork Reviews: Pitchfork is an online music magazine where someone has crawled 18,000 music reviews since 1999 and put them on Kaggle for analysis and study. The format is sqLite file, which mainly provides information about the article ID, title, artist, article link, rating, author, publication date, etc.
- 50 Years of Pop Music Lyrics: 1964 to 2015Billboard Year End Hot100 Lyrics
- MetroLyrics: 380,000 lyrics crawled from MetroLyrics, in CSV format, with fields like Song Title, artist, genre, lyric.
- Kkbox: The data set used in WSDM 2018. Kkbox, as an Asian music service, provides a lot of information about Asian songs that none of the others have.
- Spotify Song Attributes: The author called Spotify’s API to retrieve 2017 Song data and tried to retrieve and train a model to predict whether or not he liked a Song.
API
According to some official or private API, you can generate custom data sets according to your needs.
- last.fm API
- echonest API
- Spotify API
- The Echo Nest / Spotify APIs work together
- music brain API
- Cloud music API
- Quora: What is the best, most complete API or database for searching music data?