• Representing music with Word2vec?
  • Dorien Herremans
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Minghao23
  • Proofread by: LSvih

Machine learning algorithms have made a big difference in the field of vision and natural language processing. But what about music? In recent years, the field of music information retrieval (MIR) has been developing rapidly. We’ll see how some of NLP’s technologies migrate to music. In a paper published in 2018, Chuan, Agres, and Herremans explored a way to represent polyphonic music using the popular NLP technique Word2VEc. Let’s explore how this is done…

Word2vec

Word embedding models enable us to represent words in meaningful ways so that machine learning models can process them more easily. These word embedding models allow us to represent terms as vectors containing semantics. Word2vec is a popular word vector embedding model developed by Mikolov et al in 2013, which can create semantic vector Spaces in a very efficient way.

The essence of Word2vec is a simple single-layer neural network, which can be constructed in two ways: 1) using continuous word bag model (CBOW); Or 2) use the skip-gram structure. Both structures are highly efficient and can be trained relatively quickly. In this study, we used the Skpp-Gram model because Mikolov et al. mentioned in their work in 2013 that this method is more efficient for smaller data sets. The skip-gram structure uses the current word w_t as input (input layer) and tries to predict words that are adjacent to before and after within the window range (output layer) :

Image from Chuan et al (2018). Illustration of the word T and its context window.

There is some confusion about what the skip-gram structure might look like, thanks to some images circulating on the Internet. The output layer of the network does not contain multiple words, but instead consists of a single word in a context window. So how does it represent the entire context window? When training the network, we actually use sample pairs, which consist of input words and random words in a context window.

This type of network of traditional training object contains a softmax function can be used to calculate ๐‘ (๐‘ค _ {๐‘ก + ๐‘–} | ๐‘ค _ ๐‘ก) process, and its gradient calculation cost is very big. Fortunately, techniques such as noise contrast estimation (Gutmann and Hyvarine, 2012) and negative sampling (Mikolov et al., 2013) provide a solution to this. We use negative sampling to basically define a new goal: maximize the probability of real words and minimize the probability of noise samples. A simple binary logistic regression can be used to classify real words and noise samples.

When the Word2vec model is trained, the weights on the hidden layer can basically represent learned, multidimensional embedding results.

Using music as a word?

Music and language are intrinsically linked. They all consist of an ordered series of events that follow some grammatical rules. More importantly, they all create expectations. Imagine if I said, “I’m going to the pizza place to get one…” . This statement creates a clear expectation… The pizza. Now imagine I hum a happy birthday tune to you, but I stop at the last note… So like a sentence, melody generates expectations that can be measured by brain waves, such as the event-related potential N400 in the brain (Besson and Schon, 2002).

Considering the similarities between language and words, let’s see if popular language models can also be used to make meaningful representations about music. To convert a MIDI file into a “language”, we define “slices” (equivalent to words in language) in music. Each track in our database is divided into equal, non-overlapping, one-beat slices. The duration of a beat can be obtained with the MIDI Toolbox and can be different in each track. For each slice, we record a list of all the phonetic names, that is, pitches without octave information.

The figure below shows an example of how to identify sections from the first section of Chopin’s Mazurka Op. 67 โ„–4. The length of a beat here is a quarter note.

Image from Chuan et al (2018) — Creating words from Slices of music

Word2vec studies tonality — the assumption of semantic distribution in music

In language models, the assumption of semantic distribution is one of the theoretical bases behind word vector embedding. It is stated that “words appearing in the same context tend to have the same meaning”. Translation into vector space, which means that these words will be geometrically close to each other. Let’s see if the Word2vec model learns a similar representation for music.

The data set

The MIDI dataset used by Chuan et al. contains eight different music genres (from classical to metal). Out of a total of 130,000 music works, we selected only 23,178 based on genre labels. These tracks contain 4,076 unique slices.

Super parameter

The model was trained to use only the 500 slices (i.e., words) that appeared most frequently, and to replace all the other cases with a fake word. This process improves the accuracy of the model when the included words contain more information (occurrences). Other hyperparameters include learning rate (set to 0.1), skip window size (set to 4), number of training steps (set to 1,000,000), and embedding dimension (set to 256).

chord

To assess whether the semantics of musical slices are captured by the model, let’s look at chords.

In slice thesaurus, all slices that include triads are identified. These slices are then marked with Roman numerals (as we often do in music theory). For example, in the key of C, the C chord is an I and the G chord is a V. Then we’ll use cosine distance to calculate how far apart chords of different notes are in the embedding.

In n-dimensional space, the cosine distance Ds(A, B) of two non-zero vectors A and B is calculated as follows:

D ๐‘ (A, B) = 1 – cos (๐œƒ) = 1 – D ๐‘  (A, B)

Where ๐œƒ is the included Angle between A and B, Ds is cosine similarity:

From a musical perspective, the “tonal distance” between chords I and V should be smaller than that between chords I and III. The figure below shows the distance between the C major triad and the other chords.

Cosine distance between Triads and the tonic chord = C major triad

The distance from triad I to V, IV and VI is relatively small! This is consistent with what they call “tonal proximity” in music theory, and suggests that the Word2vec model does learn meaningful relationships between slices.

In word2vec space, the cosine distance between chords seems to reflect the function of chords in music theory!

adjustable

By looking at the 24 preludes of Bach’s Well-tempered Piano (WTC), which contain all 24 keys (major and minor), we can study whether the new embedding space captures the key information.

To expand the data set, each piece was converted to each other major or minor (based on the original key), so that each piece would have 12 versions. Slices of each tone were mapped into a pre-trained vector space and used k-means clustering so that we could get some central points and treat them as tunes in the new dataset. By tuning these tunes, we can ensure that the cosine distance between these centers is affected by only one element: tuning.

The figure below shows the result of cosine distance between center point tunes in different keys. As expected, the keys that differ by the fifth are close in tone, and they are represented as the darker regions next to the diagonal. Distant tones (such as F and F#) appear orange, which verifies our hypothesis that the word2vec space reflects tonal distance relationships between tones!

Image from Chuan et al (2018)– Similarity matrix based on cosine Distance between Pairs of preludes in different keys.

And so on

This image shows a remarkable feature of Word2VEc, which can find transformational relationships such as “king -> queen” and “man -> woman” in vector Spaces (Mikolov et al. 2013). This means that meaning can be passed forward through vector transformation. Can it work with music?

We first detect some chords from multi-tone slices and observe a pair of chord vectors, C major to G major (I-V). It can be found that the angles between different pairs of i-V vectors are very similar (as shown on the right) and can even be thought of as a multidimensional circle of fifth degrees. This once again proves that the concept of analogy may also exist in the musical word2vec space, although more research is needed to find a more definitive example.

Image from Chuan et al (2018) — Angle between Chord-pair Vectors.

Other applications — music generation?

Chuan et al. briefly studied how to use this model to replace slices of music to form new music in 2018. They say this is a preliminary experiment, but the system could be used as a representation for more complex systems, such as LSTM. More details can be found in the paper, but the picture below gives you a taste of the results.

Photo from Chuan et al (2018) — Replacing slices with geometrically close slices.

conclusion

Chuan, Agres, and Herremans in 2018 created a Word2VEc model that captures the tonal properties of polyphonic music without having to enter the actual notes into the model. The article gives some convincing evidence that chord and key information can be found in the new embed, so answer the question in the title by saying: Yes, we can use word2vec to represent polyphonic music! Now the path is open to embedding this representation into other models that capture the temporal information of music.

reference

  • Besson M, Schรถn D (2001) Comparison between language and music. Ann N Y Acad Sci 930(1):232–258.
  • Chuan, C. H., Agres, K., & Herremans, D. (2018). From context to concept: exploring semantic relationships in music with word2vec. Neural Computing and Applications — Special issue on Deep Learning for Music and Audio1-14.Arxiv preprint.
  • Gutmann MU, Hyvรคrinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13(Feb):307–361
  • Harris ZS (1954) Distributional structure. Word 10(2–3):146–162.
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv preprint arXiv: 1301.3781.
  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of advances in neural information processing systems (NIPS), pp 3111–3119
  • Mikolov T, Yih Wt, Zweig G (2013c) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 746–751

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.