Video introduction: Machine learning behind Hum to Search
The melody that lingers in your head, often referred to as an “earworm,” is a well-known and sometimes irritating phenomenon — and once it’s there, it’s hard to get rid of it. The study found that interacting with the original song, either listening to it or singing it, drove the earworms away. But what if you can’t quite remember the name of the song and can only hum the melody?
Existing methods of matching humming melodies to their original polyphonic studio recordings face several challenges. Through lyrics, background vocals, and instruments, the audio of the music or studio recording may be distinct from that of the humming tune. Due to error or design, when someone sings their interpretation of a song, often the pitch, key, rhythm or tempo may be slightly or even significantly different. This is why so many existing query-by-humming methods match a humming tune to a pre-existing database of only melody or humming versions of the song, rather than identifying the song directly. However, this approach typically relies on a limited number of databases that need to be manually updated.
Hum to Search, launched in October, is a completely new machine learning system for Google Search that allows people to find songs using only the songs they Hum. Compared with existing methods, this method generates melody embedding from the song’s spectrum graph without generating intermediate representation. This enables the model to match the humming melody directly to the original (chord) recording, without the need for the humming or MIDI version of each track or other complex manual design logic to extract the melody. This approach greatly simplifies Hum to Search’s database, enabling it to be constantly refreshed by embedding original recordings (even the latest versions) from around the world.
background
Many existing music recognition systems convert an audio sample into a spectral map before processing it in order to find a good match. However, one of the challenges in identifying humming melodies is that humming tunes usually contain relatively little information, as shown in the example shown here. Differences between the humming version and the same snippet in the corresponding studio recording can be visualized using a spectrum diagram, as shown below:
Given the image on the left, the model needs to locate the audio corresponding to the image on the right from a collection of more than 50 million similar-looking images that correspond to studio recordings of other songs. To achieve this, the model must learn to focus on the dominant melody and ignore background vocals, instruments and sound timbre, as well as differences from background noise or room reverb. To find by the naked eye the dominant melody that might be used to match the two spectra, one might look for similarities in the lines near the bottom of the image above.
Previous efforts to discover music, particularly in the context of identifying recorded music played in environments such as cafes or clubs, have shown how machine learning can be applied to the problem. Released to Pixel phones in 2017, Now Playing uses deep neural networks on devices to identify songs without having to connect to a server, while Sound Search has further developed the technology to provide server-based recognition services for faster and more accurate searches of over 100 million songs. The next challenge was to use the knowledge learned from these versions to identify music that was hummed or sung from something like a large library of songs.
Machine learning setup
The first step in developing Hum to Search was to modify the music recognition model used in “Playing” and “Sound Search” to handle recordings of humming. In principle, many of these retrieval systems (for example, image recognition) work in a similar way. The neural network is trained with paired inputs (in this case, paired humming or singing audio with recorded audio), generating inserts for each input that will later be used to match the humming melody.
To achieve hum recognition, the network should generate inserts that bring audio pairs containing the same melody close to each other, even if they have different instrumental accompaniments and vocals. Pairs of audio containing different melodies should be far apart. In training, the network provides such audio pairs until it learns to use this property to generate an embed.
The trained model can then generate melody embeddings similar to the embeddings of song reference recordings. Finding the right song is as simple as searching for similar inserts from a database of reference recordings calculated from popular music’s audio.
Training data
Because the training of the model requires song pairs (recording and singing), the first challenge is to get enough training data. Our initial data set consisted mainly of musical fragments of singing (which rarely included humming). To make the model more robust, we enhanced the audio during training, for example by randomly changing the pitch or rhythm of the singing input. The resulting model works well for people who sing, but not for people who hum or whistle.
To improve the model’s performance on humming melodies, we generated additional training data simulating “humming” melodies from existing audio datasets using SPICE, a pitch extraction model developed by our broader team as part of the Freddie Meeter project. SPICE extracts pitch values from a given audio, which we then use to generate melodies made up of discrete audio tones. The first version of the system converted this raw clip into these tones.
We later improved on this method by replacing a simple tone generator with a neural network that generated audio similar to the actual humming or whistling tune. For example, the network generated this humming example or whistling example from the above singing snippet.
As a final step, we compare training data by mixing and matching audio samples. For example, if we had similar clips from two different singers, we would align the two clips with our preliminary model and thus be able to show the model additional pairs of audio clips representing the same melody.
Machine learning improvement
In training the Hum to Search model, we start with the triplet loss function. This loss has been shown to perform well in a variety of categorizing tasks, such as images and recorded music. Given a pair of audio corresponding to the same melody (points R and P in the embedded space, shown below), Triplet Loss will ignore some portion of the training data from different melodies. This helps to improve learning behavior machine, both when it finds a different melody, the too “easy” because it has been far from R and P (see point E), or because it is too difficult, considering the current model in the condition of learning, audio and eventually was too close to R – although according to our data, it represents the different melody (see H point).
We found that we can take these additional training data H and E (point) for improving the accuracy of the model, through the example of a batch of model the confidence of the general concept: how to ensure that is a model, all the data can be classified correctly it see, or see does not conform to the current understanding of the sample? Based on this concept of confidence, we added a loss to achieve 100% confidence of the model embedded in all regions of space, thus improving the accuracy and recall rate of our model.
The above changes, particularly our variations, enhancements, and overlayings of training data, enable the neural network model deployed in Google search to recognize sung or humming melodies. The system currently achieves high accuracy on a song database of over half a million songs, which we are constantly updating. The song’s corpus still has room to grow to include more of the world’s many melodies.
Hum a search in the Google app
To try out the feature, you can open the latest version of the Google app, tap the microphone icon and say “What’s this song?” Or click the “Search song” button, after which you can hum, sing or whistle! We hope Hum to Search helps you solve this problem, or perhaps just helps you find and play songs without entering a song name.
Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.
Blog Source: Blog of rainy Night