Technology Editor: Xu Jiu Technology Editor: Xu Jiu Report from Beijing SegmentFault


Recently, technology giants including Alibaba and Microsoft have invested a lot of time and resources in trying to solve the problem of sound separation, as the market demand and technology development in the audio-video sector. Google has released a new dataset, the Free Universal Sound Separation Dataset, or Fuss, to support the development of AI models capable of separating different sounds from recording mixes.

According to the report, the use scenarios of this model are very rich, and if commercialized, Fuss may be used by enterprises to extract voice from conference calls.

This follows a study by Google and the Swiss IDIAP Institute, which described that two machine learning models — speaker recognition networks and spectral mask networks — together “significantly reduced the WER rate of speech recognition word errors (WER) over multi-speaker signals.


As Google Research scientists John Hershey, Scott Wisdom, and Hakan Erdogan explain in an article, most models of sound separation assume that the number of sounds in a mixture is static, They either isolate mixtures of a few sound types (such as speech and non-speech) or different instances of the same sound type (such as first speaker and second speaker). The FUSS data set shifts the focus to the more general problem of separating any number of voices from each other.

To this end, the Fuss data set includes a set of different sounds, a realistic room simulator, and code that mixes these elements together to achieve authenticity of multi-source, multi-class audio.

The Google researchers took audio clips from FreeSound.org, filtered them out to exclude those that couldn’t be separated by humans when they were mixed together, compiled 23 hours of 12,377 mixed sounds, and produced 20,000 mixed sounds that they used to train their AI models. An additional 1000 mixed sounds were used for verification and 1000 mixed sounds were used for evaluation.

The researchers said they developed their own room simulator using Google’s TensorFlow machine learning framework, which generates the impulse response of a box room with “frequency-dependent” reflection properties, given a sound source and microphone location. Fuss comes with the calculated room impulse response used for each audio sample, as well as the mixing code. In addition, Fuss provides a pre-trained, mask-based separation model that can reconstruct multi-source mixes with high precision.

The Google team plans to open up the code for the room simulator, and plans to extend it to address more computationally expensive acoustic features, as well as materials with different reflective properties and novel room shapes.

“We hope that the Fuss Dataset will lower the barriers to new research, in particular the ability to rapidly iterate and apply new techniques from other areas of machine learning to address the challenges of sound separation.”

Making address:


https://github.com/google-res…