Translator: But the Android Visualizer is An alternative

When listening to music, sometimes you’ll see visually pleasing tabs that jump higher the louder they are. In general, the left bar corresponds to a lower frequency (bass), while the right bar corresponds to a higher frequency (treble) :

These bars are often called visual equalizers or visualizers. To display similar visualizers in your Android app, you can use the Android native Visualizer class, which is part of the Android framework and can be attached to your AudioTrack.

It works effectively, but with one important flaw: It requires a microphone permission application, which, according to the official documentation, is strictly for consideration:

To protect privacy of certain audio data (e.g voice mail) the use of the visualizer requires the permission.

To protect the privacy of some audio data, such as voice mail, you need to obtain permission to use Visualizer.

The problem is that users won’t allow music apps to apply for access to their microphones (of course). When I scour the official Android apis and other tripartite libraries, I can’t find an alternative to the visualizer.

So I thought about building my own wheels, and the first problem was, I had to figure out how to translate the music that was playing into the height of each jump bar.

How visualizers work

First, let’s start with the input. When digitising audio, we typically sample the amplitude of the signal very frequently, which is called pulse code modulation (PCM). The amplitude is then quantified and we express it on our own numerical scale.

For example, if the code is PCM-16, the ratio will be 16 bits, and we can represent an amplitude in the number range of 2 to the 16th power, which is 65536 different amplitude values.

If you sample on multiple channels (such as stereo, recording left and right channels separately), these amplitudes follow each other, so first the amplitude of Channel 0, then channel 1, then Channel 0, and so on. Once we have these magnitudes as raw data, we can move on to the next step. To do that, we need to understand what sound actually is:

The sound we hear is the result of the vibration of the object. Examples include the vocal cords of a person, the metal strings of a guitar, and the body of a xylophone. Normally, air molecules move randomly, if not affected by specific sound vibrations.

From Digital Sound and Music

When A tuning fork is struck, it vibrates at A very specific 440 times per second (Hz), and this vibration will travel through the air to the eardrum, where it resonates at the same frequency and the brain interprets it as note A.

In PCM, this can be expressed as a sine wave, repeated 440 times per second. The height of these waves does not change the note, but they represent the amplitude; In layman’s terms, it’s the loudness in your ears when you hear it.

But when listening to music, there is usually not only the note A being listened to (although I hope so), but also an overabundance of instruments and sounds, resulting in A PCM pattern that is meaningless to the human eye. It is actually a combination of a large number of vibrations of different sinusoidal waves of different frequencies and amplitudes.

Even very simple PCM signals (such as square waves) are very complex when deconstructed into different sine waves:

Square waves are deconstructed into approximate sines and cosines, referenced here.

Fortunately, we have algorithms to do this deconstruction, which we call the Fourier transform. As the visualizer above shows, it is actually deconstructed from a combination of sine and cosine waves. Cosine is basically a delayed sine wave, but it’s very useful to have them in this algorithm, otherwise we wouldn’t be able to create a value for point zero, because each sine wave starts at zero, and when you multiply it you still get zero.

One of the algorithms that performs the Fourier transform is the fast Fourier Transform (FFT). When we run this FFT algorithm on our PCM sound data, we get a list of the magnitudes of each sine wave. These waves are the frequencies of sound. At the beginning of the list, we can find the low frequencies (bass) and the high frequencies (treble) at the end.

So, by drawing a bar chart like this, whose height is determined by the amplitude of each frequency — we get the visualizer we want.

The technical implementation

Now back to Android. First, we need the PCM data for the audio. To do this, we can configure AudioProcessor to our ExoPlayer instance, which will receive each audio byte before forwarding it. You can also make modifications, such as changing magnitude or filtering channels, but not now.

private val fftAudioProcessor = FFTAudioProcessor()

val renderersFactory = object : DefaultRenderersFactory(this) {
    override fun buildAudioProcessors(a): Array<AudioProcessor> {
        val processors = super.buildAudioProcessors()
        return processors + fftAudioProcessor
    }
}
player = ExoPlayerFactory.newSimpleInstance(this, renderersFactory, DefaultTrackSelector())
Copy the code

In the queueInput(inputBuffer: ByteBuffer) method, we receive byte data bundled together as a frame.

These bytes may come from more than one channel, so I take the average of all the channels and forward them only for processing.

To use the Fourier transform, I used the Noise library. The transformation requires a list of floats with a given sample size. The sample size should be a factor of 2, and I chose 4096.

Increasing this number yields finer data, but takes longer to compute and computes less frequently (because an update can be made for every X byte of sound data, where X is the sample size). If the data is PCM-16, then two bytes constitute an amplitude. Floating point values are not important because they can be scaled. If you submit a number between 0 and 1, the results will all be between 0 and 1 (because there is no need to multiply the sine wave amplitude by the higher number).

The result will also be a float list. We could use these frequencies to draw 4096 bar charts immediately, but this would be impractical.

Let’s look at how we can improve the resulting data.

spectrum

First, we can group these frequencies into smaller groups. Thus, suppose we divide the 0-20khz spectrum into 20 sections, each of which spans 1kHz.

20 is easier to draw than 4096, and we don’t need that many. If you draw the values now, you can see that only the leftmost part is moving significantly.

This is because the range of frequencies used in music is about 20-5000Hz, and listening to 10kHz sounds can be quite annoying. If you take out the higher frequencies in the music, you’ll notice that it sounds more and more dull, but the amplitude of these frequencies is very small compared to the lower frequencies.

If you look at a studio equalizer, you’ll see that the frequency spectrum is also unevenly distributed, with the lower half of the frequency typically taking up 80-90% of the band:

In view of this, it is recommended that these bands be of variable width by assigning more bands to the lower frequencies. Here’s what this looks like, and it looks better:

That seems good, but there are still two problems:

First of all, the frequency on the right seems to be moving a little too much. This is because our sampling is not perfect, and it introduces an artifact called a spectrum leak, where the original frequency is smeared into adjacent frequencies. To reduce this trailing phenomenon, we can apply a window function where we highlight the frequencies we are interested in and turn down the others. There are different types of Windows, but I’ll use a Hamming-window. The frequency we are interested in is the middle band, and we suppress both ends:

Finally, there’s one small problem that doesn’t show up in the GIF above, but is immediately noticeable when listening to music: The bars pop too early, and they pop when you’re not expecting them.

Accidental buffer

This out-of-sync behavior is due to the fact that in ExoPlayer AudioProcessor we receive data before it is passed to AudioTrack, which has its own buffer, which causes the visual effects to take over the audio effects, resulting in delayed output.

The solution to this is to copy the code for the buffer size calculation part of the ExoPlayer so that the buffer size in my AudioProcessor is exactly the same as in AudioTrack.

I put the incoming bytes at the end of the buffer and only process the bytes at the beginning of the buffer (the FIFO queue), which DELAYED the FFT as I wished.

The final result

I created a code repository where I showed off my FFT processor by playing an online broadcast and drawing using a visualizer I created. It certainly won’t work directly with online products, but if you’re looking for a visualization tool for your music APP, it provides a great foundation.


About the translator

Hello, I am the only one who is interested in this article. If you think this article is of value to you, please feel free to follow me at ❤️, or on my blog or GitHub.

  • My Android learning system
  • About article correction
  • About Paying for Knowledge
  • About the Reflections series