Let’s Talk About FFTs and Mel-Spectrograms

ally b
4 min readNov 15, 2020

--

For a long time, music and AI were considered separate, opposites on the spectrum of human creation. How could such a benchmark of human creativity, originality, and humanity be translated into 1s and 0s and analyzed by a computer? How could a computer compose music, something that takes humans years to master? Because of this cognitive dichotomy, music and AI have begun to intersect only in the last few years. A quickly growing topic of interest, the music and AI field has projects that transcribe sheet music, classify music as certain genres, and create new music in the style of certain composers.

Some of the more interesting projects are part of Google’s Magenta project. Its home page features a sleek layout and user-friendly table of images, each with an associated mini-experiment. Magenta also boasts a Tensorflow API in both Python and Java, which aligns with its mission of being “an open source research project exploring the role of machine learning as a tool in the creative process.”

For reference, here is the link to the Magenta project: https://magenta.tensorflow.org

One of their featured mini-projects is using a two neural networks (a CNN and an LSTM, which is a type of RNN) to transcribe piano into sheet music for free—something that, with another company called Chordify, costs money. Magenta’s Dual-Objective Piano Transcription, published in February 2018, is useful for transcribing improvisation, composing, and is just interesting in general.

I’ll be using their model as an example of a common theme to music-related AI projects: taking audio input, running an FFT on it, and converting the frequencies to a spectrogram or Mel-Spectrogram. The piano transcription model, named Onsets and Frames, takes the live input of the piano and converts it to frequencies (think hertz) using a Fast Fourier Transform. This algorithm is instrumental in identifying individual frequencies in a file. Essentially, a regular Fourier Transform recreates a function by adding certain multiples of a second function together. This second function is typically sine and cosine. Each coefficient of cosine/sine corresponds to a frequency. This can be proven using Fourier’s Theorem: that every function can be expressed as a series of sine/cosine terms if the function is contiguous.

A Discrete Fourier Transform is similar, but requires less similarity between the original function and the recreation using sine/cosine. The two functions only have to align at a certain, discrete set of points.The Fast Fourier Transform is a Discrete Fourier Transform or Inverse Discreet Fourier Transform because even though a typical audio sample rate is 44,100 times a second, we still don’t know the data in between each sample. We only know 44,100 samples a second. Also, an IDFT just reverses the process: given the coefficients of the cosine/sine function, it will spit out the original data. Common examples of FFTs are Cooley-Tukey algorithms, the Winograd FFT algorithm, and Rader’s algorithm.

From there, the frequencies are mapped onto a Mel-Spectrogram, which is common in music-related AI. Using images instead of frequencies allows the use of image-related neural networks like Convolutional Neural Networks, which Magenta’s Dual-Objective Piano Transcription uses.

An example of a regular spectrogram

A Mel-Spectrogram uses the Mel Scale’s system of classifying frequency. Humans’ ability to detect changes in frequency decreases as pitch gets really low or high, so the Mel Scale compensates for this. It converts hertz into a unit where changing the frequency by a constant number corresponds to a constant change in audible pitch to a listener, regardless if the pitch is low or high. This results in a non-linear scale. To create a Mel-Spectrogram, one must first convert the frequencies to the Mel Scale, then map it onto a spectrogram. Spectrograms use time as the x axis, frequency as the y axis, and colors to represent signal strength, or the frequency of that certain frequency at that time.

The Mel Scale

From looking at other Magenta projects, beginning to research for my own, and talking with a physics masters program student about FFTs, I’ve realized that converting audio to some form of image is vital to creating an effective algorithm: probably why many projects begin with running these algorithms. This also why so many open source audio libraries have some version of FFTs, Magenta API included.

This is my first blog post in an ongoing mini-series about music and AI, so stay tuned for further updates. Cheers, and until next time!

Ally is a Student Ambassador in the Inspirit AI Student Ambassadors
Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.

--

--

No responses yet