Aside from Google’s Magenta project, Music and AI isn’t a field that’s been highly explored (especially any field other than music generation, which is much easier than music analysis in my opinion). From my own experience in AI and my experience as a classical musician, I have some insights.
First, coding with music is HARD. It’s not something you can teach yourself in one night, as I’ve unfortunately learned over the past couple weeks. It involves advanced signal processing and NNs that compare the resulting spectrograms, both of which require lots of data manipulation before you can even apply your model.
Spectrograms are used for nearly all of deep learning algorithms that deal with music, as computers cannot analyze sound directly nor frequencies. These graphs compare time to frequency and use color to show the strength of a signal at a certain frequency. However, creating a spectrogram from an audio file is way more difficult than simply downloading a dataset from Kaggle.
First, from a sound file, you might have to convert the audio to a WAV file for ease of computing. From there, you have to make sure your bit rate aligns with the model that reads your file—I’ve made the mistake of not reading documentation and trying to fit a 24 bit rate WAV file into a model that only takes 8, 16, and 32 bit rate files. Or, you could change the bit rate manually—yet another thing to learn from the process! After that, you have to prepare the audio data for graphing. This includes, but is definitely not limited to, separating frequencies using an FFT, which I covered in my previous blog, using the Mel Scale to sort out which frequencies are even audible to the human ear, emphasizing certain frequencies, and manipulating the resulting data so it can actually be graphed in color. Finally, you have to add all of the spectrograms to an array, labeling each all the while.
Haytham Fayek from the Royal Melbourne Institute of Technology (RMIT) has a good blog post that goes more in-depth about the process required for the inputs: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html.
Preparing your data takes expertise in the signal processing field and knowledge of calculus and linear algebra (for example, Fourier transforms and Hamming windows). From there, you have resulting spectrograms, or images. These are relatively easy to perform machine learning on, however, music itself comes with many pitfalls.
There are so many subtleties to music, ones not easily captured by a spectrogram. How does a machine learning algorithm account for tempo? Volume? Tone? Expression? Articulation? Something easily picked up by an amateur musician, like having a forced sound vs. a loud sound, is incredibly difficult for a computer to notice. Something immediately noticed by an advanced musician, like a warm sound vs. cold sound, is impossible for a computer to detect (as of right now). Not to mention the video aspect of music: how would a machine learning algorithm deal with visual technique? How could it assess a musical performance without the emotions that are fundamental to the relationship between audience and musician?
Another example of potential issues is orchestral pieces or pieces with multiple parts. While designing a project, I did some research on existing companies that transcribe music. A popular one, Chordify, is limited to piano, guitar, and ukelele. It only transcribes music for a single part, to which you play along. Whereas, humans are capable of hearing an 8 bar section of THREE parts (or an instrument that isn’t one of the three mentioned) and playing each of them back perfectly. Chordify also only takes pre-recorded files— eg. YouTube, an .mp4 from your computer, or a song from SoundCloud— yet another way that humans continue to dominate the field of music.
One more issue is one of equity/diversity. The “music” I’ve discussed in this piece has been solely about Western classical music. What happens when we introduce music from other cultures, like India, that doesn’t use the 12-tone system? Or modern music, like pop or rap? Genres of music are so different that an algorithm trained only on a certain type would fail at analyzing others.
And finally, creating a music-related algorithm wouldn’t be an efficient use of time. For example, a tool like Grammarly for music (where it analzyes a song/piece and offers feedback) would take an incredible amount of resources to code. Developers of this model would need expertise in both computer science and music, both of which are time-intensive and take years of practice to become proficient at. Instead, there’s an easier method: you could simply hire a music teacher for an hour to tell you the same thing.
I believe that, because of these reasons, music and AI has largely been limited to music generation. I won’t get into the specifics now, but playing music is much easier than analyzing music to a computer. Which brings us to my final point: At least for the time being, I believe that music is safe from huge advances in AI, unlike NLP. Maybe in twenty years I’ll have to recant my statement, but there are so many difficulties inherent in music that our current AI is years behind. So, hold on to your instruments and keep practicing, because human musicians will be relevant for years to come!
See you next time, and be sure stay tuned for my next blog post on music and AI!
Ally is a Student Ambassador in the Inspirit AI Student Ambassadors
Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.