Recreating Shazam's Recognition Algorithm

so I decided to recreate shazam's song recognition algorithm out of curiosity okay fine it was mostly out of sheer desperation because let's be real Landing a junior Dev job these days feels harder than breaking into a bank if you're a developer with less than 3 years of experience you get what I mean and if you don't well I genuinely Envy you but Shazam has always blown my mind you hold up your phone it listens for a few seconds and then boom it tells you exact what song is playing in the background magic right well Tech magic and I thought if I can reverse engineer this sorcery maybe hiring managers will finally see that I'm serious about my engineering career but how do you even start building something like this I mean I was as clueless as anyone at the beginning so I did what any good engineer would do I researched or rather I Googled I went down a rabbit hole reading every article blog post and research paper I could find about shazam's technology and here's the simplified version of what I learned the magic of Shazam lies in something called audio fingerprinting it's like creating a unique DNA profile for every song This fingerprint lets Shazam recognize a song based on a small snippet of audio first the song is converted into something called a spectrogram which captures the frequency content of the audio over time showing which frequencies are present at each moment and how intense they are from the spectr program Shazam pinpoints frequencies with high intensity the ones representing the notes or beats that stand out in the song these frequencies known as Peaks are like the song's unique features that make it recognizable Shazam encodes the relationship between these Peaks into unique hashes and links them to the song's metadata like its title and artist and the exact time each Peak occurred in the spectrogram thousands of these hashes come together to form the song's fingerprint the fingerprint is then stored in a database where each hash serves as a quick reference point for searches when you use Shazam the app records a short audio snippet and runs it through the same fingerprinting process this time though the hashes are not saved to the database instead they're used to query the database for matches the results are then grouped by song ID isolating potential candidates at first glance it might seem logical to assume the song with the most matches is the correct one but that's not quite the case instead Shazam analyzes the time coherence of the candidates evaluating how well the snippet timestamps align with those of each candidate the song with the highest time coherence is ultimately identified as the correct match and voila that's basically how Shazam works pretty straightforward right once I wrap my my head around the theory it was time to get coding I used goong to code the algorithm from scratch with no libraries because why take the easy route when you can suffer for the sake of learning but for some of the heavy lifting like handling audio files I had to call in reinforcements if you'd like to check out my implementation I've added the GitHub link to the project in the description below along with a link to a demo where you can see the algorithm in action the first challenge was transforming raw audio into a recognizable fingerprint to tackle this I wrote three key functions that the raw audio data has to go through sequentially to generate the fingerprint the first function takes the raw audio data as input and converts it to a spectrogram this meant transforming the audio from the time domain into the frequency domain so I could visualize its frequency components over time to achieve this I split the audio into overlapping segments using the sliding window Techni so I could analyze and process smaller parts without missing any key features but this slicing introduced artificial jumps at the edges of each segment which is a problem also known as spectral leakage it's like trying to cut bread with a Dull Knife it gets messy at the edges to smooth out these jumps I apply the Hamming window function which tapers off the signal at the edges creating a smooth transition and reducing distortions finally I apply the fast foro transform fft to each segment the fft is the essential tool that transforms the audio from its time domain into the frequency domain unlocking its frequency components this process produces the spectrogram a two-dimensional Matrix of complex numbers where rows represent frequencies columns represent time intervals and the values indicate intensity but here's the thing spectrograms can be massive especially for long audio samples to keep things efficient right before I split the audio and run the f i downsampled the audio and limited the frequency range to the part where most musical information lies around 20 HZ to 5 khz this reduced processing overhead and improved the signal to noise ratio making it easier to extract meaningful features later The Next Step was to extract the standout frequencies or Peaks that make the audio uniquely identifiable from the spectrogram this is where the second function in the process comes into play to extract these Peaks effectively I applied a filter of six logarithmic frequency bands these bands are designed to mimic the human ears sensitivity to sound human ears struggle to perceive low frequencies less than 500 Hertz but are naturally more attuned to mid to high frequencies 500 to 2,000 Hertz as a result low frequencies are often Amplified in music during production so for each segment of the spectrogram I identified the loudest frequencies within each band this this approach ensures a balanced representation across the Spectrum and prevents low frequency content from dominating next I calculated the average magnitude of these six strongest frequencies and used it as a dynamic threshold frequencies below this threshold were discarded ensuring that weak frequencies from any band were not retained simply because they were the strongest within their band doing this not only preserves essential Peaks but also reduces data size optim izing the process for efficient matching down the line with the Peaks extracted from the spectrogram we now have a collection of data points highlighting the audio's most distinctive features but these Peaks on their own aren't enough for song identification to transform them into something meaningful and searchable we need to encode the relationships between them in a compact queriable format this brings us to the third function the fingerprinting process here each Peak called The Anchor is treated as a reference point for every anchor I identify five nearby Peaks or targets within a fixed range forming what's known as the target Zone this relationship forms the basis of the fingerprint for each anchor Target pair I create a unique hash by encoding the frequencies of the anchor and Target Peaks along with their time difference into a compact 32-bit integer this hash is small enough to store efficiently but still packs all the information needed to uniquely identify the audio this data is temporarily stored in a hashmap where the hash serves as the key and the value is an array containing the Anchor's Tim stamp and the song ID since different songs may share similar features the same hash can be generated for multiple songs when this happens their details are appended to the array under that hash thousands of these hashes are generated to form a complete fingerprint which is then saved to the database once the fingerprinting process was up and running the next step was to upload songs and FInd matches that's when I shifted gears and started building the front end now I could have used plain old vanilla JS but after wrestling with the fingerprinting process for so long I decided I deserved a break so I went with react instead for communication between the front end and the server I used websocket since the main goal of this project was to implement and explore the algorithm rather than build a commercial app I kept things simple instead of Autos scanning the world for songs I had users upload the songs they wanted to match to make this process seamless I integrated Spotify allowing users to paste a link to a song album or playlist since Spotify doesn't allow direct downloads I use their API to fetch relevant data about the song such as the title artist and album name I then search for the song on YouTube using this data if a match is found I proceed to download the audio from the YouTube video afterward I assign a unique ID to the song and All relevant data about the song including the YouTube video ID are saved to a song database next the song's audio data goes through the fingerprinting process and the resulting fingerprints are stored in the fingerprint database once the song uploading process was working okay I was ready to work on the real challenge matching songs from short audio Snippets this is where the magic happens the process kicks off when a user records a short audio clip either from their microphone or directly from their computer the snippet is encoded in Bas 64 format and sent to the server for processing on the server side the raw audio is decoded and processed to generate its fingerprint the hashes in the fingerprint are extracted and used to query the fingerprint database for matches then I organized the retrieved results mapping each song ID to its Associated anchor times to find the best match I evaluate how well the audio snippet align with the candidate songs for each candidate I calculate the absolute time difference between successive anchor pairs in both the snippet and the candidate next I calculate the absolute difference between these two values if this difference Falls within a tolerance of 100 milliseconds the pair is marked as consistent and the candidate's time coherence score is incremented by [Music] one after calculating the scores I fetch metadata for the top scoring candidates from the song's database for each match I then create an object containing the song's metadata the earliest anchor time and the coherence score finally I sort the matches in descending order by their scores and send them to the front end there I use YouTube's iframe to display the corresponding video for each match starting at the exact time stamp where the match was found this has definitely been the most challenging project I've worked on so far whether it helps me land a job or not I'm glad I took on the challenge unfortunately I can't do a proper demo here because copyright laws exist but I've put together a demo video that showcases everything in action you can find the link in the description along with the GitHub repo if you'd like to check out the code

Transcript for:Recreating Shazam's Recognition Algorithm

Transcript for:
Recreating Shazam's Recognition Algorithm