Shazam Clone in Next.js 14 and Python 3.10

Introduction

Have you ever been in a situation where you hear a catchy song playing somewhere, and you just have to know what it is? Apps like Shazam have made it incredibly easy to identify music by simply "listening" to a snippet through your device's microphone.

In this blog post, we'll delve into how to build a basic version of a Shazam-like application. We'll explore the underlying concepts of audio recognition without getting bogged down in overly technical jargon. By the end, you'll understand how audio fingerprinting works and how different components come together to create a music recognition system.

System Architecture

The diagram above illustrates the complete lifecycle of a song recognition request: from audio capture through fingerprint generation to final song identification.

Understanding Audio Fingerprinting

What is Audio Fingerprinting?

Audio fingerprinting is a technique used to identify audio samples based on unique characteristics extracted from the audio signal. Think of it like a "digital fingerprint" for audio. Just as human fingerprints are unique identifiers, audio fingerprints help recognize and distinguish one piece of audio from another.

How Does It Help in Recognizing Music?

When you feed an audio clip into a recognition system, the system generates a fingerprint of that clip and compares it against a database of known fingerprints. If a match is found, the system can identify the song, artist, album, and other metadata associated with the audio.

Breaking Down the Algorithm

Creating an audio recognition system involves several key steps:

Audio Preprocessing
Feature Extraction
Peak Detection
Fingerprint Generation

Let's explore each of these components.

Audio Preprocessing

Before we can analyze the audio, we need to ensure it's in a consistent and suitable format.

Normalization

Audio recordings can vary in volume, quality, and format. Normalization adjusts the audio to a standard level so that the analysis is not affected by these variations.

Sampling Rate and Channels

Sampling Rate: The number of audio samples per second, measured in Hertz (Hz). We typically use 16,000 Hz (16 kHz) for efficient processing while maintaining quality.
Channels: We convert stereo audio (two channels) to mono (one channel) for simplified processing.

By normalizing the sample width (bit depth), sample rate, and number of channels, we ensure that the audio data is consistent and ready for analysis.

Feature Extraction

Once we have normalized audio, we need to extract features that can be used to create a fingerprint.

Fast Fourier Transform (FFT)

The Fast Fourier Transform is a mathematical algorithm that converts a time-domain signal (like audio) into its frequency components. Essentially, it tells us which frequencies are present in the audio and in what amounts.

By performing FFT on the audio samples, we can obtain a spectrum that represents the amplitude of various frequencies at different points in time.

Frequency Domain Analysis

Analyzing the audio in the frequency domain allows us to identify patterns and characteristics that are not easily observable in the time domain. By examining how the frequency content changes over time, we can extract meaningful features for fingerprinting.

Windowing with Hanning Window

When performing FFT, we divide the audio signal into small segments or "windows." To reduce artifacts and improve the accuracy of the frequency analysis, we apply a window function. The Hanning window is a commonly used function that tapers the signal at the beginning and end of each segment, minimizing discontinuities.

Peak Detection

After obtaining the frequency spectrum, the next step is to detect significant peaks.

Identifying Significant Frequencies

Not all frequencies in the spectrum are equally important. We focus on frequencies that have a high amplitude compared to their neighbors, as these represent dominant tones in the audio.

Enhancing Peaks

To ensure that we capture the most significant peaks, we perform peak spreading. This involves enhancing the amplitude of a frequency bin if it is close to other high-amplitude bins. By spreading the peaks, we make them more distinguishable and easier to detect.

Fingerprint Generation

With the significant peaks identified, we can now generate the audio fingerprint.

Creating a Unique Signature

The fingerprint consists of a collection of the strongest frequency peaks along with their corresponding times. This collection captures the unique characteristics of the audio clip.

Organizing Frequency Bands

For efficiency, we categorize peaks into frequency bands. For example:

Band 1: 250 Hz to 520 Hz
Band 2: 520 Hz to 1,450 Hz
Band 3: 1,450 Hz to 3,500 Hz
Band 4: 3,500 Hz to 5,500 Hz

By organizing peaks into bands, we can create a more robust fingerprint that's less sensitive to noise and other variations.

Backend Processing

The backend of our system is responsible for processing the audio data, generating fingerprints, and communicating with a song database to find matches.

Handling Incoming Audio

When the backend receives an audio clip, it:

Loads the Audio: Reads the audio data from the request.
Normalizes the Audio: Ensures the audio has the correct sample rate, bit depth, and channels.
Processes the Audio: Passes the normalized audio through the fingerprinting algorithm.

Generating the Fingerprint

The backend uses the steps outlined in the algorithm breakdown to produce the fingerprint:

Divides the Audio into Chunks: Processes the audio in segments (e.g., 128 samples at a time).
Applies FFT to Each Chunk: Converts each chunk to the frequency domain.
Detects Peaks in Each Chunk: Identifies significant frequencies.
Aggregates Peaks Across Chunks: Builds a comprehensive fingerprint representing the entire audio clip.

Communicating with the Song Database

Once the fingerprint is generated, the backend:

Encodes the Fingerprint: Transforms the fingerprint into a format suitable for transmission (e.g., base64 encoding).
Sends a Recognition Request: Posts the encoded fingerprint to an external service (like Shazam's API) or queries its own database.
Receives the Response: Processes the response to determine if a match was found.

Handling the Response

The backend interprets the response:

Successful Match: Extracts song information (title, artist, album art, etc.) and prepares it for the frontend.
No Match Found: Prepares a response indicating that the song couldn't be identified.

Frontend Interaction

The frontend provides the user interface that allows users to interact with the application.

Recording Audio

Users can record audio directly from the browser:

Microphone Access: The app requests permission to use the device's microphone.
Recording Session: Records a short snippet (e.g., 4 seconds) when the user initiates the search.
Audio Format: Ensures the recorded audio is in the correct format (e.g., MP3).

Sending Audio to the Backend

After recording:

Form Submission: The audio blob is sent to the backend via an HTTP POST request.
Transporting the Audio: The audio data is included in the request payload.

Putting It All Together

Let's walk through the entire process:

1User Interaction
- The user opens the app and taps a button to identify a song.
- The app prompts for microphone access if not already granted.
2Audio Recording
- The app records a short audio clip from the microphone.
- A visual indicator shows that the app is "listening."
3Audio Transmission
- The recorded audio is sent to the backend server.
- The app may display a "searching" message or animation.
4Backend Processing
- The backend normalizes the audio.
- It processes the audio to generate the fingerprint.
- It sends the fingerprint to the song database or recognition service.
5Matching and Response
- The recognition service compares the fingerprint against its database.
- If a match is found, song details are returned.
- If no match is found, a "no match" response is sent.
6Displaying Results
- The frontend receives the response.
- For a match:
  - Displays Song Information: Shows the song title, artist, and album art.
  - Interactive Links: Provides options to listen to the song on streaming platforms.
- For no match:
  - Notification: Informs the user that the song couldn't be identified.
  - Retry Option: Allows the user to try again.
7User Engagement
- The user can interact with the results.
- The app may store recent searches for future reference.

Conclusion

Building a Shazam clone is an exciting project that combines signal processing, algorithm design, and user interface development. By understanding audio fingerprinting and the steps involved in processing and matching audio data, we've created a system that can identify songs based on short audio clips.

While our clone is a simplified version, the concepts and techniques used are foundational to many audio recognition systems. By refining the algorithms and expanding the song database, this system could be enhanced to provide more accurate and faster results.

Source Code

The source code for this Shazam clone is available here. You can explore the implementation details, experiment with the code, and even contribute to its development.

*Note: This blog post provides an overview of how a Shazam-like application works without delving into specific code implementations. For detailed code examples and further technical explanations, please refer to the GitHub repository.*

How it Works 🛠️

Summary

Technologies Used

Front End

Backend