Miniproject: Birdcall Classificaiton

We in Vermont are blessed with a beautiful natural environment that supports a rich ecosystem of wildlife, including birds. Below are just a few of the birds that I regularly see in my neighborhood just outside of town.

Apps like Merlin Bird ID from the Cornell Lab of Ornithology are great tools for identifying birds you see in the wild from photos, descriptions, or recordings of their calls. In this miniproject, you will build your own algorithm for predicting the species of a bird from a recording of its call.

This Miniproject is Intentionally Open-Ended and Challenging

Unlike in some of the previous miniprojects in which I’ve given you relatively explicit and scaffolded directions, I am leaving the details of this miniproject largely up to you.

Dataset

The dataset for this miniproject consists of a set of recordings of bird calls for the 8 birds pictured in Figure 1. These recordings were obtained from the xeno-canto database, which is a large and growing collection of bird call recordings contributed by birders around the world. The data set consists of:

The Metadata File

The file metadata.csv contains metadata about the recordings, including the species, common name, file url, length, and id for each recording. Here’s an example of how to access the metadata:

import pandas as pd
DATA_URL = "https://middcs.github.io/csci-0451-s26/data/birdcall"
metadata = pd.read_csv(f"{DATA_URL}/metadata.csv")
metadata.head()

	species	common name	length	recorded_by	lat	lon	alt	id	sr	split
0	melanerpes carolinus	red-bellied woodpecker	0:15	Bruce Lagerquist	29.2923	-82.6270	20	0	48000	train
1	buteo jamaicensis	red-tailed hawk	0:15	Thomas Magarian	48.2519	-112.4131	1200	1	48000	train
2	turdus migratorius	american robin	0:06	AUDEVARD Aurélien	61.1658	-150.0613	0	2	48000	train
3	turdus migratorius	american robin	0:22	Scott Olmstead	32.3022	-110.5962	1300	3	44100	train
4	corvus brachyrhynchos	american crow	0:15	David Vander Pluym	47.6551	-122.1120	10	4	22050	train

This file has been censored so that 20% of the species names have been replaced with “unknown” – these are the test set:

metadata[metadata["species"] == "unknown"].head()

	species	common name	length	recorded_by	lat	lon	alt	id	sr	split
530	unknown	unknown	0:13	Ed Pandolfino	38.5321	-121.0686	110	530	48000	test
531	unknown	unknown	0:16	Antonio Xeira	38.0695	-84.3933	300	531	44100	test
532	unknown	unknown	0:21	Antonio Xeira	36.7652	-88.1301	120	532	44100	test
533	unknown	unknown	0:25	Greg Irving	48.0510	-123.1393	130	533	48000	test
534	unknown	unknown	0:27	Rory Nefdt	40.9658	-73.6739	0	534	44100	test

The Audio Files

Each row of the metadata corresponds to an MP3 audio file which I have downloaded and hosted on the course GitHub repository. The following code block shows how to download a single MP3:

import requests
import os

LOCAL_DATA_PATH = "data/birdcall"

def download_mp3(rec_id):
    url = f"{DATA_URL}/mp3/{rec_id}.mp3"
    fname = f"{LOCAL_DATA_PATH}/{rec_id}.mp3"
    if not os.path.exists(fname):
        response = requests.get(url)
        with open(fname, "wb") as f:
            f.write(response.content)

The argument of download_mp3 corresponds to the id column of the metadata dataframe:

example_id = metadata.iloc[0]["id"]
download_mp3(example_id)

What You Should Submit

Your Python Script

Please submit your Python script, called submission.py, in which you define and train a model to predict the species of a bird from a recording of its call. You are free to use any libraries or tools you like. This file will not be run by the autograder in any way.

Your Predictions

Please submit on Gradescope a predictions.csv file with two columns: id and predicted_species. The id column should contain the id of each recording in the test set (i.e. those with species “unknown” in the metadata), and the predicted_species column should contain your predicted species for each recording (using its common name). For example:

id,predicted_species
512,northern cardinal
513,american crow
514,tufted titmouse
515,american robin
516,red-bellied woodpecker
...

Assessment

You’ll be evaluated on this project based on the quality of your code and the accuracy of your predictions.

Code Quality

Code quality accounts for 50% of your grade for this miniproject.

Because this miniproject is somewhat more complex, code organization is important. In assessing your submission, we will be looking for:

Idiomatic torch code constructs, including Dataset and DataLoader objects, and the use of torch tensors for all data manipulation and model training.
Clear and modular code organization, including the use of functions and classes to break things up into logical units.

We’re also expecting the “standard” organization for Python scripts in which you separate your function and class definitions from the code that actually runs the data loading and training:

import ...

def this(): 
    pass

class That:
    pass

# etc. 

if __name__ == "__main__":
    # instantiate objects
    # load the data
    # train the model 
    # save results
    # etc

Accuracy

Accuracy accounts for 50% of your grade for this miniproject.

The accuracy of your predictions will be assessed by comparing them to the true species for the test set, which I have withheld from you. The test set consists of 20% of the recordings in the dataset, and is balanced across the 8 species.

Your accuracy grade (out of 50) is exactly equal to your accuracy on the test set (out of 1). For example, a model that achieves 45% accuracy on the test set receives an accuracy grade of 45/50. Accuracies 50 or higher all correspond to full credit for the accuracy portion of the grade.

Hints

You’re welcome to follow any approach to this problem that you can reasonably come up with. Here are a few techniques that I found to be useful in my own solution.

Data augmentation: I found that randomly shifting the audio recordings in time was helpful for reducing overfitting.
Checkpointing: rather than training for a fixed number of epochs and taking the final model to be the one I used, I found it helpful to save the model whenever it achieved a new record for best loss, and use the model which achieved the overall lowest loss to form my predictions.
Because the recordings are of different lengths, I needed to pad them to make them all the same length before I could feed them into the model.
While it’s possible to have the __getitem__ method of your Dataset class read in the raw audio and compute the spectrogram on the fly, I found it more efficient to precompute the spectrograms for all recordings and save them as an instance variable, and then have __getitem__ just read in the precomputed spectrograms. I still used data augmentation as mentioned above by performing a random time shift in the __getitem__ method.
I didn’t use any of the other metadata about the recordings (e.g. name of the recorder, location, etc), but you are welcome to use this information if you think it might be helpful.
I was able to achieve 60% test accuracy, but I bet you can do better!