Miniproject: Birdcall Classificaiton

We in Vermont are blessed with a beautiful natural environment that supports a rich ecosystem of wildlife, including birds. Below are just a few of the birds that I regularly see in my neighborhood just outside of town.

Red-tailed hawk

Tufted titmouse

American Crow

American Robin

Red-Winged Blackbird

Northern Cardinal

Red-Bellied Woodpecker

Black-Capped Chickadee
Figure 1: Images of 8 birds common in the Middlebury area, all sourced from Wikimedia Commons.

Apps like Merlin Bird ID from the Cornell Lab of Ornithology are great tools for identifying birds you see in the wild from photos, descriptions, or recordings of their calls. In this miniproject, you will build your own algorithm for predicting the species of a bird from a recording of its call.

This Miniproject is Intentionally Open-Ended and Challenging

Unlike in some of the previous miniprojects in which I’ve given you relatively explicit and scaffolded directions, I am leaving the details of this miniproject largely up to you.

Dataset

The dataset for this miniproject consists of a set of recordings of bird calls for the 8 birds pictured in Figure 1. These recordings were obtained from the xeno-canto database, which is a large and growing collection of bird call recordings contributed by birders around the world. The data set consists of:

The Metadata File

The file metadata.csv contains metadata about the recordings, including the species, common name, file url, length, and id for each recording. Here’s an example of how to access the metadata:

import pandas as pd
DATA_URL = "https://middcs.github.io/csci-0451-s26/data/birdcall"
metadata = pd.read_csv(f"{DATA_URL}/metadata.csv")
metadata.head()
species common name length recorded_by lat lon alt id sr split
0 melanerpes carolinus red-bellied woodpecker 0:15 Bruce Lagerquist 29.2923 -82.6270 20 0 48000 train
1 buteo jamaicensis red-tailed hawk 0:15 Thomas Magarian 48.2519 -112.4131 1200 1 48000 train
2 turdus migratorius american robin 0:06 AUDEVARD Aurélien 61.1658 -150.0613 0 2 48000 train
3 turdus migratorius american robin 0:22 Scott Olmstead 32.3022 -110.5962 1300 3 44100 train
4 corvus brachyrhynchos american crow 0:15 David Vander Pluym 47.6551 -122.1120 10 4 22050 train

This file has been censored so that 20% of the species names have been replaced with “unknown” – these are the test set:

metadata[metadata["species"] == "unknown"].head()
species common name length recorded_by lat lon alt id sr split
530 unknown unknown 0:13 Ed Pandolfino 38.5321 -121.0686 110 530 48000 test
531 unknown unknown 0:16 Antonio Xeira 38.0695 -84.3933 300 531 44100 test
532 unknown unknown 0:21 Antonio Xeira 36.7652 -88.1301 120 532 44100 test
533 unknown unknown 0:25 Greg Irving 48.0510 -123.1393 130 533 48000 test
534 unknown unknown 0:27 Rory Nefdt 40.9658 -73.6739 0 534 44100 test

The Audio Files

Each row of the metadata corresponds to an MP3 audio file which I have downloaded and hosted on the course GitHub repository. The following code block shows how to download a single MP3:

import requests
import os

LOCAL_DATA_PATH = "data/birdcall"

def download_mp3(rec_id):
    url = f"{DATA_URL}/mp3/{rec_id}.mp3"
    fname = f"{LOCAL_DATA_PATH}/{rec_id}.mp3"
    if not os.path.exists(fname):
        response = requests.get(url)
        with open(fname, "wb") as f:
            f.write(response.content)

The argument of download_mp3 corresponds to the id column of the metadata dataframe:

example_id = metadata.iloc[0]["id"]
download_mp3(example_id)

What You Should Submit

Your Python Script

Please submit your Python script, called submission.py, in which you define and train a model to predict the species of a bird from a recording of its call. You are free to use any libraries or tools you like. This file will not be run by the autograder in any way.

Your Predictions

Please submit on Gradescope a predictions.csv file with two columns: id and predicted_species. The id column should contain the id of each recording in the test set (i.e. those with species “unknown” in the metadata), and the predicted_species column should contain your predicted species for each recording (using its common name). For example:

id,predicted_species
512,northern cardinal
513,american crow
514,tufted titmouse
515,american robin
516,red-bellied woodpecker
...

Assessment

You’ll be evaluated on this project based on the quality of your code and the accuracy of your predictions.

Code Quality

Code quality accounts for 50% of your grade for this miniproject.

Because this miniproject is somewhat more complex, code organization is important. In assessing your submission, we will be looking for:

  1. Idiomatic torch code constructs, including Dataset and DataLoader objects, and the use of torch tensors for all data manipulation and model training.
  2. Clear and modular code organization, including the use of functions and classes to break things up into logical units.

We’re also expecting the “standard” organization for Python scripts in which you separate your function and class definitions from the code that actually runs the data loading and training:

import ...

def this(): 
    pass

class That:
    pass

# etc. 

if __name__ == "__main__":
    # instantiate objects
    # load the data
    # train the model 
    # save results
    # etc

Accuracy

Accuracy accounts for 50% of your grade for this miniproject.

The accuracy of your predictions will be assessed by comparing them to the true species for the test set, which I have withheld from you. The test set consists of 20% of the recordings in the dataset, and is balanced across the 8 species.

Your accuracy grade (out of 50) is exactly equal to your accuracy on the test set (out of 1). For example, a model that achieves 45% accuracy on the test set receives an accuracy grade of 45/50. Accuracies 50 or higher all correspond to full credit for the accuracy portion of the grade.

Hints

You’re welcome to follow any approach to this problem that you can reasonably come up with. Here are a few techniques that I found to be useful in my own solution.

  • Data augmentation: I found that randomly shifting the audio recordings in time was helpful for reducing overfitting.
  • Checkpointing: rather than training for a fixed number of epochs and taking the final model to be the one I used, I found it helpful to save the model whenever it achieved a new record for best loss, and use the model which achieved the overall lowest loss to form my predictions.
  • Because the recordings are of different lengths, I needed to pad them to make them all the same length before I could feed them into the model.
  • While it’s possible to have the __getitem__ method of your Dataset class read in the raw audio and compute the spectrogram on the fly, I found it more efficient to precompute the spectrograms for all recordings and save them as an instance variable, and then have __getitem__ just read in the precomputed spectrograms. I still used data augmentation as mentioned above by performing a random time shift in the __getitem__ method.
  • I didn’t use any of the other metadata about the recordings (e.g. name of the recorder, location, etc), but you are welcome to use this information if you think it might be helpful.
  • I was able to achieve 60% test accuracy, but I bet you can do better!