We in Vermont are blessed with a beautiful natural environment that supports a rich ecosystem of wildlife, including birds. Below are just a few of the birds that I regularly see in my neighborhood just outside of town.
Red-tailed hawk
Tufted titmouse
American Crow
American Robin
Red-Winged Blackbird
Northern Cardinal
Red-Bellied Woodpecker
Black-Capped Chickadee
Figure 1: Images of 8 birds common in the Middlebury area, all sourced from Wikimedia Commons.
Apps like Merlin Bird ID from the Cornell Lab of Ornithology are great tools for identifying birds you see in the wild from photos, descriptions, or recordings of their calls. In this miniproject, you will build your own algorithm for predicting the species of a bird from a recording of its call.
This Miniproject is Intentionally Open-Ended and Challenging
Unlike in some of the previous miniprojects in which I’ve given you relatively explicit and scaffolded directions, I am leaving the details of this miniproject largely up to you.
Dataset
The dataset for this miniproject consists of a set of recordings of bird calls for the 8 birds pictured in Figure 1. These recordings were obtained from the xeno-canto database, which is a large and growing collection of bird call recordings contributed by birders around the world. The data set consists of:
The Metadata File
The file metadata.csv contains metadata about the recordings, including the species, common name, file url, length, and id for each recording. Here’s an example of how to access the metadata:
import pandas as pdDATA_URL ="https://middcs.github.io/csci-0451-s26/data/birdcall"metadata = pd.read_csv(f"{DATA_URL}/metadata.csv")metadata.head()
species
common name
length
recorded_by
lat
lon
alt
id
sr
split
0
melanerpes carolinus
red-bellied woodpecker
0:15
Bruce Lagerquist
29.2923
-82.6270
20
0
48000
train
1
buteo jamaicensis
red-tailed hawk
0:15
Thomas Magarian
48.2519
-112.4131
1200
1
48000
train
2
turdus migratorius
american robin
0:06
AUDEVARD Aurélien
61.1658
-150.0613
0
2
48000
train
3
turdus migratorius
american robin
0:22
Scott Olmstead
32.3022
-110.5962
1300
3
44100
train
4
corvus brachyrhynchos
american crow
0:15
David Vander Pluym
47.6551
-122.1120
10
4
22050
train
This file has been censored so that 20% of the species names have been replaced with “unknown” – these are the test set:
metadata[metadata["species"] =="unknown"].head()
species
common name
length
recorded_by
lat
lon
alt
id
sr
split
530
unknown
unknown
0:13
Ed Pandolfino
38.5321
-121.0686
110
530
48000
test
531
unknown
unknown
0:16
Antonio Xeira
38.0695
-84.3933
300
531
44100
test
532
unknown
unknown
0:21
Antonio Xeira
36.7652
-88.1301
120
532
44100
test
533
unknown
unknown
0:25
Greg Irving
48.0510
-123.1393
130
533
48000
test
534
unknown
unknown
0:27
Rory Nefdt
40.9658
-73.6739
0
534
44100
test
The Audio Files
Each row of the metadata corresponds to an MP3 audio file which I have downloaded and hosted on the course GitHub repository. The following code block shows how to download a single MP3:
Please submit your Python script, called submission.py, in which you define and train a model to predict the species of a bird from a recording of its call. You are free to use any libraries or tools you like. This file will not be run by the autograder in any way.
Your Predictions
Please submit on Gradescope a predictions.csv file with two columns: id and predicted_species. The id column should contain the id of each recording in the test set (i.e. those with species “unknown” in the metadata), and the predicted_species column should contain your predicted species for each recording (using its common name). For example:
You’ll be evaluated on this project based on the quality of your code and the accuracy of your predictions.
Code Quality
Code quality accounts for 50% of your grade for this miniproject.
Because this miniproject is somewhat more complex, code organization is important. In assessing your submission, we will be looking for:
Idiomatic torch code constructs, including Dataset and DataLoader objects, and the use of torch tensors for all data manipulation and model training.
Clear and modular code organization, including the use of functions and classes to break things up into logical units.
We’re also expecting the “standard” organization for Python scripts in which you separate your function and class definitions from the code that actually runs the data loading and training:
import ...def this(): passclass That:pass# etc. if__name__=="__main__":# instantiate objects# load the data# train the model # save results# etc
Accuracy
Accuracy accounts for 50% of your grade for this miniproject.
The accuracy of your predictions will be assessed by comparing them to the true species for the test set, which I have withheld from you. The test set consists of 20% of the recordings in the dataset, and is balanced across the 8 species.
Your accuracy grade (out of 50) is exactly equal to your accuracy on the test set (out of 1). For example, a model that achieves 45% accuracy on the test set receives an accuracy grade of 45/50. Accuracies 50 or higher all correspond to full credit for the accuracy portion of the grade.
Hints
You’re welcome to follow any approach to this problem that you can reasonably come up with. Here are a few techniques that I found to be useful in my own solution.
Data augmentation: I found that randomly shifting the audio recordings in time was helpful for reducing overfitting.
Checkpointing: rather than training for a fixed number of epochs and taking the final model to be the one I used, I found it helpful to save the model whenever it achieved a new record for best loss, and use the model which achieved the overall lowest loss to form my predictions.
Because the recordings are of different lengths, I needed to pad them to make them all the same length before I could feed them into the model.
While it’s possible to have the __getitem__ method of your Dataset class read in the raw audio and compute the spectrogram on the fly, I found it more efficient to precompute the spectrograms for all recordings and save them as an instance variable, and then have __getitem__ just read in the precomputed spectrograms. I still used data augmentation as mentioned above by performing a random time shift in the __getitem__ method.
I didn’t use any of the other metadata about the recordings (e.g. name of the recorder, location, etc), but you are welcome to use this information if you think it might be helpful.
I was able to achieve 60% test accuracy, but I bet you can do better!