In this miniproject, you’ll implement a classifier that predicts the genre of a piece of music based on its lyrics and audio features. Our data set comes from the authors of the paper “Temporal Analysis and Visualization of Music” (2020), available here. This dataset was assembled via the Spotify API.
The code below will load half of this data set as training data for you to use in modeling:
import pandas as pdurl ="https://middcs.github.io/csci-0451-s26/data/music-genre/train.csv"train = pd.read_csv(url)train.head()
track_name
release_date
genre
lyrics
len
dating
violence
world/life
night/time
shake the audience
...
sadness
feelings
danceability
loudness
acousticness
instrumentalness
valence
energy
topic
age
0
velvet light
2018
rock
mind fade pink cashmere summer nights summer n...
22
0.002506
0.002506
0.002506
0.002506
0.002506
...
0.419822
0.037206
0.658832
0.644412
0.753012
0.136640
0.504328
0.419401
sadness
0.028571
1
andy, you're a star
2004
rock
field remember incredible shut shut yeah field...
49
0.001284
0.434690
0.168126
0.001284
0.112252
...
0.001284
0.001284
0.469295
0.805656
0.228915
0.004109
0.675392
0.604592
violence
0.228571
2
with a little luck
1978
pop
little luck help damn thing work little feel e...
151
0.007842
0.000384
0.000384
0.000384
0.015161
...
0.000384
0.383363
0.636088
0.518063
0.072389
0.000176
0.258038
0.370351
feelings
0.600000
3
voodoo mon amour
2012
jazz
insert needle break stand fall consider bewild...
58
0.095308
0.005263
0.005263
0.005263
0.005263
...
0.475272
0.005263
0.549442
0.855909
0.000740
0.599190
0.356966
0.983983
sadness
0.114286
4
gulf coast highway (with willie nelson)
1990
country
gulf coast highway work rail work field cold d...
92
0.000741
0.000741
0.022941
0.000741
0.000741
...
0.101667
0.000741
0.528864
0.601826
0.747992
0.000000
0.315746
0.239215
music
0.428571
5 rows × 29 columns
The data contains several features of each track:
The track_name and release_date are relatively straightforward.
The lyrics column is a string containing the lyrics associated with the track.
The columns dating through energy are features engineered by Spotify to describe different audio characteristics of the track.
The topic column gives a one-word description of what the track is “about.”
The age column is a scaled measure of the age of the track in years, with 1950 having age = 1 and 2019 having age 0.014 (the data was collected in 2020)
The genre column gives the genre of the track.
Here are the seven genres represented:
train.groupby("genre").size()
genre
blues 2303
country 2750
hip hop 454
jazz 1911
pop 3513
reggae 1270
rock 1985
dtype: int64
Our task in this miniproject is to predict the values of the genre column using the other features.
Submission Structure
Your submission contains three files, with exactly the specified names:
submission.py
model.pkl
pipeline.pkl
submission.py
The file submission.py is where you train your model and save the results. It is also where you configure and store your data preparation pipeline. It should have the following structure:
import torch import pandas as pdimport pickleclass DataPrepPipeline: """ may need additional arguments """def__init__(self): # ...def fit(self, X): # ... def transform(self, X): # ...class GenreModel:# ...class GradientDescentOptimizer: # ... """Ok to also have other things defined here, e.g. function definitions for cross-entropy or accuracy etc. """if__name__=="__main__":# Load the data url ="https://middcs.github.io/csci-0451-s26/data/music-genre/train.csv" train = pd.read_csv(url)# separate the features and targets X_df = train.drop(columns=["genre"])# ...# fit and save the data prep pipeline pipeline = DataPrepPipeline() pipeline.fit(X_df)withopen("pipeline.pkl", "wb") as f: pickle.dump(pipeline, f)# now we can apply the pipeline to the data to get it ready for training the model X_train = pipeline.transform(X_df)# train the model# ...# save the modelwithopen("model.pkl", "wb") as f: pickle.dump(model, f)
Note that submission.py saves two pickle files: model.pkl and pipeline.pkl. The former should contain the trained model, and the latter should contain the fitted data preparation pipeline.
Data Preparation Pipeline
I anticipate for this assignment that you are likely to want to use a more complex data preparation pipeline than in previous miniprojects. In particular, it might be helpful for your pipeline to have a saved state which is fitted to the training data and then applied to the test data. For example, suppose that you wanted to use the lyrics column as a feature. You could identify the 100 most common words in the training data, and then use the counts of those words as features. If you want to engineer comparable features on the test data, you need to save the 100 words, since the top 100 words in the test data lyrics might be different.
The feature_extraction module of scikit-learn has some useful functions if you want to do something like this.
For this reason, your DataPrepPipeline class should implement:
An __init__ method which initializes any necessary attributes of the class.
A fit method which takes in the training data and stores any relevant information for preparing the data. For example, if you are using the lyrics column, this method might identify the 100 most common words and save them as an attribute of the class.
A transform method which actually returns the prepared data.
The .fit method of classes like sklearn.feature_extraction.text.CountVectorizer is a convenient shortcut: it will do this memorizing for you, and you can just save the vectorizer as an instance variable of your pipeline.
Classification Model
Your classification model can be any multiclass classification model you like, implemented in torch. You are welcome to freely copy from the lecture notes (with attribution in your comments), and you are also welcome to use generative AI coding assistance (again, with attribution in the comments).
In case you are wondering: yes, it is fine to use my implementation of logistic regression from the lecture notes.
Your model must be implemented in torch.
Your GradientDescentOptimizer must use manual gradients.
Save Your Model and Pipeline
Please take note of the two calls to pkl.dump() in the sample submission above. These are for saving your data preparation pipeline and your model. The autograder will use both of them to test your submission, and will not run if both are not present.
Classifier Performance
The performance of your classifier will be measured via accuracy. The base rate of classification in this problem is
Any accuracy above this, when measured on the test set, reflects some degree of successful learning. However, we’re going to be “ambitious”:
The test accuracy to shoot for is 45%.
Autograder
Please keep in mind that the autograder is still slow. Rather than repeatedly testing your model against the autograder, it is recommended to withold a validation set from your training data and test against that. You can then submit to the autograder once you are satisfied with your model’s performance on the validation set.
Like last time, I’m including the complete code for the autograder in the folded code block below.
Code
import os os.environ['KMP_DUPLICATE_LIB_OK']='True'import sysimport unittestimport pickleimport pandas as pdfrom gradescope_utils.autograder_utils.decorators import tagsfrom gradescope_utils.autograder_utils.json_test_runner import JSONTestRunnerfrom gradescope_utils.autograder_utils.decorators import weight from submission import DataPrepPipeline, GenreModelimport torchclass TestGenreModel(unittest.TestCase):@weight(1)def test_file_name(self):""" Test that submission.py, model.pkl, and pipeline.pkl exist in the current directory """self.assertTrue(os.path.exists("submission.py"), "submission.py not found in current directory -- please check that you uploaded exactly this file with exactly this name")self.assertTrue(os.path.exists("model.pkl"), "model.pkl not found in current directory -- please check that you uploaded this file with exactly this name")self.assertTrue(os.path.exists("pipeline.pkl"), "pipeline.pkl not found in current directory -- please check that you uploaded this file with exactly this name")@weight(1)def test_submission(self):""" Test that submission.py can be imported without errors """try: import submissionexceptImportError:self.fail("submission.py could not be imported. Make sure it is in the current directory and does not have syntax errors.")@weight(1)def test_classes_and_functions(self): """ Test that LinearRegression, BikeModelOptimizer, and prepare_data are defined in submission.py """try: from submission import GenreModelexceptImportError:self.fail("DataPrepPipeline or GenreModel class not found in submission.py.")try: from submission import DataPrepPipelineexceptImportError:self.fail("DataPrepPipeline or GenreModel class not found in submission.py.")try: from submission import GradientDescentOptimizerexceptImportError:self.fail("GradientDescentOptimizer class not found in submission.py.")@weight(1)def test_model_prediction(self):""" Test that the saved model can make predictions on the test set and that the predictions have the correct shape """try: from submission import GenreModel, DataPrepPipeline model = pickle.load(open("model.pkl", "rb")) pipeline = pickle.load(open("pipeline.pkl", "rb")) df_test = pd.read_csv("test.csv") X_test = pipeline.transform(df_test) y_test = torch.tensor(pd.get_dummies(df_test["genre"]).values, dtype=torch.float32) y_pred = model.forward(X_test)self.assertEqual(y_pred.shape, y_test.shape, f"Predicted values have incorrect shape. Expected {y_test.shape}, got {y_pred.shape}.")exceptExceptionas e:self.fail(f"Error testing model prediction: {e}")@weight(1)def test_model_performance_1(self):""" Test the saved model reaches at least 35% accuracy on the test set. """ acc = compute_test_accuracy()self.assertGreaterEqual(acc, 0.35, f"Model accuracy on test set is not above 35%. Got {100*acc:.2f}.")@weight(1)def test_model_performance_2(self):""" Test the saved model reaches at least 40% accuracy on the test set. """ acc = compute_test_accuracy()self.assertGreaterEqual(acc, 0.40, f"Model accuracy on test set is not above 40%. Got {100*acc:.2f}.")@weight(1)def test_model_performance_3(self):""" Test the saved model reaches at least 45% accuracy on the test set. """ acc = compute_test_accuracy()self.assertGreaterEqual(acc, 0.45, f"Model accuracy on test set is not above 45%. Got {100*acc:.2f}.")def compute_test_accuracy():from submission import GenreModel, DataPrepPipeline model = pickle.load(open("model.pkl", "rb")) pipeline = pickle.load(open("pipeline.pkl", "rb")) df_test = pd.read_csv("test.csv") X_test = pipeline.transform(df_test) y_test = torch.tensor(pd.get_dummies(df_test["genre"]).values, dtype=torch.float32) y_pred = model.forward(X_test) accuracy = (y_pred.argmax(dim=1) == y_test.argmax(dim=1)).float().mean().item()return accuracyif__name__=="__main__":withopen('/autograder/results/results.json', 'w') as f:# with open('results.json', 'w') as f: runner = JSONTestRunner(visibility ="visible", stream=f) runner.json_data["leaderboard"] = [{"name":"Accuracy", "value": round(100*compute_test_accuracy(), 2)}] runner.run(unittest.TestLoader().loadTestsFromTestCase(TestGenreModel))
Your Grade
Your grade is computed out of 10:
4 points for passing the basic autograder tests that check things like whether you uploaded the right files and whether your model errors when making predictions.
1 point for achieving a test accuracy higher than 35%.
1 point for achieving a test accuracy higher than 40%.
1 point for achieving a test accuracy higher than 45%.
3 points for writing logical, organized, idiomatic, vectorized code as assessed by the (human) grader.
Extra Credit
Like last time, we’ll have a leaderboard. The top submissions as measured by testing accuracy will again receive one homework problem’s worth of extra credit.
Some Tips
I found it useful to work with the lyrics column in order to meet the higher threshold. I achieved reasonable performance by storing a vectorizer from sklearn.feature_extraction, which I fit as part of the DataPrepPipeline.fit() method.
I found it useful to save the names of the engineered features I wanted to use (like loudness and feelings) in the DataPrepPipeline.__init__() method.
My complete submission.py file is around 100 lines of code (with few comments and no docstrings).
You can define other functions in submission.py, especially ones to use in your pipeline. For example, you might define a function for polynomial features (including or not including cross-terms) or other nonlinear feature maps.