Miniproject: Music Genre Classification

In this miniproject, you’ll implement a classifier that predicts the genre of a piece of music based on its lyrics and audio features. Our data set comes from the authors of the paper “Temporal Analysis and Visualization of Music” (2020), available here. This dataset was assembled via the Spotify API.

The code below will load half of this data set as training data for you to use in modeling:

import pandas as pd
url = "https://middcs.github.io/csci-0451-s26/data/music-genre/train.csv"
train = pd.read_csv(url)
train.head()

	track_name	release_date	genre	lyrics	len	dating	violence	world/life	night/time	shake the audience	...	sadness	feelings	danceability	loudness	acousticness	instrumentalness	valence	energy	topic	age
0	velvet light	2018	rock	mind fade pink cashmere summer nights summer n...	22	0.002506	0.002506	0.002506	0.002506	0.002506	...	0.419822	0.037206	0.658832	0.644412	0.753012	0.136640	0.504328	0.419401	sadness	0.028571
1	andy, you're a star	2004	rock	field remember incredible shut shut yeah field...	49	0.001284	0.434690	0.168126	0.001284	0.112252	...	0.001284	0.001284	0.469295	0.805656	0.228915	0.004109	0.675392	0.604592	violence	0.228571
2	with a little luck	1978	pop	little luck help damn thing work little feel e...	151	0.007842	0.000384	0.000384	0.000384	0.015161	...	0.000384	0.383363	0.636088	0.518063	0.072389	0.000176	0.258038	0.370351	feelings	0.600000
3	voodoo mon amour	2012	jazz	insert needle break stand fall consider bewild...	58	0.095308	0.005263	0.005263	0.005263	0.005263	...	0.475272	0.005263	0.549442	0.855909	0.000740	0.599190	0.356966	0.983983	sadness	0.114286
4	gulf coast highway (with willie nelson)	1990	country	gulf coast highway work rail work field cold d...	92	0.000741	0.000741	0.022941	0.000741	0.000741	...	0.101667	0.000741	0.528864	0.601826	0.747992	0.000000	0.315746	0.239215	music	0.428571

5 rows × 29 columns

The data contains several features of each track:

The track_name and release_date are relatively straightforward.
The lyrics column is a string containing the lyrics associated with the track.
The columns dating through energy are features engineered by Spotify to describe different audio characteristics of the track.
The topic column gives a one-word description of what the track is “about.”
The age column is a scaled measure of the age of the track in years, with 1950 having age = 1 and 2019 having age 0.014 (the data was collected in 2020)
The genre column gives the genre of the track.

Here are the seven genres represented:

train.groupby("genre").size()

genre
blues      2303
country    2750
hip hop     454
jazz       1911
pop        3513
reggae     1270
rock       1985
dtype: int64

Our task in this miniproject is to predict the values of the genre column using the other features.

Submission Structure

Your submission contains three files, with exactly the specified names:

submission.py
model.pkl
pipeline.pkl

`submission.py`

The file submission.py is where you train your model and save the results. It is also where you configure and store your data preparation pipeline. It should have the following structure:

import torch 
import pandas as pd
import pickle

class DataPrepPipeline: 
    """
    may need additional arguments
    """
    def __init__(self): 
        # ...

    def fit(self, X): 
        # ... 

    def transform(self, X): 
        # ...

class GenreModel:
    # ...

class GradientDescentOptimizer: 
    # ... 

"""
Ok to also have other things defined here, e.g. function definitions for cross-entropy or accuracy etc. 
"""

if __name__ == "__main__":
    # Load the data
    url = "https://middcs.github.io/csci-0451-s26/data/music-genre/train.csv"
    train = pd.read_csv(url)

    # separate the features and targets
    X_df = train.drop(columns=["genre"])
    
    # ...

    # fit and save the data prep pipeline
    pipeline = DataPrepPipeline()
    pipeline.fit(X_df)
    with open("pipeline.pkl", "wb") as f:
        pickle.dump(pipeline, f)

    # now we can apply the pipeline to the data to get it ready for training the model
    X_train = pipeline.transform(X_df)
    
    # train the model
    # ...
     
    # save the model
    with open("model.pkl", "wb") as f:
        pickle.dump(model, f)

Note that submission.py saves two pickle files: model.pkl and pipeline.pkl. The former should contain the trained model, and the latter should contain the fitted data preparation pipeline.

Data Preparation Pipeline

I anticipate for this assignment that you are likely to want to use a more complex data preparation pipeline than in previous miniprojects. In particular, it might be helpful for your pipeline to have a saved state which is fitted to the training data and then applied to the test data. For example, suppose that you wanted to use the lyrics column as a feature. You could identify the 100 most common words in the training data, and then use the counts of those words as features. If you want to engineer comparable features on the test data, you need to save the 100 words, since the top 100 words in the test data lyrics might be different.

The feature_extraction module of scikit-learn has some useful functions if you want to do something like this.

For this reason, your DataPrepPipeline class should implement:

An __init__ method which initializes any necessary attributes of the class.
A fit method which takes in the training data and stores any relevant information for preparing the data. For example, if you are using the lyrics column, this method might identify the 100 most common words and save them as an attribute of the class.
A transform method which actually returns the prepared data.

The .fit method of classes like sklearn.feature_extraction.text.CountVectorizer is a convenient shortcut: it will do this memorizing for you, and you can just save the vectorizer as an instance variable of your pipeline.

Classification Model

Your classification model can be any multiclass classification model you like, implemented in torch. You are welcome to freely copy from the lecture notes (with attribution in your comments), and you are also welcome to use generative AI coding assistance (again, with attribution in the comments).

In case you are wondering: yes, it is fine to use my implementation of logistic regression from the lecture notes.

Your model must be implemented in torch.
Your GradientDescentOptimizer must use manual gradients.

Save Your Model and Pipeline

Please take note of the two calls to pkl.dump() in the sample submission above. These are for saving your data preparation pipeline and your model. The autograder will use both of them to test your submission, and will not run if both are not present.

Classifier Performance

The performance of your classifier will be measured via accuracy. The base rate of classification in this problem is

base_rate = (train.groupby("genre").size() / len(train)).max()
print(f"Base rate: {base_rate:.2f}")

Base rate: 0.25

Any accuracy above this, when measured on the test set, reflects some degree of successful learning. However, we’re going to be “ambitious”:

The test accuracy to shoot for is 45%.

Autograder

Please keep in mind that the autograder is still slow. Rather than repeatedly testing your model against the autograder, it is recommended to withold a validation set from your training data and test against that. You can then submit to the autograder once you are satisfied with your model’s performance on the validation set.

Like last time, I’m including the complete code for the autograder in the folded code block below.

Code

import os 
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import sys
import unittest
import pickle
import pandas as pd
from gradescope_utils.autograder_utils.decorators import tags
from gradescope_utils.autograder_utils.json_test_runner import JSONTestRunner
from gradescope_utils.autograder_utils.decorators import weight 

from submission import DataPrepPipeline, GenreModel

import torch

class TestGenreModel(unittest.TestCase):
    
    @weight(1)
    def test_file_name(self):
        """
        Test that submission.py, model.pkl, and pipeline.pkl exist in the current directory
        """
        self.assertTrue(os.path.exists("submission.py"), "submission.py not found in current directory -- please check that you uploaded exactly this file with exactly this name")
        self.assertTrue(os.path.exists("model.pkl"), "model.pkl not found in current directory -- please check that you uploaded this file with exactly this name")
        self.assertTrue(os.path.exists("pipeline.pkl"), "pipeline.pkl not found in current directory -- please check that you uploaded this file with exactly this name")
    
    @weight(1)
    def test_submission(self):
        """
        Test that submission.py can be imported without errors
        """
        try:            
            import submission
        except ImportError:
            self.fail("submission.py could not be imported. Make sure it is in the current directory and does not have syntax errors.")
    
    @weight(1)
    def test_classes_and_functions(self): 
        """
        Test that LinearRegression, BikeModelOptimizer, and prepare_data are defined in submission.py
        """
        try: 
            from submission import GenreModel
        except ImportError:
            self.fail("DataPrepPipeline or GenreModel class not found in submission.py.")
            
        try: 
            from submission import DataPrepPipeline
        except ImportError:
            self.fail("DataPrepPipeline or GenreModel class not found in submission.py.")
            
        try: 
            from submission import GradientDescentOptimizer
        except ImportError:
            self.fail("GradientDescentOptimizer class not found in submission.py.")
        
    
    @weight(1)
    def test_model_prediction(self):
        """
        Test that the saved model can make predictions on the test set and that the predictions have the correct shape
        """
        try: 
            from submission import GenreModel, DataPrepPipeline
            model = pickle.load(open("model.pkl", "rb"))
            pipeline = pickle.load(open("pipeline.pkl", "rb"))
            df_test = pd.read_csv("test.csv")
            X_test = pipeline.transform(df_test)
            y_test = torch.tensor(pd.get_dummies(df_test["genre"]).values, dtype=torch.float32)
            y_pred = model.forward(X_test)
            self.assertEqual(y_pred.shape, y_test.shape, f"Predicted values have incorrect shape. Expected {y_test.shape}, got {y_pred.shape}.")
        except Exception as e:
            self.fail(f"Error testing model prediction: {e}")

    @weight(1)
    def test_model_performance_1(self):
        """
        Test the saved model reaches at least 35% accuracy on the test set. 
        """
        
        acc = compute_test_accuracy()
        self.assertGreaterEqual(acc, 0.35, f"Model accuracy on test set is not above 35%. Got {100*acc:.2f}.")
        
    @weight(1)
    def test_model_performance_2(self):
        """
        Test the saved model reaches at least 40% accuracy on the test set. 
        """
        
        acc = compute_test_accuracy()
        self.assertGreaterEqual(acc, 0.40, f"Model accuracy on test set is not above 40%. Got {100*acc:.2f}.")
    
    @weight(1)
    def test_model_performance_3(self):
        """
        Test the saved model reaches at least 45% accuracy on the test set. 
        """
        
        acc = compute_test_accuracy()
        self.assertGreaterEqual(acc, 0.45, f"Model accuracy on test set is not above 45%. Got {100*acc:.2f}.")
        
def compute_test_accuracy():
    from submission import GenreModel, DataPrepPipeline
    
    model = pickle.load(open("model.pkl", "rb"))
    pipeline = pickle.load(open("pipeline.pkl", "rb"))
    df_test = pd.read_csv("test.csv")
    X_test = pipeline.transform(df_test)
    y_test = torch.tensor(pd.get_dummies(df_test["genre"]).values, dtype=torch.float32)
    y_pred = model.forward(X_test)
    
    accuracy = (y_pred.argmax(dim=1) == y_test.argmax(dim=1)).float().mean().item()
    return accuracy
    
if __name__ == "__main__":
    
    with open('/autograder/results/results.json', 'w') as f:
    # with open('results.json', 'w') as f:
        runner = JSONTestRunner(visibility = "visible", stream=f)
        runner.json_data["leaderboard"] = [{"name":"Accuracy", "value": round(100*compute_test_accuracy(), 2)}]
        runner.run(unittest.TestLoader().loadTestsFromTestCase(TestGenreModel))

Your Grade

Your grade is computed out of 10:

4 points for passing the basic autograder tests that check things like whether you uploaded the right files and whether your model errors when making predictions.
1 point for achieving a test accuracy higher than 35%.
1 point for achieving a test accuracy higher than 40%.
1 point for achieving a test accuracy higher than 45%.
3 points for writing logical, organized, idiomatic, vectorized code as assessed by the (human) grader.

Extra Credit

Like last time, we’ll have a leaderboard. The top submissions as measured by testing accuracy will again receive one homework problem’s worth of extra credit.

Some Tips

I found it useful to work with the lyrics column in order to meet the higher threshold. I achieved reasonable performance by storing a vectorizer from sklearn.feature_extraction, which I fit as part of the DataPrepPipeline.fit() method.
I found it useful to save the names of the engineered features I wanted to use (like loudness and feelings) in the DataPrepPipeline.__init__() method.
My complete submission.py file is around 100 lines of code (with few comments and no docstrings).
You can define other functions in submission.py, especially ones to use in your pipeline. For example, you might define a function for polynomial features (including or not including cross-terms) or other nonlinear feature maps.