In this miniproject, you’ll try your own hand at predicting demand for a bikeshare system in Washington D.C. This data set was introduced by Fanaee-T and Gama (2013) and popularized through the UCI Machine Learning Repository.
Learning Objectives
You will demonstrate your ability to implement and train a linear regression model “from scratch,” using tools discussed in class.
You will train your model to a custom loss function that reflects the priorities of the hypothetical decision-maker we are collaborating with.
You will practice creative, open-ended model design for a real-world problem, including data preprocessing, feature engineering, and model selection.
Data Access
To download the data as a data frame, run the following code:
import pandas as pdurl ="https://middcs.github.io/csci-0451-s26/data/bikeshare/train.csv"train = pd.read_csv(url)train.head()
instant
dteday
season
yr
mnth
hr
holiday
weekday
workingday
weathersit
temp
atemp
hum
windspeed
casual
registered
cnt
0
6603
2011-10-07
4
0
10
16
0
5
1
1
0.64
0.6212
0.44
0.0000
73
346
419
1
1439
2011-03-05
1
0
3
5
0
6
0
2
0.30
0.3030
1.00
0.1343
0
3
3
2
10222
2012-03-06
1
1
3
23
0
2
1
1
0.28
0.2727
0.61
0.2239
3
51
54
3
7792
2011-11-26
4
0
11
6
0
6
0
2
0.30
0.3182
0.75
0.0896
2
8
10
4
213
2011-01-10
1
0
1
3
0
1
1
1
0.12
0.1212
0.50
0.2239
0
1
1
You’re encouraged to visualize the data to get familiar with its structure, but this isn’t a graded component of the assignment.
Modeling Aim
The bikeshare company wants to use a model to help them decide how many bikes to make available at each hour, and they’ve asked you for help! They will use your model’s predictions to determine how many bikes to stock at each hour, with the goal of minimizing their losses from unmet demand and excess inventory.
The company is proceeding under the assumption that it is worse to underpredict demand than to overpredict it:
If one unit of demand goes unmet, the company loses out on potential revenue from that rental. The company’s analysts assign cost \(a\) to each unit of unmet demand.
If one unit of inventory goes unused, the company only loses out on the cost of managing the excess inventory. The company’s analysts assign cost \(b\) to each unit of excess inventory.
The analysts have estimated that \(a = 2\) and \(b = 1\).
So, if on one day we predict 20 bikes will be rented (and we supply 20 bikes), but the actual count is 10, then the company’s loss is: 10 units of unused inventory, each costing \(b = 1\), for a total loss of \(10 \times 1 = 10\).
For the purposes of this modeling exercise, we are assuming that each bike can be rented once per hour.
If we had predicted 7 bikes, but the actual count was 10, then the company’s loss is: 3 units of unmet demand, each costing \(a = 2\), for a total loss of \(3 \times 2 = 6\).
Formally, the average loss per prediction associated with a set of predictions \(\hat{\mathbf{y}}\) on true values \(\mathbf{y}\) can be expressed as:
where \(c_i = a\) if \(\hat{y}_i < y_i\) and \(c_i = b\) if \(\hat{y}_i > y_i\). The company wants a model that will make predictions \(\hat{\mathbf{y}}\) that minimize this loss on future data.
Features and Targets
The target variable in this problem is the cnt column of the data frame, which contains the total count of bike rentals in a given hour. The other columns of the data frame are potential features that you can use to predict the target variable. However, casual and registered are not features available at the time of prediction, since they represent the count of casual and registered users, which sum to the total count of rentals. So, you must delete these columns as part of your data preparation pipeline.
Your Task
Create a regression model, trained on the training data provided above, that you will use to predict the total number of bike rentals in a given hour. Your solution should include:
A LinearRegression class that implements a linear regression model using PyTorch. The class should have a forward method that takes in a tensor of input features and returns the predicted count of bike rentals.
Your weight vector must be stored in an instance variable model.w. It should be initialized to zeros, and should have shape (n_features, 1), where n_features is the number of input features (after preprocessing). The extra dimension of size 1 is not strictly necessary for this kind of problem, but is assumed by the autograder (and we’ll need it for future problems).
A weighted_loss function that computes the loss for your model, taking into account the company’s preference for underpredicting demand over overpredicting it. The loss function should assign a higher cost to underpredictions than to overpredictions, based on the values of \(a\) and \(b\) provided by the company’s analysts.
A custom BikeModelOptimizer which computes the gradient of the loss with respect to the model’s parameters and updates the parameters accordingly.
A prepare_data function that takes in raw, unmodified data frame of data and returns a tuple of tensors (X, y) that are ready to be fed into your model. This function should perform any necessary preprocessing steps, such as one-hot encoding of categorical variables and application of nonlinear feature maps. Please remember to delete the casual and registered columns as part of this function, since these are not features available at the time of prediction.
A training loop that trains your model on the training data using your custom optimizer and the weighted loss function.
Required
Data Preparation Pipeline
Your implementation should include a function called prepare_data that takes in a data frame containing all columns of df_train and returns a tuple of tensors (X, y), where X is a tensor of input features and y is a tensor of target values. This function should perform any necessary preprocessing steps, such as one-hot encoding of categorical variables and application of nonlinear feature maps. You should be able to use your function like this:
X_train, y_train = prepare_data(df_train)# ready for the training loop now
Note
The testing data contains fewer values in the mnth column than the training data, which means that if you one-hot encode the mnth column in the standard way (with something like pd.get_dummies), you will end up with a different number of features in the training and testing data. This will cause your model to fail when the autograder tries to load it and evaluate it on the test set.
A good way to get around this is to implement a custom one-hot encoding function that guarantees that the same number of features will be produced regardless of the values in the mnth column. For example, you could create a function that takes in a data frame and produces 12 columns corresponding to the 12 months, with a 1 in the column corresponding to the month of each row and 0s in the other columns. This way, even if some months are missing from the test set, you will still have 12 columns corresponding to the months, and your model will be able to make predictions on the test set without crashing.
Manual Gradients
Your BikeModelOptimizer should be implemented from scratch, without using any of PyTorch’s built-in optimizers. You should compute the gradients of the loss with respect to the model’s parameters manually, and update the parameters using gradient descent. Implement manual gradients in a method of your BikeModelOptimizer called grad_func(X, y, a, b). This method should compute the gradient of the loss with respect to the model’s parameters, given a batch of input features X, target values y, and the cost parameters a and b. You can use this method in your training loop to update the model’s parameters based on the computed gradients.
Encouraged
An excellent solution is likely to include some model selection, in which you make some decisions:
Choose which data columns to include in your model.
Choose a regularization technique to apply to your model, and tune the strength of the regularization.
A typical way to perform model selection is to set aside a portion of the training data as a validation set (or use cross-validation) to evaluate the performance of your model and select among several alternatives.
Note on Model Assessment
Part of your final result on this assignment will be determined by the loss of your model on the hidden test set. The exact code used in the autograder script is here:
def compute_test_loss():# your functions from your submissionfrom submission import LinearRegression, prepare_data# my implementation of weighted_lossfrom solution import weighted_loss# load the model you trained and evaluate it on the test settry: model = pickle.load(open("model.pkl", "rb")) df_test = pd.read_csv("test.csv") X_test, y_test = prepare_data(df_test) y_pred = model.forward(X_test) loss = weighted_loss(y_test, y_pred, a=1.0, b=0.5)return loss.item()# if there was an error loading the model or computing the loss, you get a very large loss =(exceptExceptionas e:print(f"Error computing test loss: {e}")return1e6# return a large loss if there was an error
You can check that your model will be able to run this code by running the following, which replicates this pipeline on the training data:
# after you've trained your modelX_train, y_train = prepare_data(df_train)y_pred = model.forward(X_train)loss = weighted_loss(y_train, y_pred, a=1.0, b=0.5)print(f"Train loss: {loss.item()}")
Submitting Your Work
The Autograder Is Slow
The autograder is pretty slow to run on your scripts. For this reason, it’s recommended that you lean on the autograder to check that your script is running and that your model can be loaded and used to compute a loss, but that you perform your model selection and training (to try to reduce your model’s loss) in an interactive notebook (e.g. via Google Colab), using a held-out validation set.
Below I’ve included the complete code for the autograder’s test script, which you can use to check that your model can be loaded and used to compute a loss. Feel free to run parts of this code in your notebook or in a script to check that your model is working as expected. Note that you won’t be able to run this code block as-is, since it relies on the autograder’s file structure, the test data, and some custom gradescope_utils. I’m providing it here just so that you can see what kinds of tests to expect and so that you can take some of the code to simulate the autograder’s evaluation pipeline in your own notebook or script.
Code
import os os.environ['KMP_DUPLICATE_LIB_OK']='True'import sysimport unittestimport pickleimport pandas as pdfrom gradescope_utils.autograder_utils.decorators import tagsfrom gradescope_utils.autograder_utils.json_test_runner import JSONTestRunnerfrom gradescope_utils.autograder_utils.decorators import weight from submission import LinearRegressionimport torchclass TestBikeModel(unittest.TestCase):@weight(1)def test_file_name(self):""" Test that submission.py and model.pkl exist in the current directory """self.assertTrue(os.path.exists("submission.py"), "submission.py not found in current directory")self.assertTrue(os.path.exists("model.pkl"), "model.pkl not found in current directory")@weight(1)def test_submission(self):""" Test that submission.py can be imported without errors """try: import submissionexceptImportError:self.fail("submission.py could not be imported. Make sure it is in the current directory and does not have syntax errors.")@weight(1)def test_classes_nd_functions(self): """ Test that LinearRegression, BikeModelOptimizer, and prepare_data are defined in submission.py """try: from submission import LinearRegressionexceptImportError:self.fail("LinearRegression class not found in submission.py.")try: from submission import BikeModelOptimizerexceptImportError:self.fail("BikeModelOptimizer class not found in submission.py.")try: from submission import prepare_dataexceptImportError:self.fail("prepare_data function not found in submission.py.")@weight(1) def test_gradients(self): """ Test that the grad_func method of BikeModelOptimizer returns a tensor of the correct shape and value """try: from submission import LinearRegression, BikeModelOptimizerfrom solution import weighted_loss n_features =3 X = torch.randn(5, n_features) X = torch.cat([X, torch.ones(X.shape[0], 1)], dim=1) # add bias term y = torch.randn(5, 1) model = LinearRegression(n_features+1) model.w = torch.zeros(n_features+1, 1) # set weights to zero for easier testing optimizer = BikeModelOptimizer(model, lr=0.01) a, b =1.0, 0.5 grad = optimizer.grad_func(X, y, a, b)# compare to autograd model.w.requires_grad_(True) y_pred = model.forward(X) loss = weighted_loss(y, y_pred, a, b) loss.backward() autograd_grad = model.w.gradself.assertEqual(grad.shape, (n_features +1, 1), f"Gradient shape is incorrect. Expected ({n_features +1}, 1).")self.assertTrue(torch.allclose(grad, autograd_grad, atol=1e-4), f"Gradient values are incorrect. Please review your implementation of grad_func in BikeModelOptimizer. Received: {grad}, Expected: {autograd_grad}")exceptExceptionas e:self.fail(f"Error testing gradients: {e}")def compute_test_loss():from submission import LinearRegression, prepare_datafrom solution import weighted_losstry: model = pickle.load(open("model.pkl", "rb")) df_test = pd.read_csv("test.csv") X_test, y_test = prepare_data(df_test) y_pred = model.forward(X_test) loss = weighted_loss(y_test, y_pred, a=1.0, b=0.5)return loss.item()exceptExceptionas e:print(f"Error computing test loss: {e}")return1e7# return a large loss if there was an errorif__name__=="__main__":withopen('/autograder/results/results.json', 'w') as f: runner = JSONTestRunner(visibility ="visible", stream=f) runner.json_data["leaderboard"] = [{"Test loss": compute_test_loss()}] runner.run(unittest.TestLoader().loadTestsFromTestCase(TestBikeModel))
Submission Script Structure
You should submit a script called submission.py structured in the following way:
import torch import pandas as pdimport pickleclass LinearRegression:# your implementation hereclass BikeModelOptimizer:# your implementation heredef weighted_loss(y_true, y_pred, a, b):# your implementation heredef prepare_data(df):# your implementation heredef main(): url ="https://middcs.github.io/csci-0451-s26/data/bikeshare/train.csv" df_train = pd.read_csv(url) X_train, y_train = prepare_data(df_train) model = LinearRegression() optimizer = BikeModelOptimizer(model)# your code for training and assessing the model here pickle.dump(model, open("model.pkl", "wb"))if__name__=="__main__": main()
You’re encouraged to use an interactive notebook (e.g. via Google Colab) for prototyping your model and performing model selection; however, your submission needs to be a Python script organized as above.
Please do not perform any model training outside of the main() function, as this will make the autograder run slowly and possibly crash.
Saved Model
The script above saves your model as model.pkl using Python’s pickle module. This is the file that the autograder will load to evaluate your model’s performance on the test set. Please submit this file alongside your submission.py script.
Optional: Blog About It
These instructions can be followed at any time.
You may want to share the process by which you explored the data, designed your model, performed model selection, etc. If you want, you can share your process and results in a blog post.
Make an attractive Google Colab notebook that includes your code, visualizations, and explanations of your process.
You can also use a Jupyter notebook on your local machine if you prefer.
Download the notebook as an .ipynb file after executing all cells.
Follow the instructions here to set up your blog and use this notebook as your first blog post.
These instructions are quite involved, but you only need to do them once.
Your Grade
Your grade on this assignment has the following components:
50% for passing all the autograder unit tests.
50% for efficient, idiomatic code that minimizes loops, reduces redundant computations, and uses PyTorch’s built-in functions where appropriate (as assessed by the grader).
Leaderboard
After the final assignment due date, the autograder will run on your final submission and compute your model’s loss on the hidden test set.
The top 5 students on the leaderboard will receive extra credit equal to 1 homework problem, which can be used to skip a homework problem or compensate for low scores on prior problems.
References
Fanaee-T, Hadi, and Joao Gama. 2013. “Event Labeling Combining Ensemble Detectors and Background Knowledge.”Progress in Artificial Intelligence, 1–15. https://doi.org/10.1007/s13748-013-0040-3.