Miniproject: Predicting Bikeshare Demand

In this miniproject, you’ll try your own hand at predicting demand for a bikeshare system in Washington D.C. This data set was introduced by Fanaee-T and Gama (2013) and popularized through the UCI Machine Learning Repository.

Learning Objectives

  • You will demonstrate your ability to implement and train a linear regression model “from scratch,” using tools discussed in class.
  • You will train your model to a custom loss function that reflects the priorities of the hypothetical decision-maker we are collaborating with.
  • You will practice creative, open-ended model design for a real-world problem, including data preprocessing, feature engineering, and model selection.

Data Access

To download the data as a data frame, run the following code:

import pandas as pd
url = "https://middcs.github.io/csci-0451-s26/data/bikeshare/train.csv"
train = pd.read_csv(url)
train.head()
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 6603 2011-10-07 4 0 10 16 0 5 1 1 0.64 0.6212 0.44 0.0000 73 346 419
1 1439 2011-03-05 1 0 3 5 0 6 0 2 0.30 0.3030 1.00 0.1343 0 3 3
2 10222 2012-03-06 1 1 3 23 0 2 1 1 0.28 0.2727 0.61 0.2239 3 51 54
3 7792 2011-11-26 4 0 11 6 0 6 0 2 0.30 0.3182 0.75 0.0896 2 8 10
4 213 2011-01-10 1 0 1 3 0 1 1 1 0.12 0.1212 0.50 0.2239 0 1 1

You’re encouraged to visualize the data to get familiar with its structure, but this isn’t a graded component of the assignment.

Modeling Aim

The bikeshare company wants to use a model to help them decide how many bikes to make available at each hour, and they’ve asked you for help! They will use your model’s predictions to determine how many bikes to stock at each hour, with the goal of minimizing their losses from unmet demand and excess inventory.

The company is proceeding under the assumption that it is worse to underpredict demand than to overpredict it:

  • If one unit of demand goes unmet, the company loses out on potential revenue from that rental. The company’s analysts assign cost \(a\) to each unit of unmet demand.
  • If one unit of inventory goes unused, the company only loses out on the cost of managing the excess inventory. The company’s analysts assign cost \(b\) to each unit of excess inventory.

The analysts have estimated that \(a = 2\) and \(b = 1\).

So, if on one day we predict 20 bikes will be rented (and we supply 20 bikes), but the actual count is 10, then the company’s loss is: 10 units of unused inventory, each costing \(b = 1\), for a total loss of \(10 \times 1 = 10\).

For the purposes of this modeling exercise, we are assuming that each bike can be rented once per hour.

If we had predicted 7 bikes, but the actual count was 10, then the company’s loss is: 3 units of unmet demand, each costing \(a = 2\), for a total loss of \(3 \times 2 = 6\).

Formally, the average loss per prediction associated with a set of predictions \(\hat{\mathbf{y}}\) on true values \(\mathbf{y}\) can be expressed as:

\[ \begin{aligned} \mathrm{Loss} = \frac{1}{n} \sum_{i=1}^n c_i |\hat{y}_i - y_i|\;, \end{aligned} \]

where \(c_i = a\) if \(\hat{y}_i < y_i\) and \(c_i = b\) if \(\hat{y}_i > y_i\). The company wants a model that will make predictions \(\hat{\mathbf{y}}\) that minimize this loss on future data.

Features and Targets

The target variable in this problem is the cnt column of the data frame, which contains the total count of bike rentals in a given hour. The other columns of the data frame are potential features that you can use to predict the target variable. However, casual and registered are not features available at the time of prediction, since they represent the count of casual and registered users, which sum to the total count of rentals. So, you must delete these columns as part of your data preparation pipeline.

Your Task

Create a regression model, trained on the training data provided above, that you will use to predict the total number of bike rentals in a given hour. Your solution should include:

  1. A LinearRegression class that implements a linear regression model using PyTorch. The class should have a forward method that takes in a tensor of input features and returns the predicted count of bike rentals.
    • Your weight vector must be stored in an instance variable model.w. It should be initialized to zeros, and should have shape (n_features, 1), where n_features is the number of input features (after preprocessing). The extra dimension of size 1 is not strictly necessary for this kind of problem, but is assumed by the autograder (and we’ll need it for future problems).
  2. A weighted_loss function that computes the loss for your model, taking into account the company’s preference for underpredicting demand over overpredicting it. The loss function should assign a higher cost to underpredictions than to overpredictions, based on the values of \(a\) and \(b\) provided by the company’s analysts.
  3. A custom BikeModelOptimizer which computes the gradient of the loss with respect to the model’s parameters and updates the parameters accordingly.
  4. A prepare_data function that takes in raw, unmodified data frame of data and returns a tuple of tensors (X, y) that are ready to be fed into your model. This function should perform any necessary preprocessing steps, such as one-hot encoding of categorical variables and application of nonlinear feature maps. Please remember to delete the casual and registered columns as part of this function, since these are not features available at the time of prediction.
  5. A training loop that trains your model on the training data using your custom optimizer and the weighted loss function.

Required

Data Preparation Pipeline

Your implementation should include a function called prepare_data that takes in a data frame containing all columns of df_train and returns a tuple of tensors (X, y), where X is a tensor of input features and y is a tensor of target values. This function should perform any necessary preprocessing steps, such as one-hot encoding of categorical variables and application of nonlinear feature maps. You should be able to use your function like this:

X_train, y_train = prepare_data(df_train)
# ready for the training loop now
Note

The testing data contains fewer values in the mnth column than the training data, which means that if you one-hot encode the mnth column in the standard way (with something like pd.get_dummies), you will end up with a different number of features in the training and testing data. This will cause your model to fail when the autograder tries to load it and evaluate it on the test set.

A good way to get around this is to implement a custom one-hot encoding function that guarantees that the same number of features will be produced regardless of the values in the mnth column. For example, you could create a function that takes in a data frame and produces 12 columns corresponding to the 12 months, with a 1 in the column corresponding to the month of each row and 0s in the other columns. This way, even if some months are missing from the test set, you will still have 12 columns corresponding to the months, and your model will be able to make predictions on the test set without crashing.

Manual Gradients

Your BikeModelOptimizer should be implemented from scratch, without using any of PyTorch’s built-in optimizers. You should compute the gradients of the loss with respect to the model’s parameters manually, and update the parameters using gradient descent. Implement manual gradients in a method of your BikeModelOptimizer called grad_func(X, y, a, b). This method should compute the gradient of the loss with respect to the model’s parameters, given a batch of input features X, target values y, and the cost parameters a and b. You can use this method in your training loop to update the model’s parameters based on the computed gradients.

Encouraged

An excellent solution is likely to include some model selection, in which you make some decisions:

  1. Choose which data columns to include in your model.
  2. Tune the complexity of a nonlinear feature map which you apply to the data columns.
  3. Choose a regularization technique to apply to your model, and tune the strength of the regularization.

A typical way to perform model selection is to set aside a portion of the training data as a validation set (or use cross-validation) to evaluate the performance of your model and select among several alternatives.

Note on Model Assessment

Part of your final result on this assignment will be determined by the loss of your model on the hidden test set. The exact code used in the autograder script is here:

def compute_test_loss():
    # your functions from your submission
    from submission import LinearRegression, prepare_data

    # my implementation of weighted_loss
    from solution import weighted_loss

    # load the model you trained and evaluate it on the test set
    try: 
        model = pickle.load(open("model.pkl", "rb"))
        df_test = pd.read_csv("test.csv")
        X_test, y_test = prepare_data(df_test)
        y_pred = model.forward(X_test)
        loss = weighted_loss(y_test, y_pred, a=1.0, b=0.5)
        return loss.item()

    # if there was an error loading the model or computing the loss, you get a very large loss =(
    except Exception as e:
        print(f"Error computing test loss: {e}")
        return 1e6  # return a large loss if there was an error

You can check that your model will be able to run this code by running the following, which replicates this pipeline on the training data:

# after you've trained your model
X_train, y_train = prepare_data(df_train)
y_pred = model.forward(X_train)
loss = weighted_loss(y_train, y_pred, a=1.0, b=0.5)
print(f"Train loss: {loss.item()}")

Submitting Your Work

The Autograder Is Slow

The autograder is pretty slow to run on your scripts. For this reason, it’s recommended that you lean on the autograder to check that your script is running and that your model can be loaded and used to compute a loss, but that you perform your model selection and training (to try to reduce your model’s loss) in an interactive notebook (e.g. via Google Colab), using a held-out validation set.

Below I’ve included the complete code for the autograder’s test script, which you can use to check that your model can be loaded and used to compute a loss. Feel free to run parts of this code in your notebook or in a script to check that your model is working as expected. Note that you won’t be able to run this code block as-is, since it relies on the autograder’s file structure, the test data, and some custom gradescope_utils. I’m providing it here just so that you can see what kinds of tests to expect and so that you can take some of the code to simulate the autograder’s evaluation pipeline in your own notebook or script.

Code
import os 
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import sys
import unittest
import pickle
import pandas as pd
from gradescope_utils.autograder_utils.decorators import tags
from gradescope_utils.autograder_utils.json_test_runner import JSONTestRunner
from gradescope_utils.autograder_utils.decorators import weight 

from submission import LinearRegression

import torch

class TestBikeModel(unittest.TestCase):
    
    @weight(1)
    def test_file_name(self):
        """
        Test that submission.py and model.pkl exist in the current directory
        """
        self.assertTrue(os.path.exists("submission.py"), "submission.py not found in current directory")
        self.assertTrue(os.path.exists("model.pkl"), "model.pkl not found in current directory")
    
    @weight(1)
    def test_submission(self):
        """
        Test that submission.py can be imported without errors
        """
        try:            
            import submission
        except ImportError:
            self.fail("submission.py could not be imported. Make sure it is in the current directory and does not have syntax errors.")
    
    @weight(1)
    def test_classes_nd_functions(self): 
        """
        Test that LinearRegression, BikeModelOptimizer, and prepare_data are defined in submission.py
        """
        try: 
            from submission import LinearRegression
        except ImportError:
            self.fail("LinearRegression class not found in submission.py.")
            
        try: 
            from submission import BikeModelOptimizer
        except ImportError:
            self.fail("BikeModelOptimizer class not found in submission.py.")
        
        try: 
            from submission import prepare_data
        except ImportError:
            self.fail("prepare_data function not found in submission.py.")
    
    @weight(1)  
    def test_gradients(self): 
        """
        Test that the grad_func method of BikeModelOptimizer returns a tensor of the correct shape and value
        """
        try: 
            from submission import LinearRegression, BikeModelOptimizer
            from solution import weighted_loss
            
            n_features = 3
            X = torch.randn(5, n_features)
            X = torch.cat([X, torch.ones(X.shape[0], 1)], dim=1)  # add bias term
            y = torch.randn(5, 1)
            
            model = LinearRegression(n_features+1)
            model.w = torch.zeros(n_features+1, 1)  # set weights to zero for easier testing
            optimizer = BikeModelOptimizer(model, lr=0.01)
            a, b = 1.0, 0.5
            grad = optimizer.grad_func(X, y, a, b)
            
            # compare to autograd
            model.w.requires_grad_(True)
            y_pred = model.forward(X)
            loss = weighted_loss(y, y_pred, a, b)
            loss.backward()
            autograd_grad = model.w.grad
            self.assertEqual(grad.shape, (n_features + 1, 1), f"Gradient shape is incorrect. Expected ({n_features + 1}, 1).")
            self.assertTrue(torch.allclose(grad, autograd_grad, atol=1e-4), f"Gradient values are incorrect. Please review your implementation of grad_func in BikeModelOptimizer. Received: {grad}, Expected: {autograd_grad}")
            
        except Exception as e:
            self.fail(f"Error testing gradients: {e}")

def compute_test_loss():
    from submission import LinearRegression, prepare_data
    from solution import weighted_loss
    try: 
        model = pickle.load(open("model.pkl", "rb"))
        df_test = pd.read_csv("test.csv")
        X_test, y_test = prepare_data(df_test)
        y_pred = model.forward(X_test)
        loss = weighted_loss(y_test, y_pred, a=1.0, b=0.5)
        return loss.item()
    except Exception as e:
        print(f"Error computing test loss: {e}")
        return 1e7  # return a large loss if there was an error
    

if __name__ == "__main__":
    
    with open('/autograder/results/results.json', 'w') as f:
        runner = JSONTestRunner(visibility = "visible", stream=f)
        runner.json_data["leaderboard"] = [{"Test loss": compute_test_loss()}]
        runner.run(unittest.TestLoader().loadTestsFromTestCase(TestBikeModel))   

Submission Script Structure

You should submit a script called submission.py structured in the following way:

import torch 
import pandas as pd
import pickle

class LinearRegression:
    # your implementation here

class BikeModelOptimizer:
    # your implementation here

def weighted_loss(y_true, y_pred, a, b):
    # your implementation here

def prepare_data(df):
    # your implementation here

def main(): 
    url = "https://middcs.github.io/csci-0451-s26/data/bikeshare/train.csv"
    df_train = pd.read_csv(url)
    X_train, y_train = prepare_data(df_train)
    model = LinearRegression()
    optimizer = BikeModelOptimizer(model)

    # your code for training and assessing the model here

    pickle.dump(model, open("model.pkl", "wb"))

if __name__ == "__main__":
    main()

You’re encouraged to use an interactive notebook (e.g. via Google Colab) for prototyping your model and performing model selection; however, your submission needs to be a Python script organized as above.

Please do not perform any model training outside of the main() function, as this will make the autograder run slowly and possibly crash.

Saved Model

The script above saves your model as model.pkl using Python’s pickle module. This is the file that the autograder will load to evaluate your model’s performance on the test set. Please submit this file alongside your submission.py script.

Optional: Blog About It

These instructions can be followed at any time.

You may want to share the process by which you explored the data, designed your model, performed model selection, etc. If you want, you can share your process and results in a blog post.

  1. Make an attractive Google Colab notebook that includes your code, visualizations, and explanations of your process.
    • You can also use a Jupyter notebook on your local machine if you prefer.
  2. Download the notebook as an .ipynb file after executing all cells.
  3. Follow the instructions here to set up your blog and use this notebook as your first blog post.
    • These instructions are quite involved, but you only need to do them once.

Your Grade

Your grade on this assignment has the following components:

  • 50% for passing all the autograder unit tests.
  • 50% for efficient, idiomatic code that minimizes loops, reduces redundant computations, and uses PyTorch’s built-in functions where appropriate (as assessed by the grader).

Leaderboard

After the final assignment due date, the autograder will run on your final submission and compute your model’s loss on the hidden test set.

The top 5 students on the leaderboard will receive extra credit equal to 1 homework problem, which can be used to skip a homework problem or compensate for low scores on prior problems.

References

Fanaee-T, Hadi, and Joao Gama. 2013. “Event Labeling Combining Ensemble Detectors and Background Knowledge.” Progress in Artificial Intelligence, 1–15. https://doi.org/10.1007/s13748-013-0040-3.