Auditing Bias

2025-02-26

folktables was introduced by Ding et al. (2021).

The folktables package allows you to download and neatly organize data from the American Community Survey’s Public Use Microdata Sample (PUMS). You can install it in your ml-0451 environment by running the following two commands in your terminal:

conda activate ml-0451
pip install folktables

You can learn more about the folktables package, including documentation and examples, on the package’s GitHub page.

In this blog post, you’ll fit a classifier using data from folktables and perform a bias audit for the algorithm.

1 Using folktables

The first thing to do is to download some data! Here’s an illustration of downloading a complete set of PUMS data for the state of Michigan.

from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filter
import numpy as np

STATE = "MI"

data_source = ACSDataSource(survey_year='2018', 
                            horizon='1-Year', 
                            survey='person')

acs_data = data_source.get_data(states=[STATE], download=True)

acs_data.head()
RT SERIALNO DIVISION SPORDER PUMA REGION ST ADJINC PWGTP AGEP ... PWGTP71 PWGTP72 PWGTP73 PWGTP74 PWGTP75 PWGTP76 PWGTP77 PWGTP78 PWGTP79 PWGTP80
0 P 2018GQ0000064 3 1 2907 2 26 1013097 8 60 ... 9 0 12 9 11 9 0 9 10 12
1 P 2018GQ0000154 3 1 1200 2 26 1013097 92 20 ... 92 91 93 95 93 173 91 15 172 172
2 P 2018GQ0000158 3 1 2903 2 26 1013097 26 54 ... 26 52 3 25 25 28 28 50 51 25
3 P 2018GQ0000174 3 1 1801 2 26 1013097 86 20 ... 85 12 87 12 87 85 157 86 86 86
4 P 2018GQ0000212 3 1 2600 2 26 1013097 99 33 ... 98 96 98 95 174 175 96 95 179 97

5 rows × 286 columns

There are approximately 99,000 rows of PUMS data in this data frame. Each one corresponds to an individual citizen of the given STATE who filled out the 2018 edition of the PUMS survey. You’ll notice that there are a lot of columns. In the modeling tasks we’ll use here, we’re only going to focus on a relatively small number of features. Here are all the possible features I suggest you use:

possible_features=['AGEP', 'SCHL', 'MAR', 'RELP', 'DIS', 'ESP', 'CIT', 'MIG', 'MIL', 'ANC', 'NATIVITY', 'DEAR', 'DEYE', 'DREM', 'SEX', 'RAC1P', 'ESR']
acs_data[possible_features].head()
AGEP SCHL MAR RELP DIS ESP CIT MIG MIL ANC NATIVITY DEAR DEYE DREM SEX RAC1P ESR
0 60 15.0 5 17 1 NaN 1 1.0 4.0 1 1 2 2 1.0 1 2 6.0
1 20 19.0 5 17 2 NaN 1 1.0 4.0 2 1 2 2 2.0 2 1 6.0
2 54 18.0 3 16 1 NaN 1 1.0 4.0 4 1 2 2 1.0 1 1 6.0
3 20 18.0 5 17 2 NaN 1 1.0 4.0 4 1 2 2 2.0 1 1 6.0
4 33 18.0 5 16 2 NaN 1 3.0 4.0 2 1 2 2 2.0 1 1 6.0

For documentation on what these features mean, you can consult the appendix of the paper that introduced the package.

For a few examples:

  • ESR is employment status (1 if employed, 0 if not)
  • RAC1P is race (1 for White Alone, 2 for Black/African American alone, 3 and above for other self-identified racial groups)
  • SEX is binary sex (1 for male, 2 for female)
  • DEAR, DEYE, and DREM relate to certain disability statuses.

Let’s consider the following task: we are going to

  1. Train a machine learning algorithm to predict whether someone is currently employed, based on their other attributes not including race, and
  2. Perform a bias audit of our algorithm to determine whether it displays racial bias.

First, let’s subset the features we want to use:

features_to_use = [f for f in possible_features if f not in ["ESR", "RAC1P"]]

Now we can construct a BasicProblem that expresses our wish to use these features to predict employment status ESR, using the race RAC1P as the group label. I recommend you mostly don’t touch the target_transform, preprocess, and postprocess columns.

You can find examples of constructing problems in the folktables source code if you really want to carefully customize your problem.
EmploymentProblem = BasicProblem(
    features=features_to_use,
    target='ESR',
    target_transform=lambda x: x == 1,
    group='RAC1P',
    preprocess=lambda x: x,
    postprocess=lambda x: np.nan_to_num(x, -1),
)

features, label, group = EmploymentProblem.df_to_numpy(acs_data)

The result is now a feature matrix features, a label vector label, and a group label vector group, in convenient format with which we can work.

for obj in [features, label, group]:
  print(obj.shape)
(99419, 15)
(99419,)
(99419,)

Before we touch the data any more, we should perform a train-test split:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
    features, label, group, test_size=0.2, random_state=0)

Now we are ready to create a model and train it on the training data:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can then extract predictions on the test set like this:

y_hat = model.predict(X_test)

The overall accuracy in predicting whether someone is employed is:

(y_hat == y_test).mean()
np.float64(0.7863608931804466)

The accuracy for white individuals is

(y_hat == y_test)[group_test == 1].mean()
np.float64(0.7875706214689265)

The accuracy for Black individuals is

(y_hat == y_test)[group_test == 2].mean()
np.float64(0.7777164920022063)

We can also calculate confusion matrices, false positive rates, false negative rates, positive predictive values, prevalences, and lots of other information using tools we’ve already seen.

2 What You Should Do

Choose Your Problem

Choose a prediction problem (target variable), a list of features, and a choice of group with respect to which to evaluate bias. I would suggest one of the following two possibilities:

  1. (What we just illustrated): predict employment status on the basis of demographics excluding race, and audit for racial bias.
  2. Predict whether income is over $50K on the basis of demographics excluding sex, and audit for gender bias.

You can also pick the state from which you would like to pull your data.

Do not audit for racial bias in VT, as we didn’t have enough Black individuals fill out the PUMS survey. 😬

Finally, you should choose a machine learning model. While you can use a model like logistic regression that you’ve previously implemented, my suggestion is to try experimenting with one out of the box from scikit-learn. Some simple classifiers with good performance are:

  • sklearn.linear_model.LogisticRegression
  • sklearn.svm.SVC (support vector machine)
  • sklearn.tree.DecisionTreeClassifier (decision tree)
  • sklearn.ensemble.RandomForestClassifier (random forest)

Basic Descriptives

Use simple descriptive analysis to address the following questions. You’ll likely find it easiest to address these problems when working with a data frame. Here’s some code to turn your training data back into a data frame for easy analysis:

import pandas as pd
df = pd.DataFrame(X_train, columns = features_to_use)
df["group"] = group_train
df["label"] = y_train

Using this data frame, answer the following questions:

  1. How many individuals are in the data?
  2. Of these individuals, what proportion have target label equal to 1? In employment prediction, these would correspond to employed individuals.
  3. Of these individuals, how many are in each of the groups?
  4. In each group, what proportion of individuals have target label equal to 1?
  5. Check for intersectional trends by studying the proportion of positive target labels broken out by your chosen group labels and an additional group labe. For example, if you chose race (RAC1P) as your group, then you could also choose sex (SEX) and compute the proportion of positive labels by both race and sex. This might be a good opportunity to use a visualization such as a bar chart, e.g. via the seaborn package.

Train Your Model

Train your model on the training data. Please incorporate a tunable model complexity and use cross-validation in order to select a good choice for the model complexity. Some possibilities:

  • Use polynomial features with LogisticRegression.
  • Tune the regularization parameter C in SVC.
  • Tune the max_depth of in DecisionTreeClassifier and in RandomForestClassifier.

Audit Your Model

Then, perform an audit in which you address the following questions (all on test data):

Overall Measures
  1. What is the overall accuracy of your model?
  2. What is the positive predictive value (PPV) of your model?
  3. What are the overall false negative and false positive rates (FNR and FPR) for your model?
By-Group Measures
  1. What is the accuracy of your model on each subgroup?
  2. What is the PPV of your model on each subgroup?
  3. What are the FNR and FPR on each subgroup?
Bias Measures

See Chouldechova (2017) for definitions of these terms. For calibration, you can think of the score as having only two values, 0 and 1.
  • Is your model approximately calibrated?
  • Does your model satisfy approximate error rate balance?
  • Does your model satisfy statistical parity?
Feasible FNR and FPR Rates

How fair could your model be, as measured by the FNR and FPR rates on each category? Please reproduce Figure 5 in Chouldechova (2017) (link) for your chosen data set and task. This figure uses Eq. (2.6) in the paper, fixing the prevalence (proportion of true positive labels) \(p\) for each group. In this visualization, the PPV for all groups is set equal to the lowest PPV across groups; this corresponds to “calibrating” the model. For example, if the PPV for group 1 is 0.8 and the PPV for group 2 is 0.6, then set the PPV for group 1 equal to 0.8 for the purposes of drawing the line. Eq. (2.6) then defines a line of feasible FNR and FPR rates for each group. It is encouraged but not necessary to reproduce the shaded regions.

Using your plot, please address the following question: if we desired to tune our classifier threshold so that the false positive rates were equal between groups, how much would we need to change the false negative rate? You may wish to consult Chouldechova’s discussion of Fig. 5 to help you interpret your figure.

Concluding Discussion

In a few paragraphs, discuss the following questions:

  1. What groups of people could stand to benefit from a system that is able to predict the label you predicted, such as income or employment status? For example, what kinds of companies might want to buy your model for commercial use?
  2. Based on your bias audit, what could be the impact of deploying your model for large-scale prediction in commercial or governmental settings?
  3. Based on your bias audit, do you feel that your model displays problematic bias? What kind (calibration, error rate, etc)?
  4. Beyond bias, are there other potential problems associated with deploying your model that make you uncomfortable? How would you propose addressing some of these problems?

Add An Abstract

Add a brief summary paragraph to the very beginning of your blog post that summarizes your method and findings before submitting your blog post.



© Phil Chodrow, 2025

References

Chouldechova, Alexandra. 2017. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.” Big Data 5 (2): 153–63. https://doi.org/10.1089/big.2016.0047.
Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. “Retiring Adult: New Datasets for Fair Machine Learning.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, 34:6478–90. Curran Associates, Inc.