The folktables package allows you to download and neatly organize data from the American Community Survey’s Public Use Microdata Sample (PUMS). You can install it in your ml-0451 environment by running the following two commands in your terminal:
conda activate ml-0451pip install folktables
You can learn more about the folktables package, including documentation and examples, on the package’s GitHub page.
In this blog post, you’ll fit a classifier using data from folktables and perform a bias audit for the algorithm.
1 Using folktables
The first thing to do is to download some data! Here’s an illustration of downloading a complete set of PUMS data for the state of Michigan.
from folktables import ACSDataSource, ACSEmployment, BasicProblem, adult_filterimport numpy as npSTATE ="MI"data_source = ACSDataSource(survey_year='2018', horizon='1-Year', survey='person')acs_data = data_source.get_data(states=[STATE], download=True)acs_data.head()
RT
SERIALNO
DIVISION
SPORDER
PUMA
REGION
ST
ADJINC
PWGTP
AGEP
...
PWGTP71
PWGTP72
PWGTP73
PWGTP74
PWGTP75
PWGTP76
PWGTP77
PWGTP78
PWGTP79
PWGTP80
0
P
2018GQ0000064
3
1
2907
2
26
1013097
8
60
...
9
0
12
9
11
9
0
9
10
12
1
P
2018GQ0000154
3
1
1200
2
26
1013097
92
20
...
92
91
93
95
93
173
91
15
172
172
2
P
2018GQ0000158
3
1
2903
2
26
1013097
26
54
...
26
52
3
25
25
28
28
50
51
25
3
P
2018GQ0000174
3
1
1801
2
26
1013097
86
20
...
85
12
87
12
87
85
157
86
86
86
4
P
2018GQ0000212
3
1
2600
2
26
1013097
99
33
...
98
96
98
95
174
175
96
95
179
97
5 rows × 286 columns
There are approximately 99,000 rows of PUMS data in this data frame. Each one corresponds to an individual citizen of the given STATE who filled out the 2018 edition of the PUMS survey. You’ll notice that there are a lot of columns. In the modeling tasks we’ll use here, we’re only going to focus on a relatively small number of features. Here are all the possible features I suggest you use:
For documentation on what these features mean, you can consult the appendix of the paper that introduced the package.
For a few examples:
ESR is employment status (1 if employed, 0 if not)
RAC1P is race (1 for White Alone, 2 for Black/African American alone, 3 and above for other self-identified racial groups)
SEX is binary sex (1 for male, 2 for female)
DEAR, DEYE, and DREM relate to certain disability statuses.
Let’s consider the following task: we are going to
Train a machine learning algorithm to predict whether someone is currently employed, based on their other attributes not including race, and
Perform a bias audit of our algorithm to determine whether it displays racial bias.
First, let’s subset the features we want to use:
features_to_use = [f for f in possible_features if f notin ["ESR", "RAC1P"]]
Now we can construct a BasicProblem that expresses our wish to use these features to predict employment status ESR, using the race RAC1P as the group label. I recommend you mostly don’t touch the target_transform, preprocess, and postprocess columns.
You can find examples of constructing problems in the folktablessource code if you really want to carefully customize your problem.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
We can then extract predictions on the test set like this:
y_hat = model.predict(X_test)
The overall accuracy in predicting whether someone is employed is:
(y_hat == y_test).mean()
np.float64(0.7863608931804466)
The accuracy for white individuals is
(y_hat == y_test)[group_test ==1].mean()
np.float64(0.7875706214689265)
The accuracy for Black individuals is
(y_hat == y_test)[group_test ==2].mean()
np.float64(0.7777164920022063)
We can also calculate confusion matrices, false positive rates, false negative rates, positive predictive values, prevalences, and lots of other information using tools we’ve already seen.
2 What You Should Do
Choose Your Problem
Choose a prediction problem (target variable), a list of features, and a choice of group with respect to which to evaluate bias. I would suggest one of the following two possibilities:
(What we just illustrated): predict employment status on the basis of demographics excluding race, and audit for racial bias.
Predict whether income is over $50K on the basis of demographics excluding sex, and audit for gender bias.
You can also pick the state from which you would like to pull your data.
Do not audit for racial bias in VT, as we didn’t have enough Black individuals fill out the PUMS survey. 😬
Finally, you should choose a machine learning model. While you can use a model like logistic regression that you’ve previously implemented, my suggestion is to try experimenting with one out of the box from scikit-learn. Some simple classifiers with good performance are:
Use simple descriptive analysis to address the following questions. You’ll likely find it easiest to address these problems when working with a data frame. Here’s some code to turn your training data back into a data frame for easy analysis:
Using this data frame, answer the following questions:
How many individuals are in the data?
Of these individuals, what proportion have target label equal to 1? In employment prediction, these would correspond to employed individuals.
Of these individuals, how many are in each of the groups?
In each group, what proportion of individuals have target label equal to 1?
Check for intersectional trends by studying the proportion of positive target labels broken out by your chosen group labels and an additional group labe. For example, if you chose race (RAC1P) as your group, then you could also choose sex (SEX) and compute the proportion of positive labels by both race and sex. This might be a good opportunity to use a visualization such as a bar chart, e.g. via the seaborn package.
Train Your Model
Train your model on the training data. Please incorporate a tunable model complexity and use cross-validation in order to select a good choice for the model complexity. Some possibilities:
Use polynomial features with LogisticRegression.
Tune the regularization parameter C in SVC.
Tune the max_depth of in DecisionTreeClassifier and in RandomForestClassifier.
Audit Your Model
Then, perform an audit in which you address the following questions (all on test data):
Overall Measures
What is the overall accuracy of your model?
What is the positive predictive value (PPV) of your model?
What are the overall false negative and false positive rates (FNR and FPR) for your model?
By-Group Measures
What is the accuracy of your model on each subgroup?
What is the PPV of your model on each subgroup?
What are the FNR and FPR on each subgroup?
Bias Measures
See Chouldechova (2017) for definitions of these terms. For calibration, you can think of the score as having only two values, 0 and 1.
Is your model approximately calibrated?
Does your model satisfy approximate error rate balance?
Does your model satisfy statistical parity?
Feasible FNR and FPR Rates
How fair could your model be, as measured by the FNR and FPR rates on each category? Please reproduce Figure 5 in Chouldechova (2017) (link) for your chosen data set and task. This figure uses Eq. (2.6) in the paper, fixing the prevalence (proportion of true positive labels) \(p\) for each group. In this visualization, the PPV for all groups is set equal to the lowest PPV across groups; this corresponds to “calibrating” the model. For example, if the PPV for group 1 is 0.8 and the PPV for group 2 is 0.6, then set the PPV for group 1 equal to 0.8 for the purposes of drawing the line. Eq. (2.6) then defines a line of feasible FNR and FPR rates for each group. It is encouraged but not necessary to reproduce the shaded regions.
Using your plot, please address the following question: if we desired to tune our classifier threshold so that the false positive rates were equal between groups, how much would we need to change the false negative rate? You may wish to consult Chouldechova’s discussion of Fig. 5 to help you interpret your figure.
Concluding Discussion
In a few paragraphs, discuss the following questions:
What groups of people could stand to benefit from a system that is able to predict the label you predicted, such as income or employment status? For example, what kinds of companies might want to buy your model for commercial use?
Based on your bias audit, what could be the impact of deploying your model for large-scale prediction in commercial or governmental settings?
Based on your bias audit, do you feel that your model displays problematic bias? What kind (calibration, error rate, etc)?
Beyond bias, are there other potential problems associated with deploying your model that make you uncomfortable? How would you propose addressing some of these problems?
Add An Abstract
Add a brief summary paragraph to the very beginning of your blog post that summarizes your method and findings before submitting your blog post.
Chouldechova, Alexandra. 2017. “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments.”Big Data 5 (2): 153–63. https://doi.org/10.1089/big.2016.0047.
Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. 2021. “Retiring Adult: New Datasets for Fair Machine Learning.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, 34:6478–90. Curran Associates, Inc.