2  Data Frames in Python

Getting started with Pandas Data Frames

Open the live notebook in Google Colab or download the live notebook.

Data wrangling is a broad term for transforming data from one form into another. It some times is synonymous with “data cleaning”, i.e., preparing “messy” raw data for further analysis, but data wrangling can also refer to the analysis itself (including data transformation, visualization, aggregation). Today we will focus on the data structures and techniques used to store, transform and aggregate tabular data. These tools build on the vectorized approaches, and specifically the NumPy library, we just learned about.

Pandas

The pandas Python package offers a rich set of powerful tools for working with tabular data. The name “pandas” is originally derived from the common term “panel data.” While different users pronounce the package name in different ways, for cuteness reasons we prefer to pronounce it like the plural of “panda [bears]”.

Let’s get started with our standard imports. Note the new addition of import pandas as pd.

import numpy as np
# pd is the prefix convention for pandas
import pandas as pd

print(pd.__version__)
2.2.3

The two key data structures offered by Pandas are Series, a 1-D array, and DataFrame, a 2-D table with rows and columns.

What is a DataFrame?

If you have used Microsoft Excel, Google Sheets, or other similar spreadsheet, you have effectively used a Pandas DataFrame (which is simiar to R’s dataframe).

What are the key properties of a spreadsheet table:

  • 2-D table (i.e. rows and columns) with rows typically containing different observations/entities and columns representing different variables
  • Columns (and sometimes rows as well) have names (i.e. “labels”)
  • Columns can have different types, e.g. one column may be a string, another a float, but any given column is typically of uniform type
  • Most computations are performed on a subset of columns, e.g. sum all quiz scores for all students, for all the data in those columns (i.e., “for each” row)

The pandas DataFrame type is designed for this kind of data.

There are lot of different ways to conceptually think about DataFrames including (but not limited to) as a spreadsheet, or a database table, or as a list/dictionary of 1-D Series (the columns) with the column label as the key. The last is closer to the underlying implementation and a common way to create DataFrames.

Let’s get started by loading some example data with physiological measurements and species labels for several populations of Adelie, Chinstrap, and Gentoo penguins (the “Palmer Penguins” data.)

The Palmer Penguins data was originally collected by Gorman, Williams, and Fraser (2014) and was nicely packaged and released for use in the data science community by Horst, Hill, and Gorman (2020).
url = "https://raw.githubusercontent.com/middcs/data-science-notes/main/data/palmer-penguins/palmer-penguins.csv"

# df is a common variable name for data frames
df = pd.read_csv(url)
type(df)
pandas.core.frame.DataFrame

Checking out the resulting data, we see a lot happened in that one function call! Pandas downloaded the data from the provided URL, parsed the CSV file and constructed a DataFrame of size (344, 17) (rows × columns).

We can access the size with df.shape.
df
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 PAL0910 120 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N38A2 No 12/1/09 NaN NaN NaN NaN NaN NaN NaN NaN
340 PAL0910 121 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A1 Yes 11/22/09 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832 NaN
341 PAL0910 122 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A2 Yes 11/22/09 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117 NaN
342 PAL0910 123 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A1 Yes 11/22/09 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969 NaN
343 PAL0910 124 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A2 Yes 11/22/09 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531 NaN

344 rows × 17 columns

We noted above that DataFrame columns can have different types. We can query the types for each column with the dtypes attribute. The object type is the default when Pandas can’t infer a numeric type. This is very common when the columns contain strings. You can give string columns a dedicated data type, but we won’t focus on that here.

df.dtypes
studyName               object
Sample Number            int64
Species                 object
Region                  object
Island                  object
Stage                   object
Individual ID           object
Clutch Completion       object
Date Egg                object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
Comments                object
dtype: object

Some other commonly used methods for data inspection:

A method is a function called on the specific instance of an object. In Python we call methods and access instance attributes with the dot operator, .. The object instance to the left of the dot is the receiver, the object on which we want to call the method. You can think of a calling a method as a calling function that takes the receiver as the first argument, i.e., df.head() is equivalent to pd.DataFrame.head(df).
  • df.head(): Display first 5 rows
  • df.info(): Display column names and types, along with non-null counts
  • df.describe(): Generate statistical summary of data

The key parts of the DataFrame:

  • The index. The index is used to refer to rows, i.e., serves as “row labels”. It is the 0, 1, 2, … at the far left above.
  • The column names. These name or label the specific columns, i.e., are “column labels”, and are typically used to select data during computation.
  • The data itself. You can think of the data as a set of different 1-D arrays or Series, one for each column. Each array has the same length. Many of the operations on these arrays will be familiar from NumPy.

Selecting data

We could anticipate wanting to select data using both sets of labels, i.e., by column and by row. Recall we described a DataFrame as a dictionary of columns, and we can indeed, use the indexing operator [], to select columns by name.

# Select a single column

df["Region"].head()
0    Anvers
1    Anvers
2    Anvers
3    Anvers
4    Anvers
Name: Region, dtype: object
# Select and optionally reorder columns with a list

df[["Species", "Region", "Island", "Culmen Length (mm)"]].head()
Species Region Island Culmen Length (mm)
0 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen 39.1
1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen 39.5
2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen 40.3
3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen NaN
4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen 36.7

To access rows (and columns) via their labels we use the loc attribute. It exposes flexible indexing capabilities, e.g.

# Select rows by label (using the index)

df.loc[0:3].head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.

Here we select the rows with indices [0,3] inclusive (using Python’s slicing operator). While it may look like we are selecting rows at indices [0,3], that is not the case. We are selecting based on the label, not the position. Consider the following example:


df.loc[3:6].loc[0:3].head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.

Notice we only get a single row, with index 3. The first operation retries rows with labels 3, 4, 5, 6, and the latter selects rows with labels 0, 1, 2, 3.

To select rows by position, e.g., as you might select values in a list, we use the iloc attribute. We can again uses slices, but the ranges are position and are exclusive, i.e., the end index is excluded. Notice here we get 3 rows, with indices 3, 4, and 5. The first operation selects those rows (at positions 3, 4, 5 but not including 6), which are now at position 0, 1, and 2 (but not 3) in the result DataFrame. The second selection returns those rows. Note each row keeps its label throughout the various selection operations.


df.iloc[3:6].iloc[0:3].head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
5 PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 11/16/07 39.3 20.6 190.0 3650.0 MALE 8.66496 -25.29805 NaN

Alongside label and position we can also select data via boolean masks (i.e., data-driven filtering). The following filters the data to just those rows where the condition is True. To do so, we are computing an array (vector) of booleans using a vectorized comparison and then selecting those rows with True. Any boolean vector can be used, that is the computation used to generate the boolean index can be arbitrarily complex, incorporating different element-wise comparisons and bit-wise & (and) and | (or) or ~ (not) operations.

# Since slicing and filtering rows is common, slices or boolean arrays inside [] apply to rows, i.e.,
# is shorthand for df.loc[df["Culmen Length (mm)"] < 40].head()

df[df["Culmen Length (mm)"] < 40].head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
5 PAL0708 6 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A2 Yes 11/16/07 39.3 20.6 190.0 3650.0 MALE 8.66496 -25.29805 NaN
6 PAL0708 7 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N4A1 No 11/15/07 38.9 17.8 181.0 3625.0 FEMALE 9.18718 -25.21799 Nest never observed with full clutch.

Mutation

The process of creating and modifying columns is often described as “mutation”. Just as we create new or update existing key-value pairs in a Python Dictionary with assignment, we can do the same with columns in a Pandas DataFrame. For example, the following adds a column with the culmen length in centimeters (you may need to scroll to the right to see it). The division is an element-wise “vectorized” operation like those we saw with NumPy (and in many cases Pandas is actually storing the column as NumPy array).

df["Culmen Length (cm)"] = df["Culmen Length (mm)"] / 10
df.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments Culmen Length (cm)
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes. 3.91
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN 3.95
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN 4.03
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled. NaN
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN 3.67

Vectorized operations on strings

We can perform similar vectorized operations on the string columns, but need to expose them via the str attribute, e.g.,

df["Species"].str.upper()
0      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
1      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
2      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
3      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
4      ADELIE PENGUIN (PYGOSCELIS ADELIAE)
                      ...                 
339      GENTOO PENGUIN (PYGOSCELIS PAPUA)
340      GENTOO PENGUIN (PYGOSCELIS PAPUA)
341      GENTOO PENGUIN (PYGOSCELIS PAPUA)
342      GENTOO PENGUIN (PYGOSCELIS PAPUA)
343      GENTOO PENGUIN (PYGOSCELIS PAPUA)
Name: Species, Length: 344, dtype: object

For example, to improve readability we might want just the first part the species name, i.e., “Adelie” instead of “Adelie Penguin (Pygoscelis adeliae)”. This is the first part of the name up to the space. We can get it by splitting the string by whitespace and select the value at index 0 in the resulting list. Note that we can chain those operations, but need a str before each one to apply the resulting operation element-wise.

df["Species"] = df["Species"].str.split().str.get(0)
df.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments Culmen Length (cm)
0 PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes. 3.91
1 PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN 3.95
2 PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN 4.03
3 PAL0708 4 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled. NaN
4 PAL0708 5 Adelie Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN 3.67

Split-Apply-Combine

One of the fundamental pattern in tabular data analysis is to perform computation by/across groups. In our penguins data, for example, a very natural thing to do is to compute summary statistics by species, or perhaps by habitat (or both!). We can contextualize this task in three stages:

  • Split the data data frame into pieces, one for each group (e.g., species).
  • Apply an aggregation function to each piece, yielding a single number.
  • Combine the results into a new data frame.

This pattern is so common that the phrase “split-apply-combine” now appears in many texts on data analysis. This phrase was originally coined by Hadley Wickham, who is famous for developing many of the modern tools for data analysis in the R programming language.

Schematic of split-apply-combine. Image credit: Vanderplas (2016)

Pandas implements this pattern with the groupby method. For example, we can compute the mean of the numeric columns as a function of species using

df.groupby("Species").mean(numeric_only=True)
Sample Number Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Delta 15 N (o/oo) Delta 13 C (o/oo) Culmen Length (cm)
Species
Adelie 76.5 38.791391 18.346358 189.953642 3700.662252 8.859733 -25.804194 3.879139
Chinstrap 34.5 48.833824 18.420588 195.823529 3733.088235 9.356155 -24.546542 4.883382
Gentoo 62.5 47.504878 14.982114 217.186992 5076.016260 8.245338 -26.185298 4.750488

Compare these values to the overall means. We now readily observe that Adelie penguins have much shorter bills (“Culmen length (mm)”) than Chinstrap and Gentoo penguins.

df.mean(numeric_only=True)
Sample Number            63.151163
Culmen Length (mm)       43.921930
Culmen Depth (mm)        17.151170
Flipper Length (mm)     200.915205
Body Mass (g)          4201.754386
Delta 15 N (o/oo)         8.733382
Delta 13 C (o/oo)       -25.686292
Culmen Length (cm)        4.392193
dtype: float64

Conceptually Pandas collected or “grouped” together all the data into subsets of rows with same value for the specified variable(s), then applied the aggregation function to each subset. We can think of the groupby method as returning a special “view” of the DataFrame, such that any aggregation functions used will by applied to each of the individual “groups” (i.e. species).

Split-Apply-Combine enables sophisticated aggregations with very little code. Consider computing multiple statistics for all combinations of Species and Island. Note that we use a different approach to efficiently perform different aggregation operation on the same subsets (instead repeating the grouping operation). The aggregate method is a general approach to apply multiple functions to each group. It does not have a numeric_only argument, so we subset the groups to just the column(s) of interest.


summary = df.groupby(["Species", "Island"])[["Culmen Length (mm)","Culmen Depth (mm)"]].aggregate(["mean", "std"])
summary
Culmen Length (mm) Culmen Depth (mm)
mean std mean std
Species Island
Adelie Biscoe 38.975000 2.480916 18.370455 1.188820
Dream 38.501786 2.465359 18.251786 1.133617
Torgersen 38.950980 3.025318 18.429412 1.339447
Chinstrap Dream 48.833824 3.339256 18.420588 1.135395
Gentoo Biscoe 47.504878 3.081857 14.982114 0.981220

If there isn’t a built-in function for the particular aggregation you want to perform, you can provide your own function to aggregate, or even more flexibly using the apply method. .apply will work for aggregating, filtering or transforming groups. It can be a helpful tool for performing multiple operations on the same group. For example here, we supply an anonymous function (a lambda function) to both compute the mean culmen length and convert it to centimeters.

# The Python lambda keyword creates an anonymous function with the specified arguments and body (here a single argument `g`). `apply`
# calls that function for each group, passing the group `DataFrame` as the argument.
df.groupby(["Species", "Island"]).apply(lambda g: g["Culmen Length (mm)"].mean() / 10, include_groups=False)
Species    Island   
Adelie     Biscoe       3.897500
           Dream        3.850179
           Torgersen    3.895098
Chinstrap  Dream        4.883382
Gentoo     Biscoe       4.750488
dtype: float64

Hierarchical indexing

Complex data summary tables like the one above are useful and powerful, but it is also non-obvious how we can extract specific data of interest. For example, how can I get the mean culmen (bill) length for Chinstrap penguins on Dream island? To extract this kind of data, we leverage Pandas’ hierarchical indices, in which we pass multiple keys to the .loc attribute as a tuple.

# Provide tuples for row and column index

summary.loc[("Chinstrap", "Dream"),("Culmen Length (mm)", "mean")]
np.float64(48.83382352941176)

More generally, we can think of each value provided for an index drops us one level in the hierarchy. For example, the following returns a DataFrame with row indices for just island and column indices for just the aggregates of the “Culmen Length (mm)” column.

# Tuple literals are created with parentheses, but parentheses with a single element (no comma) are interpreted
# as a grouping (in the PEMDAS sense), not a tuple. To create a single element tuple we add a trailing comma.

summary.loc[("Chinstrap",),("Culmen Length (mm)",)]
mean std
Island
Dream 48.833824 3.339256

Apply split-apply-combine

The following data is birth weight data (in ounces) for a cohort of babies along with other data about the baby and mother, including whether the mother smoked during pregnancy(Adhikari, DeNero, and Wagner 2022). Our hypothesis is that birth weight is negatively associated with whether the mother smoked during pregnancy (i.e., that babies born to mothers who smoked will be smaller on average).

baby = pd.read_csv("https://www.inferentialthinking.com/data/baby.csv")
baby
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
... ... ... ... ... ... ...
1169 113 275 27 60 100 False
1170 128 265 24 67 120 False
1171 130 291 30 65 150 True
1172 125 281 21 65 110 False
1173 117 297 38 65 129 False

1174 rows × 6 columns

Here we see a sample of the data (1174 observations total). We are most interested in the “Birth Weight” and “Maternal Smoker” columns. We can use split-apply-combine technique, via the groupby method, to quickly figure out how many smokers and non-smokers are in the data. That helps us answer the question “Do we have enough data to do a meaningful analysis of that variable?”. Fortunately it appears we do!

baby.groupby("Maternal Smoker").size()
Maternal Smoker
False    715
True     459
dtype: int64

Let’s check out the means for smokers and non-smokers. Again we can use groupby.


means = baby.groupby("Maternal Smoker")["Birth Weight"].mean()
means
Maternal Smoker
False    123.085315
True     113.819172
Name: Birth Weight, dtype: float64

Notice the over 9 ounce difference in average birth weight! That is suggestive of a significant difference. Specifically:

# Compute the difference to the previous value. We expect the first to have NaN (not a number)
# since there is not previous value.
means.diff()
Maternal Smoker
False         NaN
True    -9.266143
Name: Birth Weight, dtype: float64

In contrast, the other attributes seem similar between the two groups.


baby.groupby("Maternal Smoker").mean()
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight
Maternal Smoker
False 123.085315 279.874126 27.544056 64.013986 129.47972
True 113.819172 277.897603 26.736383 64.104575 126.91939

However, is that difference in average birth weight actually meaningful or did it just arise by chance? That is, what if in reality there is no difference between the two groups and what we are observing is just an artifact of how these particular mothers/babies were selected (it just happened that this set of non-smoking mothers had larger babies)? We will describe that latter expectation, that there is no difference, as the null hypothesis.

We would like to determine how likely it is that the null hypothesis is true. To do so we can simulate the null hypothesis by randomly shuffling the “Maternal Smoker”, i.e. randomly assigning row to be a smoker or non-smoker to simulate a different random sample of patients, and then recomputing the difference in mean birth weight. If the null hypothesis is true, the difference in means of shuffled data will be similar to what we observe in the actual data.

Spoiler warning: We are computing a p-value!

Pandas and the other libraries will use are very complex and the interfaces are not always intuitive (understatement!). As a result, working with these libraries can be a good application for LLMs. We know what we want, and having learned about split-apply-combine, we generally know what the solution should include, but it’s not always clear which of the many hundreds or thousands of functions that are available would be relevant. As experiment, let’s try asking an LLM to implement this code for us.

Before doing so let’s make sure we know what to expect. We need to:

  • Randomly shuffle the smoking label
  • Compute the difference in mean birth weight with groupby (or some type of grouping)
  • Do so many times to generate a distribution of birth weight differences

Let’s turn GenAI loose. After a little iteration with Claude Sonnet 4.5 and Copilot (GPT-5 mini) I settled on the following prompt.

Show me a permutation test in Python using Pandas groupby() and shuffling the “Material Smoker” column to test the difference in mean “Birth Weight”.

This is more prescriptive than might be desired. Many prompts produced lower level implementations (i.e., using boolean index into column series) than we wanted. Those approaches may be faster, but are not very clear (and our ultimate goal is learning, and specifically here, learning to use Pandas).

A possible approach to computing a distribution of birth weight differences for permuted labels:

n_permutations = 5000
permuted_diffs = []

for i in range(n_permutations):
1    shuffled_smoking = baby["Maternal Smoker"].sample(frac=1).reset_index(drop=True)
2    weight_diff = baby.groupby(shuffled_smoking)["Birth Weight"].mean().diff().iloc[-1]
    permuted_diffs.append(weight_diff)

permuted_diffs = np.array(permuted_diffs)
1
baby["Maternal Smoker"].sample(frac=1) produces a random permutation of the label column. We reset the index to the default integers index, i.e., match the position.
2
Group the original data by the permuted column then compute the difference in the mean of the birth weight. Pandas is using the indices from shuffled_smoking to select the relevant rows from baby. If we hadn’t reset the indices, this would just re-compute the difference in the actual data (try it to convince yourself). But by resetting the index we “break” the original relationship between smoking status and birth weight.
Code
from matplotlib import pyplot as plt
import seaborn as sns
sns.histplot(permuted_diffs, bins=30)
plt.axvline(x=means.diff().iloc[-1], color="r", label="Actual difference")
plt.xlabel("Difference in mean birth weight between non-smoking and smoking mothers")
plt.show()

Distribution of birth weight differences for permuted data with actual difference shown with vertical line

That sure looks significant, and indeed we don’t observe any examples with larger effect. That would imply an empirical p-value, the probability of observing a more extreme test statistic assuming the null model, of \(p<\) 0.0002.


(np.abs(permuted_diffs) >= abs(means.diff().iloc[-1])).sum()
np.int64(0)

A permutation test, what we did above, is a statistical hypothesis test! It is has the nice property that we don’t need to make any assumptions about distribution of the data. And we can implement it and interpret the results without a lot of prior statistics experience. Our goal today is to learn about how to work with Pandas, not necessarily to learn about different forms of hypothesis testing. But this is not a “toy” example. The computational tools in our toolbox are already sufficient to make rigorous inferences about data!

Adhikari, An, John DeNero, and David Wagner. 2022. Computational and Inferential Thinking: The Foundations of Data Science. Second. https://inferentialthinking.com.
Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.
Horst, Allison M, Alison Presmanes Hill, and Kristen B Gorman. 2020. “Allisonhorst/Palmerpenguins: V0.1.0.” Zenodo. https://doi.org/10.5281/ZENODO.3960218.
Vanderplas, Jacob T. 2016. Python Data Science Handbook: Essential Tools for Working with Data. First edition. Sebastopol, CA: O’Reilly Media, Inc.