import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Tidy data with pandas
I have been been doing some recap on how to do basic data processing and cleaning in Python with pandas. So this post is mostly a self-reminder on how to deal with messy data in Python, by reproducing examples presented in Hadley Wickham’s Tidy Data paper (Wickham 2014).
Introduction
The most significant contribution of this well-known work is that it gave clear definition on what “tidy” means for a dataset. There are 3 main requirements, as illustrated on tidyr
’s website (evolving from what Hadley originally proposed):
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Messy data are, by extension, datasets in volation of these 3 rules. The author then described the five most common problems with messy datasets:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.
In this post I will be focusing on the first 3 symptoms since the other two violations often occur when working with databases. All datasets come from Hadley’s repo containing materials for the paper and Daniel Chen’s 2019 SciPy tutorial on data processing with pandas.
Column headers are values, not variable names
= pd.read_csv("https://github.com/qiushiyan/blog-data/blob/main/pew.csv?raw=true")
pew pew.head()
religion <$10k $10-20k ... $100-150k >150k Don't know/refused
0 Agnostic 27 34 ... 109 84 96
1 Atheist 12 27 ... 59 74 76
2 Buddhist 27 21 ... 39 53 54
3 Catholic 418 617 ... 792 633 1489
4 Don’t know/refused 15 14 ... 17 18 116
[5 rows x 11 columns]
The pew
dataset explores the relationship between income and religion in the US, produced by the Pew Research Center. To tidy it, we think of its “right” form if we are to answer data analysis questions. Say, what if we want to a person’s income is influenced by his religion or the other way around. It should be obvious that we need to derive from pew
a column to indicate the level of a person’s income, and another column being count of any combination of income and religion. pew
is messy in the sense that all column names besides religion
, from <$10k
to Don't know/refused
, should be different levels (values) of a column storing information about income.
The code to do this is fairly easy in pandas, the .melt
method is very similar to tidyr::pivot_longer
and its namesake in the retired reshape2
package.
= pew.melt(id_vars = "religion", var_name = "income", value_name = "count")
tidy_pew 20) tidy_pew.head(
religion income count
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don’t know/refused <$10k 15
5 Evangelical Prot <$10k 575
6 Hindu <$10k 1
7 Historically Black Prot <$10k 228
8 Jehovah's Witness <$10k 20
9 Jewish <$10k 19
10 Mainline Prot <$10k 289
11 Mormon <$10k 29
12 Muslim <$10k 6
13 Orthodox <$10k 13
14 Other Christian <$10k 9
15 Other Faiths <$10k 20
16 Other World Religions <$10k 5
17 Unaffiliated <$10k 217
18 Agnostic $10-20k 34
19 Atheist $10-20k 27
Now let’s calculate the \(\chi^2\) statistic in R and the corresponding p value;
library(infer)
chisq_stat <- py$tidy_pew |>
tidyr::uncount(count) |>
specify(religion ~ income) |>
hypothesize(null = "independence") |>
calculate(stat = "Chisq")
py$tidy_pew |>
tidyr::uncount(count) |>
specify(religion ~ income) |>
hypothesize(null = "independence") |>
visualize(method = "theoretical") +
shade_p_value(obs_stat = chisq_stat, direction = "right")
This shows strong relationship between income
and religion
.
Another common use of this wide data format is to record regularly spaced observations over time, illustrated by the billboard
dataset. Ther rank of a specific track in each week after it enters the Billboard top 100 is recorded in 75 columns, wk1
to wk75
.
= pd.read_csv("https://github.com/qiushiyan/blog-data/blob/main/billboard.csv?raw=true")
billboard billboard
year artist track ... wk74 wk75 wk76
0 2000 2 Pac Baby Don't Cry (Keep... ... NaN NaN NaN
1 2000 2Ge+her The Hardest Part Of ... ... NaN NaN NaN
2 2000 3 Doors Down Kryptonite ... NaN NaN NaN
3 2000 3 Doors Down Loser ... NaN NaN NaN
4 2000 504 Boyz Wobble Wobble ... NaN NaN NaN
.. ... ... ... ... ... ... ...
312 2000 Yankee Grey Another Nine Minutes ... NaN NaN NaN
313 2000 Yearwood, Trisha Real Live Woman ... NaN NaN NaN
314 2000 Ying Yang Twins Whistle While You Tw... ... NaN NaN NaN
315 2000 Zombie Nation Kernkraft 400 ... NaN NaN NaN
316 2000 matchbox twenty Bent ... NaN NaN NaN
[317 rows x 81 columns]
If we are to answer questions like “what are the average ranking of artists across all weeks?”, the wk*
columns need to be transformed into values: 1
1 Although both pandas and dplyr (> 1.0) supports row-wise operations, we will suffer some performance penalties for not utilizing vectorization.
= billboard.melt(id_vars = ["year", "artist", "track", "time", "date.entered"],
tidy_billboard = "week",
var_name = "rank")
value_name
tidy_billboard
year artist track ... date.entered week rank
0 2000 2 Pac Baby Don't Cry (Keep... ... 2000-02-26 wk1 87.0
1 2000 2Ge+her The Hardest Part Of ... ... 2000-09-02 wk1 91.0
2 2000 3 Doors Down Kryptonite ... 2000-04-08 wk1 81.0
3 2000 3 Doors Down Loser ... 2000-10-21 wk1 76.0
4 2000 504 Boyz Wobble Wobble ... 2000-04-15 wk1 57.0
... ... ... ... ... ... ... ...
24087 2000 Yankee Grey Another Nine Minutes ... 2000-04-29 wk76 NaN
24088 2000 Yearwood, Trisha Real Live Woman ... 2000-04-01 wk76 NaN
24089 2000 Ying Yang Twins Whistle While You Tw... ... 2000-03-18 wk76 NaN
24090 2000 Zombie Nation Kernkraft 400 ... 2000-09-02 wk76 NaN
24091 2000 matchbox twenty Bent ... 2000-04-29 wk76 NaN
[24092 rows x 7 columns]
Now we can compute the average ranking:
(tidy_billboard."artist")[["rank"]].
groupby(
mean().= "rank")
sort_values(by )
rank
artist
Santana 10.500000
Elliott, Missy "Misdemeanor" 14.333333
matchbox twenty 18.641026
N'Sync 18.648649
Janet 19.416667
... ...
Lil' Mo 98.142857
LL Cool J 98.500000
Zombie Nation 99.000000
Fragma 99.000000
Smith, Will 99.000000
[228 rows x 1 columns]
Multiple variables stored in one column
The tb
daaset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases by country, year, and demographic group. The demographic groups are broken down by sex (m, f) and age (0–14, 15–25, 25–34, 35–44).
= pd.read_csv("https://github.com/qiushiyan/blog-data/blob/main/tb.csv?raw=true")
tb tb
country year m014 m1524 m2534 ... f3544 f4554 f5564 f65 fu
0 AD 2000 0.0 0.0 1.0 ... NaN NaN NaN NaN NaN
1 AE 2000 2.0 4.0 4.0 ... 3.0 0.0 0.0 4.0 NaN
2 AF 2000 52.0 228.0 183.0 ... 339.0 205.0 99.0 36.0 NaN
3 AG 2000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 NaN
4 AL 2000 2.0 19.0 21.0 ... 8.0 8.0 5.0 11.0 NaN
.. ... ... ... ... ... ... ... ... ... ... ..
196 YE 2000 110.0 789.0 689.0 ... 517.0 345.0 247.0 92.0 NaN
197 YU 2000 NaN NaN NaN ... NaN NaN NaN NaN NaN
198 ZA 2000 116.0 723.0 1999.0 ... 933.0 423.0 167.0 80.0 NaN
199 ZM 2000 349.0 2175.0 2610.0 ... 1305.0 186.0 112.0 75.0 NaN
200 ZW 2000 NaN NaN NaN ... NaN NaN NaN NaN NaN
[201 rows x 18 columns]
To clean this data, we first melt all columns except for country
and year
in return for a longer version of tb
, and then seperate the variable
column into two pieces of information, sex
and age
.
= tb.melt(id_vars = ["country", "year"])
tb_long = tb_long["variable"].str.split(pat = "(m|f)(.+)").str.get(1)
sex = tb_long["variable"].str.split(pat = "(m|f)(.+)").str.get(2)
age
sex.head(), age.head()
(0 m
1 m
2 m
3 m
4 m
Name: variable, dtype: object, 0 014
1 014
2 014
3 014
4 014
Name: variable, dtype: object)
Add these two columns and drop the redundant variable
column
= (tb_long
tidy_tb = sex, age = age)
.assign(sex "variable", axis = "columns"))
.drop(
tidy_tb
country year value sex age
0 AD 2000 0.0 m 014
1 AE 2000 2.0 m 014
2 AF 2000 52.0 m 014
3 AG 2000 0.0 m 014
4 AL 2000 2.0 m 014
... ... ... ... .. ...
3211 YE 2000 NaN f u
3212 YU 2000 NaN f u
3213 ZA 2000 NaN f u
3214 ZM 2000 NaN f u
3215 ZW 2000 NaN f u
[3216 rows x 5 columns]
Variables are stored in both rows and columns
The
weather
data shows daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010.
= pd.read_csv("https://github.com/qiushiyan/blog-data/blob/main/weather.csv?raw=true")
weather weather
id year month element d1 ... d27 d28 d29 d30 d31
0 MX17004 2010 1 tmax NaN ... NaN NaN NaN 27.8 NaN
1 MX17004 2010 1 tmin NaN ... NaN NaN NaN 14.5 NaN
2 MX17004 2010 2 tmax NaN ... NaN NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN ... NaN NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN ... NaN NaN NaN NaN NaN
5 MX17004 2010 3 tmin NaN ... NaN NaN NaN NaN NaN
6 MX17004 2010 4 tmax NaN ... 36.3 NaN NaN NaN NaN
7 MX17004 2010 4 tmin NaN ... 16.7 NaN NaN NaN NaN
8 MX17004 2010 5 tmax NaN ... 33.2 NaN NaN NaN NaN
9 MX17004 2010 5 tmin NaN ... 18.2 NaN NaN NaN NaN
10 MX17004 2010 6 tmax NaN ... NaN NaN 30.1 NaN NaN
11 MX17004 2010 6 tmin NaN ... NaN NaN 18.0 NaN NaN
12 MX17004 2010 7 tmax NaN ... NaN NaN NaN NaN NaN
13 MX17004 2010 7 tmin NaN ... NaN NaN NaN NaN NaN
14 MX17004 2010 8 tmax NaN ... NaN NaN 28.0 NaN 25.4
15 MX17004 2010 8 tmin NaN ... NaN NaN 15.3 NaN 15.4
16 MX17004 2010 10 tmax NaN ... NaN 31.2 NaN NaN NaN
17 MX17004 2010 10 tmin NaN ... NaN 15.0 NaN NaN NaN
18 MX17004 2010 11 tmax NaN ... 27.7 NaN NaN NaN NaN
19 MX17004 2010 11 tmin NaN ... 14.2 NaN NaN NaN NaN
20 MX17004 2010 12 tmax 29.9 ... NaN NaN NaN NaN NaN
21 MX17004 2010 12 tmin 13.8 ... NaN NaN NaN NaN NaN
[22 rows x 35 columns]
There are two major problems with weather
:
-
d1
,d2
, …,d31
should be values instead of column names (solved by.melt
) - On the other hand, values in the
element
column should be names, it should be spread into two columns namedtmax
,tmin
(solved by.pivot_table
)
(weather.= ["id", "year", "month", "element"], var_name = "day", value_name = "temp").
melt(id_vars = ["id", "year", "month", "day"],
pivot_table(index = "element",
columns = "temp").
values
reset_index().
head() )
element id year month day tmax tmin
0 MX17004 2010 1 d30 27.8 14.5
1 MX17004 2010 2 d11 29.7 13.4
2 MX17004 2010 2 d2 27.3 14.4
3 MX17004 2010 2 d23 29.9 10.7
4 MX17004 2010 2 d3 24.1 14.4
Case study: mortality data from Mexico
After stating these common problems and their remidies, Hadley presented a case study section on how tidy dataset can facilitate data analysis. The case study uses individual-level mortality data from Mexico. The goal is to find causes of death with unusual temporal patterns, at hour level. It’s time to move back from Python to R!
library(ggplot2)
library(dplyr)
library(tidyr)
deaths <- readr::read_csv("https://github.com/qiushiyan/blog-data/blob/main/mexico-deaths.csv?raw=true") |> na.omit()
deaths
# A tibble: 513,273 × 5
yod mod dod hod cod
<dbl> <dbl> <dbl> <dbl> <chr>
1 1920 11 17 3 W78
2 1923 2 4 16 J44
3 1923 6 23 19 E12
4 1926 2 5 16 C67
5 1926 4 1 16 J44
6 1928 10 30 19 I27
7 1929 4 23 15 I25
8 1930 9 11 19 E14
9 1930 12 22 19 E11
10 1931 5 26 11 K65
# … with 513,263 more rows
The columns are year, month, day, hour and cause of specific death respectively. Another table codes
explains what acronyms in cod
mean.
codes <- readr::read_csv("https://github.com/qiushiyan/blog-data/blob/main/codes.csv?raw=true")
codes
# A tibble: 1,851 × 2
cod disease
<chr> <chr>
1 A00 "Cholera"
2 A01 "Typhoid and paratyphoid\nfevers"
3 A02 "Other salmonella infections"
4 A03 "Shigellosis"
5 A04 "Other bacterial intestinal\ninfections"
6 A05 "Other bacterial foodborne\nintoxications, not elsewhere\nclassified"
7 A06 "Amebiasis"
8 A07 "Other protozoal intestinal\ndiseases"
9 A08 "Viral and other specified\nintestinal infections"
10 A09 "Diarrhea and gastroenteritis\nof infectious origin"
# … with 1,841 more rows
Thanks to the reticulate
package, we can mix R and Python code seamlessly. Here is a line plot made with seaborn
demonstrating total deaths per hour:
= r.deaths
deaths = deaths["hod"].value_counts()
deaths_per_hour = deaths_per_hour.index, y = deaths_per_hour.values)
sns.lineplot(x "Temporal pattern of all causes of death")
plt.title("Hour of the day")
plt.xlabel("Number of deaths")
plt.ylabel( plt.show()
To provide informative labels for causes, we next join the dataset to the codes
dataset, on the cod
variable.
# A tibble: 513,273 × 6
yod mod dod hod cod cause
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1920 11 17 3 W78 "Inhalation of gastric\ncontents"
2 1923 2 4 16 J44 "Other chronic obstructive\npulmonary disease"
3 1923 6 23 19 E12 "Malnutrition-related diabetes\nmellitus"
4 1926 2 5 16 C67 "Malignant neoplasm of bladder"
5 1926 4 1 16 J44 "Other chronic obstructive\npulmonary disease"
6 1928 10 30 19 I27 "Other pulmonary heart\ndiseases"
7 1929 4 23 15 I25 "Chronic ischemic heart\ndisease"
8 1930 9 11 19 E14 "Unspecified diabetes mellitus"
9 1930 12 22 19 E11 "Non-insulin-dependent\ndiabetes mellitus"
10 1931 5 26 11 K65 "Peritonitis"
# … with 513,263 more rows
The total deaths for each cause varies over several orders of magnitude: there are 46,794 deaths from heart attack but only 1 from Tularemia.
"cod"]).
(deaths.groupby([
size().= "per_cause").
reset_index(name = "per_cause", ascending = False)
sort_values(by )
cod per_cause
417 I21 46794
260 E11 42421
262 E14 27330
495 J44 16043
566 K70 12860
... ... ...
1079 X24 1
521 K02 1
939 V30 1
940 V33 1
182 D04 1
[1194 rows x 2 columns]
This means that rather than the total number, it makes more sense to think in proportions. If a cause of death departs from the overall temporal pattern, then its proportion of deaths in a given hour compared to the total deaths of that cause should differ significantly from that of the hourly deaths at the same time compared to total deaths. I denote these two proportions as prop1
and prop2
respectively. To ensure that the causes we consider are sufficiently representative we’ll only work with causes with more than 50 total deaths.
prop1 <- deaths |>
count(hod, cause, name = "per_hour_per_cause") |>
add_count(cause, wt = per_hour_per_cause, name = "per_cause") |>
mutate(prop1 = per_hour_per_cause / per_cause)
prop2 <- deaths |>
count(hod, name = "per_hour") |>
add_count(wt = per_hour, name = "total") |>
mutate(prop2 = per_hour / total)
Hadley used mean square error between the two proportions as a kind of distance, to indicate the average degree of anomaly of a cause, and I follow:
dist <- prop1 |>
filter(per_cause > 50) |>
left_join(prop2, by = "hod") |>
select(
hour = hod,
cause,
n = per_cause,
prop1,
prop2
) |>
group_by(cause, n) |>
summarize(dist = mean((prop1 - prop2)^2)) |>
ungroup()
dist |>
arrange(desc(dist))
# A tibble: 447 × 3
cause n dist
<chr> <int> <dbl>
1 "Accident to powered aircraft\ncausing injury to occupant" 57 0.00573
2 "Victim of lightning" 97 0.00513
3 "Bus occupant injured in other\nand unspecified transport\nacc… 52 0.00419
4 "Assault (homicide) by smoke,\nfire, and flames" 51 0.00229
5 "Exposure to electric\ntransmission lines" 77 0.00161
6 "Sudden infant death syndrome" 323 0.00156
7 "Drowning and submersion while\nin natural water" 469 0.00133
8 "Motorcycle rider injured in\ncollision with car, pickup\ntruc… 66 0.00126
9 "Contact with hornets, wasps,\nand bees" 86 0.00118
10 "Exposure to smoke, fire, and\nflames, undetermined intent" 51 0.00110
# … with 437 more rows
Here we see causes of death with highest dist
are mainly accidents and rare diseases. However, there is a negative correlation between the frequency of a cause and its deviation, as shown in the following plot, so that the result based solely on the dist
column would be biased in favour of rare causes.
dist |>
ggplot(aes(n, dist)) +
geom_jitter() +
ggrepel::geom_text_repel(
aes(label = cause),
top_n(dist, 10)
) +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "lm") +
labs(
title = "Temporal deviation of causes of deaths in Mexico",
y = NULL,
x = "total death"
)
Thus, our final solution is to build a model with n
as predictor, and dist
as response. The cause with highest residual are assumed to have the most deviation. Since the linear trend fits the data quite well, I opt for linear regression (Hadley used robust linear model in the paper).
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3.74 0.110 -34.0 8.34e-126
2 log(n) -0.869 0.0186 -46.8 7.55e-174
Let’s plot these residuals against the predictor log(n)
:
augment(lm_fit) |>
ggplot(aes(`log(n)`, .std.resid)) +
geom_hline(yintercept = 0, color = "red") +
geom_point()
The plot shows an empty region around a standardized residual of 1.5. So somewhat arbitrarily, we’ll select those diseases with a residual greater than 1.5
rows <- augment(lm_fit) |>
mutate(row = row_number()) |>
filter(.std.resid > 1.5) |>
select(row) |>
pull(row)
unusual <- dist |>
mutate(row = row_number()) |>
filter(row %in% rows) |>
select(cause, n)
unusual
# A tibble: 26 × 2
cause n
<chr> <int>
1 "Accident to powered aircraft\ncausing injury to occupant" 57
2 "Acute bronchitis" 684
3 "Acute myocardial infarction" 46794
4 "Assault (homicide) by other\nand unspecified firearm\ndischarge" 7228
5 "Assault (homicide) by sharp\nobject" 1575
6 "Assault (homicide) by smoke,\nfire, and flames" 51
7 "Bus occupant injured in other\nand unspecified transport\naccidents" 52
8 "Car occupant injured in\nnoncollision transport\naccident" 616
9 "Car occupant injured in other\nand unspecified transport\naccidents" 1938
10 "Contact with hornets, wasps,\nand bees" 86
# … with 16 more rows
Finally, we plot the temporal course for each unusual cause of death.
prop1 |>
filter(cause %in% unusual$cause) |>
left_join(prop2, by = "hod") |>
pivot_longer(c(prop1, prop2)) |>
ggplot(aes(hod, value, color = name)) +
geom_line() +
scale_color_manual(
name = NULL,
labels = c("cause-specific", "overall"),
values = c("#FFBF0F", "#0382E5")
) +
facet_wrap(~cause, scales = "free_y", labeller = label_wrap_gen(20)) +
labs(
x = "hour", y = NULL, title = "Most deviated causes of death",
subtitle = "comparing cause-specific temporal pattern to overall trend"
) +
theme_minimal(base_size = 16) +
theme(legend.position = "top")