From R for Data Science
Exercises 6.2.1
Happy Halloween
1-For each of the sample tables, describe what each observation and each column represents.
table1
table2
table3
2-Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:
-Extract the number of TB cases per country per year. -Extract the matching population per country per year. -Divide cases by population, and multiply by 10000. -Store back in the appropriate place.
For table2, thinking through this, I would create a new dataframe with just country, year, and count for TB cases. Then I’d do the same for population. Then I’d join those two tables based on year and country. Then I’d take do the calculation for the rate.
But just doing this with a case_when:
#create the table with some rows
> table2 <- data.frame(country=c("Af", "Af", "Af", "Af","Br", "Br"),
year = c("1999", "1999", "2000", "2000", "1999", "1999"),
type=c("cases", "population", "cases", "population", "cases", "population"),
count=c(745,19987071,2677,20595360,37737,172006362)
)
> table2
country year type count
1 Af 1999 cases 745
2 Af 1999 population 19987071
3 Af 2000 cases 2677
4 Af 2000 population 20595360
5 Br 1999 cases 37737
6 Br 1999 population 172006362
#create new rows for just tbcases and population using case_when()
> table2_rate <- table2 |>
mutate(tbcases = case_when(type == "cases" ~ count, .default=0)) |>
mutate(population = case_when(type == "population" ~ count,.default=0))
> table2_rate
country year type count tbcases population
1 Af 1999 cases 745 745 0
2 Af 1999 population 19987071 0 19987071
3 Af 2000 cases 2677 2677 0
4 Af 2000 population 20595360 0 20595360
5 Br 1999 cases 37737 37737 0
6 Br 1999 population 172006362 0 172006362
#Collapse the rows using group_by() and summarize() with sum(), then divide the new total columns and multiple by 10000 to get the rate
> table2_rate1 <- table2_rate |>
group_by(country, year) |>
summarize(total_tb = sum(tbcases),
total_pop=sum(population)) |>
mutate(rate = (total_tb / total_pop) * 10000)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
> table2_rate1
# A tibble: 3 × 5
# Groups: country [2]
country year total_tb total_pop rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Af 1999 745 19987071 0.373
2 Af 2000 2677 20595360 1.30
3 Br 1999 37737 172006362 2.19
For table 3, thinking through it, split the rate column on the “/”, then change to numeric, then divide the resulting columns
#create the table
> table3_rate <- data.frame(country=c("Af","Af","Br","Br","Ch","Ch"),
year=c("1999","2000","1999","2000","1999","2000"),
rate=c("745/19987071","2677/2059360","37737/172006362","80488/174504898","212258/1272915272","213766/1280428583")
)
> table3_rate
country year rate
1 Af 1999 745/19987071
2 Af 2000 2677/2059360
3 Br 1999 37737/172006362
4 Br 2000 80488/174504898
5 Ch 1999 212258/1272915272
6 Ch 2000 213766/1280428583
> table3_rate <- table3 |>
separate(rate,c("tbcases","population")) |>
mutate(rate = (as.numeric(tbcases) / as.numeric(population) * 1000)
> table3_rate <- table3 |>
separate(rate,c("tbcases","population")) |>
mutate(rate = (as.numeric(tbcases) / as.numeric(population)) * 1000)
> table3_rate
country year tbcases population rate
1 Af 1999 745 19987071 0.0372741
2 Af 2000 2677 2059360 1.2999184
3 Br 1999 37737 172006362 0.2193930
4 Br 2000 80488 174504898 0.4612363
5 Ch 1999 212258 1272915272 0.1667495
6 Ch 2000 213766 1280428583 0.1669488