Blog Home

R for Data Science 6.2.1 Exercises

From R for Data Science

Exercises 6.2.1

Happy Halloween

1-For each of the sample tables, describe what each observation and each column represents.

table1

  • country column is the country for the TB cases and population
  • year column is the year for the TB cases and population
  • cases column is the number of TB cases for the year and country
  • population column is the population for the year and country

table2

  • country column is the country for eiher TB cases or population (type column)
  • year column is the year for either the TB cases or population (type column)
  • type is either the TB cases or population in the count
  • count is either the count of TB cases or population

table3

  • country column is the country for rate of TB cases
  • year column is the year for the rate of TB cases

2-Sketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:

-Extract the number of TB cases per country per year. -Extract the matching population per country per year. -Divide cases by population, and multiply by 10000. -Store back in the appropriate place.

For table2, thinking through this, I would create a new dataframe with just country, year, and count for TB cases. Then I’d do the same for population. Then I’d join those two tables based on year and country. Then I’d take do the calculation for the rate.

But just doing this with a case_when:

#create the table with some rows
> table2 <- data.frame(country=c("Af", "Af", "Af", "Af","Br", "Br"),
                      year = c("1999", "1999", "2000", "2000", "1999", "1999"),
                      type=c("cases", "population", "cases", "population", "cases", "population"),
                     count=c(745,19987071,2677,20595360,37737,172006362)
                      )
> table2
  country year       type     count
1      Af 1999      cases       745
2      Af 1999 population  19987071
3      Af 2000      cases      2677
4      Af 2000 population  20595360
5      Br 1999      cases     37737
6      Br 1999 population 172006362

#create new rows for just tbcases and population using case_when()
> table2_rate <- table2 |>
                       mutate(tbcases = case_when(type == "cases" ~ count, .default=0)) |>
                       mutate(population = case_when(type == "population" ~ count,.default=0))

> table2_rate
  country year       type     count tbcases population
1      Af 1999      cases       745     745          0
2      Af 1999 population  19987071       0   19987071
3      Af 2000      cases      2677    2677          0
4      Af 2000 population  20595360       0   20595360
5      Br 1999      cases     37737   37737          0
6      Br 1999 population 172006362       0  172006362

#Collapse the rows using group_by() and summarize() with sum(), then divide the new total columns and multiple by 10000 to get the rate

> table2_rate1 <- table2_rate |>
                    group_by(country, year) |>
                    summarize(total_tb = sum(tbcases), 
                              total_pop=sum(population)) |>
                    mutate(rate = (total_tb / total_pop) * 10000)
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
> table2_rate1
# A tibble: 3 × 5
# Groups:   country [2]
  country year  total_tb total_pop  rate
  <chr>   <chr>    <dbl>     <dbl> <dbl>
1 Af      1999       745  19987071 0.373
2 Af      2000      2677  20595360 1.30 
3 Br      1999     37737 172006362 2.19 

For table 3, thinking through it, split the rate column on the “/”, then change to numeric, then divide the resulting columns

#create the table

> table3_rate <- data.frame(country=c("Af","Af","Br","Br","Ch","Ch"),
                           year=c("1999","2000","1999","2000","1999","2000"),
                           rate=c("745/19987071","2677/2059360","37737/172006362","80488/174504898","212258/1272915272","213766/1280428583")
                           )
> table3_rate
  country year              rate
1      Af 1999      745/19987071
2      Af 2000      2677/2059360
3      Br 1999   37737/172006362
4      Br 2000   80488/174504898
5      Ch 1999 212258/1272915272
6      Ch 2000 213766/1280428583

> table3_rate <- table3 |>
                   separate(rate,c("tbcases","population")) |>
                   mutate(rate = (as.numeric(tbcases) / as.numeric(population) * 1000)
 
> table3_rate <- table3 |>
                   separate(rate,c("tbcases","population")) |>
                   mutate(rate = (as.numeric(tbcases) / as.numeric(population)) * 1000)
> table3_rate
  country year tbcases population      rate
1      Af 1999     745   19987071 0.0372741
2      Af 2000    2677    2059360 1.2999184
3      Br 1999   37737  172006362 0.2193930
4      Br 2000   80488  174504898 0.4612363
5      Ch 1999  212258 1272915272 0.1667495
6      Ch 2000  213766 1280428583 0.1669488