Blog Home

R for Data Science 4.5.7 Exercises

From R for Data Science

Exercises 4.5.7

1-Which carrier has the worst average delays? Challenge-can you disentangle the effects of bad airports vs bad carriers? Why or why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

flights |> 
     group_by(carrier) |>
     summarize(avg_delay = mean(arr_delay, na.rm=TRUE)) |>
     arrange(desc(avg_delay))

# A tibble: 16 × 2
   carrier avg_delay
   <chr>       <dbl>
 1 F9         21.9  
 2 FL         20.1  
 3 EV         15.8  
 4 YV         15.6  
 5 OO         11.9  
 6 MQ         10.8  
 7 WN          9.65 
 8 B6          9.46 
 9 9E          7.38 
10 UA          3.56 
11 US          2.13 
12 VX          1.76 
13 DL          1.64 
14 AA          0.364
15 HA         -6.92 
16 AS         -9.93 

IT is difficult to disentangle the effects of bad airports vs bad carriers because there are some destinations with only a few carriers. For example, for the destination with the highest average delay, CAE, there are only two carriers, EV, and 9E, and EV had 113 out of 116 of those flights.

> flights |>
         filter(dest=='CAE') |>
         group_by(carrier) |>
         summarize(avg_delay = mean(arr_delay, na.rm=TRUE), n=n()) |>
         arrange(desc(avg_delay))
# A tibble: 2 × 3
  carrier avg_delay     n
  <chr>       <dbl> <int>
1 EV           42.8   113
2 9E            6       3

Carrier EV flew to 51 other destinations, and had the third highest average delays

> flights |>
         filter(carrier=='EV') |>
         group_by(dest) |>
         summarize(avg_delay = mean(arr_delay, na.rm=TRUE), n=n()) |>
         arrange(desc(avg_delay))
# A tibble: 61 × 3
   dest  avg_delay     n
   <chr>     <dbl> <int>
 1 CAE        42.8   113
 2 TYS        41.2   323
 3 PBI        40.7     6
 4 TUL        33.7   315
 5 OKC        30.6   346
 6 MKE        23.2  1118
 7 PWM        22.1   813
 8 DCA        21.3  1717
 9 DSM        21.2   478
10 RIC        21.2  2114
# ℹ 51 more rows
# ℹ Use `print(n = ...)` to see more rows

2-Find the flights that are most delayed upon departure from each destination.

> flights |>
+     group_by(dest) |>
+     slice_max(dep_delay, n=1) |>
+ relocate(dest)
# A tibble: 105 × 19
# Groups:   dest [105]
   dest   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1 ABQ    2013    12    14     2223           2001       142      133           2304       149 B6     
 2 ACK    2013     7    23     1139            800       219     1250            909       221 B6     
 3 ALB    2013     1    25      123           2000       323      229           2101       328 EV     
 4 ANC    2013     8    17     1740           1625        75     2042           2003        39 UA     
 5 ATL    2013     7    22     2257            759       898      121           1026       895 DL     
 6 AUS    2013     7    10     2056           1505       351     2347           1758       349 UA     
 7 AVL    2013     6    14     1158            816       222     1335           1007       208 EV     
 8 BDL    2013     2    21     1728           1316       252     1839           1413       266 EV     
 9 BGR    2013    12     1     1504           1056       248     1628           1230       238 EV     
10 BHM    2013     4    10       25           1900       325      136           2045       291 EV     
# ℹ 95 more rows
# ℹ 8 more variables: flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# ℹ Use `print(n = ...)` to see more rows

3-How do delays vary over the course of the day. Illustrate your answer with a plot.

Delays increase each hour of the day, and after hour 20, begin to decrease

> del_arr_times <- flights |>
     group_by(hour) |>
     summarize(avg_delay = mean(arr_delay, na.rm=TRUE))

#plot
> ggplot(del_arr_times, aes(x=hour, y=avg_delay)) +
     geom_point() +
     geom_smooth()

4-What happens if you supply a negative n to slice_min() and friends?

The slicing does not work, and all rows are returned

> flights |>
     group_by(carrier) |>
     slice_min(arr_delay, n=-1) |>
     relocate(dest)
# A tibble: 336,775 × 19
# Groups:   carrier [16]
   dest   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1 MCI    2013     5     6     1846           1859       -13     2026           2134
 2 MSP    2013     8    20     1555           1559        -4     1720           1828
 3 MCI    2013     8    30     1703           1715       -12     1846           1952
 4 MKE    2013     5     1     1550           1610       -20     1704           1807
 5 DFW    2013     5     7     1521           1530        -9     1735           1837
 6 DCA    2013     5     7     2041           2047        -6     2133           2235
 7 DCA    2013     5    17     2037           2047       -10     2133           2235
 8 DFW    2013    11    13     1813           1815        -2     2027           2127
 9 MCI    2013     2    16     1818           1830       -12     2004           2104
10 MSY    2013     1    27     1839           1845        -6     2038           2137
# ℹ 336,765 more rows

5-Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

It counts the number or rows of a grouped variable

> flights |>
     group_by(carrier) |>
     count()
# A tibble: 16 × 2
# Groups:   carrier [16]
   carrier     n
   <chr>   <int>
 1 9E      18460
 2 AA      32729
 3 AS        714
 4 B6      54635
 5 DL      48110
 6 EV      54173
 7 F9        685
 8 FL       3260
 9 HA        342
10 MQ      26397
11 OO         32
12 UA      58665
13 US      20536
14 VX       5162
15 WN      12275
16 YV        601

6-Suppose we have the following tiny data frame:

df <- tibble(
  x = 1:5,
  y = c("a", "b", "a", "a", "b"),
  z = c("K", "K", "L", "L", "K")
)

a-Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

It will show columns x,y,z

x y z 1 a K 2 b K

etc

Output

> df
# A tibble: 5 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     2 b     K    
3     3 a     L    
4     4 a     L    
5     5 b     K   

df |>
    group_by(y)

group_by(y) will group the dataframe by column y

b-Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?

df |>
  arrange(y)

It will look the same but will note that it’s grouped by y

arrange() sorts the columns by the specified column name provided

It’s different from group_by because it will change the way the data is sorted

> df |>
+     group_by(y)
# A tibble: 5 × 3
# Groups:   y [2]
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     2 b     K    
3     3 a     L    
4     4 a     L    
5     5 b     K    
> df |>
+     arrange(y)
# A tibble: 5 × 3
      x y     z    
  <int> <chr> <chr>
1     1 a     K    
2     3 a     L    
3     4 a     L    
4     2 b     K    
5     5 b     K  

c-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

df |>
  group_by(y) |>
  summarize(mean_x = mean(x))

What I think it will look like: y mean_x a n b n

What it looks like

> df |>
     group_by(y) |>
     summarize(mean_x=mean(x))
# A tibble: 2 × 2
  y     mean_x
  <chr>  <dbl>
1 a       2.67
2 b       3.5 

The pipeline chains commands together so that new variables don’t have to be written for each command/change and/or they don’t have to be nested statements either

d-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))

What I think it will look like

y z mean_x a K n a L n b K n b L n

What the output looks like

> df |>
     group_by(y,z) |>
     summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can
override using the `.groups` argument.
# A tibble: 3 × 3
# Groups:   y [2]
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

The message appears when a grouping has more than one variable, and indicates that each summary removes the last group.

e-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x), .groups = "drop")

I think it will look the same as output from (d) question.

> df |>
     group_by(y,z) |>
     summarize(mean_x = mean(x), .groups="drop")
# A tibble: 3 × 3
  y     z     mean_x
  <chr> <chr>  <dbl>
1 a     K        1  
2 a     L        3.5
3 b     K        3.5

It’s different because the message is removed because .groups="drop" is included

f-Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

df |>
  group_by(y, z) |>
  summarize(mean_x = mean(x))

df |>
  group_by(y, z) |>
  mutate(mean_x = mean(x))

I think these will look the same as (d) and (e) do

Output

> df |>
     group_by(y,z) |>
     mutate(mean_x = mean(x))
# A tibble: 5 × 4
# Groups:   y, z [3]
      x y     z     mean_x
  <int> <chr> <chr>  <dbl>
1     1 a     K        1  
2     2 b     K        3.5
3     3 a     L        3.5
4     4 a     L        3.5
5     5 b     K        3.5

I was wrong about the mutate() output. It just adds another column called mean_x with the mean, but keeps column x also