From R for Data Science
Exercises 4.5.7
1-Which carrier has the worst average delays? Challenge-can you disentangle the effects of bad airports vs bad carriers? Why or why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))
flights |>
group_by(carrier) |>
summarize(avg_delay = mean(arr_delay, na.rm=TRUE)) |>
arrange(desc(avg_delay))
# A tibble: 16 × 2
carrier avg_delay
<chr> <dbl>
1 F9 21.9
2 FL 20.1
3 EV 15.8
4 YV 15.6
5 OO 11.9
6 MQ 10.8
7 WN 9.65
8 B6 9.46
9 9E 7.38
10 UA 3.56
11 US 2.13
12 VX 1.76
13 DL 1.64
14 AA 0.364
15 HA -6.92
16 AS -9.93
IT is difficult to disentangle the effects of bad airports vs bad carriers because there are some destinations with only a few carriers. For example, for the destination with the highest average delay, CAE, there are only two carriers, EV, and 9E, and EV had 113 out of 116 of those flights.
> flights |>
filter(dest=='CAE') |>
group_by(carrier) |>
summarize(avg_delay = mean(arr_delay, na.rm=TRUE), n=n()) |>
arrange(desc(avg_delay))
# A tibble: 2 × 3
carrier avg_delay n
<chr> <dbl> <int>
1 EV 42.8 113
2 9E 6 3
Carrier EV flew to 51 other destinations, and had the third highest average delays
> flights |>
filter(carrier=='EV') |>
group_by(dest) |>
summarize(avg_delay = mean(arr_delay, na.rm=TRUE), n=n()) |>
arrange(desc(avg_delay))
# A tibble: 61 × 3
dest avg_delay n
<chr> <dbl> <int>
1 CAE 42.8 113
2 TYS 41.2 323
3 PBI 40.7 6
4 TUL 33.7 315
5 OKC 30.6 346
6 MKE 23.2 1118
7 PWM 22.1 813
8 DCA 21.3 1717
9 DSM 21.2 478
10 RIC 21.2 2114
# ℹ 51 more rows
# ℹ Use `print(n = ...)` to see more rows
2-Find the flights that are most delayed upon departure from each destination.
> flights |>
+ group_by(dest) |>
+ slice_max(dep_delay, n=1) |>
+ relocate(dest)
# A tibble: 105 × 19
# Groups: dest [105]
dest year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<chr> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 ABQ 2013 12 14 2223 2001 142 133 2304 149 B6
2 ACK 2013 7 23 1139 800 219 1250 909 221 B6
3 ALB 2013 1 25 123 2000 323 229 2101 328 EV
4 ANC 2013 8 17 1740 1625 75 2042 2003 39 UA
5 ATL 2013 7 22 2257 759 898 121 1026 895 DL
6 AUS 2013 7 10 2056 1505 351 2347 1758 349 UA
7 AVL 2013 6 14 1158 816 222 1335 1007 208 EV
8 BDL 2013 2 21 1728 1316 252 1839 1413 266 EV
9 BGR 2013 12 1 1504 1056 248 1628 1230 238 EV
10 BHM 2013 4 10 25 1900 325 136 2045 291 EV
# ℹ 95 more rows
# ℹ 8 more variables: flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# ℹ Use `print(n = ...)` to see more rows
3-How do delays vary over the course of the day. Illustrate your answer with a plot.
Delays increase each hour of the day, and after hour 20, begin to decrease
> del_arr_times <- flights |>
group_by(hour) |>
summarize(avg_delay = mean(arr_delay, na.rm=TRUE))
#plot
> ggplot(del_arr_times, aes(x=hour, y=avg_delay)) +
geom_point() +
geom_smooth()
4-What happens if you supply a negative n to slice_min() and friends?
The slicing does not work, and all rows are returned
> flights |>
group_by(carrier) |>
slice_min(arr_delay, n=-1) |>
relocate(dest)
# A tibble: 336,775 × 19
# Groups: carrier [16]
dest year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<chr> <int> <int> <int> <int> <int> <dbl> <int> <int>
1 MCI 2013 5 6 1846 1859 -13 2026 2134
2 MSP 2013 8 20 1555 1559 -4 1720 1828
3 MCI 2013 8 30 1703 1715 -12 1846 1952
4 MKE 2013 5 1 1550 1610 -20 1704 1807
5 DFW 2013 5 7 1521 1530 -9 1735 1837
6 DCA 2013 5 7 2041 2047 -6 2133 2235
7 DCA 2013 5 17 2037 2047 -10 2133 2235
8 DFW 2013 11 13 1813 1815 -2 2027 2127
9 MCI 2013 2 16 1818 1830 -12 2004 2104
10 MSY 2013 1 27 1839 1845 -6 2038 2137
# ℹ 336,765 more rows
5-Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?
It counts the number or rows of a grouped variable
> flights |>
group_by(carrier) |>
count()
# A tibble: 16 × 2
# Groups: carrier [16]
carrier n
<chr> <int>
1 9E 18460
2 AA 32729
3 AS 714
4 B6 54635
5 DL 48110
6 EV 54173
7 F9 685
8 FL 3260
9 HA 342
10 MQ 26397
11 OO 32
12 UA 58665
13 US 20536
14 VX 5162
15 WN 12275
16 YV 601
6-Suppose we have the following tiny data frame:
df <- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)
a-Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.
It will show columns x,y,z
x y z 1 a K 2 b K
etc
Output
> df
# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
df |>
group_by(y)
group_by(y)
will group the dataframe by column y
b-Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a)?
df |>
arrange(y)
It will look the same but will note that it’s grouped by y
arrange()
sorts the columns by the specified column name provided
It’s different from group_by because it will change the way the data is sorted
> df |>
+ group_by(y)
# A tibble: 5 × 3
# Groups: y [2]
x y z
<int> <chr> <chr>
1 1 a K
2 2 b K
3 3 a L
4 4 a L
5 5 b K
> df |>
+ arrange(y)
# A tibble: 5 × 3
x y z
<int> <chr> <chr>
1 1 a K
2 3 a L
3 4 a L
4 2 b K
5 5 b K
c-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
df |>
group_by(y) |>
summarize(mean_x = mean(x))
What I think it will look like: y mean_x a n b n
What it looks like
> df |>
group_by(y) |>
summarize(mean_x=mean(x))
# A tibble: 2 × 2
y mean_x
<chr> <dbl>
1 a 2.67
2 b 3.5
The pipeline chains commands together so that new variables don’t have to be written for each command/change and/or they don’t have to be nested statements either
d-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
What I think it will look like
y z mean_x a K n a L n b K n b L n
What the output looks like
> df |>
group_by(y,z) |>
summarize(mean_x = mean(x))
`summarise()` has grouped output by 'y'. You can
override using the `.groups` argument.
# A tibble: 3 × 3
# Groups: y [2]
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
The message appears when a grouping has more than one variable, and indicates that each summary removes the last group.
e-Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d).
df |>
group_by(y, z) |>
summarize(mean_x = mean(x), .groups = "drop")
I think it will look the same as output from (d) question.
> df |>
group_by(y,z) |>
summarize(mean_x = mean(x), .groups="drop")
# A tibble: 3 × 3
y z mean_x
<chr> <chr> <dbl>
1 a K 1
2 a L 3.5
3 b K 3.5
It’s different because the message is removed because .groups="drop"
is included
f-Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
df |>
group_by(y, z) |>
summarize(mean_x = mean(x))
df |>
group_by(y, z) |>
mutate(mean_x = mean(x))
I think these will look the same as (d) and (e) do
Output
> df |>
group_by(y,z) |>
mutate(mean_x = mean(x))
# A tibble: 5 × 4
# Groups: y, z [3]
x y z mean_x
<int> <chr> <chr> <dbl>
1 1 a K 1
2 2 b K 3.5
3 3 a L 3.5
4 4 a L 3.5
5 5 b K 3.5
I was wrong about the mutate()
output. It just adds another column called mean_x with the mean, but keeps column x also