5 Data transformation

5.2 Filter rows with filter()

Exercise 5.2.4.1

Find all flights that:

  1. Had an arrival delay of two or more hours
  2. Flew to Houston (IAH or HOU)
  3. Were operated by United, American, or Delta
  4. Departed in summer (July, August, and September)
  5. Arrived more than two hours late, but didn’t leave late
  6. Were delayed by at least an hour, but made up over 30 minutes in flight
  7. Departed between midnight and 6am (inclusive)

The answer to each part follows.

  1. Using a couple of functions to knowledge, including arrange() and %>% (pipe). More on pipe later. Viewed the data ? and glimpse() before constructing filter() functions.

    data('flights')
    glimpse(flights)
    #> Observations: 336,776
    #> Variables: 19
    #> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
    #> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
    #> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
    #> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
    #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
    #> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
    #> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
    #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
    #> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
    #> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
    #> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
    #> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
    #> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
    #> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
    #> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
    #> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
    #> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
    #> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
    #> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
    ?flights
    
    filter(flights, arr_delay >= 120) %>% arrange(arr_delay)
    #> # A tibble: 10,200 x 19
    #>    year month   day dep_time sched_dep_time dep_delay arr_time
    #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
    #> 1  2013     1     2     1806           1629        97     2008
    #> 2  2013     1    10     1801           1619       102     1923
    #> 3  2013     1    13     1958           1836        82     2231
    #> 4  2013     1    13     2145           2005       100        4
    #> 5  2013     1    14     1652           1445       127     1806
    #> 6  2013     1    15     1603           1446        77     1957
    #> # ... with 1.019e+04 more rows, and 12 more variables:
    #> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
    #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  2. Flew to Houston (IAH or HOU)

  3. Were operated by United, American, or Delta

  4. Departed in summer (July, August, and September)

  5. Arrived more than two hours late, but didn’t leave late

  6. Were delayed by at least an hour, but made up over 30 minutes in flight

  7. Departed between midnight and 6 a.m. (inclusive)