Filtering, Transforming, and Summarizing Tables

1 Introduction

Note

This material was prepared for students in Introduction to Data Analytics and may be weird outside that context.

Notations like df[1, 2] and df$column are powerful ways to interact with two-dimensional objects like tables. But they’re also limiting and confusing. The first of these is great for looking at a single value, but doing so drops the context of the rest of the table. The second is great to consider one column of data, but doing so also ignores the other columns.

Both options manage two-dimensional objects by rejecting their two-dimensionality.

Since structured data is one of the fundamental forms of information in data analysis, we need a different way to work with tables.

2 Entering the tidyverse

When we use the library() function, we’re loading up packages to add something to “base R” to make it easier or more capable. It’s common to load up a few packages before getting started in an analysis because they provide helpful methods or approaches, or they provide built-in shortcuts that we can use. If someone else has already written a function we might need, then there’s no need to reinvent the wheel.

The tidyverse is a collection of many useful packages designed for interoperability. We’ve used it before when we visualized data using one of the best packages for that task, ggplot2. It is included in the tidyverse.

For working with two-dimensional data—tables or “data frames”—we’re going to be using dplyr, another package that is part of the tidyverse. Pronounced something like “d-plier” (think of the mechanic’s tool called pliers), the dplyr package lets us work with tables as tables, instead of dealing with them piecemeal.

When using the functionality made possible by one or two of these packages, you might want to load just that package, for example using library(dplyr) and library(ggplot2). But if you’ll be using more of them, of if you don’t remember which package provides the functionality, it’s usually fine to use library(tidyverse) to load them all at once.

Tangent

Hadley Wickham is one of the creators of all of these packages listed here: ggplot2, dplyr, and even the tidyverse. Because Wickham is from New Zealand, you might notice that British spellings are allowed (but not required) in many parts of the tidyverse landscape. Wickham lives in Houston and is the Chief Scientist at the company that makes RStudio.

3 Getting started

We always start any analysis by loading packages and our data. Here, we’ll use the tidyverse because it’s not a bad habit to start. The swac package we’re loading can be installed from GitHub using remotes::install_github("jmclawson/swac"), and its football data set shows games played by teams in the Southwestern Athletic Conference from 2015 through 2022.

library(tidyverse)
library(swac)

Once we have these things set up, it’s always a good idea to familiarize ourselves with what’s in the data. Typing the name of an object lets us see its dimensions, column names, and the first ten or so lines. For this data, each row represents a football game played by a SWAC team; when two SWAC teams play against each other, the game is represented in two rows.

football

# A tibble: 903 × 10
   season date       team  opponent stadium team_score opponent_score team_venue
    <dbl> <date>     <chr> <chr>    <chr>        <dbl>          <dbl> <fct>     
 1   2022 2022-12-17 Jack… NC Cent… Merced…         34             41 Neutral   
 2   2022 2022-12-03 Jack… Souther… MS Vet…         43             24 Home      
 3   2022 2022-12-03 Sout… Jackson… MS Vet…         24             43 Away      
 4   2022 2022-11-26 Gram… Souther… Caesar…         17             34 Neutral   
 5   2022 2022-11-26 Sout… Grambli… Caesar…         34             17 Neutral   
 6   2022 2022-11-24 Alab… UAPB Go… ASU St…         14             19 Home      
 7   2022 2022-11-24 Arka… Alabama… ASU St…         19             14 Away      
 8   2022 2022-11-19 Alab… Texas S… Louis …         24             20 Home      
 9   2022 2022-11-19 Alco… Jackson… Jack S…         13             24 Home      
10   2022 2022-11-19 Beth… Florida… Campin…         20             41 Neutral   
# ℹ 893 more rows
# ℹ 2 more variables: attendance <dbl>, homecoming <lgl>

Here, we see the football data set has 903 rows and 10 columns. These columns each provide a specific kind of data, shown under the column name with the angled brackets <…>. For instance, the “date” column provides us with data that is all of the <date> format, while the “team” column provides us with data that is of the <chr> or character / string format. Some other types include <fct> for factor / categorical, <lgl> for logical, and <dbl> meaning double (or numeric data that might have numbers both before and after the decimal point). For some reason, we don’t see it here, but we could also expect <int> for integer (or numeric data that doesn’t show anything after the decimal point).

Tangent

We might also see here that this data is a tibble. Just think of a tibble as a table or data frame with a couple added conveniences to make things nicer. Because we loaded the entire tidyverse, we didn’t need also to load the tibble package individually.

4 Selecting columns

Sometimes a data set shows us more information than we need. To select only the columns we care about, we can use the select() function.

For instance, what if we just want to see the schedule of games, including the date, the team, and their opponent? The select() function allows us to limit the output to show just these columns. The first argument in the parentheses should be the name of the object, and the remaining arguments should be the columns we want to keep.

select(football, date, team, opponent)

# A tibble: 903 × 3
   date       team                opponent              
   <date>     <chr>               <chr>                 
 1 2022-12-17 Jackson State       NC Central Eagles     
 2 2022-12-03 Jackson State       Southern Jaguars      
 3 2022-12-03 Southern            Jackson State Tigers  
 4 2022-11-26 Grambling State     Southern Jaguars      
 5 2022-11-26 Southern            Grambling State Tigers
 6 2022-11-24 Alabama State       UAPB Golden Lions     
 7 2022-11-24 Arkansas-Pine Bluff Alabama State Hornets 
 8 2022-11-19 Alabama A&M         Texas Southern Tigers 
 9 2022-11-19 Alcorn State        Jackson State Tigers  
10 2022-11-19 Bethune-Cookman     Florida A&M Rattlers  
# ℹ 893 more rows

As you can see, the data has slimmed down from 10 columns to showing just three. Keep in mind that we haven’t changed the initial football object, which still exists. If we want to keep the changes we make, we need to assign them to a name using the assignment arrow <- .

schedule <- select(football, date, team, opponent)

Now, that we’ve saved the step, we can look at the result by typing the name of the object:

schedule

# A tibble: 903 × 3
   date       team                opponent              
   <date>     <chr>               <chr>                 
 1 2022-12-17 Jackson State       NC Central Eagles     
 2 2022-12-03 Jackson State       Southern Jaguars      
 3 2022-12-03 Southern            Jackson State Tigers  
 4 2022-11-26 Grambling State     Southern Jaguars      
 5 2022-11-26 Southern            Grambling State Tigers
 6 2022-11-24 Alabama State       UAPB Golden Lions     
 7 2022-11-24 Arkansas-Pine Bluff Alabama State Hornets 
 8 2022-11-19 Alabama A&M         Texas Southern Tigers 
 9 2022-11-19 Alcorn State        Jackson State Tigers  
10 2022-11-19 Bethune-Cookman     Florida A&M Rattlers  
# ℹ 893 more rows

5 Filtering by values

Suppose we are diehard Gramblinites, and we really only care about the games that list Grambling in the team column. To limit our data by certain values, we can use the filter() function along with a logical test.

Most of these logical tests are very simple. For example, we can imagine that attendance > 3000 is useful to find the games where there are more than 3,000 fans listed in the attendance column as having been at the game. Similarly, if we want to show only those games where the team listed in the team column lost to their opponent, then we can find the rows with team_score < opponent_score. And if we want to show only those columns that are played as away games, we can look for team_venue == "Away" to match that string.

Notice that we use double equals signs == for a logical test. The single equals sign is already used to indicate assignment within a function, like aes(x = team_score). It’s easy to mistakenly use the single where you need the double—but it’s also easy to fix it when you realize your mistake.

Tip

Here are some common logical tests:

x == y tests whether x is equal to y
x >= y tests whether x is greater than or equal to y
x <= y tests whether x is less than or equal to y
x != y tests whether x is not equal to y
x %in% y tests whether x is listed among y
str_detect(x, y) tests whether the string y partially matches the string x.
is.na(x) tests whether x is missing a value and is thus coded as NA.

The first argument in the filter() function should be the name of the data set we’re looking into. Since we’re building from the schedule we made in the previous section, we’re going to look at the schedule object. The rest of the space in the parentheses is devoted to the logical tests. In our case, we’re looking in the team column to try to find Grambling. We might first test for it by building our filter function in this way.

filter(schedule, team == "Grambling")

# A tibble: 0 × 3
# ℹ 3 variables: date <date>, team <chr>, opponent <chr>

The resulting tibble doesn’t have any rows! This code seems like it should work, but it doesn’t because “Grambling” is actually coded as something else in our data set. The str_detect() function is very useful for building a partial match in any logical search involving character data, letting us look for a needle in a haystack. It expects two arguments: the first is the column in which we’re looking (the haystack), and the second is string that we’re trying to match (the needle). So we ask filter to look in the team column for any rows that include the text “Grambling”:

filter(schedule, str_detect(team, "Grambling"))

# A tibble: 89 × 3
   date       team            opponent                 
   <date>     <chr>           <chr>                    
 1 2022-11-26 Grambling State Southern Jaguars         
 2 2022-11-12 Grambling State Texas Southern Tigers    
 3 2022-11-05 Grambling State UAPB Golden Lions        
 4 2022-10-29 Grambling State Alcorn State Braves      
 5 2022-10-15 Grambling State Florida A&M Rattlers     
 6 2022-10-08 Grambling State Alabama A&M Bulldogs     
 7 2022-10-01 Grambling State Prairie View A&M Panthers
 8 2022-09-24 Grambling State Bethune-Cookman Wildcats 
 9 2022-09-17 Grambling State Jackson State Tigers     
10 2022-09-10 Grambling State Northwestern State Demons
# ℹ 79 more rows

The results show that “Grambling” has actually been coded as “Grambling State” in the data set. Good to know! Now our schedule has been shortened down to the 89 rows we care about.

We can actually group multiple filters into a filter() command by separating them with commas. If we just wanted Grambling’s 2019 season, we might try to call it like this:

filter(schedule, str_detect(team, "Grambling"), season == "2019")

It won’t work at this stage because we already got rid of the season column. We’ll come back to this problem in the next section.

Don’t forget that we haven’t actually changed anything. If we’d like to save our work into an object, we have to use the assignment arrow and give it a name:

schedule_gram2019 <- 
  filter(schedule, 
         str_detect(team, "Grambling"))

Here I’ve also introduced some white space by adding a line break after each comma. This makes it easier to follow along.

6 Detour: The Pipe `|>`

Over time, we’re going to realize that chaining together steps like this can get tedious. If all we ever wanted was Grambling’s 2019 schedule, then it seems annoying that we had to save things in an intermediate step called schedule, when that step even contained games we don’t need. Additionally, because of the order in which we did things, we ran into the problem of being unable to show only the 2019 season, since that column no longer existed in the schedule object we were using.

The pipe |> is a helpful way to chain steps together so that we don’t need to save intermediate steps to new names. We use pipes by describing all the steps to get from the start to the end. For instance, here are the steps need to get to the object named schedule_gram2019:

Start from the football object, and then
Filter to show only games where “Grambling” is in the team column and where the season is “2019”, and then
Select the date, team, and opponent columns.

Pipes make it easy to show this process:

schedule_gram2019 <- 
  football |> 
  filter(str_detect(team, "Grambling"), season == "2019") |> 
  select(date, team, opponent)

We think of |> like the phrase “and then” when we describe our workflow, but we call it a “pipe” when we talk about our code. Just like a pipe, it provides indoor plumbing. Pipes allow the first argument of each step—filter(football, …), select(filter(football, …), …)—to be filled in from the step that came before, without needing to be called explicitly. We’re going to use the pipe from here on out.

Tip

Because the pipe is a command that’s used a lot, it’s not a bad idea to use a keyboard shortcut. One is built-in for RStudio, but you’ll need to enable it first:

Go to Global Options
Go to the “Code” section
Select the “Use native pipe operator, |>”

On Mac, the keyboard shortcut is Command + Shift + M. On Windows, it’s Control + Shift + M.

7 Arranging rows

Our schedule is looking pretty good, but we might have noticed that the dates go in reverse chronological order. The arrange() function lets us change the order by specific columns:

arrange(schedule_gram2019, date)

# A tibble: 11 × 3
   date       team            opponent                 
   <date>     <chr>           <chr>                    
 1 2019-08-31 Grambling State ULM Warhawks             
 2 2019-09-07 Grambling State Louisiana Tech Bulldogs  
 3 2019-09-21 Grambling State Alabama State Hornets    
 4 2019-09-28 Grambling State Prairie View A&M Panthers
 5 2019-10-05 Grambling State Jackson State Tigers     
 6 2019-10-12 Grambling State Alabama A&M Bulldogs     
 7 2019-10-26 Grambling State UAPB Golden Lions        
 8 2019-11-02 Grambling State Texas Southern Tigers    
 9 2019-11-09 Grambling State Alcorn State Braves      
10 2019-11-16 Grambling State MVSU Delta Devils        
11 2019-11-30 Grambling State Southern Jaguars

By default, arrange() will put things in ascending order. If we actually prefer things in descending order, we can add the desc() function:

arrange(schedule_gram2019, desc(date))

# A tibble: 11 × 3
   date       team            opponent                 
   <date>     <chr>           <chr>                    
 1 2019-11-30 Grambling State Southern Jaguars         
 2 2019-11-16 Grambling State MVSU Delta Devils        
 3 2019-11-09 Grambling State Alcorn State Braves      
 4 2019-11-02 Grambling State Texas Southern Tigers    
 5 2019-10-26 Grambling State UAPB Golden Lions        
 6 2019-10-12 Grambling State Alabama A&M Bulldogs     
 7 2019-10-05 Grambling State Jackson State Tigers     
 8 2019-09-28 Grambling State Prairie View A&M Panthers
 9 2019-09-21 Grambling State Alabama State Hornets    
10 2019-09-07 Grambling State Louisiana Tech Bulldogs  
11 2019-08-31 Grambling State ULM Warhawks

But this isn’t any different from what we saw before, so we’re going to skip the desc() function and save our newly arranged tibble to the named object:

schedule_gram2019 <- 
  football |> 
  filter(str_detect(team, "Grambling"),
         season == "2019") |> 
  select(date, team, opponent) |> 
  arrange(date)

Notice here that I’ve started back at the beginning and documented each step. Doing so isn’t strictly necessary. You could, after all, write schedule_gram2019 <- arrange(schedule_gram2019, date). But saving intermediately isn’t needed when we can just go back to our old code to add a step.

Tip

Everybody builds up their workflows one step at a time. It’s okay to go back and edit code to add more to it. To figure out how to get where you’d like to go, first break things down into pieces. Once you get one step working the way it should, your next task is to figure out the next step. In this way, you earn yourself lots of little victories along the way—and a giant victory at the end!

8 Changing the data set with mutations

Our schedule from 2019 is looking pretty good, but we might also want to see whether Grambling won each game or lost it. Our original data set doesn’t include a column indicating whether a game was won or lost, but we can decipher this data by looking at other columns: If the team_score is higher than the opponent_score, then the team won.

We can use the mutate() function to add a column to the data set. Check out line 5 below—and also notice that we had to add our new column into the select() function on line 6:

schedule_gram2019 <- 
  football |> 
  filter(str_detect(team, "Grambling"),
         season == "2019") |> 
  mutate(win = team_score > opponent_score) |> 
  select(date, team, opponent, win) |> 
  arrange(date)

And then we can see the results here:

schedule_gram2019

# A tibble: 11 × 4
   date       team            opponent                  win  
   <date>     <chr>           <chr>                     <lgl>
 1 2019-08-31 Grambling State ULM Warhawks              FALSE
 2 2019-09-07 Grambling State Louisiana Tech Bulldogs   FALSE
 3 2019-09-21 Grambling State Alabama State Hornets     FALSE
 4 2019-09-28 Grambling State Prairie View A&M Panthers FALSE
 5 2019-10-05 Grambling State Jackson State Tigers      TRUE 
 6 2019-10-12 Grambling State Alabama A&M Bulldogs      TRUE 
 7 2019-10-26 Grambling State UAPB Golden Lions         TRUE 
 8 2019-11-02 Grambling State Texas Southern Tigers     TRUE 
 9 2019-11-09 Grambling State Alcorn State Braves       TRUE 
10 2019-11-16 Grambling State MVSU Delta Devils         TRUE 
11 2019-11-30 Grambling State Southern Jaguars          FALSE

We used a logical test for our “win” column, comparing two columns to see if one was bigger than the other, but we might as easily have used math to find the difference between them. Here, we create a “pointrange” column to characterize each win or loss, with 0 indicating a tie, positive numbers indicating a Grambling win, and negative numbers indicating a Grambling loss:

football |> 
  filter(str_detect(team, "Grambling"),
         season == "2019") |> 
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  select(date, team, opponent, win, pointrange) |> 
  arrange(date)

# A tibble: 11 × 5
   date       team            opponent                  win   pointrange
   <date>     <chr>           <chr>                     <lgl>      <dbl>
 1 2019-08-31 Grambling State ULM Warhawks              FALSE        -22
 2 2019-09-07 Grambling State Louisiana Tech Bulldogs   FALSE         -6
 3 2019-09-21 Grambling State Alabama State Hornets     FALSE         -3
 4 2019-09-28 Grambling State Prairie View A&M Panthers FALSE         -6
 5 2019-10-05 Grambling State Jackson State Tigers      TRUE          23
 6 2019-10-12 Grambling State Alabama A&M Bulldogs      TRUE          13
 7 2019-10-26 Grambling State UAPB Golden Lions         TRUE           6
 8 2019-11-02 Grambling State Texas Southern Tigers     TRUE          35
 9 2019-11-09 Grambling State Alcorn State Braves       TRUE           3
10 2019-11-16 Grambling State MVSU Delta Devils         TRUE          40
11 2019-11-30 Grambling State Southern Jaguars          FALSE         -2

9 Summarizing data

After all this, we’ve done a lot to make our data useful. By filtering the data set to Grambling games in 2019, and by selecting only a small number of columns we’re interested in, we’ve made the data set easier to think about. Additionally, by changing the table to add a couple columns, we’ve simultaneously been able to add value, making it easier to see which games Grambling won or lost, and by how much.

Even with this smaller data set, showing 11 rows instead of the 547 rows in the original, we might still imagine making our lives easier. For instance, we could easily count the number of games Grambling won—but we’re humans who make mistakes and who can sometimes be lazy. Why can’t we make the computer do the counting?

If all we’re interested in is a summary of this data set, the summarize() function will get us there. It simplifies the set of data down to one row. I’ll add it to the end of our workflow, here:

football |> 
  filter(str_detect(team, "Grambling"),
         season == "2019") |> 
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  select(date, team, opponent, win, pointrange) |> 
  arrange(date) |> 
  summarize(wins = sum(win))

# A tibble: 1 × 1
   wins
  <int>
1     6

There are three things to notice here:

We can take the sum() of a logical vector to add up all the TRUE values. Basically, R treates TRUE as a value of 1 and FALSE as a value of 0. So yes, this means we could also use mean() to find a win ratio.
The summarize() command strips out all of the other columns. This means that the select() step isn’t really necessary. I guess I’m keeping it for now out of a sense of nostalgia.
The summarize() step also leaves just one row. If all we’re interested in is a summary of data, this is very useful. Narrowing things down allows us to clearly communicate the answer to a question we might be asking.

When we summarize the data, we might reasonably be interested in more than one column. After all, we added two columns with the mutate() command, so we probably are curious about two bits of information. Let’s add a few more summary statistics in at this stage, so that we can query the wins, the win ratios, and the minimum, average, and maximum point ranges:

football |> 
  filter(str_detect(team, "Grambling"),
         season == "2019") |> 
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  select(date, team, opponent, win, pointrange) |> 
  arrange(date) |> 
  summarize(wins = sum(win),
            win_ratio = mean(win),
            ptrange_min = min(pointrange),
            ptrange_avg = mean(pointrange),
            ptrange_max = max(pointrange))

# A tibble: 1 × 5
   wins win_ratio ptrange_min ptrange_avg ptrange_max
  <int>     <dbl>       <dbl>       <dbl>       <dbl>
1     6     0.545         -22        7.36          40

We’ve just found a wealth of information about Grambling’s 2019 season! We see that they won slightly more than half of their games and that they did so with about a 7-point lead on average. We also see that their greatest loss of the season was by 22 points, and their greatest victory was a 40-point show of dominance.

10 Grouping by categories to make comparisons

Sometimes we want to find these summary statistics for more than just one team in one season. Sometimes we want to see how Grambling’s stats changed from one season to the next.

The group_by() function allows us to group the information by one or more categories. The football data set inside the swac package shows games from 2015 through 2022, so using group_by(season) will divide it into eight virtual groups. Each of these groups corresponds to a single season. Using this function before summarize() will let us find the summary statistics for each season. Before it will work, we need to remember to remove season == "2019" from our filter() step. Additionally, we have to remember that select() has limited our columns to remove season. I’m going to simplify things by removing both select() and arrange().

football |> 
  filter(str_detect(team, "Grambling")) |> 
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season) |> 
  summarize(wins = sum(win),
            win_ratio = mean(win),
            ptrange_min = min(pointrange),
            ptrange_avg = mean(pointrange),
            ptrange_max = max(pointrange))

# A tibble: 8 × 6
  season  wins win_ratio ptrange_min ptrange_avg ptrange_max
   <dbl> <int>     <dbl>       <dbl>       <dbl>       <dbl>
1   2015     9     0.75          -59        8             35
2   2016    13     0.929         -10       24.6           70
3   2017    11     0.846         -29       10.8           32
4   2018     6     0.545         -32        4.09          55
5   2019     6     0.545         -22        7.36          40
6   2020    NA    NA              NA       NA             NA
7   2021     4     0.364         -45      -10.5           14
8   2022     3     0.273         -55      -10.4           29

These results show that we’ve simultaneously struck paydirt and hit a pothole. The table shows mostly what we want, but it seems that some information is missing in the 2020 season. And since some information is missing, R is telling us that it doesn’t want to assume that it can use the rest of the information to do its calculations. Here, we see that NA is what R uses to signify that a value is missing.

Since these details are missing from 2020 games because they weren’t played—they were cancelled because of the pandemic—we know that the missing details aren’t significant to the questions we’re asking. We can tell R to ignore the NA values and figure things out with the rest of the data. In many functions, adding the argument na.rm = TRUE will tell R to ignore the missing values altogether.

football |> 
  filter(str_detect(team, "Grambling")) |> 
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season) |> 
  summarize(wins = sum(win, 
                       na.rm = TRUE),
            win_ratio = mean(win, 
                             na.rm = TRUE),
            ptrange_min = min(pointrange, 
                              na.rm = TRUE),
            ptrange_avg = mean(pointrange, 
                              na.rm = TRUE),
            ptrange_max = max(pointrange, 
                              na.rm = TRUE))

# A tibble: 8 × 6
  season  wins win_ratio ptrange_min ptrange_avg ptrange_max
   <dbl> <int>     <dbl>       <dbl>       <dbl>       <dbl>
1   2015     9     0.75          -59        8             35
2   2016    13     0.929         -10       24.6           70
3   2017    11     0.846         -29       10.8           32
4   2018     6     0.545         -32        4.09          55
5   2019     6     0.545         -22        7.36          40
6   2020     0     0             -42      -20.2           -5
7   2021     4     0.364         -45      -10.5           14
8   2022     3     0.273         -55      -10.4           29

These results let us see clearly how Grambling’s football team has fared over the years. The 2016 season was clearly their best, in which they won more than 92% of the games they played. With a win ratio of 54.5%, the 2018 and 2019 seasons are a close match with each other. And after a dismal year in 2020, the team improved in 2021, not only winning more games but also raising the distribution of point ranges.

The group_by() function is obviously powerful, but it isn’t limited to one group. By adding more columns into the function, it can create ever-complex groupings for us to summarize. Here, for instance, we’ve changed the filter function to match both “Grambling State” and “Southern” (using the %in% logical test), and we’re grouping by both season and team. This allows us to compare the records from both universities over time.

football |> 
  filter(team %in% c("Grambling State", "Southern")) |>
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season, team) |> 
  summarize(wins = sum(win, 
                       na.rm = TRUE),
            win_ratio = mean(win, 
                             na.rm = TRUE),
            ptrange_min = min(pointrange, 
                              na.rm = TRUE),
            ptrange_avg = mean(pointrange, 
                              na.rm = TRUE),
            ptrange_max = max(pointrange, 
                              na.rm = TRUE))

# A tibble: 16 × 7
# Groups:   season [8]
   season team             wins win_ratio ptrange_min ptrange_avg ptrange_max
    <dbl> <chr>           <int>     <dbl>       <dbl>       <dbl>       <dbl>
 1   2015 Grambling State     9     0.75          -59       8              35
 2   2015 Southern            6     0.545         -47       1.09           39
 3   2016 Grambling State    13     0.929         -10      24.6            70
 4   2016 Southern            8     0.727         -45      11.5            58
 5   2017 Grambling State    11     0.846         -29      10.8            32
 6   2017 Southern            7     0.636         -45      -0.364          26
 7   2018 Grambling State     6     0.545         -32       4.09           55
 8   2018 Southern            7     0.636         -48       3.09           38
 9   2019 Grambling State     6     0.545         -22       7.36           40
10   2019 Southern            8     0.615         -31       6.85           61
11   2020 Grambling State     0     0             -42     -20.2            -5
12   2020 Southern            5     0.833          -3      15.3            42
13   2021 Grambling State     4     0.364         -45     -10.5            14
14   2021 Southern            4     0.364         -52      -4.45           27
15   2022 Grambling State     3     0.273         -55     -10.4            29
16   2022 Southern            7     0.583         -48       9.92           86

Chaining group_by() together with summarize() is a powerful way to drill down into a complex data set, making comparisons across different categories. Whereas using the summarize() function alone strips out all categories except those by which you’re summarizing, using it with group_by() ensures that the grouping column is retained at the start of the table.

Note

Sometimes when you add a group with group_by(), the grouping persists longer than you might expect. For instance, if you group by two or more variables, a single summarize() step will remove the last variable grouping but keep the rest. Adding an empty ungroup() step at the end would get things back to normal, without any groups.

11 Going the extra mile

Although not strictly necessary nor in the scope of this lesson, you can make the final table look nicer if you’re willing to do a little more work.

11.1 Using `knitr::kable()`

The easiest step you might take is to add one line at the end. The knitr package offers a command called kable() that will make your table look more like it belongs in your document. Since all we want from the knitr package is this one function, we can call the function directly without using a library() command by using the double colon notation, knitr::kable().

football |> 
  filter(team %in% c("Grambling State", "Southern")) |>
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season, team) |> 
  summarize(wins = sum(win, 
                       na.rm = TRUE),
            win_ratio = mean(win, 
                             na.rm = TRUE),
            ptrange_min = min(pointrange, 
                              na.rm = TRUE),
            ptrange_avg = mean(pointrange, 
                              na.rm = TRUE),
            ptrange_max = max(pointrange, 
                              na.rm = TRUE)) |> 
  knitr::kable()

season	team	wins	win_ratio	ptrange_min	ptrange_avg	ptrange_max
2015	Grambling State	9	0.7500000	-59	8.0000000	35
2015	Southern	6	0.5454545	-47	1.0909091	39
2016	Grambling State	13	0.9285714	-10	24.6428571	70
2016	Southern	8	0.7272727	-45	11.5454545	58
2017	Grambling State	11	0.8461538	-29	10.7692308	32
2017	Southern	7	0.6363636	-45	-0.3636364	26
2018	Grambling State	6	0.5454545	-32	4.0909091	55
2018	Southern	7	0.6363636	-48	3.0909091	38
2019	Grambling State	6	0.5454545	-22	7.3636364	40
2019	Southern	8	0.6153846	-31	6.8461538	61
2020	Grambling State	0	0.0000000	-42	-20.2500000	-5
2020	Southern	5	0.8333333	-3	15.3333333	42
2021	Grambling State	4	0.3636364	-45	-10.4545455	14
2021	Southern	4	0.3636364	-52	-4.4545455	27
2022	Grambling State	3	0.2727273	-55	-10.3636364	29
2022	Southern	7	0.5833333	-48	9.9166667	86

This is a much nicer table already! You can actually do more customizing with kable() than I show here, but there’s no need to get too far along with it.

11.2 Using `gt`

The gt package offers some nice features beyond what we can get with kable(). Notice here that I’ve changed the final line and added one argument for rowname_col to make the table a little easier to read. Because the “season” column grouping is still in effect (see the “Note” at the end of the previous section), gt adopts these groups automatically.

football |> 
  filter(team %in% c("Grambling State", "Southern")) |>
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season, team) |> 
  summarize(wins = sum(win, 
                       na.rm = TRUE),
            win_ratio = mean(win, 
                             na.rm = TRUE),
            ptrange_min = min(pointrange, 
                              na.rm = TRUE),
            ptrange_avg = mean(pointrange, 
                              na.rm = TRUE),
            ptrange_max = max(pointrange, 
                              na.rm = TRUE)) |> 
  gt::gt(rowname_col = "team")

	wins	win_ratio	ptrange_min	ptrange_avg	ptrange_max
2015
Grambling State	9	0.7500000	-59	8.0000000	35
Southern	6	0.5454545	-47	1.0909091	39
2016
Grambling State	13	0.9285714	-10	24.6428571	70
Southern	8	0.7272727	-45	11.5454545	58
2017
Grambling State	11	0.8461538	-29	10.7692308	32
Southern	7	0.6363636	-45	-0.3636364	26
2018
Grambling State	6	0.5454545	-32	4.0909091	55
Southern	7	0.6363636	-48	3.0909091	38
2019
Grambling State	6	0.5454545	-22	7.3636364	40
Southern	8	0.6153846	-31	6.8461538	61
2020
Grambling State	0	0.0000000	-42	-20.2500000	-5
Southern	5	0.8333333	-3	15.3333333	42
2021
Grambling State	4	0.3636364	-45	-10.4545455	14
Southern	4	0.3636364	-52	-4.4545455	27
2022
Grambling State	3	0.2727273	-55	-10.3636364	29
Southern	7	0.5833333	-48	9.9166667	86

Even just this one step looks pretty good. But if you’re really driven to stretch the limits, gt offers boundless options to make your tables look more polished.

Below, I’ve added a few more steps, adding a mutate(), using rename() to change some column names, and drawing on gt’s options for advanced customization to show an example of how one might really make things look nice. At each step, I’ve added a comment in the code using the # symbol to explain what’s happening. Because all this extra code might seem a little overwhelming, I’ve hidden it for now so that the first impression you see is the final product, but you can show things by toggling the disclosure arrow labeled “Show or hide the code.”

I really don’t expect you to learn how to do so much with gt. I’m only showing it here so that you can get a sense of where you might start with things if you’re interested in learning more.

Show or hide the code

# This time we're loading the gt package because we're going
# to be using many different functions from it.
library(gt)

football |> 
  
  # We're starting off in the next line by renaming "Grambling 
  # State" to just "Grambling". This looks better when compared
  # against "Southern". This mutate function starts with the
  # "team" column, then uses str_remove() as a kind of
  # powerful find-and-replace, finding all text that matches
  # " State" and removing it.
  
  mutate(team = team |> 
           str_remove(" State")) |> 
  
  # From here, things mostly continue as before.
  
  filter(team %in% c("Grambling", "Southern")) |>
  mutate(win = team_score > opponent_score,
         pointrange = team_score - opponent_score) |> 
  group_by(season, team) |> 
  summarize(wins = sum(win, 
                       na.rm = TRUE),
            win_ratio = mean(win, 
                             na.rm = TRUE),
            ptrange_min = min(pointrange, 
                              na.rm = TRUE),
            ptrange_avg = mean(pointrange, 
                              na.rm = TRUE),
            ptrange_max = max(pointrange, 
                              na.rm = TRUE)) |>
  
  # We're renaming the columns here so that they look nicer in
  # the final table. Surrounding a column name with
  # backticks allows us to use special characters in it.
  
  rename(`#` = wins,
         ratio = win_ratio,
         min = ptrange_min,
         avg = ptrange_avg,
         max = ptrange_max) |> 
  
  # The gt() function prints tables nicely. Here, we're also
  # defining the "groupname_col" and the "rowname_col"
  # arguments so that the season and team columns stand out.
  # The first of these isn't strictly necessary, since the
  # season column is still grouping the data, but it's
  # helpful to be explicit.
  
  gt(groupname_col = "season",
     rowname_col = "team") |> 
  
  # When using gt, the "fmt_*" functions allow us to format
  # certain columns and set the number of decimals we want
  # shown. The fmt_percent() function is especially nice to
  # say that our "ratio" column should be formatted like a
  # percentage.
  
  fmt_percent(columns = ratio, 
              decimals = 1) |> 
  fmt_number(columns = avg,
             decimals = 1) |> 
  
  # In gt, the tab_spanner() function lets us group columns.
  # The "columns" argument defines the range of columns we
  # want in each group. (Don't forget to count the "season"
  # and "team" columns first.)
  
  tab_spanner(label = "Wins per season", 
              columns = 3:4) |> 
  tab_spanner(label = "Point range", 
              columns = 5:7) |> 
  
  # This last step is where a person might go wild, looking
  # up options to make each little element look better. I
  # can't document it all here, but you should be able to
  # click on "tab_options" and "tab_style" to see the
  # official help documents. To avoid the perfectionist's
  # temptation, save this step for last!
  
  tab_options(
    table.background.color = "#FFFFFF00",
    row_group.border.top.width = px(3),
    row_group.border.top.color = "black",
    row_group.border.bottom.color = "black",
    table_body.hlines.color = "white",
    table.border.top.color = "white",
    table.border.top.width = px(3),
    table.border.bottom.color = "white",
    table.border.bottom.width = px(3),
    column_labels.border.bottom.color = "black",
    column_labels.border.bottom.width = px(2)
    ) |> 
  tab_style(
    style = cell_text(
      color = "black", 
      weight = "bold"),
    locations = list(
      cells_row_groups(),
      cells_column_spanners())
    )

	Wins per season		Point range
	#	ratio	min	avg	max
2015
Grambling	9	75.0%	-59	8.0	35
Southern	6	54.5%	-47	1.1	39
2016
Grambling	13	92.9%	-10	24.6	70
Southern	8	72.7%	-45	11.5	58
2017
Grambling	11	84.6%	-29	10.8	32
Southern	7	63.6%	-45	−0.4	26
2018
Grambling	6	54.5%	-32	4.1	55
Southern	7	63.6%	-48	3.1	38
2019
Grambling	6	54.5%	-22	7.4	40
Southern	8	61.5%	-31	6.8	61
2020
Grambling	0	0.0%	-42	−20.2	-5
Southern	5	83.3%	-3	15.3	42
2021
Grambling	4	36.4%	-45	−10.5	14
Southern	4	36.4%	-52	−4.5	27
2022
Grambling	3	27.3%	-55	−10.4	29
Southern	7	58.3%	-48	9.9	86

This last version of the table offers the clearest direct comparison of football records for these universities and in these years. It took a little time to build, but the final result is polished enough to publish.

Citation

BibTeX citation:

@misc{clawson2022,
  author = {Clawson, James},
  title = {Filtering, {Transforming,} and {Summarizing} {Tables}},
  date = {2022-10-30},
  url = {https://jmclawson.net/posts/filter-transform-summarize/},
  langid = {en}
}

For attribution, please cite this work as:

Clawson, James. “Filtering, Transforming, and Summarizing Tables.” jmclawson.net, 30 Oct. 2022, https://jmclawson.net/posts/filter-transform-summarize/.

1 Introduction

2 Entering the tidyverse

3 Getting started

4 Selecting columns

5 Filtering by values

6 Detour: The Pipe |>

7 Arranging rows

8 Changing the data set with mutations

9 Summarizing data

10 Grouping by categories to make comparisons

11 Going the extra mile

11.1 Using knitr::kable()

11.2 Using gt

Citation

6 Detour: The Pipe `|>`

11.1 Using `knitr::kable()`

11.2 Using `gt`