Understanding and using linear models in R

A 10-slide guide to the dark side

Case study: finding mass from height

The starwars data set is always useful. We’re filtering out Jabba Desilijic Tiure who is surprisingly massive for their height. We’re also filtering out anyone whose mass is missing.

library(tidyverse)
starwars |> 
  filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> 
  select(name, height, mass) |> 
  head()
# A tibble: 6 × 3
  name           height  mass
  <chr>           <int> <dbl>
1 Luke Skywalker    172    77
2 C-3PO             167    75
3 R2-D2              96    32
4 Darth Vader       202   136
5 Leia Organa       150    49
6 Owen Lars         178   120

lm() fits a linear model

Inside the function, a formula y ~ x will show that we want to deduce y from x:

the_model <- starwars |> 
  filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> 
  lm(mass ~ height, data = _) 

the_model

Call:
lm(formula = mass ~ height, data = filter(starwars, name != "Jabba Desilijic Tiure", 
    !is.na(mass)))

Coefficients:
(Intercept)       height  
   -32.5408       0.6214  

Use broom to tidy()

library(broom)
tidy(the_model)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -32.5     12.6        -2.59 1.22e- 2
2 height         0.621    0.0707      8.79 4.02e-12

Coefficients reveal the trend line’s equation, y = mx + a

  • m = height = 0.621; each cm adds this many kg
  • At 100cm, a character’s mass should be 29.6kg.
  • a = Intercept = -32.5; or height if mass is 0kg
  • At 200cm, a character’s mass should be 91.7kg.

augment() adds two columns

starwars |> 
  select(name, height, mass) |> 
  head() |> 
  augment(x = the_model, 
          newdata = _)
# A tibble: 6 × 5
  name           height  mass .fitted .resid
  <chr>           <int> <dbl>   <dbl>  <dbl>
1 Luke Skywalker    172    77    74.3   2.67
2 C-3PO             167    75    71.2   3.77
3 R2-D2              96    32    27.1   4.89
4 Darth Vader       202   136    93.0  43.0 
5 Leia Organa       150    49    60.7 -11.7 
6 Owen Lars         178   120    78.1  41.9 
  • .fitted is the predicted mass
  • .resid calculates the actual mass minus this prediction

RMSE measures a model’s accuracy

  • RMSE stands for Root Mean Squared Error
  • lower RMSE is better than higher
  • calculate with sqrt(var(the_model$residuals))