Understanding and using linear models in R

A 10-slide guide to the dark side

Case study: finding mass from height

The starwars data set is always useful. We’re filtering out Jabba Desilijic Tiure who is surprisingly massive for their height. We’re also filtering out anyone whose mass is missing.

library(tidyverse)
starwars |> 
  filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> 
  select(name, height, mass) |> 
  head()

# A tibble: 6 × 3
  name           height  mass
  <chr>           <int> <dbl>
1 Luke Skywalker    172    77
2 C-3PO             167    75
3 R2-D2              96    32
4 Darth Vader       202   136
5 Leia Organa       150    49
6 Owen Lars         178   120

`lm()` fits a linear model

Inside the function, a formula y ~ x will show that we want to deduce y from x:

the_model <- starwars |> 
  filter(name != "Jabba Desilijic Tiure", !is.na(mass)) |> 
  lm(mass ~ height, data = _) 

the_model


Call:
lm(formula = mass ~ height, data = filter(starwars, name != "Jabba Desilijic Tiure", 
    !is.na(mass)))

Coefficients:
(Intercept)       height  
   -32.5408       0.6214

Use broom to `tidy()`

library(broom)
tidy(the_model)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -32.5     12.6        -2.59 1.22e- 2
2 height         0.621    0.0707      8.79 4.02e-12

Coefficients reveal the trend line’s equation, y = mx + a

m = height = 0.621; each cm adds this many kg
At 100cm, a character’s mass should be 29.6kg.

a = Intercept = -32.5; or height if mass is 0kg
At 200cm, a character’s mass should be 91.7kg.

`augment()` adds two columns

starwars |> 
  select(name, height, mass) |> 
  head() |> 
  augment(x = the_model, 
          newdata = _)

# A tibble: 6 × 5
  name           height  mass .fitted .resid
  <chr>           <int> <dbl>   <dbl>  <dbl>
1 Luke Skywalker    172    77    74.3   2.67
2 C-3PO             167    75    71.2   3.77
3 R2-D2              96    32    27.1   4.89
4 Darth Vader       202   136    93.0  43.0 
5 Leia Organa       150    49    60.7 -11.7 
6 Owen Lars         178   120    78.1  41.9

.fitted is the predicted mass
.resid calculates the actual mass minus this prediction

RMSE measures a model’s accuracy

RMSE stands for Root Mean Squared Error
lower RMSE is better than higher
calculate with sqrt(var(the_model$residuals))

Understanding and using linear models in R

Case study: finding mass from height

lm() fits a linear model

Use broom to tidy()

augment() adds two columns

RMSE measures a model’s accuracy

`lm()` fits a linear model

Use broom to `tidy()`

`augment()` adds two columns