stylo2gg: Visualizing Reproducible Stylometry

James M. Clawson

Reproducible notebook

background

Plan

  1. Stylometry?
  2. Viz. problems & solutions
  3. ggplot2 opportunities

1. Stylometry

  • measure distributions of features (words, characters, grammar, …)
  • “fingerprint” of a work
  • study style and authorship

The Federalist Papers (1787–1788)

  • 85 papers by “Publius” (Alexander Hamilton, James Madison, and John Jay)

  • history of authorship research

    • Douglass Adair (1944)
    • Frederick Mosteller and David Wallace (1963)
    • count / compare features

Stylometry in R

  • stylo by Maciej Eder et al.
  • interactive GUI or code-based
  • frequency tables and visualizations
  • principal component analysis, hierarchical clustering, etc.

stylo Problems — and stylo2gg

 

logging — with replication

 

modifications — with successive exploration

 

graphics — using ggplot2

Lost logs with stylo

library(stylo)

my_data1 <- stylo(
  gui = FALSE,
  corpus.dir = "data/federalist/",
  display.on.screen = FALSE,
  culling.max = 75,
  culling.min = 75,
  mfw.min = 900,
  mfw.max = 900)

Log and replicate with stylo2gg

library(stylo2gg)
stylo_log(my_data1)
my_data2 <- stylo_replicate("2023-01-01 00:01:00")

2. Visualizing problems and solutions

Principal Component Analysis

library(stylo)

federalist_mfw <- 
  stylo(gui = FALSE,
        corpus.dir = "data/federalist/",
        analysis.type = "PCR",
        pca.visual.flavour = "symbols",
        analyzed.features = "w",
        ngram.size = 1,
        display.on.screen = TRUE,
        sampling = "no.sampling",
        culling.max = 75,
        culling.min = 75,
        mfw.min = 900,
        mfw.max = 900)
library(stylo2gg)

stylo2gg(federalist_mfw)

Hierarchical Clustering

library(stylo)

federalist_hc <- 
  stylo(gui = FALSE,
        corpus.dir = "data/federalist/",
        analysis.type = "CA",
        pca.visual.flavour = "symbols",
        analyzed.features = "w",
        ngram.size = 1,
        display.on.screen = TRUE,
        sampling = "no.sampling",
        culling.max = 75,
        culling.min = 75,
        mfw.min = 900,
        mfw.max = 900)
library(stylo2gg)

stylo2gg(federalist_hc)

Labeling

stylo2gg(federalist_mfw,
         shapes = FALSE, 
         labeling = 2)

stylo2gg(federalist_mfw,
         shapes = FALSE, 
         labeling = 0)

Renaming, Highlighting, Emphasizing

federalist_mfw |> 
  rename_category("NA", "unknown") |> 
  stylo2gg(black = 4,
           highlight = c(3, 4))

federalist_mfw |> 
  rename_category("NA", "unknown") |> 
  stylo2gg(viz = "CA",
           shapes = FALSE,
           black = 4,
           highlight = 4)

Overlaying features

stylo2gg(federalist_mfw,
  top.loadings = 6,
  loadings.line.color = "blue",
  loadings.word.color = "navy",
  loadings.upper = TRUE)

stylo2gg(federalist_mfw,
  loadings.line.color = "magenta",
  loadings.word.color = "deeppink3",
  select.loadings = list(c(-1, 2), 
    "Jay", call("word", c("people","public","men","republic","state","woman","women"))))

Other visualization options

  • Flip horizontally, vertically
  • Visualize other principal components
  • Choose feature subsets
  • Withhold a category from PCA space

3. ggplot2 opportunities

Adding layers

stylo2gg(federalist_mfw, "pca", legend = FALSE) + 
  scale_alpha_manual(values = rep(0, 4)) +
  geom_density_2d(aes(color = class),
                  show.legend = FALSE) + 
  theme_minimal()

Increased customization

federalist_mfw |> 
  rename_category("NA", "unknown") |> 
  stylo2gg(viz = "pca") + 
  theme_minimal() +
  theme(legend.position = c(0.9,0.15),
        panel.grid = element_blank(),
        plot.title.position = "plot") + 
  scale_shape_manual(values = 15:18) +
  scale_size_manual(values = c(8.5, 9, 7, 10)) +
  scale_alpha_manual(values = rep(.5, 4)) + 
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73", "#CC79A7")) +
  labs(title = "Larger, solid points can make relationships easier to understand.",
       subtitle = "Setting alpha values is a good idea when solid points overlap.",
       x = NULL, y = NULL,
       color = "author", shape = "author", alpha = "author", size = "author")

Increased customization

Other packages

library(ggforce)
library(wesanderson)

federalist_mfw |> 
  rename_category("NA", "disputed") |> 
  stylo2gg("pca")  +
  geom_mark_hull(aes(fill = class, 
                     color = class)) + 
  geom_mark_hull(aes(group = class, 
                     label = class, 
                     filter = class %in% c("Madison","disputed")),
                 con.cap = 0,
                 show.legend = FALSE) +
  scale_fill_manual(values = wes_palettes$Darjeeling1[c(1:3,5)]) +
  scale_color_manual(values = wes_palettes$Darjeeling1[c(1:3,5)])

Other packages

(4.) Use and usefulness

Reproducible notebooks

Publication ready

GitHub and pkgdown site

github.com/jmclawson/stylo2gg

jmclawson.net

clawson@gmail.com

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2023. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Clawson, James M. 2023. Stylo2gg: Visualize and Explore Stylo Data with Ggplot2 (version 1.0.1). https://github.com/jmclawson/stylo2gg.
Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. “Stylometry with r: A Package for Computational Text Analysis.” R Journal 8 (1): 107–21. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html.
Ginnerskov, Josef. 2022. “Change Color of Loadings (Arrows and Features).” Stylo2gg Issues Tracker. https://github.com/jmclawson/stylo2gg/issues/2.
———. 2023. “Loadings Stuck on PC1 and PC2 When Trying to Plot PC3.” Stylo2gg Issues Tracker. https://github.com/jmclawson/stylo2gg/issues/4.
Mosteller, Frederick, and David L. Wallace. 1963. “Inference in an Authorship Problem.” Journal of the American Statistical Association 58 (302): 275–309. http://www.jstor.org/stable/2283270.
Pedersen, Thomas Lin. 2022. ggforce: Accelerating ggplot2. https://CRAN.R-project.org/package=ggforce.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schöch, Christof. 2022. “Proposal for a Replication or Documentation Mode.” Stylo Issues Tracker. https://github.com/computationalstylistics/stylo/issues/53.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Xie, Yihui. 2014. knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2023. knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.