stylo2gg: Visualizing Reproducible Stylometry

James M. Clawson

Reproducible notebook

background

Plan

  1. Stylometry?
  2. Viz. problems & solutions
  3. ggplot2 opportunities

1. Stylometry

  • measure distributions of features (words, characters, grammar, …)
  • “fingerprint” of a work
  • study style and authorship

The Federalist Papers (1787–1788)

  • 85 papers by “Publius” (Alexander Hamilton, James Madison, and John Jay)

  • history of authorship research

    • Douglass Adair (1944)
    • Frederick Mosteller and David Wallace (1963)
    • count / compare features

Stylometry in R

  • stylo by Maciej Eder et al.
  • interactive GUI or code-based
  • frequency tables and visualizations
  • principal component analysis, hierarchical clustering, etc.

stylo Problems — and stylo2gg

 

logging — with replication

 

modifications — with successive exploration

 

graphics — using ggplot2

Lost logs with stylo

library(stylo)

my_data1 <- stylo(
  gui = FALSE,
  corpus.dir = "data/federalist/",
  display.on.screen = FALSE,
  culling.max = 75,
  culling.min = 75,
  mfw.min = 900,
  mfw.max = 900)

Log and replicate with stylo2gg

library(stylo2gg)
stylo_log(my_data1)
my_data2 <- stylo_replicate("2023-01-01 00:01:00")

2. Visualizing problems and solutions

Principal Component Analysis

library(stylo)

federalist_mfw <- 
  stylo(gui = FALSE,
        corpus.dir = "data/federalist/",
        analysis.type = "PCR",
        pca.visual.flavour = "symbols",
        analyzed.features = "w",
        ngram.size = 1,
        display.on.screen = TRUE,
        sampling = "no.sampling",
        culling.max = 75,
        culling.min = 75,
        mfw.min = 900,
        mfw.max = 900)
library(stylo2gg)

stylo2gg(federalist_mfw)