Bilingual Bluesky Sentiment

Working on a project using Bluesky’s API has reminded me how easily social media can provide quick corpora of text data. Since the API I’ve been using is in Python but my tmtyro package is in R, this post will be bilingual.¹ Appropriately, its topic matter is bilingual, too.

¹ The bskyr package in R offers bs_search_posts(), which probably works better than my quick search_bsky() function in Python—but then I wouldn’t have had a chance to stretch these ligaments.

Gathering data

I’ll start by loading the reticulate package in R, to help make the data available across languages:

library(reticulate)

With that in place, I’ll write a Python function to query Bluesky and return a JSON. By its very nature, this data changes quickly. To pin things down to one moment in time, my search_bsky() archives representational searches, cached here as pickles.²

² The next few code blocks are in Python, but they switch back to R in the subsequent section.

import _keys
from atproto import Client, models
import json, os, pickle

def search_bsky(query):
  the_file = f"{query}.pkl"
  if os.path.exists(the_file):
    # load cache if it exists
    with open(the_file, 'rb') as file:
      result_json = pickle.load(file)
  else:
    # if no cache, create one
    params = {
      'q' : query,
      'limit' : 100}
    result = client.app.bsky.feed.search_posts(params)
    if len(result.posts) == 0: 
      return None
    result_json = json.loads(models.utils.get_model_as_json(result))['posts']
    with open(the_file, 'wb') as file:
      # rerendering the report will use cached values
      pickle.dump(result_json, file)
  return result_json

After logging in with my credentials (saved in “_keys.py”), this new function can be used to search Bluesky for anything. Here, I’m using it for two queries: “python” and “rstats”:

# Log in with my credentials
client = Client()
client.login(_keys.BSKY_ID, _keys.BSKY_PW)

# Now search!
json_rstats = search_bsky("rstats")
json_python = search_bsky("python")

Preparing texts

Now that the data is collected, I can swap over to R to do more. Python’s json_rstats object can be referenced as py$json_rstats in R, via reticulate. Since the JSON data converts into an R list object needing a little futzing, the following code will flatten it out as a data frame and get the dimensions:

library(tidyverse)
library(tmtyro)
library(jsonlite)

py$json_rstats |>
    # turn a list into a dataframe
    enframe() |>
    # expand one list column
    unnest_wider(value) |>
    # expand all the resulting list columns
    {\(x) unnest_wider(x, x |>
                          select(where(is.list)) |>
                          names(), 
                       names_sep = ".")
    }() |> 
  # now show the dimensions
  dim()

[1] 100  40

The 100 rows correspond to the 100 posts found in our search, while the 40 columns are named things like “author.handle”, “likeCount”, and “record.text”.

This conversion into a data frame is something we’ll want to do more than once, simultaneously simplifying the columns to a manageable number, so we’ll turn it into a function simplify_bsky() and use it to convert the two data sets:

simplify_bsky <- function(json_list, label = NULL) {
  result <- json_list |>
    # turn a list into a dataframe
    enframe() |>
    # expand one list column
    unnest_wider(value) |>
    # expand all the resulting list columns
    {\(x) unnest_wider(x, x |>
                          select(where(is.list)) |>
                          names(), names_sep = ".")
    }() |>
    # limit and rename columns
    select(date = record.createdAt,
           account = author.handle,
           text = record.text) |>
    # convert data type
    mutate(date = ymd_hms(date))
  
  # optionally, add an ID column
  if (!is.null(label)) {
    result <- result |> 
      mutate(corpus_id = label, .before = date)
  }
  result
}

df_rstats <- simplify_bsky(py$json_rstats, "rstats")
df_python <- simplify_bsky(py$json_python, "python")

To avoid any concerns with identifying people in this data set, I won’t be sharing the full thing, but I will share the underlying text in a scrambled form. tmtyro’s load_texts() prepares a tidytext representation for study and comparison, and arrange() alphabetizes the rows to scramble things up.

corpus_python <- df_python |> 
  select(-account, -date) |> 
  load_texts()

corpus_python |> 
  arrange(word)

corpus_rstats <- df_rstats |> 
  select(-account, -date) |> 
  load_texts()

corpus_rstats |> 
  arrange(word)

combined <- rbind(corpus_python, corpus_rstats)

combined |> 
  arrange(word)

Vocabularies

With everything loaded, it’s easy to compare things:

combined |> 
  add_vocabulary() |> 
  visualize(label = "inline")

Both data sets contain 100 posts, but it looks like the posts referencing “python” tend to be a little longer than those including “rstats,” as shown by the X-axis. The Python data also has a slightly bigger vocabulary, as shown by the angle between the two lines.

combined |> 
  add_vocabulary() |> 
  tabulize()

	length	vocabulary		hapax
	length	total	ratio	total	ratio
python	2,809	1,302	0.464	923	0.329
rstats	2,633	1,105	0.420	755	0.287

The table confirms what the figure shows. Text-token ratios are comparable for both groups of posts, and posts using “python” use more unique words—in large part driven by one-off hapaxes.

Word counts

More granularly, word counts lets us see which words were most used:

combined |> 
  add_frequency() |> 
  visualize()

It’s unsurprising to see our search terms at the top, since every post will have included one or the other. Stopwords like “the” and “to” are also overrepresented, making it hard to see anything more interesting. We can skip over these to get to the juicier stuff:

combined |>
  filter(!word %in% c("python", "rstats")) |> 
  drop_stopwords() |> 
  add_frequency() |> 
  visualize()

Some of this could have been predicted. On the other hand, it does seem like “https” is higher up on the “rstats” side than we might expect. Do posts using “rstats” contain more links than posts using “python”?

Weighing Tf-idf

Raw word counts aren’t ideal for identifying what makes each group distinct. Instead, weighing by tf-idf shows which words are most representational of one data set in comparison with the other:

combined |> 
  add_tf_idf() |> 
  visualize()

Many words on the “rstats” side of the chart are to be expected: posts about R are more likely to mention CRAN than are posts about Python. The Python side is more interesting, since “de,” “aprender,” and characters “が” and “の” point to a greater share of posts written in languages other than English. Indeed, the underlying JSON data point the same direction.

Sentiment

Word counts can only take us so far. Social media is about engaging, connecting, and sharing both joys and frustrations. We might learn more by limiting consideration to words evoking some kind of emotion. Most simply, how do posts from these two communities compare in their positivity?

combined |> 
  add_sentiment() |> 
  visualize() |> 
  change_colors("okabe-ito")

Posts in both communities seem more positive than negative, but the imbalance is stronger among “rstats” posts. But “positive” and “negative” can be a simple way to model emotion. Will another sentiment dictionary offer more?

combined |> 
  add_sentiment(lexicon = "nrc") |> 
  filter(!sentiment %in% c("positive", "negative")) |> 
  visualize()

Posts in these communities show markedly different sentiments: among “python” posts, fear is predominant; in “rstats” posts, it’s trust. Since this data is just from a small sample of 100 posts collected one time, it’s unwise to draw too many conclusions. Still, we can peek at the underlying words that inform these bars:

combined |> 
  add_sentiment("nrc") |> 
  drop_na() |> 
  filter(!sentiment %in% c("positive", "negative")) |> 
  count(word, doc_id) |> 
  pivot_wider(
    names_from = doc_id, 
    values_from = n, 
    values_fill = 0) |> 
  arrange(word) |> 
  column_to_rownames("word") |> 
  wordcloud::comparison.cloud(
    colors = c("#7570b3", "#1b9e77"),
    # colors = c("#00BFC4", "#F8766D"),
    max.words = 150)

Again, it’s important to remember that the data represents a single small sample. But it’s hard not to notice the largest words in each set, with R’s going to “good” and Python’s going to “hell.”

January Update

Data shown above was gathered in December; a month later, it’s nice to compare things. I’ll start by revising my Python function to add a label to the cached data and then collect updates.

Code

def search_bsky_label(query, label):
  the_file = f"{query}_{label}.pkl"
  if os.path.exists(the_file):
    # load cache if it exists
    with open(the_file, 'rb') as file:
      result_json = pickle.load(file)
  else:
    # if no cache, create one
    params = {
      'q' : query,
      'limit' : 100}
    result = client.app.bsky.feed.search_posts(params)
    if len(result.posts) == 0: 
      return None
    result_json = json.loads(models.utils.get_model_as_json(result))['posts']
    with open(the_file, 'wb') as file:
      # rerendering the report will use cached values
      pickle.dump(result_json, file)
  return result_json

# Now search!
json_rstats25 = search_bsky_label("rstats", "jan-2025")
json_python25 = search_bsky_label("python", "jan-2025")

With this data prepared, some quick visualizations show what’s changed:

Code

corpus_rstats25 <- py$json_rstats25 |> 
  simplify_bsky("rstats-january") |> 
  select(-account, -date) |> 
  load_texts()

corpus_python25 <- py$json_python25 |> 
  simplify_bsky("python-january") |> 
  select(-account, -date) |> 
  load_texts()

combined25 <- rbind(corpus_python25, corpus_rstats25)

combined25 |> 
  add_vocabulary() |> 
  visualize(label = "inline")

Posts with the word “python” continue to be (slightly) wordier than those with “rstats” and they continue to use larger vocabularies. Unlike what was seen with data from December, the lines here follow each other for much of the chart, splitting apart only after around 2,000 words. This change suggests that only a small portion of posts, found lower in the data sets, are responsible for dissimilar vocabulary use when Bluesky users post about Python or R.

Code

combined25 |> 
  add_vocabulary() |> 
  tabulize()

	length	vocabulary		hapax
	length	total	ratio	total	ratio
python-january	2,811	1,377	0.490	1,022	0.364
rstats-january	2,677	1,158	0.433	794	0.297

Overall vocabulary stats seem largely unchanged from last month. Posts about Python continue to use more words and more unique words.

Code

combined25 |> 
  add_tf_idf() |> 
  visualize()

The Pythonic propensity for polyglottal posting has passed.

Code

combined25 |> 
  add_sentiment(lexicon = "nrc") |> 
  filter(!sentiment %in% c("positive", "negative")) |> 
  visualize()

With this update, the sentiment distributions more closely match each other. Posts with “python” apparently show more fear and anger than those with “rstats”, but only to a small degree.

Code

combined25 |> 
  add_sentiment("nrc") |> 
  drop_na() |> 
  filter(!sentiment %in% c("positive", "negative")) |> 
  count(word, doc_id) |> 
  mutate(doc_id = str_remove_all(doc_id, "-january$")) |> 
  pivot_wider(
    names_from = doc_id, 
    values_from = n, 
    values_fill = 0) |> 
  arrange(word) |> 
  column_to_rownames("word") |> 
  wordcloud::comparison.cloud(
    colors = c("#7570b3", "#1b9e77"),
    # colors = c("#00BFC4", "#F8766D"),
    max.words = 150)

The words evoking these sentiments are more closely aligned, and posts about both languages seem more likely to emphasize the good.

One month’s data is no more representational than another’s. On the contrary, considering them in tandem can show how much variation is normal. Most important to me, though, is the proof of concept: it’s not the same as the old Twitter firehose, but Bluesky’s API makes it easy to collect these kinds of posts, providing a convenient corpus for consideration.

Citation

BibTeX citation:

@misc{clawson2025,
  author = {Clawson, James},
  title = {Bilingual {Bluesky} {Sentiment}},
  date = {2025-01-18},
  url = {https://jmclawson.net/posts/bluesky-sentiment/},
  langid = {en}
}

For attribution, please cite this work as:

Clawson, James. “Bilingual Bluesky Sentiment.” jmclawson.net, 18 Jan. 2025, https://jmclawson.net/posts/bluesky-sentiment/.