library(reticulate)
Working on a project using Bluesky’s API has reminded me how easily social media can provide quick corpora of text data. Since the API I’ve been using is in Python but my tmtyro package is in R, this post will be bilingual.1 Appropriately, its topic matter is bilingual, too.
1 The bskyr package in R offers bs_search_posts()
, which probably works better than my quick search_bsky()
function in Python—but then I wouldn’t have had a chance to stretch these ligaments.
Gathering data
I’ll start by loading the reticulate package in R, to help make the data available across languages:
With that in place, I’ll write a Python function to query Bluesky and return a JSON. By its very nature, this data changes quickly. To pin things down to one moment in time, my search_bsky()
archives representational searches, cached here as pickles.2
2 The next few code blocks are in Python, but they switch back to R in the subsequent section.
import _keys
from atproto import Client, models
import json, os, pickle
def search_bsky(query):
= f"{query}.pkl"
the_file if os.path.exists(the_file):
# load cache if it exists
with open(the_file, 'rb') as file:
= pickle.load(file)
result_json else:
# if no cache, create one
= {
params 'q' : query,
'limit' : 100}
= client.app.bsky.feed.search_posts(params)
result if len(result.posts) == 0:
return None
= json.loads(models.utils.get_model_as_json(result))['posts']
result_json with open(the_file, 'wb') as file:
# rerendering the report will use cached values
file)
pickle.dump(result_json, return result_json
After logging in with my credentials (saved in “_keys.py”), this new function can be used to search Bluesky for anything. Here, I’m using it for two queries: “python” and “rstats”:
# Log in with my credentials
= Client()
client client.login(_keys.BSKY_ID, _keys.BSKY_PW)
# Now search!
= search_bsky("rstats")
json_rstats = search_bsky("python") json_python
Preparing texts
Now that the data is collected, I can swap over to R to do more. Python’s json_rstats
object can be referenced as py$json_rstats
in R, via reticulate. Since the JSON data converts into an R list object needing a little futzing, the following code will flatten it out as a data frame and get the dimensions:
library(tidyverse)
library(tmtyro)
library(jsonlite)
$json_rstats |>
py# turn a list into a dataframe
enframe() |>
# expand one list column
unnest_wider(value) |>
# expand all the resulting list columns
unnest_wider(x, x |>
{\(x) select(where(is.list)) |>
names(),
names_sep = ".")
|>
}() # now show the dimensions
dim()
[1] 100 40
The 100 rows correspond to the 100 posts found in our search, while the 40 columns are named things like “author.handle”, “likeCount”, and “record.text”.
This conversion into a data frame is something we’ll want to do more than once, simultaneously simplifying the columns to a manageable number, so we’ll turn it into a function simplify_bsky()
and use it to convert the two data sets:
<- function(json_list, label = NULL) {
simplify_bsky <- json_list |>
result # turn a list into a dataframe
enframe() |>
# expand one list column
unnest_wider(value) |>
# expand all the resulting list columns
unnest_wider(x, x |>
{\(x) select(where(is.list)) |>
names(), names_sep = ".")
|>
}() # limit and rename columns
select(date = record.createdAt,
account = author.handle,
text = record.text) |>
# convert data type
mutate(date = ymd_hms(date))
# optionally, add an ID column
if (!is.null(label)) {
<- result |>
result mutate(corpus_id = label, .before = date)
}
result
}
<- simplify_bsky(py$json_rstats, "rstats")
df_rstats <- simplify_bsky(py$json_python, "python") df_python
To avoid any concerns with identifying people in this data set, I won’t be sharing the full thing, but I will share the underlying text in a scrambled form. tmtyro’s load_texts()
prepares a tidytext representation for study and comparison, and arrange()
alphabetizes the rows to scramble things up.
<- df_python |>
corpus_python select(-account, -date) |>
load_texts()
|>
corpus_python arrange(word)
<- df_rstats |>
corpus_rstats select(-account, -date) |>
load_texts()
|>
corpus_rstats arrange(word)
<- rbind(corpus_python, corpus_rstats)
combined
|>
combined arrange(word)
Vocabularies
With everything loaded, it’s easy to compare things:
|>
combined add_vocabulary() |>
visualize(label = "inline")
Both data sets contain 100 posts, but it looks like the posts referencing “python” tend to be a little longer than those including “rstats,” as shown by the X-axis. The Python data also has a slightly bigger vocabulary, as shown by the angle between the two lines.
|>
combined add_vocabulary() |>
tabulize()
length | vocabulary | hapax | |||
---|---|---|---|---|---|
total | ratio | total | ratio | ||
python | 2,809 | 1,302 | 0.464 | 923 | 0.329 |
rstats | 2,633 | 1,105 | 0.420 | 755 | 0.287 |
The table confirms what the figure shows. Text-token ratios are comparable for both groups of posts, and posts using “python” use more unique words—in large part driven by one-off hapaxes.
Word counts
More granularly, word counts lets us see which words were most used:
|>
combined add_frequency() |>
visualize()
It’s unsurprising to see our search terms at the top, since every post will have included one or the other. Stopwords like “the” and “to” are also overrepresented, making it hard to see anything more interesting. We can skip over these to get to the juicier stuff:
|>
combined filter(!word %in% c("python", "rstats")) |>
drop_stopwords() |>
add_frequency() |>
visualize()
Some of this could have been predicted. On the other hand, it does seem like “https” is higher up on the “rstats” side than we might expect. Do posts using “rstats” contain more links than posts using “python”?
Weighing Tf-idf
Raw word counts aren’t ideal for identifying what makes each group distinct. Instead, weighing by tf-idf shows which words are most representational of one data set in comparison with the other:
|>
combined add_tf_idf() |>
visualize()
Many words on the “rstats” side of the chart are to be expected: posts about R are more likely to mention CRAN than are posts about Python. The Python side is more interesting, since “de,” “aprender,” and characters “が” and “の” point to a greater share of posts written in languages other than English. Indeed, the underlying JSON data point the same direction.
Sentiment
Word counts can only take us so far. Social media is about engaging, connecting, and sharing both joys and frustrations. We might learn more by limiting consideration to words evoking some kind of emotion. Most simply, how do posts from these two communities compare in their positivity?
|>
combined add_sentiment() |>
visualize() |>
change_colors("okabe-ito")
Posts in both communities seem more positive than negative, but the imbalance is stronger among “rstats” posts. But “positive” and “negative” can be a simple way to model emotion. Will another sentiment dictionary offer more?
|>
combined add_sentiment(lexicon = "nrc") |>
filter(!sentiment %in% c("positive", "negative")) |>
visualize()
Posts in these communities show markedly different sentiments: among “python” posts, fear is predominant; in “rstats” posts, it’s trust. Since this data is just from a small sample of 100 posts collected one time, it’s unwise to draw too many conclusions. Still, we can peek at the underlying words that inform these bars:
|>
combined add_sentiment("nrc") |>
drop_na() |>
filter(!sentiment %in% c("positive", "negative")) |>
count(word, doc_id) |>
pivot_wider(
names_from = doc_id,
values_from = n,
values_fill = 0) |>
arrange(word) |>
column_to_rownames("word") |>
::comparison.cloud(
wordcloudcolors = c("#7570b3", "#1b9e77"),
# colors = c("#00BFC4", "#F8766D"),
max.words = 150)
Again, it’s important to remember that the data represents a single small sample. But it’s hard not to notice the largest words in each set, with R’s going to “good” and Python’s going to “hell.”
January Update
Data shown above was gathered in December; a month later, it’s nice to compare things. I’ll start by revising my Python function to add a label to the cached data and then collect updates.
Code
def search_bsky_label(query, label):
= f"{query}_{label}.pkl"
the_file if os.path.exists(the_file):
# load cache if it exists
with open(the_file, 'rb') as file:
= pickle.load(file)
result_json else:
# if no cache, create one
= {
params 'q' : query,
'limit' : 100}
= client.app.bsky.feed.search_posts(params)
result if len(result.posts) == 0:
return None
= json.loads(models.utils.get_model_as_json(result))['posts']
result_json with open(the_file, 'wb') as file:
# rerendering the report will use cached values
file)
pickle.dump(result_json, return result_json
# Now search!
= search_bsky_label("rstats", "jan-2025")
json_rstats25 = search_bsky_label("python", "jan-2025") json_python25
With this data prepared, some quick visualizations show what’s changed:
Code
<- py$json_rstats25 |>
corpus_rstats25 simplify_bsky("rstats-january") |>
select(-account, -date) |>
load_texts()
<- py$json_python25 |>
corpus_python25 simplify_bsky("python-january") |>
select(-account, -date) |>
load_texts()
<- rbind(corpus_python25, corpus_rstats25)
combined25
|>
combined25 add_vocabulary() |>
visualize(label = "inline")
Code
|>
combined25 add_vocabulary() |>
tabulize()
length | vocabulary | hapax | |||
---|---|---|---|---|---|
total | ratio | total | ratio | ||
python-january | 2,811 | 1,377 | 0.490 | 1,022 | 0.364 |
rstats-january | 2,677 | 1,158 | 0.433 | 794 | 0.297 |
Code
|>
combined25 add_tf_idf() |>
visualize()
Code
|>
combined25 add_sentiment(lexicon = "nrc") |>
filter(!sentiment %in% c("positive", "negative")) |>
visualize()
Code
|>
combined25 add_sentiment("nrc") |>
drop_na() |>
filter(!sentiment %in% c("positive", "negative")) |>
count(word, doc_id) |>
mutate(doc_id = str_remove_all(doc_id, "-january$")) |>
pivot_wider(
names_from = doc_id,
values_from = n,
values_fill = 0) |>
arrange(word) |>
column_to_rownames("word") |>
::comparison.cloud(
wordcloudcolors = c("#7570b3", "#1b9e77"),
# colors = c("#00BFC4", "#F8766D"),
max.words = 150)
One month’s data is no more representational than another’s. On the contrary, considering them in tandem can show how much variation is normal. Most important to me, though, is the proof of concept: it’s not the same as the old Twitter firehose, but Bluesky’s API makes it easy to collect these kinds of posts, providing a convenient corpus for consideration.
Citation
@misc{clawson2025,
author = {Clawson, James},
title = {Bilingual {Bluesky} {Sentiment}},
date = {2025-01-18},
url = {https://jmclawson.net/posts/bluesky-sentiment/},
langid = {en}
}