Announcing My New Book: Cultural Analytics in R - A Tidy Approach

Exciting news: my book Cultural Analytics in R: A Tidy Approach has just been published by SpringerLink! This is the first book that brings together the many ways tidy data principles can be used across cultural analytics workflows. I’m grateful to everyone who has expressed interest in this project and supported its development.

So what exactly is Cultural Analytics in R? Well, it’s my attempt to fill a gap I’ve been thinking about for years—how do we get humanities folks comfortable with computational methods without scaring them away with too much technical jargon? As historian Roy Rosenzweig pointed out way back in 2003, we’ve moved from “a culture of scarcity to a culture of abundance” in our digital world. Suddenly, we have access to massive archives of cultural data, but most humanities scholars weren’t really trained to handle that scale of information.

Two New Publications: Topps Ripped and Programming Historian!

Two quick publication announcements!

First, I’m thrilled to announce that I have a new article published over on the official Topps Ripped blog!

Anyone who knows me knows I’ve been a lifelong professional wrestling fan. It’s a passion that goes back to when I was a kid. So, getting the chance to combine that interest with writing about cultural history and trading cards was a fun opportunity. The article, “1987 Topps WWE WrestleMania Cards: History & Culture”, takes a deep dive into the iconic card set released around the time of the legendary WrestleMania III. I explore the design choices, the larger-than-life wrestlers featured, and the cultural significance of these cards during a pivotal boom period for WWE (then WWF).

The GLM as a Means to Ensure Statistical Validity

This was published a little while ago, but I haven’t had a chance to share it until recently: my article in the International Journal of Digital Humanities! It tackles a challenge I’ve seen many of us in the Digital Humanities (DH) dealing with, something I’ve been thinking about for a while. Namely, how do we ensure our quantitative research stands up to scrutiny? In 2019, Nan Z. Da published an article in Critical Inquiry entitled “The Computational Case against Computational Literary Studies.” The response was a fierce debate about the validity of quantitative methods in the humanities. In response, Critical Inquiry hosted an online forum to discuss some issues. The responses often focused on some of Da’s errors in her critique. Still, I think the real issue is that many of us in DH and the humanities, more broadly, lack the statistical training to conduct quantitative research effectively.

Utilizing Time Series Analysis using Facebook's Prophet to Analyze Firearm Permits

The debate over gun ownership and regulation in the United States remains contentious. Arguments often center around interpretations of the Second Amendment, public safety concerns, and the effectiveness of existing policies. To inform these discussions, I thought it would be interesting to dive into the FBI’s National Instant Criminal Background Check System (NICS) data, as compiled by the Data Liberation Project. NICS was mandated by the Brady Handgun Violence Prevention Act of 1993 and launched by the FBI in 1998. It’s used by Federal Firearms Licensees (FFLs) to determine whether a prospective buyer is eligible to purchase firearms or explosives.

A Geospatial Analysis of Unaccompanied Migrant Children using Anselin Local Moran's I

In the last blog post, I worked on the data set that the New York Times released about unaccompanied migrant children in the United States. These are children who have crossed the border into the United States without their parents or legal guardians.

A particularly useful element of the data is the spatial and time data. We have information about when individuals entered the United States, their country of origin, and where they were placed. This allows for some interesting analysis and visualizations. For instance, we can see that while there are children from around the globe, we have a large concentration of individuals from Central America, particularly Guatemala, El Salvador, and Honduras. I figured creating an arc map would allow us to see some of these trends better.

Convolutional Neural Networks and HDBSCAN for Tagging Handwritten Archival Material

Handwritten archival materials present a unique challenge for automated tagging and categorization, as traditional optical character recognition (OCR) techniques often struggle with their variability and complexity. Consequently, documents typically require manual tagging and categorization by human archivists—a time-consuming and labor-intensive process. This issue is compounded by the fact that historical documents often use scripts or writing styles unfamiliar to contemporary students. Since cursive is rarely taught now, younger generations may need specific training to read certain historical documents. While transformer models show promise in transcribing handwritten text, they require large amounts of labeled training data, which is often unavailable for historical documents. Convolutional neural networks (CNNs) offer a promising alternative. By learning visual features directly from handwritten images, CNNs can potentially categorize and tag documents without requiring full transcription.

Mustaches, Unibrows, and Shalwar Khameezes

In April 2024, The Verge posted a story about attempting to create an Asian man with a white woman. The article found that many image generators gave the woman “Asian features.” Because the majority of these image generators are trained on datasets that predominantly feature white individuals, the AI struggled to accurately represent an Asian man without relying on stereotypes and tropes.

At one point, Meta banned keywords that were related to Asians. Likewise, Google paused Gemini’s ability to generate images of people after concerns about diversity. There are some attempts at fixing this bias through the use of “Diversity Fine-tuning”:

Prompt Exploration of AI Image Diffusion Models for Students

For several years now, I have had my students explore AI image generation using text prompts. It’s been fascinating to see how the technology has progressed and what students can create with it. Before the technology became so prominent, I used to have the students do a small competition using Runway.

At first, it was difficult for students to get good results, but with technological improvements, students were able to create increasingly impressive images. Perhaps more importantly, the initial “wow” factor has diminished. Around 2021-2022, almost no student had tried an image generator. Now, although they tend to use large-language models like ChatGPT more, they are generally less amazed by the output.

Exploratory Analysis of Unaccompanied Migrant Children Data

In 2023, the New York Times released data about the number of unaccompanied migrant children who have crossed into the United States. The U.S. Department of Human Health and Services keeps this data, and the NYT gained access to it through the Freedom of Information Act. Conditions for these children are often dire, with many facing violence, abuse, and poverty. As the newspaper notes, Americans have used these children to build roofs and work the night shift of dangerous jobs. Frequently, federal agencies ignored numerous warnings about the exploitation of these children.

An Undue Burden: A Look at Digital Humanities Conference Travel

In 2022, I updated my GIS time-lapse map looking at the disproportionate travel that digital humanities scholars in the Global South undertook to attend the annual Digital Humanities conference. The map uses data from the Index of DH Conferences by Matthew D. Lincoln, Scott B. Weingart, and Nickoal Eichmann-Kalwara. Over time, there have been updates to the Index, so the map needed an upgrade.

A little about the visualization for those that are unfamiliar. To create it, I reverse-geocoded the institutional affiliations of authors in the dataset and the conference hosting institutions using ArcGIS’ API through tidygeocoder. When university affiliation was missing, I used the city or country of the presenter. The new map now includes data up until July of 2024 and covers 60 years of conference data, featuring over 500 events, 8,800 presentations, 10,400 authors, 2,600 institutions, and multiple countries.

Hello World (Minimal Computing Edition)

While this site may currently look like it is spam, I am in the process of changing the website from WordPress to Jekyll and hosting it on GitHub. I find WordPress significantly easier to use than Jekyll, but I think the latter provides a better experience for the data science and cultural analytics work I do.

With WordPress, when you have a lot of image embeddings or interactive visualizations, the site can become slow to load. While I have been critical of minimal computing for its emphasis on “speed” and “longevity,” I have to admit that it does allow those who may not have the resources to engage with more dominant, resource-heavy computing environments.