Reverse Engineering the Viral Texts Project

Introduction

This week in Hacking the Humanities, I took a closer look at a published digital humanities project called the Viral Texts Project. This project attempts to provide data and visualizations, as well as interactive exhibits to showcase how nineteenth-century newspaper articles were copied and reprinted, in order to assess what qualities of an article contributed to its virality. I have included a screenshot below to show one of the main features of their project, a connected graph with nodes representing newspapers and edges representing reprints between different newspapers. While mainly geared towards an academic audience, this site represents its data in a way that is still approachable to anybody interested the virality of nineteenth-century newspapers.

Image showing the network of article reprints collected by The Viral Texts Project.
Graph view of article reprints cataloged by the Viral Texts Project

Data & Processing

Interestingly, rather than using data from an external source, the Viral Texts Project grew out of an earlier initiative funded by the National Endowment for the Humanities’ Office of Digital Humanities, called Infectious Texts. This precursor project sought to develop the main text reuse discovery algorithm, with the data generated from this project serving as the data behind the Viral Texts Project. Over time, new research groups have collaborated with team behind Infectious Texts and the Viral Texts Project, continuing to improve the algorithm and provide new lenses for data analysis.

While the site linked above mainly serves to visually represent the researchers’ findings, they have a page which links to the many research publications where the researchers document the details of their algorithms, such as the original paper paper published in 2013. In this paper, the researchers document the algorithms they used to determine whether articles showed similarity. Specifically, they used the natural language processing technique of N-grams (subsequences of N words) to find snippets of text that are repeated in multiple articles, as well as other features that can be extracted from these computed N-grams, such as N-gram position in the document.

Once passages from the documents are determined to be sufficiently similar, and aligned passage pairs are formed, the researchers clustered the similar passages together into equivalence classes that can then be used to determine links between specific newspapers.


Presentation

The researchers decided to use two main methods of displaying their data on the Viral Texts Project website. The first is a cluster search tool, displayed all clusters of text passages that contain the inputted search term. However, when attempting to use this site, I received an internal server error, indicating that this search tool is no longer maintained or at least needs some attention. The other tool that the Viral Texts Project provides is the graph viewer, allowing users to browse newspapers and the individual connections that were found in text from the newspapers. The screenshot above is from this graph viewer.


Conclusion

The Viral Texts Project represents a successful digital humanities project and collaboration between specialists in many different fields. Through its data collection and visualization techniques, this project contributes to the body of knowledge around the virality newspapers in the nineteenth-century, and serves as a high quality example for future digital humanities projects.

1 thought on “Reverse Engineering the Viral Texts Project

  1. Wow, that is such a cool website! It’s interesting how vast a collection of old newspapers, all turned digital, is so that they could run their analysis. I had never heard of N-Gram before this post; that’s a clever algorithm to look at cross-citing of articles. It’s a shame the cluster search tool didn’t work.

Leave a Reply to Rafael Volkamer-Pastor Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

css.php