Introduction
This week in Hacking the Humanities, I took a closer look at a published digital humanities project called the Viral Texts Project. This project attempts to provide data and visualizations, as well as interactive exhibits to showcase how nineteenth-century newspaper articles were copied and reprinted, in order to assess what qualities of an article contributed to its virality. I have included a screenshot below to show one of the main features of their project, a connected graph with nodes representing newspapers and edges representing reprints between different newspapers. While mainly geared towards an academic audience, this site represents its data in a way that is still approachable to anybody interested the virality of nineteenth-century newspapers.

Data & Processing
Interestingly, rather than using data from an external source, the Viral Texts Project grew out of an earlier initiative funded by the National Endowment for the Humanities’ Office of Digital Humanities, called Infectious Texts. This precursor project sought to develop the main text reuse discovery algorithm, with the data generated from this project serving as the data behind the Viral Texts Project. Over time, new research groups have collaborated with team behind Infectious Texts and the Viral Texts Project, continuing to improve the algorithm and provide new lenses for data analysis.
While the site linked above mainly serves to visually represent the researchers’ findings, they have a page which links to the many research publications where the researchers document the details of their algorithms, such as the original paper paper published in 2013. In this paper, the researchers document the algorithms they used to determine whether articles showed similarity. Specifically, they used the natural language processing technique of N-grams (subsequences of N words) to find snippets of text that are repeated in multiple articles, as well as other features that can be extracted from these computed N-grams, such as N-gram position in the document.
Once passages from the documents are determined to be sufficiently similar, and aligned passage pairs are formed, the researchers clustered the similar passages together into equivalence classes that can then be used to determine links between specific newspapers.
Presentation
The researchers decided to use two main methods of displaying their data on the Viral Texts Project website. The first is a cluster search tool, displayed all clusters of text passages that contain the inputted search term. However, when attempting to use this site, I received an internal server error, indicating that this search tool is no longer maintained or at least needs some attention. The other tool that the Viral Texts Project provides is the graph viewer, allowing users to browse newspapers and the individual connections that were found in text from the newspapers. The screenshot above is from this graph viewer.
Conclusion
The Viral Texts Project represents a successful digital humanities project and collaboration between specialists in many different fields. Through its data collection and visualization techniques, this project contributes to the body of knowledge around the virality newspapers in the nineteenth-century, and serves as a high quality example for future digital humanities projects.
Wow, that is such a cool website! It’s interesting how vast a collection of old newspapers, all turned digital, is so that they could run their analysis. I had never heard of N-Gram before this post; that’s a clever algorithm to look at cross-citing of articles. It’s a shame the cluster search tool didn’t work.