Researchers at the University of Adelaide in Southern Australia recently published an article in PLOS ONE (an awesome open access, peer-reviewed science journal) entitled Automated Authorship Attribution Using Advanced Signal Classification Techniques. The research team worked for over 10 years on developing an automatic authorship detection system that would determine authorship based on commonly used words.
Here’s the abstract from the fascinating article:
In this paper, we develop two automated authorship attribution schemes, one based on Multiple Discriminant Analysis (MDA) and the other based on a Support Vector Machine (SVM). The classification features we exploit are based on word frequencies in the text. We adopt an approach of preprocessing each text by stripping it of all characters except a-z and space. This is in order to increase the portability of the software to different types of texts. We test the methodology on a corpus of undisputed English texts, and use leave-one-out cross validation to demonstrate classification accuracies in excess of 90%. We further test our methods on theFederalist Papers, which have a partly disputed authorship and a fair degree of scholarly consensus. And finally, we apply our methodology to the question of the authorship of theLetter to the Hebrews by comparing it against a number of original Greek texts of known authorship. These tests identify where some of the limitations lie, motivating a number of open questions for future work. An open source implementation of our methodology is freely available for use at https://github.com/matthewberryman/author-detection.
In order to test their research, the team first applied it to English texts with known authors, and found an accuracy rate of over 90%. Then, they applied it to the twelve disputed essays in the Federalist Papers:
The Federalist Papers are a series of 85 political essays published under the name ‘Publius’ in 1788. At first, the real author(s) were a guarded secret, but scholars now accept that Alexander Hamilton, James Madison, and John Jay are the authors. After a while Hamilton and then Madison provided their own lists declaring the authorship [31], [32]. The difference between these two lists is that there are 12 essays that both Madison and Hamilton claimed individually for themselves. So 73 texts might be considered to have known author(s) while 12 are of disputed authorship. These 12 disputed authorship texts are essay numbers 49–58, 62 and 63. An early study carried out by Mosteller and Wallace (1964) concluded that all of the disputed essays were written by Madison, with the possible exception that essay number 55 might be written by Hamilton [10], [33]. Not all researchers agree with this conclusion. Some scholars also suggest that essay number 64, which is normally attributed to Jay, is written by Madison[31], so we also consider essay number 64 as a disputed text. In total, this gives us 13 disputed essays and 72 undisputed essays. Amongst the undisputed texts, 51 essays are written by Hamilton, 14 essays are written by Madison, and 4 essays are written by Jay. Three essays (numbers 18, 19, and 20) are products of collaboration between Hamilton and Madison[34], [35].
The texts are obtained from the Project Gutenberg Archives [28]. We put aside the three essays with collaborative authorship and take the remaining 69 essays as the training dataset. The same function word list (see Table 4) is used for our MDA and SVM classifiers. Because there are three authors, MDA produces two discriminant functions, that are shown in Figure 4. For the Federalist Papers of undisputed authorship, the LOO-CV accuracy is 97.1%, close to the LOO-CV accuracy for the SVM, 95.6%. In both methods the number of function words required to achieve the highest accuracy is 75 words.
The program found that “there is a relatively high likelihood that Essay 62 was written by Madison.” However, it was not certain about the authorship of several of the other disputed essays. A possible reason for this is that these essays were “the products of a greater degree of collaboration between the authors, and this remains an open question for future investigation.”