Five years ago, Oxford controversially republished the first three plays of the minor Henriad with Christopher Marlowe as the coauthor. The scholarship may befuddle fans of the bards, unless they know about Natural Language Processing.
Shakespeare’s Henry VI plays have always occupied a curious place in his canon. Although they are not among the most read of his works, many have chiefly derived interest from them to understand how the upstart crow with “a tiger’s heart wrapped in a player’s hide” developed his talent to become a pillar of world literature. The influence of the playwright Christopher Marlowe has long been seen in these first plays, often with the kicker that they were only poor imitations of his star contemporary. However, the recent scholarship has shown that these are more than just an imitation of Marlowe, they are in fact, a collaboration. To understand how these scholars proved the involvement of Marlowe, whether in the flesh or as a quarto at the Bard’s hand, involves a quick voyage through Bayesian statistics and natural language processing.
The founder of Information Theory, Claude P. Shannon, picking up a copy of a Raymond Chandler novel, once asked his assistant to guess each letter in a random sentence. Shannon would confirm when the assistant was correct and give the correct letter when she was incorrect. As would be expected although the first letter and second letter of a word would be incorrect, the remainder of the word would be easily guessed. With this experiment Shannon concluded that English is about 75% predictable. For example, the average English speaker given the first letter t might guess that h was a more likely letter to follow than x, and given the appearance of q the u after it is virtually guaranteed.
Using a similar strategy, the scholar Gabriel Egan with a team of computer scientist developed a model to parse the style of an author. In their paper, “Attributing the Authorship of the Henry VI Plays by Word Adjacency”, the authors explain their model’s technique, with the use of the following two extracts:
To find the likelihood of “with”, “and,” “one,” and “in” in an n-gram of five words, the following normalized Word Adjacency Networks were produced, scoring each on a proximity score (the further away, the less their collocation counts):
Just like in the Shannon example, given the appearance of “with” in the Hamlet extract, “in” was found to be more likely to appear next than “one” or “and.” The reason why the focus is on these more functional words than the nouns or adjectives is that these words give something of a stylistic fingerprint of the author. In an interview about this study for the podcast Shakespeare Unlimited, digital humanist and Folger Director Michael Witmore explains that while we as fluent English speakers may drop these words when parsing a sentence’s meaning, the choice and frequency of these functional words (including words like ‘of’, ‘that’, and ‘which’) as well as their position relative to each other can tell us a lot about the authorship of a text, as they appear to be an unconscious set of preferences specific to an author. Similar to Muybridge’s camera rifle, the algorithm can identify linguistic patterns that escape human perception.
To compare two Word Adjacency Networks to each other, the authors of the study used Shannon’s mathematics for relative entropy, which in the author’s case was the predictability of a word given an initial word. The mathematics to appropriately weigh these relationships were derived from the same mathematics as Google’s PageRank algorithm. In the following graph they compare the relative entropy scores from building WANs of “the”, “to”, and “and” two plays attribute to Shakespeare to two plays attributed to Ben Jonson :
You’ll notice that the entropy scores are lower when the two plays are by the same author and higher when the plays are by different authors. In the author’s full analysis, they used WANs ranging from fifty to a hundred words around an entire oeuvre, and a WAN for each text in each canon around the most common words. A model may guess 50% of the play’s attributions correctly, in which case they would add the next most common functional word to see if it would improve their model, until all 211 target words were used. The set of target words that produced the greatest accuracy between two authors would then be used on a play with unknown authorship. Those familiar with machine learning will quickly recognize the training and testing portion of this experiment. Validation of the model was performed by using the corpus of six Elizabethan playwrights and attributing each play to the author-profile achieving the lowest relative entropy. When used on a corpus where there is no dispute in authorship, the model achieved an accuracy of 93.6%. When finding the attribution of whole acts, the model had an accuracy of 93.4 %.
Using such a scene by scene approach, the model found that certain scenes in the Henry VI plays had a much lower entropy score to the works of Christopher Marlowe than to the works of Shakespeare. The graph below shows the difference between the two author’s profiles of each scene in Henry VI part 1, so that Shakespeare’s profile is 7 cn closer to Act 4 Scene 6 than Marlowe’s is.
The study found that almost half of each play in the Henry VI trilogy were more attributable to Marlowe than they were to Shakespeare, thus Oxford University Press decision to credit Christopher Marlowe as a coauthor in these three plays.
The study made me address and correct a lot of the assumptions I had about authorship; although to be completely honest, I was happy to find that my favorite scenes in the plays, particularly the death of Talbot and the Jack Cade Rebellion, were Shakespeare’s. For some humanists, such studies damage the totem of the sole genius, but to my mind, by damaging our presuppositions they can give us a more human picture of Shakespeare.