Title Shakespeare's N-grams at 400: Comparing Content Themes With Four Contemporaries

Encoded by   Vanessa Hannesschläger

Encoded by   Daniel Schopper


The Creative Commons Attribution 4.0 International (CC BY 4.0) License applies to this text.


TEI-encoded texts can serve easily as units of content for the purposes of text analysis, and may be up-coded easily with part-of-speech (PoS) data. For the present work, plays by Shakespeare and four of his contemporaries (James Shirley, Thomas Middleton, Christopher Marlowe, and Ben Jonson) were processed with Northwestern University’s MorphAdorner tool, developed by Philip Burns. XSLT, which is sometimes misunderstood as merely a display or rendering tool, but is also an excellent text analysis language, was employed to query all PoS-tagged data. Moreover, XSLT easily supports the development of custom statistical functions such as pooled standard deviation and Cohen’s d, functions helpful for comparing standardized differences between the separate data sets of the various authors.

Experimental routines were then developed to examine speeches in each play for the intersection of two sorts of phenomena:

  • parts of speech located at various locations within a word n-gram sequence of a given length, and
  • content themes as identified by the Linguistic Inquiry and Word Count (LIWC) text analysis framework. LIWC includes many content themes, but for this work, the author examined the following: Love, motion, body, anger, death, murder, and happiness.

Once the XSLT query routines were developed, it became a fairly simple matter to ask questions such as these:

Which author in the corpus wrote more 3-grams in which one of the words was a noun, and on the theme of happiness, or religion, etc.?

How different is the occurrence of the religion theme in Shakespeare's 3-grams compared to Jonson's 3-grams, or Marlowe's with Jonson's?

Which comparisons show the largest and smallest differences?

Texts used here mainly originated in the work of the Text Creation Partnership. Martin Mueller offered additional WordHoard texts for this work.