Gender distribution among the contributors to TEI 2016

We took an interest in the gender distribution among the contributors to the TEI 2016 Conference and Members' Meeting. However, as we had not collected information on the gender the conference contributors identified with, we were confronted with a lack of data.

Luckily, we had tagged the authors' forenames in the XML TEI version of the abstracts. Therefore, we decided to deduce the contributors' genders from their forenames - or, to be more precise, to deduce the gender that the respective forename is most commonly associated with. To be very clear: We did not make assumptions about the gender of the contributors themselves, but much rather analyzed the gender of the forenames of the contributors.

How we genderized the forenames

Finding out the gender commonly associated with a forename is possible when empirical evidence is available - evidence such as lists of names of persons who declared their own gender. We started with such a list provided by Mark Kantrowitz, used e.g. in the NLTK package. Unfortunately, there is hardly any documentation available on the decisions to match names to gender in this list. For this reason, we looked for another resource and found genderize.io, a web service to "determine the gender of a first name" . The genderize.io website states that the data collected there was assembled by scraping data from social network profiles, where people can declare their gender themselves. genderize.io provides an API to their data, so we went with them.

Finding the adequate dataset was hard, using it was easy. We did so by writing a simple xQuery script which

  • iterates over all forename-elements,
  • takes the text node of each element,
  • builds a URL out of it,
  • sends a GET-request to the genderize.io-API,
  • parses the response,
  • and adds the parsed response in form of a type attribute to the forename attribute.
The only aspect complicating this process was the 1000-request-per-day-for-free-limit set up by genderize.io, which forced us to split this script in parts.

As the result of the comparison of our contributor name list to genderize.io did still not leave us with a fully meaningful result, we manually compared the still ungendered names to genderchecker.com. This website sources its data from the "2001 and 2011 UK Census Data, together with multiple online sources and contributions from our 2m website visitors" and assigns gender accordingly. What was particularly appealing about this database was that it also considers the possibility of unisex names: "If we see just one instance of a name appearing as both male and female, we categorise it as unisex."

This manual comparison of the last unidentified forenames left us with statistically relevant results and three ungendered names: Kiyonori, S√ľnna, and Tetsuei.

After enriching our data, the fun part started. First, we wanted to know the gender distribution amongst the authors' forenames. This question was answered by gender-distribution.xql, a script that also feeds the matching visualization.

In addition, we wanted to know which gender the authors' forenames are associated with for each single abstract. This question was answered by the script text-gender.xql, which also feeds its matching visualization.

What happens now

We invite you to draw your own conclusions about the state of gender equality within the TEI community in 2016 from the graphs we provide.

If you are a contributor to the TEI 2016 Conference and Members' Meeting and disagree with the gender you were assigned, we invite you to edit the data on GitHub.