The gender gap in computer science authorship will close in

Let's unpack that.

By performing a linear regression, we were able to determine when the gender gap in computer science authorship will reach an equilibrium, if trends continue as they have been. The blue bars represent historical data, while the purple regression line represents our future estimate, which is in line with existing estimates. As visible in the chart, parity is expected to be achieved in the year 2097.


Let's look at gender demographics

Here we compared the growth rate by comparing the differences in publications between years. In this graph we actually see female growth rate is higher in certain years than male growth rate, but it's simply not consistent enough to make a conclusive assessment that this problem is improving.

How have gender ratios in authorship changed?

This visual shows the average ratio of female to male collaborators per publication over the years. You can see that the ratio has steadily increased over the years, but is still at an extremely low rate, around .2.

How do career lengths compare between genders?

By looking at the authorship career lengths of men and women in this field, we were able to determine the average career lengths in days, which confirmed the idea that the average career length of a female author is shorter than that of a male author. Given the data we unearthed, we found that the career length of a female author is 27% shorter than that of a man.

Where does our data come from?

We downloaded our data from the DBLP computer science bibliographic database. It contains data on over six and a half million publications, and works from over 3 million authors. The raw data came in an XML file that we converted to a csv file using this Python package.


Because all of our data did not include any information about the gender of the authors included, we decided to use the Gender Guesser Python package. It uses the first name of a person, and returns a result of male, mostly male, female, mostly female, androgynous, or unknown. In order to reduce the potential harm that could come from incorrectly guessing an author’s gender, we chose to only use the data points that returned as either male or female. To ensure the accuracy of the package on our data, we hand-checked the gender identity of one hundred authors included in our dataset, and the gender returned by the package was correct in one hundred percent of the cases. Due to this, we felt comfortable proceeding with our analysis using these predicted gender outputs.


Once we had the data in an accessible format, we got rid of any columns where the majority of data points were missing. We also converted dates to be properly formatted, and used one-hot encoding where necessary. We then did various straight-forward calculations using different columns and data points to create the visualizations we have displayed on this site. In order to create our linear regression model, we simply summed the ratio of female authors to male authors over each year included in our data set, and used that ratio to predict how that ratio would change over time. According to our model, the community of tech authors will reach gender parity in 2098. This date is consistent with what the World Economic Forum has forecasted for this industry.

About the project

Meet the team

Snow
Mustafa
Abdulkadir

Project Manager & Data Analyst

LinkedIn Github
Forest
Molly
Stark

Data Scientist & Back End Developer

LinkedIn Github
Mountains
Marie
O'Connell

Full Stack Developer & UI/UX Designer

LinkedIn Github
Mountains
Luka
Marceta

Data Scientist & Back End Developer

LinkedIn Github
Mountains
Jameson
Pastor

Data Scientist & Back End Developer

LinkedIn Github

We are a team of Informatics students at the Information School at the University of Washington. Our project is called "Analyzing The Gender Gap in Computer Science Authorship." Through this project, we seek to analyze gender demographics in this field, find patterns, and understand why they exist. We hope you enjoy.


This project is sponsored by Marlina Hales.

Demo Video Below