Transaction: pv_aaVsy-XYZv-seBlgvMZw2S_b8_yNgFEvQIsAxkiQ

HashBlockUserFee
pv_aaVsy-XYZv-seBlgvMZw2S_b8_yNgFEvQIsAxkiQDe3a4ocT6YLPztgq8SCd7ZbwbFpUq7r8XdzFCiQhgnkcgFFLXHS1V7LpjYK-28OgVUjnUQG4t_aNmbgndfvfLzr7A3sQ8cpHOrpIqEfS_yk0.000083 AR
Data:

## Abstract

Biases against women in the workplace have been documented in a variety of studies. This paper presents a large scale study on gender bias, where we compare acceptance rates of contributions from men versus women in an open source software community. Surprisingly, our results show that women’s contributions tend to be accepted more often than men’s. However, for contributors who are outsiders to a project and their gender is identifiable, men’s acceptance rates are higher. Our results suggest that although women on GitHub may be more competent overall, bias against them exists nonetheless.

## Acknowledgements

Special thanks to Denae Ford for her help throughout this research project. Thanks to the Developer Liberation Front for their reviews of this paper. For their helpful discussions, thanks to Tiffany Barnes, Margaret Burnett, Tim Chevalier, Aaron Clauset, Julien Couvreur, Prem Devanbu, Ciera Jaspan, Saul Jaspan, David Jones, Jeff Leiter, Ben Livshits, Titus von der Malsburg, Peter Rigby, David Strauss, Bogdan Vasilescu, and Mikael Vejdemo-Johansson. For their helpful critiques during the peer review process, thanks to Lynn Conway, Caroline Simard, and the anonymous reviewers.

## Materials and Methods

### GitHub scraping

An initial analysis of GHTorrent pull requests showed that our pull request merge rate was significantly lower than that presented in prior work on pull requests^[[Gousios, Pinzger & Deursen, 2014](https://scholar.google.com/scholar_lookup?title=An%20exploratory%20study%20of%20the%20pull-based%20software%20development%20model&author=Gousios&publication_year=2014)]. We found a solution to the problem that calculated pull request status using a different technique, which yielded a pull request merge rate comparable to prior work. However, in a manual inspection of pull requests, we noticed that several calculated pull request statuses were different than the statuses indicated on the https://github.com website. As a consequence, we wrote a web scraping tool that automatically downloaded the pull request HTML pages, parsed them, and extracted data on status, pull request message, and comments on the pull request. We performed this process for all pull requests submitted by GitHub users that we had labeled as either a man or woman. In the end, the pull request acceptance rate was 74.8% for all processed pull requests.

We determined whether a pull requestor was an insider or an outsider during our scraping process because the data was not available in the GHTorrent dataset. We classified a user as an insider when the pull request explicitly listed the person as a collaborator or owner (https://help.github.com/articles/what-are-the-different-access-permissions/#user-accounts), and classified them as an outsider otherwise. This analysis has inaccuracies because GitHub users can change roles from outsider to insider and vice-versa. As an example, about 5.9% of merged pull requests from both outsider female and male users were merged by the outsider pull-requestor themselves, which is not possible, since outsiders by definition do not have the authority to self-merge. We emailed such an outsider, who indicated that, indeed, she was an insider when she made that pull request. We attempted to mitigate this problem by using a technique similar to that used in prior work^[[Yu et al., 2015](https://scholar.google.com/scholar_lookup?title=Wait%20for%20it:%20determinants%20of%20pull%20request%20evaluation%20latency%20on%20GitHub&author=Yu&publication_year=2015)]. From contributors that we initially marked as outsiders, for a given pull request on a project, we instead classified them as insiders when they met any of three conditions. The first condition was that they had closed an issue on the project within 90 days prior to opening the given pull request. The second condition was that they had merged the given pull request or any other pull request on the project in the prior 90 days. The third condition was that they had closed any pull request that someone else had opened in the prior 90 days. Meeting any of these conditions implies that, even if the contributor was an outsider at the time of our scraping, they were probably an insider at the time of the pull request.

### Gender linking

To evaluate gender bias on GitHub, we first needed to determine the genders of GitHub users.

Our technique uses several steps to determine the genders of GitHub users. First, from the GHTorrent data set, we extract the email addresses of GitHub users. Second, for each email address, we use the search engine in the Google+ social network to search for users with that email address. The search works for both Google users’ email addresses (_@gmail.com_), as well as other email addresses (such as _@ncsu.edu_). Third, we parse the returned users’ ‘About’ page to scrape their gender. Finally, we include only the genders ‘Male’ and ‘Female’ (334,578 users who make pull requests) because there were relatively few other options chosen (159 users). We also automated and parallelized this process. This technique capitalizes on several properties of the Google+ social network. First, if a Google+ user signed up for the social network using an email address, the search results for that email address will return just that user, regardless of whether that email address is publicly listed or not. Second, signing up for a Google account currentlyrequires you to specify a gender (though ‘Other’ is an option) (https://accounts.google.com/SignUp), and, in our discussion, we interpret their use of ‘Male’ and ‘Female’ in gender identification (rather than sex) as corresponding to our use of the terms ‘man’ and ‘woman’. Third, when Google+ was originally launched, gender was publicly visible by default (http://latimesblogs.latimes.com/technology/2011/07/google-plus-users-will-soon-be-able-to-opt-out-of-sharing-gender.html).

### Merged pull requests

Throughout this study, we measure pull requests that are accepted by calculating developers’ merge rates, that is, the number of pull requests merged divided by the sum of the number of pull requests merged, closed, and still open. We include pull requests still open in the denominator in this calculation because pull requests that are still open could be indicative of a pull requestor being ignored, which has the same practical impact as rejection.

### Project licensing

To determine whether a project uses an open source license, we used an experimental GitHub API that uses heuristics to determine a project’s license (https://developer.github.com/v3/licenses/). We classified a project (and thus the pull request on that project) as open source if the API reported a license that the Open Source Initiative considers in compliance with the Open Source Definition (https://opensource.org/licenses), which were afl-3.0, agpl-3.0, apache-2.0, artistic-2.0, bsd-2-clause, bsd-3-clause, epl-1.0, eupl-1.1, gpl-2.0, gpl-3.0, isc, lgpl-2.1, lgpl-3.0, mit, mpl-2.0, ms-pl, ms-rl, ofl-1.1, and osl-3.0. Projects were not considered open source if the API did not return a license for a project, or the license was bsd-3-clause-clear, cc-by-4.0, cc-by-sa-4.0, cc0-1.0, other, unlicense, or wtfpl.

### Determining gender neutral and gendered profiles

To determine gendered profiles, we first parsed GitHub profile pages to determine whether each user was using a profile image or an identicon. Of the users who performed at least one pull request, 213,882 used a profile image and 104,648 used an identicon. We then ran display names and login names through a gender inference program, which maps a name to a gender.

***

_The article was trimmed for testing purposes._

***

### Authors

- Josh Terrell[^1]

- Andrew Kofink[^2]

- Justin Middleton[^2]

- Clarissa Rainear[^2]

- Emerson Murphy-Hill​[^2]

- Chris Parnin[^2]

- Jon Stallings[^3]

### Academic Editor

[Arie van Deursen](https://peerj.com/articles/cs-111/editor-1)

### Keywords

`Gender`, `Bias`, `Open source`, `Software development`, `Software engineering`

### Copyright

© 2017 Terrell et al.

### Licence

This is an open access article distributed under the terms of the [Creative Commons Attribution License](http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

[^1]: Department of Computer Science, California Polytechnic State University—San Luis Obispo, San Luis Obispo, CA, United States

[^2]: Department of Computer Science, North Carolina State University, Raleigh, NC, United States

[^3]: Department of Statistics, North Carolina State University, Raleigh, NC, United States

Tags:
App-Name:Academic
Article-Title:Gender differences and bias in open source: pull request acceptance of women versus men
Article-Timestamp:1573247492