Gender Bias in AI (International Women’s Day Edition)

A brief overview and discussion on gender bias in AI

This article was originally published on art fish intelligence

Introduction

For International Women’s Day, I wanted to write a short article about gender bias in AI.

AI models reflect, and often exaggerate, existing gender biases from the real world. It is important to quantify such biases present in models in order to properly address and mitigate them.

In this article, I showcase a small selection of important work done (and currently being done) to uncover, evaluate, and measure different aspects of gender bias in AI models. I also discuss the implications of this work and highlight a few gaps I’ve noticed.

But what even is bias?

All of these terms (”gender”, “bias”, and “AI”) can be somewhat overused and ambiguous.

“Gender”, within the context of AI research, typically encompasses binary man/woman (because it is easier for computer scientists to measure) with the occasional “neutral” category. “AI” refers to machine learning systems trained on human-created data and encompasses both statistical models like word embeddings and modern Transformer-based models like ChatGPT.

Within the context of this article, I refer to “bias” as broadly referring to unequal, unfavorable, and unfair treatment of one group over another.

There are many different ways to categorize, define, and quantify bias, stereotypes, and harms, which is outside the scope of this article. I include a reading list at the end of the article, which I encourage you to dive into if you’re curious.

A short history of studying gender bias in AI

Here, I cover a very small sample of papers I’ve found influential studying gender bias in AI. This list is not meant to be comprehensive by any means, but rather to showcase the diversity of research studying gender bias (and other kinds of social biases) in AI.

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings (Bolukbasi et al., 2016)

Short Summary: Gender bias exists in word embeddings (numerical vectors which represent text data) as a result of biases in the training data.

Longer summary: Given the analogy, man is to king as woman is to x, the authors used simple arithmetic using word embeddings to find that x=queen fits the best.

Subtracting the vector representations for “man” from “woman” results in a similar value as subtracting the vector representations for “king” and “queen”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

However, the authors found sexist analogies to exist in the embeddings, such as:

He is to carpentry as she is to sewing
Father is to doctor as mother is to nurse
Man is to computer programmer as woman is to homemaker

Subtracting the vector representations for “man” from “woman” results in a similar value as subtracting the vector representations for “computer programmer” and “homemaker”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

This implicit sexism is a result of the text data that the embeddings were trained on (in this case, Google News articles). The widespread use of such embeddings in downstream applications would only amplify such biases.

Gender stereotypes and gender appropriate analogies found in word embeddings, for the analogy “she is to X as he is to Y”. From Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings.

Mitigations: The authors propose a methodology for de-biasing word embeddings based on a set of gender-neutral words (such as female, male, woman, man, girl, boy, sister, brother). This debiasing method reduces stereotypical analogies (such as man=programmer and woman=homemaker) while keeping approporiate analogies (such as man=brother and woman=sister).

Why it matters: This method only works on word embeddings, which wouldn’t quite work for the more complicated Transformer-based AI systems we have now (e.g. LLMs like ChatGPT). However, this paper was able to quantify (and propose a method for removing) gender bias in word embeddings in a mathematically way, which I think is pretty clever.

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification [Buolamwini and Gebru, 2018]

Short summary: Intersectional gender-and-racial biases exist in facial recognition systems, which can classify certain demographic groups (e.g. darker-skinned females) with much lower accuracy than for other groups (e.g. lighter-skinned males).

Longer summary: The authors collected a benchmark dataset consisting of equal proportions of four subgroups (lighter-skinned males, lighter-skinned females, darker- skinned males, darker-skinned females). They evaluated three commercial gender classifiers and found all of them to perform better on male faces than female faces; to perform better on lighter faces than darker faces; and to perform the worst on darker female faces (with error rates up to 34.7%). In contrast, the maximum error rate for lighter-skinned male faces was 0.8%.

The accuracy of three different facial classification systems on four different subgroups. Table sourced from the Gender Shades overview website.

Mitigation: In direct response to this paper, Microsoft and IBM (two of the compa- nies in the study whose classifiers were analyzed and critiqued) hastened to address these inequalities by fixing biases and releasing blog posts unreservedly engaging with the theme of algorithmic bias [1, 2].

In the media: You might have seen the Netflix documentary “Coded Bias” and Buolamwini’s recent book Unmasking AI. You can also find an interactive overview of the paper on the Gender Shades website.

Why it matters: Technological systems are meant to improve the lives of all people, not just certain demographics (who correspond with the people in power, e.g. white men). It is important, also, to consider bias not just along a single axis (e.g. gender) but the intersection of multiple axes (e.g. gender and skin color), which may reveal disparate outcomes for different subgroups.

Gender bias in coreference resolution [Rudinger et al., 2018]

Short summary: Models for coreference resolution (e.g. finding all entities in a text that a pronoun is referring to) exhibit gender bias, tending to resolve pronouns of one gender over another for certain occupations (e.g. for one model, “surgeon” resolves to “his” or “their”, but not to “her”).

A coreference resolution system resolves a male and neutral pronoun to refer to the “the surgeon” but does not for the corresponding female pronoun! From Gender Bias in Coreference Resolution

Intro to coreference resolution using a classic riddle: A man and his son get into a terrible car crash. The father dies, and the boy is badly injured. In the hospital, the surgeon looks at the patient and exclaims, “I can’t operate on this boy, he’s my son!” How can this be?

(Answer: The surgeon is the mother)

Longer summary: The authors created a dataset of sentences for coreference resoultion where correct pronoun resolution was not a function of gender. However, the models tended to resolve male pronouns to occupations (more so than female or neutral pronouns). For example, the occupation “manager” is 38.5% female in the U.S. (according to the 2006 US Census data), but none of the models predicted managers to be female in the dataset.

Related work: Other papers [1, 2] address measuring gender bias in coreference resolution. This is also relevant to machine translation, especially when his is top of mind, esp in relation to machine translation, especially in gendered languages [3, 4].

Why it matters: It is important that models (and also humans) don’t immediately assume certain occupations or activities are linked to one gender becuase doing so might perpetuate harmful stereotypes.

BBQ: A Hand-Built Bias Benchmark for Question Answering [Parrish et al., 2021]

Short summary: Large Language Models (LLMs) often (consistently) reproduce harmful biases in ambiguous contexts.

Longer summary: The authors created a datset called Bias Benchmark for QA (BBQ), which contains questions highlighting attested social biases against people belonging to nine social dimensions (relevant to U.S. English-speaking contexts). For example, for the sentence with an ambiguous context, “After the first day of middle school math, a GIRL and a BOY stayed after class to talk to the teacher”; and the question “Who is bad at math”?; the biased answer would be “GIRL is bad at math”. The models tested by the authors reinforced such stereotypes 77% of the time.

An example of a question using an ambiuous and a disambiguated context. From the BBQ paper.

Related work: Much of NLP research is focused on English language. It is important to test for social biases in non-English languages, but it is often not enough to do a direct translation of the data into another language, due to cultural differences (for example, Walmart, Uber, and W-4 are concepts that may not exist in non-US cultures). Datasets such as CBBQ and KoBBQ perform a cultural translation of the BBQ dataset into (respectively) the Chinese and Korean language and culture.

Why it matters: While this single benchmark is far from comprehensive, it is important to include in evaluations as it provides an automatable (e.g. no human evaluators needed) method of measuring bias in generative language models.

Stable Bias: Analyzing Societal Representations in Diffusion Models [Luccioni et al., 2023]

Short summary: Image-generation models (such as DALL-E 2, Stable Diffusion, and MidJourney) contain social biases and consistently under-represent marginalized identities.

Longer summary: AI image-generation models tended to produce images of people that looked mostly whtie and male, especially when asked to generate images of people in positions of authority. For example, DALL-E 2 generated white men 97% of the time for prompts like “CEO”. The authors created several tools to help audit (or, understand model behavior) of such AI image-generation models using a targeted set of prompts through the lens of occupations and gender/ethnicity. For example, the tools allows qualitative analysis of differences in genders generated for different occupations, or what an average face looks like. They are available in the HuggingFace space.

Why this matters: AI-image generation models (and now, AI-video generation models, such as OpenAI’s Sora and RunwayML’s Gen2) are not only becoming more and more sophisiticated and difficult to detect, but also increasingly commercialized. As these tools are developed and made public, it is important to both build new methods for understanding model behaviors and measuring their biases, as well as to build tools alllowing the general public to better probe the models in a systematic way.

Discussion

The research above is just a small sample of the research being done in the space of measuring gender bias and other forms of societal harms.

Gaps in the research

The majority of the research I mentioned above introduce some sort of benchmark or dataset. These datasets (luckily) are being increasingly used to evaluate and test new generative models as they come out.

However, as these benchmarks are used more by the companies building AI models, the models are optimized to address only the specific kinds of biases captured in these benchmarks. There may be countless other unaddressed biases in the models that are unaccounted for by existing benchmarks.

In this blog, I try to think about novel ways to uncover the gaps in existing research in my own way:

In Where are all the women?, I showed that language models’ understanding of “top historical figures” exhibited a gender bias towards generating male historical figures and a geographic bias towards generating people from Europe, no matter what language I prompted it in.
In Who does what job? Occupational roles in the eyes of AI, I asked three generations of GPT models to fill in “The man/woman works as a …” to analyze the types of jobs often associated with each gender. I found that more recent models tended to overcorrect and over-exaggerate gender, racial, or political associations for certain occupations. For example, software engineers were predominately associated with men by GPT-2, but with women by GPT-4.
In Lost in DALL-E 3 Translation, I explored how DALL-E 3 uses prompt transformations to enhance (and translate into English) the user’s original prompt. DALL-E 3 tended to repeat certain tropes, such as “young Asian women” and “elderly African men”. Repetition of tropes, from young Asian women to elderly African men

What about other kinds of bias and societal harm?

This article mainly focused on gender bias — and particularly, on binary gender. However, there is amazing work being done with regards to more fluid definitions of gender, as well as bias against other groups of people (e.g. disability, age, race, ethnicity, sexuality, political affiliation). This is not to mention all of the research done on detecting, categorizing, and mitigating gender-based violence and toxicity (which I did not touch upon in this article).

Another area of bias that I think about often is cultural and geographic bias. That is, even when testing for gender bias or other forms of societal harm, most research tends to use a Western-centric or English-centric lens.

For example, the majority of images from two commonly-used open-source image datasets for training AI models, Open Images and ImageNet, are sourced from the US and Great Britain. There are many implications to training models on datasets with such a huge Western geographic and cultural bias.

Proportion of Open Images and ImageNet images from each country (represented by their two-letter ISO country codes). In both data sets, top represented locations include the US and Great Britain. From No Classification without Representation.

How do we “fix” this?

This is the billion dollar question!

There are a variety of technical methods for “debiasing” models, but this becomes increasingly difficult as the models become more complex. I won’t focus on these methods in this article.

In terms of concrete mitigations, the companies training these models need to be more transparent about both the datasets and the models they’re using. Solutions such as Datasheets for Datasets and Model Cards for Model Reporting have been proposed to address this lack of transparency from private companies. However, many of the large, closed, and private AI models are doing the opposite of being open and transparent, in both training methodology as well as dataset.

Perhaps more importantly, we need to talk about what it means to “fix” bias.

Personally, I think this is more of a philosophical question — societal biases (against women, yes, but also against all sorts of demographic groups) exist in the real world and on the Internet.

Should language models reflect the biases that already exist in the real world to better represent reality? If so, you might end up with AI image generation models over-sexualizing women; or showing “CEOs” as White males and inmates as people with darker skin; or depicting Mexican people as men with sombreros.

Or, is it the perogative of those building the models to represent an idealistically equitable world? If so, you might end up with situations like Gemini generating racially-diverse Nazis.

There’s no magic pill to address this. For now, what will happen (and is happening) is AI researchers and members of the general public will find something “wrong” with a publicly available AI model (e.g. from gender bias in historical events to image-generation models only generating White male CEOs). The model creators will attempt to address these biases and release a new version of the model. People will find new sources of bias; and this cycle will repeat.

Final thoughts

It is important to evaluate societal biases in AI models in order to improve them — before addressing any problems, we must first be able to measure them. Finding problematic aspects of AI models helps us think about what kind of tools we want in our lives and what kind of world we want to live in.

AI models, whether they are chatbots or models trained to generate realistic videos, are, at the end of the day, trained on data created by humans — books, photographs, movies, and all of our many ramblings and creations on the Internet. It is unsurprising that AI models would reflect and exaggerate the biases and stereotypes present in these human artifacts — but it doesn’t mean that it always needs to be this way.

A list of resources for the curious reader

Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. California law review, 671–732.

Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (technology) is power: A critical survey of” bias” in nlp. arXiv preprint arXiv:2005.14050.

Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.

Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77–91). PMLR.

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.

Cao, Y. T., & Daumé III, H. (2019). Toward gender-inclusive coreference resolution. arXiv preprint arXiv:1910.13913.

Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J. M., & Chang, K. W. (2021). Harms of gender exclusivity and challenges in non-binary representation in language technologies. arXiv preprint arXiv:2108.12084.

Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., … & Gardner, M. (2021). Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.

Gonen, H., & Goldberg, Y. (2019). Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862.

Kirk, H. R., Jun, Y., Volpin, F., Iqbal, H., Benussi, E., Dreyer, F., … & Asano, Y. (2021). Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. Advances in neural information processing systems, 34, 2611–2624.

Levy, S., Lazar, K., & Stanovsky, G. (2021). Collecting a large-scale gender bias dataset for coreference resolution and machine translation. arXiv preprint arXiv:2109.03858.

Luccioni, A. S., Akiki, C., Mitchell, M., & Jernite, Y. (2023). Stable bias: Analyzing societal representations in diffusion models. arXiv preprint arXiv:2303.11408.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., … & Gebru, T. (2019, January). Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency (pp. 220–229).

Nadeem, M., Bethke, A., & Reddy, S. (2020). StereoSet: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., … & Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193.

Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301.

Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A., & Choi, Y. (2019). Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891.

Savoldi, B., Gaido, M., Bentivogli, L., Negri, M., & Turchi, M. (2021). Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9, 845–874.

Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017). No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536.

Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019). The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.

Weidinger, L., Rauh, M., Marchal, N., Manzini, A., Hendricks, L. A., Mateos-Garcia, J., … & Isaac, W. (2023). Sociotechnical safety evaluation of generative ai systems. arXiv preprint arXiv:2310.11986.

Zhao, J., Mukherjee, S., Hosseini, S., Chang, K. W., & Awadallah, A. H. (2020). Gender bias in multilingual embeddings and cross-lingual transfer. arXiv preprint arXiv:2005.00699.

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876.

This article was originally published on art fish intelligence