## Abstract

Multiple measures, such as WEAT or MAC, attempt to quantify the magnitude of bias
present in word embeddings in terms of a single-number metric. However, such
metrics and the related statistical significance calculations rely on treating
pre-averaged data as individual data points and utilizing bootstrapping
techniques with low sample sizes. We show that similar results can be easily
obtained using such methods even if the data are generated by a null model
lacking the intended bias. Consequently, we argue that this approach generates
false confidence. To address this issue, we propose a Bayesian alternative:
hierarchical Bayesian modeling, which enables a more uncertainty-sensitive
inspection of bias in word embeddings at different levels of granularity. To
showcase our method, we apply it to Religion, Gender, and Race word lists from
the original research, together with our control neutral word lists. We deploy
the method using Google, GloVe, and Reddit embeddings. Further, we utilize our
approach to evaluate a debiasing technique applied to the Reddit word embedding.
Our findings reveal a more complex landscape than suggested by the proponents of
single-number metrics. The datasets and source code for the paper are publicly
available.^{1}

## 1 Introduction

It has been suggested that language models can learn implicit biases that reflect
harmful stereotypical thinking (see, for instance, Bolukbasi et al. 2016; Caliskan, Bryson, and Narayanan 2017; Gonen and Goldberg 2019; Lauscher and Glavaš 2019; Garg et al. 2018; Manzini
et al. 2019). For example, the (vector
corresponding to the) word *she* might be much closer in the vector
space to the word *cooking* than the word *he*. Such
phenomena are undesirable at least in some downstream tasks, such as Web search,
recommendations, and so on. To investigate such issues, several measures of bias in
word embeddings have been formulated and applied. Our goal is to use two prominent
examples of such measures to argue that this approach oversimplifies the situation
and to develop a Bayesian alternative.

A common approach in natural language processing is to represent words by vectors of
real numbers—such representations are called **embeddings**. One way
to construct an embedding—we will focus our attention on non-contextual
language models^{2} —is to use a
large corpus to train a neural network to assign vectors to words in a way that
optimizes for co-occurrence prediction accuracy. Such vectors can then be compared
in terms of their similarity—the usual measure is cosine
similarity—and the results of such comparisons can be used in downstream
tasks. Roughly speaking, cosine similarity is an imperfect mathematical proxy for
semantic similarity (Mikolov et al. 2013).

Recent research, however, has criticized the existing methods for assessing bias in word embedding models. Zhang, Sneyd, and Stevenson (2020) highlight the limitations of word pairs in detecting bias, arguing that analogies may not necessarily capture societal bias, but instead reflect co-occurrence frequency. Ethayarajh, Duvenaud, and Hirst (2019) draw attention to potential problems with the Word Embedding Association Test (WEAT) in degenerate cases, emphasizing its sensitivity to word frequency and list selection. Similarly, Schröder et al. (2021) critique existing bias metrics such as MAC and WEAT in light of their failure to satisfy plausible general formal principles. Goldfarb-Tarrant et al. (2021) find only limited correlation between intrinsic bias as measured by WEAT and extrinsic metrics such as equality of opportunity and predictive parity. Du, Fang, and Nguyen (2021) discover inconsistencies among existing bias measures, indicating sensitivity to embedding algorithms and corpora. Finally, Spliethöver and Wachsmuth (2021) introduce Bias Silhouette Analysis (BSA) as a method for assessing the quality of metrics that measure bias in word embedding models based on word lists, and concludes that none of the metrics they evaluated can reliably discriminate between biased and non-biased models in all cases. Ethayarajh (2020) argues against expressing bias as a single number without considering the inherent uncertainty in sample-based estimates. Lum, Zhang, and Bower (2022) propose a double-corrected variance estimator for unbiased estimates and uncertainty quantification of group-wise performance metrics. Both proposals enrich the existing measures with classical statistical uncertainty quantification. Our proposal is to get rid of the single-number measures and use Bayesian methods.

In what follows, we focus on two popular measures of bias applicable to many existing
word embeddings, such as *GoogleNews*,^{3}*GloVe* (Pennington, Socher, and Manning 2014),^{4} and *Reddit Corpus* (Rabinovich,
Tsvetkov, and Wintner 2018):^{5}*Word Embedding Association
Test* (WEAT) (Caliskan, Bryson, and Narayanan 2017), and *Mean Average Cosine
Distance* (MAC) (Manzini et al. 2019). We first explain how these measures
are supposed to work. Then we argue that they are problematic for various
reasons—the key one being that by pre-averaging data they manufacture false
confidence, which we illustrate in terms of simulations showing that the measures
often suggest the existence of bias even if by design it is non-existent in a
simulated dataset.

We propose to replace them with Bayesian data analysis, which not only provides more modest and realistic assessment of the uncertainty involved, but in which hierarchical models allow for inspection at various levels of granularity. Once we introduce the method, we apply it to multiple word embeddings and results of supposed debiasing, putting forward some general observations that are not exactly in line with the usual picture painted in terms of WEAT or MAC.

Most of the problems that we point out generalize to any existing approach that
focuses on chasing a single numeric metric of bias: (1) They treat the results of
pre-averaging as raw data in statistical significance tests, which in this context
is bound to overestimate significance. We show similar results can easily be
obtained when sampling from null models with no bias. (2) The word list sizes and
sample sizes used in the studies are usually small.^{6} (3) Many studies do not use any control predicates,
such as random neutral words or neutral human predicates for comparison.

On the constructive side, we develop and deploy our method, and the results are, roughly, as follows. (A) Posterior density intervals are fairly wide and the average differences in cosine distances between stereotypically associated, stereotypically different, random neutral, and regular human-related predicates are not very large. (B) A preliminary inspection suggests that the desirability of changes obtained by the usual debiasing methods is debatable.

In Section 2 we describe the two key measures
discussed in this paper, WEAT and MAC, explaining how they are calculated and how they
are supposed to work. In Section 3 we first
argue, in Subsection 3.1, that it is far from
clear how results given in terms of WEAT or MAC are to be interpreted. Second, in Subsection 3.2 we explain the statistical problems
that arise when one uses pre-averaged data in such contexts, as these measures do.
In Section 4 we explain the alternative
Bayesian approach that we propose. In Section
5 we elaborate on the results that it leads to, including low efficiency
of debiasing methods, discussed in Subsection
5.2. Finally, in Section 6 we
spend some time placing our results in the ongoing discussions.^{7,}^{8}

## 2 Two Measures of Bias: WEAT and MAC

The underlying intuition is that if a particular harmful stereotype is learned in a
given embedding, then certain groups of words will be systematically closer to (or
further from) each other. This gives rise to the idea of protected groups—for
example, in guiding online search completion or recommendation, female words might
require protection in that they should not be systematically closer to
stereotypically female job names, such as “nurse,”
“librarian,” and “waitress,” and male words require
protection in that they should not be systematically closer to toxic masculinity
stereotypes, such as “tough,” “never complaining,” or
“macho.”^{9}

^{10,}

^{11}

^{12}The gender bias of a word

*W*is then understood as

*W*’s projection on the gender direction: $W\u2192\xb7GD$ (which, after normalizing by dividing by ∥

*W*∥ ∥GD∥, is the same as cosine similarity). Given a list

**N**of supposedly gender neutral words,

^{13}and the gender direction GD, the direct gender bias is defined as the average cosine similarity of the words in

**N**from GD (

*c*is a parameter determining how strict we want to be):

The use of projections in bias estimation has been criticized, for instance, in Gonen
and Goldberg (2019), where it is pointed out
that while a higher average similarity to the gender direction might be an indicator
of bias with respect to a given class of words, it is only one possible
manifestation of it, and reducing the cosine similarity to such a projection may not
be sufficient to eliminate bias. For instance, *math* and *delicate* might be equally similar to a pair of opposed
explicitly gendered words (*she*, *he*), while being
closer to quite different stereotypical attribute words (such as *scientific* or *caring*). Further, it is observed
in Gonen and Goldberg (2019) that most word
pairs retain similarity under debiasing meant to minimize projection-based
bias.^{14}

*X*and

*Y*(we call them

**protected words**), should be quantified in terms of the cosine similarity between the protected words and attribute words coming from two sets of stereotype attribute words,

*A*and

*B*(we will call them

**attributes**). For instance,

*X*might be a set of male names,

*Y*a set of female names,

*A*might contain stereotypically male-related career words, and

*B*stereotypically female-related career words. The association difference for a particular word

*t*(belonging to either

*X*or

*Y*) is:

*A*and

*B*is:

*s*scores suggest systematic differences between how

*X*and

*Y*are related to

*A*and

*B*, and therefore are indicative of the presence of bias. The authors use it as a test statistic in some tests,

^{15}and the final measure of effect size, WEAT, is constructed by taking means of these values and standardizing:

WEAT is inspired by the Implicit Association Test (IAT)
(Nosek, Banaji, and Greenwald 2002) used in
psychology, and in some applications it uses almost the same word sets, allowing for
a *prima facie* sensible comparison with bias in humans. In Caliskan,
Bryson, and Narayanan (2017) the authors
argue that significant biases—thus measured—similar to the ones
discovered by IAT can be found in word embeddings. In Lauscher and Glavaš
(2019) the methodology is extended to a
multilingual and cross-lingual setting, arguing that using Euclidean distance
instead of cosine similarity does not make much difference, while the bias effects
vary greatly across embedding models.^{16} A similar methodology is used in Garg et al. (2018). The authors use word embeddings trained
on corpora from different decades to study the shifts in various biases through the
century.^{17}

*T*= {

*t*

_{1},…,

*t*

_{k}} be a set of protected words, and let each $Aj\u2208A$ be a set of attributes stereotypically associated with a protected word. For instance, when biases related to religion are to be investigated, they use a dataset of the format illustrated in Table 1. The measure is defined as follows:

*t*∈

*T*, and each attribute set

*A*

_{j}, they first take the mean of distances for this protected word and all attributes in a given attribute class, and then take the mean of thus obtained means for all the protected words and all the protected classes.

^{18}

**Table 1**

protected words (T)
. | attributes . | attribute set
(A_{j})
. | cosine distance . |
---|---|---|---|

rabbi | greedy | jewStereotype | 1.03 |

church | familial | christianStereotype | 0.70 |

synagogue | liberal | jewStereotype | 0.79 |

jew | familial | christianStereotype | 0.98 |

quran | dirty | muslimStereotype | 1.12 |

muslim | uneducated | muslimStereotype | 0.52 |

torah | terrorist | muslimStereotype | 0.93 |

quran | hairy | jewStereotype | 1.18 |

synagogue | violent | muslimStereotype | 0.95 |

bible | cheap | jewStereotype | 1.22 |

christianity | greedy | jewStereotype | 0.97 |

muslim | hairy | jewStereotype | 0.88 |

islam | critical | christianStereotype | 0.79 |

muslim | conservative | christianStereotype | 0.45 |

mosque | greedy | jewStereotype | 1.15 |

protected words (T)
. | attributes . | attribute set
(A_{j})
. | cosine distance . |
---|---|---|---|

rabbi | greedy | jewStereotype | 1.03 |

church | familial | christianStereotype | 0.70 |

synagogue | liberal | jewStereotype | 0.79 |

jew | familial | christianStereotype | 0.98 |

quran | dirty | muslimStereotype | 1.12 |

muslim | uneducated | muslimStereotype | 0.52 |

torah | terrorist | muslimStereotype | 0.93 |

quran | hairy | jewStereotype | 1.18 |

synagogue | violent | muslimStereotype | 0.95 |

bible | cheap | jewStereotype | 1.22 |

christianity | greedy | jewStereotype | 0.97 |

muslim | hairy | jewStereotype | 0.88 |

islam | critical | christianStereotype | 0.79 |

muslim | conservative | christianStereotype | 0.45 |

mosque | greedy | jewStereotype | 1.15 |

Notably, the intuitive distinction between different attribute sets plays no real
role in the MAC calculations. Equally well one could
calculate the mean distance of *muslim* to all the predicates, mean
distance of *christian* to all the predicates, mean distance of *jew* to all the predicates, and then take the mean of these
three means.

Having introduced the measures, first, we will introduce a selection of general problems with this approach, and then we will move on to more specific but important problems related to the fact that the measures take averages and averages of averages. Once this is done, we move to the development of our Bayesian alternative and the presentation of its deployment.

## 3 Challenges to Cosine-based Bias Metrics

### 3.1 Interpretability Issues

Table 2 contains an example of MAC scores (and *p*-values, we
explain how these are obtained in Subsection
3.2) before and after deploying two debiasing methods to the Reddit
embedding, where the score is calculated using the Religion word lists from
Manzini et al. (2019). The main element
of the strategy is projection onto (hard debiasing) or shifting vectors towards
(soft debiasing) the bias subspace so that the biased information is no longer
linearly accessible, and normalization of the vectors.^{19} For our purpose, details of the debiasing
methods are not important. What matters is that the authors use MAC in the evaluation of these methods.

**Table 2**

Religion Debiasing . | MAC . | p-value
. |
---|---|---|

Biased | 0.859 | N/A |

Hard Debiased | 0.934 | 3.006e−07 |

Soft Debiased (λ = 0.2) | 0.894 | 0.007 |

Religion Debiasing . | MAC . | p-value
. |
---|---|---|

Biased | 0.859 | N/A |

Hard Debiased | 0.934 | 3.006e−07 |

Soft Debiased (λ = 0.2) | 0.894 | 0.007 |

The first question we should ask is whether the initial MAC values lower than 1 indeed are indicative of the presence of bias. Thinking abstractly, 1 is the ideal distance for unrelated words. But in fact, there is some variation in distances, which might lead to non-biased lists also having MAC scores smaller than 1. What may attract attention is the fact that the value of cosine distance in the “Biased” category is already quite high (i.e., close to 1) even before debiasing. High cosine distance indicates low cosine similarity between values. One could think that the average cosine similarity equal to approximately 0.141 is not large enough to claim the presence of a bias to start with. The authors, though, still aim to mitigate it by making the distances involved in the MAC calculations even larger. The question is, on what basis is this small similarity still considered as proof of the presence of bias, and whether these small changes are meaningful.

The problem is that the original paper did not use any control group of neutral
attributes for comparison to obtain a more realistic gauge on how to understand MAC values. Later on, in our approach, we introduce
such control word lists. One of them is a list of words we intuitively
considered neutral. Moreover, it might be the case that words that have to do
with human activities in general, even if unbiased, are systematically closer to
the protected words than merely neutral words. This, again, casts doubt on
whether comparing MAC to the abstractly ideal value of
1 is a methodologically sound idea. For this reason, we also use a second list
with intuitively non-stereotypical human attributes.^{20}

Another important observation is that MAC calculations
do not distinguish whether a given attribute is associated with a given
protected word, simply averaging across all such groups. Let us use the case of
religion-related stereotypes to illustrate. The full lists from Manzini et al.
(2019) can be found in Appendix A.4.1. In the original paper,
words from all three religions were compared against all of the stereotypes. No
distinction between cases in which the stereotype is associated with a given
religion, as opposed to the situation in which it is associated with another
one, is made. For example, the protected word *jew* is supposed
to be stereotypically connected with the attribute *greedy*,
while from the protected word *quran* the attribute *greedy* comes from a different stereotype, and yet the
distances between these pairs contribute equally to the final MAC score. This is problematic, as not all of the
stereotypical words have to be considered harmful for all religions. To avoid
the masking effect, one should pay attention to how protected words and
attributes are paired with stereotypes.

In Figures 3–5 we look at the empirical distributions, while paying attention to such divisions. The horizontal lines represent the values of 1 −MAC (cosine similarity) that the authors considered indicative of bias for stereotypes corresponding to given word lists. For instance, in religion, MAC was .859, which was considered a sign of bias, so we plot 0 ± (1 −.859) ≈ .14 lines around similarity = 0 (that is, distance = 1). Notice that most distributions are quite wide, and the proportions of even neutral or human neutral words with similarities higher than the value of 1 −MAC deserving debiasing according to the authors are quite high.

Another issue to consider is the selection of attributes for bias measurement. The word lists used in the literature are often fairly small (5-50). The papers in the field do utilize statistical tests to measure the uncertainty involved and do make claims of statistical significance. Yet, we will later on argue that these methods are not proper for the goal at hand. By applying Bayesian methods we will show that a more appropriate use of statistical methods leads to estimates of uncertainty, which suggest that larger word lists would be advisable.

To avoid the problem brought up in this subsection, we use control groups and, in line with Bayesian methodology, use posterior distributions and highest posterior density intervals instead of chasing single-point metrics based on pre-averaged data. Before we do so, we first explain why pre-averaging and chasing single-number metrics is a suboptimal strategy.

### 3.2 Problems with Pre-averaging

The approaches we have been describing use means of mean average cosine
similarities to measure similarity between protected words and attributes coming
from harmful stereotypes. But once we take a look at the individual values, it
turns out that the raw data variance is rather high, and there are quite a few
outliers and surprisingly dissimilar words. This problem becomes transparent
when we examine the visualizations of the individual cosine distances, following
the idea that one of the first steps in understanding data is to look at it.
Let’s start with inspecting two examples of such visualizations in Figures 6 and 7 (we also include neutral and human predicates to make our point
more transparent). Again, we emphasize that *we do not condone the
associations which we are about to illustrate*.

As is transparent in Figures 6 and 7, for the protected word muslim, the most similar attributes tend to be the ones associated with it stereotypically, but then words associated with other stereotypes come closer than neutral or human predicates. For the protected word priest, the situation is even less as expected: The nearest attributes are human attributes, and there seems to be no clear pattern when it comes to the distances between attributes.

The general phenomenon that makes us skeptical about running statistical tests on pre-averaged data is that raw datasets of different variance can result in the same pre-averaged data and consequently the same single-number metric. In other words, a method that proceeds this way is not very sensitive to the real sample variance.

Let us illustrate how this problem arises in the context of WEAT. Once a particular *s*(*X*, *Y*, *A*, *B*) is calculated, the question arises
whether a value that high could have arisen by chance. To address the question,
each *s*(*X*, *Y*, *A*, *B*) is used in the original paper to
generate a *p*-value by bootstrapping. The *p*-value is the frequency of how often it is the case that *s*(*X*_{i}, *Y*_{i}, *A*, *B*) > *s*(*X*, *Y*, *A*, *B*) for sampled
equally sized partitions *X*_{i}, *Y*_{i} of *X* ∪ *Y*. The WEAT score is then computed by standardizing
the difference in means of means by dividing by the standard deviation of means;
see Equation (3).

The WEAT scores reported by Caliskan, Bryson, and
Narayanan (2017) for lists of words for
which the embeddings are supposedly biased range from 2.06 to 1.81, and the
reported *p*-values are in the range of 10^{−7} −10^{−2} with one exception for *Math vs.
Arts*, where it is .018.

The question is, are those results meaningful? One way to answer this question is
to think in terms of null generative models. If the words actually are samples
from two populations with equal means, how often would we see WEAT scores in this range? How often would we reach
the *p*-values that the authors reported?

Imagine that there are two groups of protected words, each of size 8, and two
groups of stereotypical attributes, of the same size.^{21} Each such a collection of samples, as far as
our question is involved, is equivalent to a sample of 16^{2} cosine
distances. Further, imagine that there really is no difference between these
groups of words and the model is in fact null. That is, we draw the cosine
distances from the Normal(0,.08) distribution.^{22}

In Figure 8 we illustrate one iteration of
the procedure. We draw one such sample of size 16^{2}. Then we actually
list all possible ways to split the 16 words in two equal sets (each such a
split is one bootstrapped sample) and for each of them we calculate the s values and WEAT. What
are the resulting distributions of s scores; what *p*-values and effect sizes do they lead to?

In the bootstrapped samples we would rather expect low s values and low WEAT: After all, these are just random permutations of random distances all of which are drawn from the same null distribution. Let’s take a look at one such a bootstrapped sample. For the sake of illustration, we picked a rather unusual one: The observed test statistic in this sample is 0.39 and 1.27.

The bootstrapped distributions of the test statistics and effect sizes across
multiple samples are illustrated in Figure 8 (we marked the location of our particular example). Notably, both
(two-sided) *p*-values for our example are rather low. This might
suggest that we ended up with a situation where “bias” is present
(albeit, due to random noise). After all the observed statistic is unusual
enough for the *p*-values to pass the traditional significance
threshold.

The reason why we picked the example that we did is that while it leads to a
relatively low *p*-value, and a relatively unusual effect size, a
visualization of the sample reveals no interesting patterns (see Figure 9), which strongly suggests that the
way we calculated effect sizes and *p*-values overestimates the
impact of random noise, in line with our more theoretical comment, which will
follow.

In fact, while there might be some outliers here and there, saying that a clear bias on which one group is systematically closer to one word group than another lacks support. Crucially, in the calculations of WEAT means are taken twice. The s-values themselves are means and then means of s-values are compared between groups. Statistical troubles start when we run statistical tests on sets of means, for at least two reasons.

By pre-averaging data we throw away information about sample sizes. For the former point, think about proportions: 10 out of 20 and 2 out of 4 give the same mean, but you would obtain more information by making the former observation rather than by making the latter. And especially in this context, in which the word lists are not huge, sample sizes should matter.

When we pre-average, we disregard variation, and therefore pre-averaging tends to manufacture false confidence. Group means display less variation than the raw data points and the standard deviation of sets of means is bound to be lower than the original standard deviation in the row data. Now, if you calculate your effect size by dividing by the pre-averaged standard deviation, you are quite likely to get something that looks like a strong effect size, but the results of your calculations might not track anything interesting.

Let us think again about the question that we are ultimately interested in. Are
the *X* terms systematically closer to or further from the *A* or *B* attributes than the *Y* words? But this time, instead of pre-averaging,
let’s use the raw data points to answer the question. To start with, let
us run two quick *t*-tests to gauge what the raw data illustrated
in Figure 9 tell us. First, distances to *A* attributes for *X* words and *Y* words. The result is statistically significant. The *p*-value is 0.02 (more than ten times higher than the *p*-value obtained by the bootstrapping procedure). So the
sample is in some sense unusual. But the 95% confidence interval for the
difference in means is [.0052,.061], which suggests values that are much smaller
than what a reader would expect having read that the calculated effect size is
quite large. Let us inspect the distances to the *B* attributes.
Here the *p*-value is .22 and the 95% confidence interval
is [−0.03,.009], even less of a reason to think a bias is present.

Another difficulty is that these statistical tests are based on bootstrapping
from relatively small data sets, which is quite likely to underestimate the
population variance. To make our point clear, let us avoid bootstrapping and
work with the null generative model with Norm(0,.08)
for both word groups. We keep the sizes the same: we have eight protected words
in each group, sixteen in total, and for each we randomly draw 8 distances from
hypothetical *A* attributes, and 8 distances from hypothetical *B* attributes. Calculate the test statistic and effect size
the way Caliskan, Bryson, and Narayanan (2017) did. Do this 10,000 times, each time calculating WEAT and s values, and
look at what the distributions of these values are on the assumption of the null
model with realistic empirically motivated raw data point standard deviation
(Figure 10).

The first observation is that the supposedly large effect size we obtained is not
that unusual even assuming a null model. Around 38% of samples result in
a WEAT score at least as extreme. This illustrates the
point that it does not constitute strong evidence of bias. Second, the
distribution of s values is much more narrow, which
means that if we use it to calculate *p*-values, it is not too
difficult to obtain a supposedly significant test statistic, which nevertheless
does not correspond to anything interesting happening in the data set.

We have seen that seemingly high effect sizes might arise even if the underlying
processes actually have the same mean. The uncertainty resulting from including
the raw data point variance in considerations is more extensive than the one
suggested by the low *p*-values obtained from taking means or
means of means as data points. In this section we discussed the performance of
the WEAT measure, but since the Manzini et al. (2019) one is a generalization thereof,
including the method of running statistical tests on pre-averaged data, our
remarks, *mutatis nutandis*, apply.

As an alternative, we will propose focusing on what the real underlying question is and trying to answer it using a statistical analysis of the raw data using meaningful control groups, to ensure interpretability. Moreover, since the data sets are not too large and since multiple evaluations are to be made, we will pursue this method from the Bayesian perspective.

## 4 A Bayesian Approach to Cosine-based Bias

### 4.1 Model Construction

Bayesian data analysis takes prior probability distributions, a mathematical
model structure and the data, and returns the posterior probability
distributions over the parameters of interest, thus capturing our uncertainty
about their actual values. One important difference between such a result and
the result of classical statistical analysis is that classical confidence
intervals (CIs) have a rather complicated and somewhat confusing interpretation
not directly related to the posterior probability distribution.^{23}

In fact, Bayesian highest posterior density intervals (HPDIs)^{24} and CIs end up being numerically the same
only if the prior distributions are completely uninformative. This illustrates
that classical analysis (1) is unable to incorporate non-trivial priors, and (2)
is therefore more susceptible to over-fitting, unless regularization (equivalent
to a more straightforward Bayesian approach) is used. In contrast with CIs, the
posterior distributions are easily interpretable and have direct relevance to
the question at hand. Moreover, Bayesian data analysis is better at handling
hierarchical models and small datasets, which is exactly what we will be dealing
with.

In standard Bayesian analysis, the first step is to understand the data, think hard about the underlying process, and select potential predictors and the outcome variable. The next step is to formulate a mathematical description of the generative model of the relationships between the predictors and the outcome variable. Prior distributions must then be chosen for the parameters used in the model. Next, Bayesian inference must be applied to find posterior distributions over the possible parameter values. Finally, we need to check how well the posterior predictions reflect the data with a posterior predictive check. This will be our evaluation of the method’s performance, discussed in Subsection 4.2, and a more complete collection of such results is available in the Appendix.

In our analysis, the outcome variable is the cosine distances between the protected words and attribute words. The predictor is a factor related to an attribute word, and has four levels:

a given attribute word is stereotypically associated with the protected word,

it comes from a different stereotype connected with another protected word,

it is a neutral word,

it is a human-related predicate.

The idea is that if bias is present in the embedding, distances to associated attribute words should be systematically lower than to other attribute words.

Furthermore, conceptually there are two levels of analysis in our approach (see Figure 11). On the one hand, we are interested in the general question of whether related attributes are systematically closer across the dataset. On the other hand, we are interested in a more fine-grained picture of the role of the predictor for particular protected words. Learning in hierarchical Bayesian models involves using Bayesian inference to update the parameters of the model. This update is based on the observed data, and estimates are made at different levels of the data hierarchy. We use hierarchical Bayesian models in which we simultaneously estimate parameters at the protected word level and at the global level, assuming that all lower-level parameters are drawn from global distributions. Such models can be thought of as incorporating adaptive regularization, which avoids over-fitting and leads to improved estimates for unbalanced datasets.

*a*[

*pw*], its mean distance to attributes coming from different stereotypes,

*d*[

*pw*], its mean distance to human attributes,

*h*[

*pw*], and its mean distance to neutral attributes,

*n*[

*pw*]:

*a*parameters come from one distribution, which is normal around a higher-level parameter $\u0101$ and so on for the other three groups of parameters. That is,

*a*

_{pw[i]}is the average distance of a given particular protected word to attributes stereotypically associated with it, while $\u0101$ is the overall average distance of protected words to attributes associated with them.

^{25}

The governing principles guiding our prior selection are:

We come with the same level of agnosticism about how biased a given embedding is with respect to a particular word list, and so we want to use the same priors across all word lists. Prior to the analysis there seem to be no particular reasons to have strong beliefs about the outcomes being different for different word lists.

We want the priors to provide some level of regularization to avoid over-fitting. Taking coefficient priors centered around zeros is a fairly standard technique equivalent to other machine-learning regularization techniques (see Chapter 6 of James et al. [2013], especially the section on Bayesian interpretation of ridge regression and the lasso). Hence, centering around the no-effect value (in our case, cosine distance equal to 1, or in other words, cosine similarity equal to 0) and non-uniformity.

We also want to build in conceptual information that we already have, such as there are mathematical restrictions on the range of cosine similarities, and therefore on the expected values. If we choose sd = .3, that means that prior to seeing the data we expect the mean distances in groups with 99.7% prior confidence to be within 1 ± 3

*sd*, that is, (−.2,2.2), which is more than plenty and does not really exclude any mathematically possible value.

If the reader is concerned about the choice of priors, the publicly available code can be used to re-run it with their own priors.

### 4.2 Posterior Predictive Check

A posterior predictive check is a technique used to evaluate the fit of a Bayesian model by comparing its predictions with observed data. The underlying principle is to generate simulated data from the posterior distribution of the model parameters and compare them with the observed data. If the model is a good fit to the data, the simulated data should resemble the observed data. In Figure 12 we illustrate a posterior predictive check for one corpus (Reddit) and one word list. The remaining posterior predictive checks are in Appendix A.3. The general phenomenon is that the frequency of observed values falling within the 89% and 55% highest posterior density predictive intervals are close to these percentages—which illustrates that our model is not systematically incorrect. Another observation is that the posterior density intervals are relatively wide, which is not unexpected—this illustrates that information about what treatment/control group a word belongs to is not very useful in predicting its distance to other words, in line with our general methodological comments.

## 5 Results and Discussion

### 5.1 Observations

Despite one-number metrics suggesting otherwise, our Bayesian analysis reveals that insofar as the short word lists typically used in related research projects are involved, there usually are no strong reasons to claim the presence of systematic bias. Moreover, comparison between the groups (including control word lists) leads to the conclusion that the effect sizes (that is, the absolute differences between cosine distances between groups) tend to be rather small, with few exceptions. The choice of protected words is crucial—as there is a lot of variance when it comes to the protected word-level analysis.

One example of a visualization of the results can be found in Figure 13. First, it illustrates posterior marginal distributions at the top of the model hierarchy, relating estimated mean cosine distances by groups to each other. Then, it turns to particular protected words and estimates mean cosine distances of attributes to it, also by groups. Visualizations for other embeddings and word lists can be found in Appendix A.2. Overall, these show that the situation is more complicated than merely looking at one-number summaries might suggest. Note that the axes are sometimes in different scales to increase visibility.

To start with, let us look at the association-type level coefficients (illustrated in the top parts of the plots). Depending on the corpus used and word class, there is a large variety as to posterior densities. Quite aware of this being a crude approximation, let’s compare the HPDIs and whether they overlap for different attribute groups.

In WEAT 7 (Reddit) there is no reason to think there are systematic differences between cosine distances (recall that words from WEAT 7 were mostly not available in other embeddings).

In WEAT 1 (Google, GloVe and Reddit) associated words are somewhat closer, but the cosine distance differences from neutral words are very low, and surprisingly it is human attributes, not neutral predicates, that are systematically the furthest.

In Religion (Google, GloVe, Reddit) and Race (Google, GloVe), the associated attributes are not systematically closer than attributes belonging to different stereotypes, and the difference between neutral and human predicates is rather low, if noticeable. The situation is interestingly different in Race (Reddit) where both human and neutral predicates are systematically further than associated and different attributes—but even then, there is no clear difference between associated and different attributes.

For Gender (Google, GloVe), despite the superficially binary nature, associated and opposite attributes tend to be more or less in the same distances, much closer than neutral words (but not closer than human predicates in GloVe). Reddit is an extreme example: Both associated and opposite attributes are much closer than neutral and human (around .6 vs. .9), but even then, there seems to be no reason to think that cosine distances to associated predicates are much different from distances to opposite predicates.

Moreover, when we look at particular protected words, the situation is even less straightforward. We will just go over a few notable examples, leaving the visual inspection of particular results for other protected words to the reader. One general phenomenon is that—as we already pointed out—the word lists are quite short, which contributes to large uncertainty involved in some cases.

For some protected words the different attributes are somewhat closer than the associated attributes.

For some protected words, associated and different attributes are closer than neutral attributes, but so are human attributes.

In some cases, associated attributes are closer, but so are neutral and human predicates, which illustrates that just looking at average cosine similarity as compared to the theoretically expected value of 1, instead of running a comparison to neutral and human attributes, is misleading.

The only group of protected words where differences are noticeable at the protected word level is Gender-related words—as in Gender (Google) and in Gender (Reddit)—note however that in the latter, for some words, the opposite attributes seem to be a little bit closer than the associated ones.

### 5.2 Rethinking Debiasing

Bayesian analyses and visualizations thereof can also be handy when it comes to
the investigation of the effect that debiasing has on the embedding space. We
used the embeddings that were debiased using *hard* mode
debiasing from Manzini et al. (2019).
In Figures 13 and 14 we see an example of two visualizations depicting the
difference in means with 89% HPDIs before and after applying debiasing
(the remaining visualizations are in the Appendix).

In

*Gender (Reddit)*, minor differences between different and associated predicates end up being smaller. However, this is not achieved by any major change in the relative positions of associated and different predicates with respect to protected words, but rather by shifting them jointly together. The only protected word for which a major difference is noticeable is*hers*.In

*Religion (Reddit)*, debiasing aligns general coefficients for all groups together, all of them getting closer to where neutral words were prior to debiasing (this is true also for human predicates in general, which intuitively did not require debiasing). For some protected words such as*christian*,*jew*, the proximity ordering between associated and different predicates has been reversed, and most of the distances shifted a bit towards 1 (sometimes even beyond, such as predicates associated with the word*quran*), but for most protected words, the relative differences between the coefficient did not change much (for instance, there is no change in the way the protected word*muslim*is mistreated).For

*Race (Reddit)*, general coefficients for different and associated predicates became aligned. However, most of the changes roughly preserve the structure of bias for particular protected words with minor exceptions, such as making the proximities of different predicates for protected words*asian*and*asia*much lower than associated predicates, which is the main factor responsible for the alignment of the general level coefficients.

In general, debiasing might end up leading to lower differences between general level coefficients for associated and different attributes. But that usually happens without any major change to the structure of the coefficients for protected words, sporadic extreme and undesirable changes for some protected words, usually with the side effect of changing what happens with neutral and human predicates.

We wouldn’t be even able to notice these phenomena had we restricted our attention to MAC or WEAT scores only. To be able to diagnose and remove biases at the right level of granularity, we need to go beyond single metric chasing.

In Figures 15–17 we inspect the empirical distributions for the debiased embeddings. Comparing the results to the original embedding, one may notice that for the Religion group, the neutral and human distribution has changed slightly. Before, within the “correct” cosine similarity boundaries, there were 56% of neutral and 55% of human word lists. After the debiasing, the values changed to 59% (for neutral) and 59% (for human). The different and associated word lists were more influenced. The general shape of both distributions is less stretched. Before debiasing, 43% of the different word lists and 35% of the associated word lists were within the accepted boundaries. After the embedding manipulation, the percentage has increased for both lists to 63%. Visualization for the Gender group illustrates almost no change for the neutral and human word lists before and after debiasing. The values for different and associated word lists are also barely impacted by the embedding modification. In the Race group, the percentage within the boundaries for neutral and associated word lists has increased. The opposite happened for human and different word lists, where the percentage of “correct” cosine similarity dropped from 67% to 55% (human) and from 39% to 36% (different).

### 5.3 Potential Objections

A worry that one might have is that the original challenge was to develop a single-number bias metric, and that our approach does not reach this goal. Essentially, this is correct: We argue that chasing a single metric is the wrong game to play to start with. In line with general Bayesian methodology, we grant epistemic priority to the inspection of posterior distributions, of which particular single-number summaries are just that. Imperfect summaries that can be easily misinterpreted if no attention is paid to the uncertainties surrounding it. Having said this, if one really needs a single-number summary, then mean posterior contrasts between associated and human attributes are closer to adequacy than the single number metrics proposed so far (but they still should come accompanied by some description of the uncertainty involved).

One of our recommendations is that the word lists should be larger. A possible
concern here is that one might ask: *But wouldn’t larger word
lists solve the problem with the existing measures of bias?* This is
in some sense true: The overconfidence resulting from using aggregated data will
be less damaging if more data is used. However, as Ethayarajh (2020) proves, the list sizes needed for useful
uncertainty estimation from this perspective obtained using theoretical bounds
tend to be prohibitive. Bayesian uncertainty estimates, however, scale with size
and can be improved and evaluated as the list sizes increase, even if the size
improvement is far from the one required by classical theoretical
considerations.

Another concern may be that this research focuses primarily on traditional uncontextualized embeddings, which may be considered outdated given recent advances in the field. Our focus on WEAT and similar methods is motivated by the fact that WEAT is the main bias measure that is claimed to be supported by data and findings from the IAT in psychology, and by the fact that its analogues have seen the most research contributions.

Having said that, here is a generalization of the WEAT for contextual models, The Sentence Encoder Association Test (SEAT) (May et al. 2019). It works by injecting words into the context of template sentences, which are then embedded: The bias is computed from the sentence embeddings. A number of similar generalizations have been reviewed and studied by Husse and Spitz (2022). The authors’ main observation is that the reported results are heterogeneous, inconsistent, and ultimately inconclusive, as minor design and implementation choices or errors have a substantial and often significant impact on the derived bias scores. This is consistent with our critique of the use of such scores.

However, just as WEAT can be generalized to contextual models, so can our method. The fact that it has not been applied in this way is due to a practical rather than a conceptual or technical limitation. The results of such generalization remain to be produced and studied.

## 6 Related Work and Conclusions

There are a few related papers, whose discussion goes beyond the scope of this article:

Xiao and WangXiao and Wang (2019) use Bayesian methods to estimate uncertainty in NLP tasks, but they apply their Bayesian Neural Networks-based method to sentiment analysis or named entity recognition, not to bias.

Schröder et al.Schröder et al. (2021) criticize some existing bias metrics such as MAC or WEAT on the grounds of them not satisfying some general formal principles, such as magnitude-comparability, and they propose a modification, called SAME.

May et al.May et al. (2019) develop a generalization of WEAT meant to apply to sets of sentences, SEAT, which basically applies WEAT to vector representations of sentences. The authors, however, still pre-average and focus on a single-number metric, so our remarks apply.

Guo and CaliskanGuo and Caliskan (2021) introduce the Contextualized Embedding Association Test Intersectional, meant to apply to dynamic word embeddings and, importantly, develop methods for intersectional bias detection. The measure is a generalization of the WEAT method. The authors do inspect a distribution of effect sizes that arise from the consideration of various possible contexts, but they continue to standardize the difference in averaged means and use a single-number summary: the weighted mean of the effect sizes thus understood. The method, admittedly, deserves further evaluation, which goes beyond the scope of this article.

Lum, Zhang, and BowerLum, Zhang, and Bower (2022) observe that many group-wise performance meta-metrics used in algorithmic fairness consideration are biased estimators of disparities and propose a double-corrected variance estimator, which provides unbiased estimates and uncertainty quantification of the variance of model performance. This is certainly valuable. Our approach differs in the following dimensions: It is not clear how bias variance estimator for model performance should be used in the context of group-wise cosine similarity evaluation, where we are not dealing with performance, but with cosine similarities, and where we are not dealing with a binomial distribution. Moreover, while having an unbiased estimate is valuable, if we have one this only means that

*in the long run*our estimates would tend to the true value. In the current situation we were looking at, the datasets are relatively small, and we want to focus on what can be said before the long run has passed. Moreover, our Bayesian approach allows for regularization, inspection at values levels of granularity, and more straightforward interpretability, as it results in posterior distributions.Ethayarajh, Duvenaud, and HirstEthayarajh, Duvenaud, and Hirst (2019) point out that WEAT will be overblown in degenerate cases: If the word groups are singletons, the effect size is always maximal in one direction. They also bring up sensitivity to the word frequency in a given corpus and to the word list choice. The authors propose a measure they call Relational Inner Product Association (RIPA) to measure the association between a word vector and a relation vector in a word embedding. Roughly, this is the scalar projection of a word vector onto a one-dimensional bias subspace (which is found by using the approach from Bolukbasi et al. [2016]). They do not attempt any statistical analysis or generalization to a bias measure that would apply to an embedding space as a whole (as opposed to assigning RIPA to one word vector with respect to a relation vector), or a particular protected word list. In contrast, our analysis allows for a more general evaluation, with uncertainty estimates. Moreover, RIPA can be used only for those embedding models that implicitly utilize matrix factorization and contain a co-occurrence statistic. In contrast, such technicalities do not prevent the Bayesian approach from being deployed for contextual models.

EthayarajhEthayarajh (2020) argues that a bias estimate should not be expressed as a single number without taking into account that the estimate is made using a sample of data and therefore has intrinsic uncertainty. The author suggests using Bernstein bounds to gauge the uncertainty in terms of confidence intervals. We do not discuss this approach extensively, as we think that confidence intervals are quite problematic for several reasons, among others the confusing interpretability. We do not think that Bernstein bounds provide the best solution to the problem. Applying this method to a popular WinoBias dataset leads to the conclusion that more than 11,903 data points of protected words are needed to claim a 95% confidence interval for a bias estimate. This amount vastly exceeds the existing word lists for bias estimation. We propose a more realistic Bayesian method. Our conclusion is still that the word lists are sometimes too small, but at least they allow for gauging uncertainty as we go on to improve our methodology and extend the lists gradually.

Zhang, Sneyd, and StevensonZhang, Sneyd, and Stevenson (2020) show the limitations of methods that use gender word pairs to detect bias in an embedding space. They claim that using analogies to detect bias may not necessarily reflect societal bias, but rather simply co-occurrence frequency between words. They conduct experiments where they evaluate four popular bias measures: Direct Bias (DB), Word Association (WA), Neighbourhood Bias Metric (NBM), and Relational Inner Product Association (RIPA). They show that these measures are not robust to changing either the base pair or the form of a word used. This is a valid point, to some extent related to the limitations on non-contextual models, and to some extent suggesting the inclusion of various word forms in the word list, and not constructing base direction using small word lists. As one of our observations is that the uncertainty resulting from the use of currently existing protected word lists is too large to justify sweeping statements, this is in line with our criticism.

Du, Fang, and NguyenDu, Fang, and Nguyen (2021) investigate the reliability of bias measures. They find that key existing candidates for bias measures often fail to agree with one another on particular pairs and protected words, and are sensitive to the word embedding algorithm and the corpus used. While some of these sensitivities are not necessarily signs of failure, we agree that the more abstract a measure is the more degrees of freedom for particular ways it is constructed, and the more ways such measures can disagree without a clear and principled reason. This is partially why we propose a less abstract approach that estimates expected cosine distances directly using raw data points.

Goldfarb-Tarrant et al.Goldfarb-Tarrant et al. (2021) further our understanding of the relationship between intrinsic bias measured as a property of an embedding space, and extrinsic bias, measured in terms of downstream task performance. They compare WEAT as an intrinsic measure with Equality of Opportunity and Predictive Parity as external metrics. The conclusion is that such a correlation is very limited. We consider this to be a point that matches our criticism of WEAT. Running a similar analysis for the Bayesian approach we discuss herein would be a useful task, which remains beyond the scope of this article. One interesting general difference is that our method often claims that there are no sufficient reasons to claim that an embedding is biased, so one would be on the lookout for such cases in which extrinsic bias measures nevertheless suggest the presence of bias. Such a presence, however, would not necessarily have to mean that the Bayesian approach to estimating potential systematic differences in cosine similarity is wrong, but rather suggest that external bias in downstream performance is not a function thereof.

Spliethöver and WachsmuthSpliethöver and Wachsmuth (2021) propose Bias Silhouette Analysis (BSA), a method for assessing the quality of metrics that measure bias in word embedding models based on word lists. The core idea here is to quantify how much the bias values of a metric vary depending on what words from the lists are actually observed, where the computations result in values for each model obtained using word list subsets of increasing length. This allows for an inspection of bias metric convergence and sensitivity to word list choice, with a biased (GloVe) and an explicitly debiased model whose lower bias has been confirmed empirically (NBatch). They examine the Embedding Coherence Test (ECT), the Relative Negative Sentiment Bias (RNSB), and WEAT, concluding that none of these metrics can reliably discriminate between biased and non-biased models in all cases. This is in line with our results. An interesting question is what would happen if a similar test was applied to our method. However, our point is that the existing word lists are too short to provide a reliable estimate of bias.

To summarize, a Bayesian data analysis with hierarchical models of cosine distances between protected words, control group words, and stereotypical attributes provides more modest and realistic assessment of the uncertainty involved. It reveals that much complexity is hidden when one instead chases single bias metrics present in the literature. After introducing the method, we applied it to multiple word embeddings and results of supposed debiasing, putting forward some general observations that are not exactly in line with the usual picture painted in terms of WEAT or MAC (and the problem generalizes to any approach that focuses on chasing a single numeric metric): The word list sizes and sample sizes used in the studies are usually small. Posterior density intervals are fairly wide. Often the differences between associated, different, and neutral human predicates are not very impressive. Also, a preliminary inspection suggests that the desirability of changes obtained by the usual debiasing methods is debatable. The tools that we propose, however, allow for a more fine-grained and multi-level evaluation of bias and debiasing in language models without losing modesty about the uncertainties involved. The short, general, and somewhat disappointing lesson here is this: things are complicated. Instead of chasing single-number metrics, we should rather devote attention to more nuanced analysis.

## A Appendix

### A.1 A Philosophical Commentary

One response to the raising of the issue of bias in natural language models might be to say that there is not much point in reflecting on such biases, as they are unavoidable. This unavoidability might seem in line with the arguments to the effect that learning algorithms are always value-laden (Johnson 2023): They use inductive methods that require design-, data-, or risk-related decisions that have to be guided by extra-algorithmic considerations. Such choices necessarily involve value judgments and have to do, for instance, with what simplifications or risks one finds acceptable. Admittedly, algorithmic decision-making cannot fulfill the value-free ideal, but this only means that even more attention needs to be paid to the values underlying different techniques and decisions, and to the values being pursued in a particular use of an algorithm.

Another response might be to insist that there is no bias introduced by the use of machine learning methods here since the algorithm is simply learning to correctly predict co-occurrences based on what “reality” looks like. However, this objection overlooks the fact that we, humans, are the ones who construct this linguistic reality, which is shaped in part by the natural language processing tools we use on a massive scale. Sure, if there is unfairness and our goal is to diagnose it, we should do complete justice to learning it in the model used to study it. One example of this approach is Garg et al. (2018), where the authors use language models to study the shape of certain biases across a century.

However, if our goal is to develop downstream tools that perform tasks that we care about without further perpetuating or exacerbating harmful stereotypes, we still have good reasons to try to minimize the negative impact. Moreover, it is often not the case that the corpora mirror reality—to give a trivial example, heads are spoken of more often than kidneys, but this does not mean that kidneys occur much less often in reality than heads. To give a more relevant example, the disproportionate association of female words with female occupations in a corpus actually greatly exaggerates the actual lower disproportion in the real distribution of occupations (Gordon and Durme 2013).

### A.2 Visualizations

### A.3 Posterior Predictive Checks

### A.4 Word Lists

#### A.4.1 Lists Used in Previous Research

The lists from Manzini et al. (2019) are available here:

The lists are:

**Gender:**

Pairs of protected attributes: [“he”, “she”], [“his”, “hers”], [“son”, “daughter”], [“father”, “mother”], [“male”, “female”], [“boy”, “girl”], [“uncle”, “aunt”]

Lists of stereotypes per gender: man: [“manager”, “executive”, “doctor”, “lawyer”, “programmer”, “scientist”, “soldier”, “supervisor”, “rancher”, “janitor”, “firefighter”, “officer”], woman: [“secretary”, “nurse”, “clerk”, “artist”, “homemaker”, “dancer”, “singer”, “librarian”, “maid”, “hairdresser”, “stylist”, “receptionist”, “counselor”] }

**Race:**

Sets of protected attributes: [“black”, “caucasian”, “asian”], [“african”, “caucasian”, “asian”], [“black”, “white”, “asian”], [“africa”, “america”, “asia”], [“africa”, “america”, “china”], [“africa”, “europe”, “asia”]

Lists of stereotypes per race: “caucasian”: [“manager”, “executive”, “redneck”, “hillbilly”, “leader”, “farmer”], “asian” : [“doctor”, “engineer”, “laborer”, “teacher”], “black” : [“slave”, “musician”, “runner”, “criminal”, “homeless”]

**Religion:**

Sets of protected attributes: [“judaism”, “christianity”, “islam”], [“jew”, “christian”, “muslim”], [“synagogue”, “church”, “mosque”], [“torah”, “bible”, “quran”], [“rabbi”, “priest”, “imam”]

Lists of stereotypes per race: “jew” : [“greedy”, “cheap”, “hairy”, “liberal”], “christian” : [“judgemental”, “conservative”, “familial”], “muslim” : [“violent”, “terrorist”, “dirty”, “uneducated”]

#### A.4.2 Custom Lists Used in This Article

**Neutral:**

[‘ballpark’, ‘glitchy’, ‘billy’, ‘dallas’, ‘rip’, ‘called’, ‘outlooks’, ‘floater’, ‘rattlesnake’, ‘exports’, ‘recursion’, ‘shortfall’, ‘corrected’, ‘solutions’, ‘diagnostic’, ‘patently’, ‘flops’, ‘approx’, ‘percents’, ‘lox’, ‘hamburger’, ‘engulfed’, ‘households’, ‘north’, ‘playtest’, ‘replayability’, ‘glottal’, ‘parable’, ‘gingers’, ‘anachronism’, ‘organizing’, ‘reach’, ‘shtick’, ‘eleventh’, ‘cpu’, ‘ranked’, ‘irreversibly’, ‘ponce’, ‘velociraptor’, ‘defects’, ‘puzzle’, ‘smasher’, ‘northside’, ‘heft’, ‘observation’, ‘rectum’, ‘mystical’, ‘telltale’, ‘remnants’, ‘inquiry’, ‘indisputable’, ‘boatload’, ‘lessening’, ‘uselessness’, ‘observes’, ‘fictitious’, ‘repatriation’, ‘duh’, ‘attic’, ‘schilling’, ‘charges’, ‘chatter’, ‘pad’, ‘smurfing’, ‘worthiness’, ‘definitive’, ‘neat’, ‘homogenized’, ‘lexicon’, ‘nationalized’, ‘earpiece’, ‘specializations’, ‘lapse’, ‘concludes’, ‘weaving’, ‘apprentices’, ‘fri’, ‘militias’, ‘inscriptions’, ‘gouda’, ‘lift’, ‘laboring’, ‘adaptive’, ‘lecture’, ‘hogging’, ‘thorne’, ‘fud’, ‘skews’, ‘epistles’, ‘tagging’, ‘crud’, ‘two’, ‘rebalanced’, ‘payroll’, ‘damned’, ‘approve’, ‘reason’, ‘formally’, ‘releasing’, ‘muddled’, ‘mineral’, ‘shied’, ‘capital’, ‘nodded’, ‘escrow’, ‘disconnecting’, ‘marshals’, ‘winamp’, ‘forceful’, ‘lowes’, ‘sip’, ‘pencils’, ‘stomachs’, ‘goff’, ‘cg’, ‘backyard’, ‘uprooting’, ‘merging’, ‘helpful’, ‘eid’, ‘trenchcoat’, ‘airlift’, ‘frothing’, ‘pulls’, ‘volta’, ‘guinness’, ‘viewership’, ‘eruption’, ‘peeves’, ‘goat’, ‘goofy’, ‘disbanding’, ‘relented’, ‘ratings’, ‘disputed’, ‘vitamins’, ‘singled’, ‘hydroxide’, ‘telegraphed’, ‘mercantile’, ‘headache’, ‘muppets’, ‘petal’, ‘arrange’, ‘donovan’, ‘scrutinized’, ‘spoil’, ‘examiner’, ‘ironed’, ‘maia’, ‘condensation’, ‘receipt’, ‘solider’, ‘tattooing’, ‘encoded’, ‘compartmentalize’, ‘lain’, ‘gov’, ‘printers’, ‘hiked’, ‘resentment’, ‘revisionism’, ‘tavern’, ‘backpacking’, ‘pestering’, ‘acknowledges’, ‘testimonies’, ‘parlance’, ‘hallucinate’, ‘speeches’, ‘engaging’, ‘solder’, ‘perceptive’, ‘microbiology’, ‘reconnaissance’, ‘garlic’, ‘neutrals’, ‘width’, ‘literaly’, ‘guild’, ‘despicable’, ‘dion’, ‘option’, ‘transistors’, ‘chiropractic’, ‘tattered’, ‘consolidating’, ‘olds’, ‘garmin’, ‘shift’, ‘granted’, ‘intramural’, ‘allie’, ‘cylinders’, ‘wishlist’, ‘crank’, ‘wrongly’, ‘workshop’, ‘yesterday’, ‘wooden’, ‘without’, ‘wheel’, ‘weather’, ‘watch’, ‘version’, ‘usually’, ‘twice’, ‘tomato’, ‘ticket’, ‘text’, ‘switch’, ‘studio’, ‘stick’, ‘soup’, ‘sometimes’, ‘signal’, ‘prior’, ‘plant’, ‘photo’, ‘path’, ‘park’, ‘near’, ‘menu’, ‘latter’, ‘grass’, ‘clock’]

**Human-related:**

[‘wear’, ‘walk’, ‘visitor’, ‘toy’, ‘tissue’, ‘throw’, ‘talk’, ‘sleep’, ‘eye’, ‘enjoy’, ‘blogger’, ‘character’, ‘candidate’, ‘breakfast’, ‘supper’, ‘dinner’, ‘eat’, ‘drink’, “carry”, “run”, “cast”, “ask”, “awake”, “ear”, “nose”, “lunch”, “coalition”, “policies”, “restaurant”, “stood”, “assumed”, “attend”, “swimming”, “trip”, “door”, “determine”, “gets”, “leg”, “arrival”, “translated”, “eyes”, “step”, “whilst”, “translation”, “practices”, “measure”, “storage”, “window”, “journey”, “interested”, “tries”, “suggests”, “allied”, “cinema”, “finding”, “restoration”, “expression”,“visitors”, “tell”, “visiting”, “appointment”, “adults”, “bringing”, “camera”, “deaths”, “filmed”, “annually”, “plane”, “speak”, “meetings”, “arm”, “speaking”, “touring”, “weekend”, “accept”, “describe”, “everyone”, “ready”, “recovered”, “birthday”, “seeing”, “steps”, “indicate”, “anyone”, “youtube”]

## Notes

The datasets and source code for the paper are publicly available at https://github.com/efemeryds/Bayesian-analysis-for-NLP-bias.

One example of a contextualized representation is BERT. Another is GPT.

GoogleNews-vectors-negative300, available at https://github.com/mmihaltz/word2vec-GoogleNews-vectors.

Available at https://nlp.stanford.edu/projects/glove/.

Reddit-L2 corpus, available at http://cl.haifa.ac.il/projects/L2/.

Depending on which list for Caliskan, Bryson, and Narayanan (2017) we inspect, the range for protected words is between 13 and 18, and for attributes between 11 and 25.

**Disclaimer:** Throughout the paper we will be mentioning and using
word lists and stereotypes we did not formulate, which does not mean we
condone any judgment made therein or underlying a given word selection. For
instance, the Gender dataset does not recognize non-binary categories, and
yet we use it without claiming that such categories should be ignored.

A few more philosophical comments on the enterprise of reflecting on bias in language models can be found in Appendix A.1.

However, for some research-related purposes, such as the study of stereotypes across history (Garg et al. 2018), embeddings that do not protect certain classes may also be useful.

Here, “−” stands for point-wise difference, “·” stands for the dot product operation, and $\u2225a\u2225=(a\xb7a)$.

Note that this terminology is slightly misleading, as mathematically cosine
distance is not a distance measure, because it does not satisfy the triangle
inequality, as generally cosineDistance(*A*, *C*)≦̸cosineDistance(*A*, *B*) + cosineDistance(*B*, *C*). We will keep using this mainstream terminology.

Roughly, the principal component is the vector obtained by projecting the data points on their linear combination in a way that maximizes the variance of the projections.

We follow the methodology used in the debate in assuming that there is a
class of words identified as more or less neutral, such as *ballpark,
eat, walk, sleep, table*, whose average similarity to the gender
direction (or other protected words) is around 0. See our list in Appendix A.4.2 and a brief discussion
in Subsection 3.1.

Note their method assumes *X* and *Y* are of
the same size.

Interestingly, with social media-text trained embeddings being less biased than those based on Wikipedia.

Strictly speaking, these authors use Euclidean distances and their differences, but the way they take averages and averages thereof is analogous, and so what we will have to say about pre-averaging leading to false confidence applies to this methodology as well.

The authors’ code is available through their GitHub repository at https://github.com/TManzini/DebiasMulticlassWordEmbedding.

See the authors’ GitHub code at https://github.com/TManzini/DebiasMulticlassWordEmbedding/blob/master/Debiasing/biasOps.py for details.

See Appendix A.4.2 for the word lists.

16 is the sample size used in the WEAT7 word list, which is not much different from the other sample sizes in word lists used by Caliskan, Bryson, and Narayanan (2017).

.08 is approximately the empirical standard deviation observed in fairly large samples of neutral words.

Here are a few usual problems. CIs are often mistakenly interpreted as providing the probability that a resulting confidence interval contains the true value of a parameter. CIs bring confusion also with regard to precision; it is a common mistake to interpret narrow intervals as the ones corresponding to more precise knowledge. Another fallacy is to associate CIs with likelihood and to state that values within a given interval are more probable than the ones outside it. The theory of confidence intervals does not support the above interpretations. CIs should be plainly interpreted as a result of a certain procedure (there are many ways to obtain CIs from a given set of data) that will in the long run contain the true value if the procedure is performed a fixed amount of times. For a nice survey and explanation of these misinterpretations, see Morey et al. (2015). For a psychological study of the occurrence of such misinterpretations, see Hoekstra et al. (2014). In this study, 120 researchers and 442 students were asked to assess the truth value of six false statements involving different interpretations of a CI. Both researchers and students endorsed, on average, more than three of these statements.

HPDIs are the narrowest intervals containing a certain ratio of the area under the curve.

## References

*Reporting bias and knowledge acquisition*

## Author notes

Action Editor: Rico Sennrich