Abstract
Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine-grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas, which can provide higher resolution analyses of language variation. We embed voting precincts, which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse, with many areas having scant social media data. We propose a novel embedding approach that alternates training with smoothing, which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We develop two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.
1 Introduction
Similar to embeddings that capture word usage, recent work in NLP has developed methods that generate embeddings for areas that represent language in those areas. For example, Huang et al. (2016) developed an embedding method for capturing language use in counties and Hovy and Purschke (2018) developed an embedding method for capturing language use in cities. These embeddings can be used for a wide variety of sociolinguistic analyses as well as downstream tasks.
Given the sheer volume available, social media data is often used to provide the text data needed to train the embeddings. However, one inherent problem that arises is the imbalance of population distribution across a region of interest, which leads to an imbalance of social media data across that region. For example, rural areas use Twitter less than urban areas (Duggan 2015). This could make it more difficult to capture language use in rural areas.
One solution to this issue is to use larger areas. For example, one could focus on cities and not explore the countryside, as done in Hovy and Purschke (2018). Or one could divide a region of interest into large squares, as done in Hovy et al. (2020). Or one could divide a region of interest into counties, as done in Huang et al. (2016). While these solutions produce areas with more data, the areas themselves could be less useful for analysis as (1) there could be important areas that are not covered (e.g., only studying cities and missing the rest of the region), (2) the areas could have awkward boundaries (e.g., dividing regions into squares that ignore geopolitical boundaries), or (3) the resolution would be too low to be useful for certain analyses (e.g., using cities as areas prevents analyses of intracity language use).
We propose a novel solution to the data problem. We use smaller areas, voting precincts, that provide finer resolution analyses and propose a novel embedding approach to mitigate the specific data issues related to using smaller areas. Voting precincts are small, equally sized areas that are used in the administration of elections (in Texas, each voting precinct has about 1,100 voters). As they are well regulated (voting precincts are required to fit within county, congressional boundaries), monitored (voting precincts are a fundamental unit in censuses), compact (voting precincts need to be compact to make elections, polling, and governance more efficient), and cover an entire region, they form a perfect mesh to represent language use across a region. Unlike with using cities, voting precincts can also capture rural areas. Unlike with using squares, voting precincts follow geopolitical boundaries. Unlike with counties, voting precincts can better capture intracity differences in language use. Thus, by developing embedding representations of these precincts, we can find fine-grained differences in language use across a large region of interest.
While voting precincts are a great mesh to model language use across a region, the smaller sizes lead to significant data issues. For example, less populated areas use social media less, which can lead to voting precincts that have extremely limited data or no data at all. To counteract this, we propose a novel embedding technique where training and smoothing alternate to mitigate the weaknesses of both. Training has limited potential in voting precincts with little data, so smoothing will provide extra information to create a more accurate embedding. Smoothing can spread noise, so training afterwards can refine the embeddings.
We propose novel evaluations that explore how well embeddings can be used to predict information useful to sociolinguists. The first evaluation explores how well embeddings can be used to predict where a dialect is spoken using some specific features of the dialect. We use the Dictionary of American Regional English dataset (DAREDS) (Rahimi, Cohn, and Baldwin 2017), which provides key terms for various American dialects. We evaluate how well embeddings can be used to predict dialect areas from those key terms.
The second evaluation explores how well embeddings can be used to predict lexical variation. Lexical variation is the choice between two semantically similar lexical items, for example, fam versus family, and is a good determiner of linguistic variation (Cassidy, Hall, and Von Schneidemesser 1985; Carver 1987). We evaluate how well embeddings can be used to predict choices among lexical variants across a region of interest.
As part of these evaluations, we perform a hyperparameter analysis that demonstrates that post-training retrofitting can have numerical issues when applied to smaller areas, so alternating is a necessary step with smaller areas. As mentioned, many smaller areas lack sufficient data, so retrofitting with these areas can cause the spreading of noise, which in turn can result in unreliable embeddings.
We then provide a novel methodology to extract novel sociolinguistic insights from social media data. Area embeddings capture language use in an area, and language use is connected to a wide swath of sociological factors. If we treat embeddings as the “genetic code” of an area, we can identify sections of the embeddings that act as genes for sociological phenomena. For example, we can find the “gene” that encodes how race and the urban–rural divide affect language use. Then by exploring the predictions of these “genes” we can then connect the sociological phenomenon with a linguistic one, for example, identify novel African American slang via analyzing the expressions of the “gene” corresponding to Black Percentage.
Finally, we use our embeddings to predict geographic boundaries of linguistic variation, or “isoglosses.” Prior work has used principal component analysis to infer isoglosses, but with smaller areas, we find that PCA will focus on the urban–rural divide and ignore regional divides. Instead, we find that t-distributed stochastic neighbor embedding (Van der Maaten and Hinton 2008) is better able to identify larger geographic distinctions.
2 Prior Work
While there has been a wealth of work that has used Twitter data to explore lexical variation (e.g., Eisenstein et al. 2012,2014; Cook, Han, and Baldwin 2014; Doyle 2014; Jones 2015; Huang et al. 2016; Kulkarni, Perozzi, and Skiena 2016; Grieve, Nini, and Guo 2018), the incorporation of distributional methods is a more recent trend.
Huang et al. (2016) apply a count-based method to Twitter data to represent language use in counties across the United States. They use a manually created list of sociolinguistically relevant variant pairs, such as couch and sofa, from Grieve, Asnaghi, and Ruette (2013) and embedded a county based on the proportion of each variant. They then used adaptive kernel smoothing to smooth the counts and used PCA for dimensionality reduction. They do not perform a quantitative evaluation and instead perform PCA of the embeddings. One limitation of their approach is that it requires a list of sociolinguistically relevant variant pairs. Producing such pairs is labor-intensive and such pairs are specific to certain language varieties (variant pairs that make sense for American English may not make sense for British English) and may lose relevance as language use changes over time.
Hovy and Purschke (2018) use document embedding techniques to represent language use in cities in Germany, Austria, and Switzerland. In this work, they collected social media data from Jodel,1 a social media platform, and used Doc2Vec (Le and Mikolov 2014) to produce an embedding for each city. As their goal was to explore regional variation, they used retrofitting (Faruqui et al. 2015; Hovy and Fornaciari 2018) to have the embeddings better match the NUTS2 regional breakdown of those countries. We discuss these methods further in Section 4. For quantitative evaluation, they compare clusterings of their embeddings to a German dialect map (Lameli 2013). While this an excellent evaluation if you have such a map, the constantly evolving nature of language and the sheer difficulty of hand-creating such a dialect map make this approach difficult to generalize to analyses of new regions, especially a region as evolving and large as the state of Texas, which is our focus. The authors also evaluated their embeddings by measuring how well they could predict the geolocation of the tweet. While geolocation is a laudable goal in and of itself, our focus is on linguistic variation specifically and geolocation is not necessarily a measure of how well the embeddings capture linguistic variation. For example, a list of business names in each area would be fantastic for geolocation, but of less use for analyzing variation.
Hovy et al. (2020) followed up this work by extending their method to cover entire continents/countries and not just the cities. They did this by dividing their region of interest into a coordinate grid of 11 km (6.8 mi.) by 11 km squares and training embeddings for each square. They then retrofitted the square embeddings. They did not perform a quantitative evaluation of their work.
An alternative approach to generating regional embeddings is through using linguistic features as the embedding coordinates. For example, Bohmann (2020) embedded Twitter linguistic registers into a space based on 236 linguistic features. They then use factor analysis on these embeddings to generate 10 dimensions of linguistic variation. While these kinds of embeddings are more interpretable, they require more a priori knowledge about relevant linguistic features and the capability to calculate them. While we do not explore linguistic feature–based embeddings in our work, we do perform a similar task in extracting smaller dimensional representations when analyzing theoretic linguistic hypotheses.
Clustering is a well-explored topic in computational dialectology (e.g., Grieve, Speelman, and Geeraerts 2011; Pröll 2013; Lameli 2013; Huang et al. 2016). To this effect, we largely follow the clustering approach in Hovy and Purschke (2018). We also explore this topic while incorporating newer clustering techniques, such as t-SNE (Van der Maaten and Hinton 2008). Like Hovy et al. (2020), we do not do hard clustering (like k-means) and only do soft clustering.
There has been work that has analyzed non-conventional spellings (Liu et al. 2011 and Han and Baldwin 2011, for example), but recent work has explored the use of word embeddings to study lexical variation through non-conventional spelling (Nguyen and Grieve 2020). In that work, the authors explored the connection between conventional and non-conventional forms and found that word embeddings do capture spelling variation (despite being ignorant of orthography in general) and discovered a link between the intent of the different spelling and the distance between the embeddings. While we do not directly interact with this work, their exploration of the connection between non-conventional spelling and lexical variation may be useful for future work.
There is a wealth of work that uses computational linguistic methods to connect sociological factors with word use (see Nguyen et al. [2016] for a review of work in this area as well as computational sociolinguistics in general). One such approach is that from Eisenstein, Smith, and Xing (2011), which uses a regression model to connect word use with demographic features. By using a regularization method to focus on key words, they show which words are connected to specific sociological factors. While we don’t connect word A with demographic B, we use a similar technique to extract sections of embeddings that are related to specific demographic differences.
3 Texas Twitter and Precinct Data Collection
Our focus is on language use across the state of Texas. It is large, populous, and has been researched only lightly in sociolinguistics and dialect geography, compared with other large American states. Both Thomas and Bailey have contributed quantitative studies of variation in Mainstream (not ethnically specific) Texas English: Thomas (1997) describes a rural/urban split in Texas dialects, driven by the much-accelerated migration of non-southerners into Texas and other southern U.S. states since the latter decades of the twentieth century, a trend that effectively creates “dialect islands in Texas where the large metropolitan centers lie” (Thomas 1997, page 309) and relegates canonical features of southern U.S. speech (Thomas’s focus is on the monophthongization of PRICE and the lowering of the nucleus in FACE vowels) to rural areas and small towns. Bailey et al. (1991), by tracking nine different features of phonetic innovation/conservativeness in Texas English and resolving findings at the level of the county, identify the most linguistically innovative areas driving change in Texas English as a cluster of five counties in the Dallas/Fort Worth area (Figure 1).
In addition to these geographic approaches to variation in Texas, there have been a number of studies focusing on selected features (Bailey and Dyer 1992; Atwood 1962; Bailey et al. 1991; Bernstein 1993; Di Paolo 1989; Hinrichs, Bohmann, and Gorman 2013; Koops 2010; Koops, Gentry, and Pantos 2008; Walsh and Mote 1974; Tarpley 1970; Wheatley and Stanley 1959) and/or variation and change in minority varieties (Bailey and Maynor 1989, 1987, 1985; Bayley 1994; Galindo 1988; Garcia 1976; Bailey and Thomas 2021; McDowell and McRae 1972).
Outside of computational sociolinguistics, attempts to geographically model linguistic variation in Texas English have been made as part of the established, large initiatives in American dialect mapping. These include:
Kurath’s linguistic atlas project (LAP; see Petyt [1980] for an overview) that produced the Linguistic Atlas of the Gulf States (Pederson 1986), based on survey data;
Carver’s (1987) “word geography” atlas of American English dialects, which visualizes data from the Dictionary of American Regional English (Cassidy, Hall, and Von Schneidemesser 1985) on the geographic distribution of lexical items; and
the Atlas of North American English (Labov et al. 2006), which maps phonetic variation in phone interview data from speakers of American English.
3.1 Data Collection
In this section, we will describe how we collected Texas Twitter data for our analysis. Twitter data has allowed sociolinguists new ways to explore how society affects language (Mencarini 2018). This data is composed of a large selection of natural uses of language that cut across many social boundaries. Additionally, tweets are often geotagged, which allows researchers to connect examples of language use with location.
We draw our Twitter data from two sources. The first is from archive.org’s collection of billions of tweets (Archive Team 2017) that were retrieved between 2011 and 2017. This collection represents tweets from all over the world and not Texas specifically. The second source is a collection of 13.6 million tweets that were retrieved using the Twitter API between February 16, 2017, and May 3, 2017. We only retrieved tweets that originate in a rectangular bounding box that contains Texas.
Our preprocessing steps are as follows. First, we remove all tweets that do not have coordinate information nor a city name in its metadata. For any tweet that does not have coordinate information, but a city name, we use the simplemaps.org United States city database2 to give these tweets coordinates based upon its city’s coordinates. We then remove tweets that were not sent from Texas. We then remove all tweets that have a hashtag (#) to help remove automatically generated tweets, like highway accident reports. We then use the ekphrasis Python module to normalize the tweets (Baziotis, Pelekis, and Doulkeridis 2017). We do not remove mentions or replace them with a named entity label. Together, this results in 2.3 million tweets (1.7 million from archive.org and 563,000 from the Twitter API).
In Figure 2, we visualize the number of tweets in each voting precinct (left) and the voting precincts that have 10 or fewer tweets (right). We see that quite a few voting precincts have 10 or fewer tweets, especially rural and West Texas. This indicates that many precincts do not have enough tweets to generate accurate representations on their own and thus require some form of smoothing. In Figure 3, we show how the tweets are distributed across voting precincts. The voting precincts are ranked by number of tweets. We see that there are a few that have a vast amount of tweets, but most voting precincts have a number of tweets in the hundreds.
3.2 Voting Precincts
Our goal is to represent language use across the entirety of Texas (including rural Texas) as well as capture fine-grained differences in language use (including within a city). In prior work, researchers either only used cities (e.g., Hovy and Purschke 2018), or used a coordinate grid (e.g., Hovy et al. 2020). The former does not explore rural areas at all and does not explore within-city divisions. The latter uses boundaries that do not reflect the geography of the area and are difficult to use for fine-grained analyses.
To achieve our goals, we operate at the voting precinct level. Voting precincts are relatively tiny political divisions that are used for the efficient administration of elections. Each voting precinct usually has one polling place and, in the 2016 election, each voting precinct contained on average 1,547 registered voters nationwide (U.S. Election Assistance Commission 2017). These voting precincts are generally relatively small (on average containing 3,083 people), cohesive (each voting precinct must reside entirely within an electoral district/county), and balanced (generally, voting precincts are designed to contain similar population sizes). Additionally, states record meticulous detail on the demographics of each voting precinct (see Table 1 for descriptive statistics). Thus, these voting precincts act as perfect building blocks.3
Variable . | Pop/Area Per VP . | Demo % of VP . |
---|---|---|
Land Area | 76.08km2 (± 18.55km2) | |
Population | 3,083.0 (± 2601.2) | 100.0% (± 0.0%) |
Asian | 116.2 (± 309.1) | 2.60% (± 5.48%) |
Black | 354.1 (± 681.6) | 10.6% (± 16.8%) |
Hispanic | 1,160.5 (± 1677.5) | 33.7% (± 27.6%) |
Multiple | 39.1 (± 50.9) | 1.15% (± 0.90%) |
Native American | 9.8 (± 12.9) | 0.36% (± 1.09%) |
Other | 4.1 (± 7.6) | 0.11% (± 0.22%) |
Pacific Islander | 2.1 (± 10.7) | 0.06% (± 0.66%) |
White | 1,396.8 (± 1384.4) | 51.3% (± 29.4%) |
Variable . | Pop/Area Per VP . | Demo % of VP . |
---|---|---|
Land Area | 76.08km2 (± 18.55km2) | |
Population | 3,083.0 (± 2601.2) | 100.0% (± 0.0%) |
Asian | 116.2 (± 309.1) | 2.60% (± 5.48%) |
Black | 354.1 (± 681.6) | 10.6% (± 16.8%) |
Hispanic | 1,160.5 (± 1677.5) | 33.7% (± 27.6%) |
Multiple | 39.1 (± 50.9) | 1.15% (± 0.90%) |
Native American | 9.8 (± 12.9) | 0.36% (± 1.09%) |
Other | 4.1 (± 7.6) | 0.11% (± 0.22%) |
Pacific Islander | 2.1 (± 10.7) | 0.06% (± 0.66%) |
White | 1,396.8 (± 1384.4) | 51.3% (± 29.4%) |
We note that gerrymandering has very little influence on voting precinct boundaries. It is true that congressional districts (and similar) can be heavily gerrymandered and voting precincts are bound by congressional district boundaries. However, the practical pressures of administration and the relatively small size of the voting precincts minimize these effects. Voting precincts are used to administer elections, which means that significant effort is needed to coordinate people to run polling stations and identify locations where people can vote. Additionally, voting precincts are often used to organize polling and signature collection. Due to these factors, there is a strong need for all parties involved to make voting precincts as compact and efficient as possible. In contrast, voting precinct boundaries only decide where you vote and not who you vote for, so there is not the pressure to gerrymander in the first place. Voting precincts are also generally small enough to fit into the nooks and crannies of congressional districts. Congressional districts have dozens of voting precincts, so voting precincts are small enough to be compact despite any boundary issues of the larger congressional district. It is for these reasons that voting precincts are often used as atomic units in redistricting efforts (e.g., Baas n.d.).
The voting precinct information comes from the United States Census and is compiled by the Auto-Redistrict project (Baas n.d.). Each precinct in this data comes with the coordinate bounds of the precinct along with the census demographic data. Further processing of the demographic data was done by Murray and Tengelsen (2018).
In order to map tweets to voting precincts, we first extract a representative point for each voting precinct using the Shapely Python module (Gillies et al. 2007). Representative points are computationally efficient approximations to the center of a voting precinct. We then associate a tweet to the closest voting precinct by distance from the tweet’s coordinates to the representative points.
4 Voting Precinct Embedding Methods
In this section, we describe the area embedding methods we will analyze. Area embedding methods generally have two parts: a training part and a smoothing part. The training part takes text and uses a machine learning or counting based model to produce embeddings. The smoothing part averages area embeddings with their neighbors to add extra information.
4.1 Count-Based Methods
The first approach we explore is a count-based approach from Huang et al. (2016). The training part counts the relative frequencies of a manually curated list of sociolinguistically relevant lexical variations. The smoothing part takes a weighted average of the area embedding and enough nearest neighbors to meet some data threshold.
4.1.1 Training: Mean-Variant-Preference
Grieve, Asnaghi, and Ruette (2013) and Grieve and Asnaghi (2013) have manually collected sets of lexical variants where the choice of variant is indicative of local language use. For example, soda, pop, and Coke are a set of lexical variants for “soft drink” and regions have a variant preference. Huang et al. (2016) count the relative frequency of variants and use these counts as the embedding.
More specifically, they begin with a manually curated list of sociolinguically relevant sets of lexical variants. They designate the most frequent variant as the “main” variant. In the soft drink example, soda would be the main variant as it is the most frequent variant among all variants.
4.1.2 Smoothing: Adaptive Kernel Smoothing
One issue with working with area embeddings is that there is an uneven distribution of tweets and many areas can lack tweet data. Huang et al. (2016) do smoothing by creating neighborhoods that have enough data then taking a weighted average of the embeddings in the neighborhood.
For an area A, a neighborhood is the smallest set of geographically closest areas to A that have data above a certain threshold. For a set of lexical variants, this is some multiple B times the average frequency of those variants across all areas. For soda, pop, and Coke, this would be B times the average number of times someone used any of those variants. Huang et al. (2016) explore B values of 1, 10, and 100.
As we will also explore more traditional embedding models, such as Doc2Vec, we adapt this smoothing approach for unsupervised machine learning models. Instead of average counts of variants, we use average number of tweets. In that way, each neighborhood will have a sufficient number of tweets to mitigate the data sparsity issue.
4.2 Post-training Retrofitting
The approach Hovy and Purschke (2018) and Hovy et al. (2020) took in their analysis is one where embeddings are first trained on social media data then altered such that adjacent areas have more similar embeddings. The first step uses Doc2Vec (Le and Mikolov 2014), while the second step uses retrofitting (Faruqui et al. 2015).
4.2.1 Training: Doc2Vec
The first part in their approach is to train a Doc2Vec model (Le and Mikolov 2014) for 10 epochs to obtain an embedding for each German-speaking city (Hovy and Purschke 2018) or coordinate square (Hovy et al. 2020). Doc2Vec is an extension of word2vec (Mikolov et al. 2013) that also trains embeddings for document labels (or in this case, the city/square/voting precinct where the post was written).
The result of this process is that we have an embedding for each voting precinct (in our case) or coordinate square/German-speaking city (in Hovy and Purschke’s case).
4.2.2 Smoothing: Retrofitting
One key insight from Hovy and Purschke (2018) is that Doc2Vec alone can produce embeddings that capture language use in an area, but not in a way that captures regional variation as opposed to city specific artifacts. For example, an embedding for the city of Austin, Texas, might capture all of the language use surrounding specific bus lines in the Austin Public Transportation system, but that information is less useful for understanding differences in language use across Texas.
4.3 Proposed Models
Given that our divisions are much smaller than those in previous work, we propose several area embedding methods that may perform better under our circumstances.
4.3.1 Geography Only Embedding
In this section, we describe a novel baseline that reflects embeddings that effectively only contain geographic information and no Twitter data, which we call Geography Only Embedding. In this approach, embeddings are randomly generated (we use a Doc2Vec model that is initialized, but not trained) and then retrofit the embeddings using the same process above.
Despite its simple description, this approach can be seen as one where embeddings capture solely geographic information. To see this, note that the randomization process provides each precinct its own completely random embedding. In effect, the embedding acts as a kind of unique identifier for the precinct as it is incredibly unlikely for two 300 dimensional random vectors to be similar. By retrofitting (i.e., averaging these unique identifiers precincts), you form unique identifiers for larger subregions. Thus, each precinct and each area has an embedding that directly reflects where it is located on the map. In this way, these embeddings capture the geographic properties, while simultaneously containing no Twitter information.
4.4 Smoothing: Alternating
One issue with the Post-training Retrofitting approach in our setting is that it relies on a large body of tweets per area. In our case, the voting precincts are too small. Despite having 2.3 million tweets, each voting district only contains about 400 tweets on average and hundreds of precincts have fewer than 10 tweets. Thus, the initial Doc2Vec step would lack sufficient data to create quality embeddings. The retrofitting step would then just be propagating noise.
In order to alleviate this issue, we propose to alternate the Doc2Vec and retrofitting steps to mitigate the weaknesses of both. In our setting, training injects tweet information into the embeddings, but voting precincts often lack enough data to be used on its own. In contrast, retrofitting can send information from adjacent neighbors to improve an embedding, but can also overwhelm the embedding with noise or irrelevant information, for example, the Austin embedding (a major metropolis) could overwhelm the Round Rock embedding (a suburb of Austin) even though language use is different between those areas. If we train after retrofitting, we can correct any wrong information from the adjacent neighbors. If we retrofit after training, we can provide information where it is lacking. Thus, alternating these steps can mitigate each step’s weakness.
4.5 Training: BERT with Label Embedding Fusion
Since the prior work, there have been advances in document embedding approaches, such as those that use contextual embeddings. We explore BERT with Label Embedding Fusion (BERTLEF) (Xiong et al. 2021), which is a recent paper in this area. BERTLEF combines the label and the document as a sentence pair and trains BERT for up to 5 epochs to predict the label and the document. This is similar to the Paragraph Vectors flavor of Doc2Vec as it is using the label and document to predict the context. A diagram showing how this approach works is shown in Figure 4.
4.6 Approach Summary
We summarize the different approaches we will explore in Table 2. “Model” is the training part and “Smoothing” is the smoothing part. “Data” indicates if the underlying data is a manually crafted set of features (“Grieve List”), raw text, or some other data. “Train epochs” is the number of epochs the models were trained in total. “Smooth Iter” is the number of smoothing iterations in total. “Dim” is the final dimension size of the embeddings.
Model . | Smoothing . | Data . | Train Epochs . | Smooth Iter . | Dim . |
---|---|---|---|---|---|
Static | None | Ones | None | None | 1 |
Coordinates | None | Lat–Long | None | None | 2 |
MVP | AKS B = 1 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 1 | Grieve list | None | 1 | 15 |
MVP | AKS B = 10 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 10 | Grieve list | None | 1 | 15 |
MVP | AKS B = 100 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 100 | Grieve list | None | 1 | 15 |
Random 300 | None | None | None | None | 300 |
Random 300 | Retrofitting | None | None | 50 | 300 |
Doc2Vec | None | Raw text | 10 | None | 300 |
Doc2Vec | AKS B = 1 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 1 | Raw text | 10 | 1 | 15 |
Doc2Vec | AKS B = 10 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 10 | Raw text | 10 | 1 | 15 |
Doc2Vec | AKS B = 100 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 100 | Raw text | 10 | 1 | 15 |
Doc2Vec | Retrofitting | Raw text | 10 | 50 | 300 |
Doc2Vec | Alternating | Raw text | 10 | 50 | 300 |
Random 768 | None | None | None | None | 768 |
Random 768 | Retrofitting | None | None | 50 | 768 |
BERTLEF | None | Raw text | 5 | None | 768 |
BERTLEF | AKS B = 1 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 1 | Raw text | 5 | 1 | 15 |
BERTLEF | AKS B = 10 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 10 | Raw text | 5 | 1 | 15 |
BERTLEF | AKS B = 100 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 100 | Raw text | 5 | 1 | 15 |
BERTLEF | Retrofitting | Raw text | 5 | 50 | 768 |
BERTLEF | Alternating | Raw text | 5 | 50 | 768 |
Model . | Smoothing . | Data . | Train Epochs . | Smooth Iter . | Dim . |
---|---|---|---|---|---|
Static | None | Ones | None | None | 1 |
Coordinates | None | Lat–Long | None | None | 2 |
MVP | AKS B = 1 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 1 | Grieve list | None | 1 | 15 |
MVP | AKS B = 10 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 10 | Grieve list | None | 1 | 15 |
MVP | AKS B = 100 | Grieve list | None | 1 | 45 |
MVP + PCA | AKS B = 100 | Grieve list | None | 1 | 15 |
Random 300 | None | None | None | None | 300 |
Random 300 | Retrofitting | None | None | 50 | 300 |
Doc2Vec | None | Raw text | 10 | None | 300 |
Doc2Vec | AKS B = 1 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 1 | Raw text | 10 | 1 | 15 |
Doc2Vec | AKS B = 10 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 10 | Raw text | 10 | 1 | 15 |
Doc2Vec | AKS B = 100 | Raw text | 10 | 1 | 300 |
Doc2Vec + PCA | AKS B = 100 | Raw text | 10 | 1 | 15 |
Doc2Vec | Retrofitting | Raw text | 10 | 50 | 300 |
Doc2Vec | Alternating | Raw text | 10 | 50 | 300 |
Random 768 | None | None | None | None | 768 |
Random 768 | Retrofitting | None | None | 50 | 768 |
BERTLEF | None | Raw text | 5 | None | 768 |
BERTLEF | AKS B = 1 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 1 | Raw text | 5 | 1 | 15 |
BERTLEF | AKS B = 10 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 10 | Raw text | 5 | 1 | 15 |
BERTLEF | AKS B = 100 | Raw text | 5 | 1 | 768 |
BERTLEF + PCA | AKS B = 100 | Raw text | 5 | 1 | 15 |
BERTLEF | Retrofitting | Raw text | 5 | 50 | 768 |
BERTLEF | Alternating | Raw text | 5 | 50 | 768 |
We have six baselines. The first is “Static” which is just a single constant value and emulates the use of static embeddings. The second is “Coordinates,” which uses a representative point4 of the voting precinct as the embedding. “Lat–Long” refer to latitude and longitude. “Random 300 None” and “Random 768 None” are random embeddings with no smoothing. “Random 300 Retrofitting” and “Random 768 Retrofitting” are random vectors where retrofitting is applied. As discussed in Section 4.3.1, these correspond to embeddings that capture geographic information and do not contain any linguistic information.
We then have the count-based approach by Huang et al. (2016). “MVP” is Mean-Variant-Preference (Section 4.1.1). “AKS” is adaptive kernel smoothing, “B” is the multiplier, and “PCA” is applying PCA after AKS (Section 4.1.2). “Grieve list” is a list of sets of sociologically-relevant lexical variants described in Section 4.1.1.
Finally, we have the machine learning and iterated smoothing methods. “Doc2Vec” is Doc2Vec (Section 4.2.1). “BERTLEF” is BERT with Label Embedding Fusion (Section 4.5). “Retrofitting” applies smoothing after training (Section 4.2.2) and “Alternating” alternates smoothing with training (Section 4.4). “Raw text” means that the model is trained on text instead of manually crafted features.
5 Quantitative Evaluation
5.1 Prediction of Dialect Area from Dialect-specific Terms
Our first evaluation measures how well embeddings can be used to map a dialect when provided some words specific to that dialect. We use the dialect divisions in DAREDS (Rahimi, Cohn, and Baldwin 2017), which divides the United States into 99 dialect regions, each with their own set of unique terms. These regions and terms were compiled from the Dictionary of American Regional English (Cassidy, Hall, and Von Schneidemesser 1985). As our focus is on the state of Texas, we only use the “Gulf States”, “Southwest”, “Texas”, and “West” dialects, each of which include cities in Texas. The list of terms that are specific to those regions can be found in Section Appendix B.
If an embedding method captures variational language use, then a Poisson regression fit on those embeddings should accurately emulate this Poisson distribution. Poisson regression is like regular linear regression except it assumes that errors follow a Poisson distribution around the mean instead of a Normal distribution.
One particular issue that is faced with performing Poisson regression with large embeddings is that models may not converge due to data separation (Mansournia et al. 2018). To correct this, we use bias-reduction methods (Firth 1993; Kosmidis and Firth 2009), which are proven to always produce finite parameter estimates (Heinze and Schemper 2002). We use R’s brglm2 package (Kosmidis 2020) to do this.
We show the AIC scores for the various precinct embedding approaches in Table 3. See Section 4.6 for a reference for the method names. In the Gulf States region, we see that methods that use manually crafted lists of lexical variants (MVP models) are competitive with machine learning–based models applied to raw text with the largest neighborhood size outperforming these methods. However, in the other regions, the Doc2Vec approaches that use Retrofitting and Alternating smoothing greatly outperform those approaches. What this indicates is that if we have a priori knowledge of sociolinguistically relevant lexical variants then we can accurately predict dialect areas. However, machine learning methods can achieve similar or greater results with just raw text. Thus, even when lexical variant information is unavailable, we can still make accurate predictions.
Method . | Alternation . | DAREDS AIC by Region . | |||
---|---|---|---|---|---|
Gulf States . | Southwest . | Texas . | West . | ||
Static | None | 4,890.32 | 8,793.00 | 7,885.50 | 6,236.38 |
Coordinates | None | 4,859.89 | 8,159.15 | 7,681.31 | 6,090.05 |
MVP | AKS B = 1 | 4,713.70 | 8,251.73 | 7,214.86 | 6,078.22 |
MVP + PCA | AKS B = 1 | 4,713.31 | 8,492.32 | 7,523.04 | 6,110.55 |
MVP | AKS B = 10 | 4,696.95 | 7,697.70 | 7,011.86 | 5,933.71 |
MVP + PCA | AKS B = 10 | 4,725.05 | 8,324.49 | 7,483.78 | 6,060.23 |
MVP | AKS B = 100 | 4,581.97 | 7,421.84 | 7,123.18 | 5,861.19 |
MVP + PCA | AKS B = 100 | 4,584.86 | 7,710.95 | 7,382.14 | 5,950.82 |
Random 300 | None | 4,878.53 | 7,441.02 | 6,780.70 | 6,065.14 |
Random 300 | Retrofitting | 4,778.34 | 7,196.95 | 6,372.70 | 5,797.75 |
Doc2Vec | None | 4,599.22 | 6,746.71 | 6,145.31 | 5,511.69 |
Doc2Vec | AKS B = 1 | 4,945.14 | 7,940.38 | 7,498.78 | 6,088.75 |
Doc2Vec + PCA | AKS B = 1 | 4,859.17 | 8,706.27 | 7,819.10 | 6,187.54 |
Doc2Vec | AKS B = 10 | 4,907.23 | 7,589.73 | 7,211.45 | 6,058.02 |
Doc2Vec + PCA | AKS B = 10 | 4,874.47 | 8,662.70 | 7,827.59 | 6,153.67 |
Doc2Vec | AKS B = 100 | 5,017.93 | 7,916.88 | 7,038.32 | 6,093.19 |
Doc2Vec + PCA | AKS B = 100 | 4,880.77 | 8,689.66 | 7,869.85 | 6,182.27 |
Doc2Vec | Retrofitting | 4,814.15 | 7,164.03 | 6,433.94 | 5,802.43 |
Doc2Vec | Alternating | 4,689.96 | 6,919.24 | 6,192.12 | 5,659.31 |
Random 768 | None | 5,345.06 | 7,211.48 | 6,609.13 | 6,029.10 |
Random 768 | Retrofitting | 5,366.13 | 7,349.66 | 6,534.66 | 6,221.10 |
BERTLEF | None | 5,299.95 | 7,211.09 | 6,521.57 | 6,260.76 |
BERTLEF | AKS B = 1 | 5,292.91 | 7,217.49 | 6,828.36 | 6,212.75 |
BERTLEF + PCA | AKS B = 1 | 4,870.77 | 8,601.52 | 7,860.10 | 6,208.87 |
BERTLEF | AKS B = 10 | 5,286.53 | 7,390.63 | 6,793.89 | 6,172.18 |
BERTLEF + PCA | AKS B = 10 | 4,870.26 | 8,647.27 | 7,847.80 | 6,215.73 |
BERTLEF | AKS B = 100 | 5,382.80 | 7,538.72 | 6,630.50 | 6,176.40 |
BERTLEF + PCA | AKS B = 100 | 4,894.13 | 8,639.23 | 7,858.67 | 6,230.27 |
BERTLEF | Retrofitting | 5,450.53 | 7,619.40 | 6,875.99 | 6,355.34 |
BERTLEF | Alternating | 5,308.68 | 7,377.52 | 6,511.52 | 6,124.20 |
Method . | Alternation . | DAREDS AIC by Region . | |||
---|---|---|---|---|---|
Gulf States . | Southwest . | Texas . | West . | ||
Static | None | 4,890.32 | 8,793.00 | 7,885.50 | 6,236.38 |
Coordinates | None | 4,859.89 | 8,159.15 | 7,681.31 | 6,090.05 |
MVP | AKS B = 1 | 4,713.70 | 8,251.73 | 7,214.86 | 6,078.22 |
MVP + PCA | AKS B = 1 | 4,713.31 | 8,492.32 | 7,523.04 | 6,110.55 |
MVP | AKS B = 10 | 4,696.95 | 7,697.70 | 7,011.86 | 5,933.71 |
MVP + PCA | AKS B = 10 | 4,725.05 | 8,324.49 | 7,483.78 | 6,060.23 |
MVP | AKS B = 100 | 4,581.97 | 7,421.84 | 7,123.18 | 5,861.19 |
MVP + PCA | AKS B = 100 | 4,584.86 | 7,710.95 | 7,382.14 | 5,950.82 |
Random 300 | None | 4,878.53 | 7,441.02 | 6,780.70 | 6,065.14 |
Random 300 | Retrofitting | 4,778.34 | 7,196.95 | 6,372.70 | 5,797.75 |
Doc2Vec | None | 4,599.22 | 6,746.71 | 6,145.31 | 5,511.69 |
Doc2Vec | AKS B = 1 | 4,945.14 | 7,940.38 | 7,498.78 | 6,088.75 |
Doc2Vec + PCA | AKS B = 1 | 4,859.17 | 8,706.27 | 7,819.10 | 6,187.54 |
Doc2Vec | AKS B = 10 | 4,907.23 | 7,589.73 | 7,211.45 | 6,058.02 |
Doc2Vec + PCA | AKS B = 10 | 4,874.47 | 8,662.70 | 7,827.59 | 6,153.67 |
Doc2Vec | AKS B = 100 | 5,017.93 | 7,916.88 | 7,038.32 | 6,093.19 |
Doc2Vec + PCA | AKS B = 100 | 4,880.77 | 8,689.66 | 7,869.85 | 6,182.27 |
Doc2Vec | Retrofitting | 4,814.15 | 7,164.03 | 6,433.94 | 5,802.43 |
Doc2Vec | Alternating | 4,689.96 | 6,919.24 | 6,192.12 | 5,659.31 |
Random 768 | None | 5,345.06 | 7,211.48 | 6,609.13 | 6,029.10 |
Random 768 | Retrofitting | 5,366.13 | 7,349.66 | 6,534.66 | 6,221.10 |
BERTLEF | None | 5,299.95 | 7,211.09 | 6,521.57 | 6,260.76 |
BERTLEF | AKS B = 1 | 5,292.91 | 7,217.49 | 6,828.36 | 6,212.75 |
BERTLEF + PCA | AKS B = 1 | 4,870.77 | 8,601.52 | 7,860.10 | 6,208.87 |
BERTLEF | AKS B = 10 | 5,286.53 | 7,390.63 | 6,793.89 | 6,172.18 |
BERTLEF + PCA | AKS B = 10 | 4,870.26 | 8,647.27 | 7,847.80 | 6,215.73 |
BERTLEF | AKS B = 100 | 5,382.80 | 7,538.72 | 6,630.50 | 6,176.40 |
BERTLEF + PCA | AKS B = 100 | 4,894.13 | 8,639.23 | 7,858.67 | 6,230.27 |
BERTLEF | Retrofitting | 5,450.53 | 7,619.40 | 6,875.99 | 6,355.34 |
BERTLEF | Alternating | 5,308.68 | 7,377.52 | 6,511.52 | 6,124.20 |
Among the Doc2Vec approaches, we see that Alternating smoothing does better than all other forms of smoothing. More than that, Alternating smoothing is the only one that consistently beats the geography only baseline (Random 300 Retrofitting). In other words, the other smoothing approaches may not be leveraging as much linguistic information as they could and may be overpowered by the geography signal. In contrast, alternating smoothing and training produces embeddings that provide more than what can be provided by geography alone.
In the table, we see that Doc2Vec without smoothing outperforms Doc2Vec with smoothing. We see a similar phenomenon with the BERTLEF models. The nature of the task may benefit Doc2Vec without smoothing as counts in an area are going to be higher in places with more data. However, we see that Doc2Vec Alternating smoothing does better than every other smoothing variant across the board. In particular, Alternating smoothing outperforms the AKS approaches. What that indicates is that the effectiveness of MVP models is due to the manually crafted list of lexical variants and less due to the smoothing approach.
In Figures 5–8, we visualize the predictions of a select set of methods for the relevant DAREDS regions.5 In each one, we see that Doc2Vec None produces a noisy, largely indiscernible pattern, indicating that the high score may be related to the model learning the artifacts of the dataset. In contrast, the Doc2Vec Alternating (panel e) and MVP AKS B = 100 (panel b) produce patterns that make sense; for example, the prediction of the “Gulf States” region is near the Gulf of Mexico (southeast of Texas) for which the region is named. Similarly, these models predict what the “Southwest” and “West” regions are to the southwest and west, respectively. Of particular note, these predictions match the locations of where the words were used, as shown in subfigure a. In contrast, the Doc2Vec Retrofitting (panel d) and BERTLEF Alternating (panel f) show some appropriate regional patterns, but are much messier than Doc2Vec Alternating, which corroborates their score.
BERT based models generally do worse than their Doc2Vec counterparts. One possibility is that the added value of using a BERT model doesn’t outgain the increase in parameters (768 parameters in BERT to 300 parameters in Doc2Vec). What this indicates is that the added pretraining done with BERT may not provide the obvious boost in analyzing lexical variation as is seen in other kinds of tasks. Additionally, while we see that Alternating smoothing does better than Retrofitting, both are worse than the AKS smoothing methods and Retrofitting smoothing is worse than the random vector baseline. In Figure 9, we show a possible explanation and explore this phenomenon in more detail in the next evaluation. The figure shows the tradeoff between number of smoothing iterations and AIC. Generally, Retrofitting increases in AIC with more iterations, which is bad. Thus, for our data, retrofitting may actually be detrimental and therefore fewer iterations would be less harmful. In contrast, with Alternating smoothing, we do not see an increase in AIC, which indicates that alternating training and smoothing may mitigate any harm that could be brought from smoothing the data.
The other metric we explore is McFadden’s pseudo-R2 (McFadden et al. 1973). McFadden’s pseudo-R2 is a generalization of the coefficient of determination (R2) that is more appropriate for generalized linear models, such as Poisson regression. Whereas the coefficient of determination is 1 minus the residual sum of squares divided by the total sum of squares, McFadden’s pseudo-R2 is 1 minus the residual deviance over the null deviance. The deviance of a model is the log-likelihood of the predicted values of the model minus the log-likelihood of the actual values of the model. The residual deviance is the deviance of the model in question and the null deviance is the deviance of a model where the probability is the same for every voting precinct (only has an intercept and no embedding information).
McFadden’s pseudo-R2 =
We chose this metric as well as it produces easier to understand values (1 is the best, 0 means the model is just as good as a constant model, negative numbers indicate that the model is worse than just using a constant model). However, it does not have many of the nice properties that AIC has.
We provide the corresponding evaluation scores in Table 4 and hyperparameter analysis graphs in Figure 10. R2 values are largely connected to the number of parameters (MVP scores are lower than Doc2Vec scores, which are lower than BERTLEF scores), so comparing models with different parameter sizes is of limited help. What the pseudo-R2 does tell us is that the embeddings are useful for capturing dialect areas as they are positive (as in, more useful than a constant model). More than this, as values between 0.2 and 0.4 are seen as indicators of excellent fit (McFadden 1977), we see that the Doc2Vec and BERTLEF approaches with Retrofitting and Alternating smoothing provide excellent fits for the data.
Method . | Alternation . | DAREDS R2 by Region . | |||
---|---|---|---|---|---|
Gulf States . | Southwest . | Texas . | West . | ||
Static | None | 0.00 | 0.00 | 0.00 | 0.00 |
Coordinates | None | 0.01 | 0.09 | 0.03 | 0.03 |
MVP | AKS B = 1 | 0.07 | 0.09 | 0.12 | 0.05 |
MVP + PCA | AKS B = 1 | 0.06 | 0.05 | 0.06 | 0.03 |
MVP | AKS B = 10 | 0.08 | 0.17 | 0.16 | 0.09 |
MVP + PCA | AKS B = 10 | 0.05 | 0.07 | 0.07 | 0.05 |
MVP | AKS B = 100 | 0.11 | 0.21 | 0.14 | 0.10 |
MVP + PCA | AKS B = 100 | 0.09 | 0.16 | 0.09 | 0.07 |
Random 300 | None | 0.17 | 0.29 | 0.28 | 0.17 |
Random 300 | Retrofitting | 0.20 | 0.32 | 0.34 | 0.23 |
Doc2Vec | None | 0.25 | 0.39 | 0.38 | 0.29 |
Doc2Vec | AKS B = 1 | 0.15 | 0.21 | 0.16 | 0.16 |
Doc2Vec + PCA | AKS B = 1 | 0.02 | 0.02 | 0.02 | 0.02 |
Doc2Vec | AKS B = 10 | 0.16 | 0.26 | 0.21 | 0.17 |
Doc2Vec + PCA | AKS B = 10 | 0.01 | 0.02 | 0.01 | 0.02 |
Doc2Vec | AKS B = 100 | 0.13 | 0.22 | 0.23 | 0.16 |
Doc2Vec + PCA | AKS B = 100 | 0.01 | 0.02 | 0.01 | 0.02 |
Doc2Vec | Retrofitting | 0.19 | 0.33 | 0.33 | 0.23 |
Doc2Vec | Alternating | 0.22 | 0.36 | 0.37 | 0.26 |
Random 768 | None | 0.30 | 0.46 | 0.46 | 0.38 |
Random 768 | Retrofitting | 0.30 | 0.44 | 0.47 | 0.34 |
BERTLEF | None | 0.32 | 0.46 | 0.47 | 0.33 |
BERTLEF | AKS B = 1 | 0.32 | 0.46 | 0.42 | 0.34 |
BERTLEF + PCA | AKS B = 1 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | AKS B = 10 | 0.32 | 0.43 | 0.43 | 0.35 |
BERTLEF + PCA | AKS B = 10 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | AKS B = 100 | 0.29 | 0.41 | 0.45 | 0.35 |
BERTLEF + PCA | AKS B = 100 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | Retrofitting | 0.27 | 0.40 | 0.41 | 0.31 |
BERTLEF | Alternating | 0.31 | 0.43 | 0.47 | 0.36 |
Method . | Alternation . | DAREDS R2 by Region . | |||
---|---|---|---|---|---|
Gulf States . | Southwest . | Texas . | West . | ||
Static | None | 0.00 | 0.00 | 0.00 | 0.00 |
Coordinates | None | 0.01 | 0.09 | 0.03 | 0.03 |
MVP | AKS B = 1 | 0.07 | 0.09 | 0.12 | 0.05 |
MVP + PCA | AKS B = 1 | 0.06 | 0.05 | 0.06 | 0.03 |
MVP | AKS B = 10 | 0.08 | 0.17 | 0.16 | 0.09 |
MVP + PCA | AKS B = 10 | 0.05 | 0.07 | 0.07 | 0.05 |
MVP | AKS B = 100 | 0.11 | 0.21 | 0.14 | 0.10 |
MVP + PCA | AKS B = 100 | 0.09 | 0.16 | 0.09 | 0.07 |
Random 300 | None | 0.17 | 0.29 | 0.28 | 0.17 |
Random 300 | Retrofitting | 0.20 | 0.32 | 0.34 | 0.23 |
Doc2Vec | None | 0.25 | 0.39 | 0.38 | 0.29 |
Doc2Vec | AKS B = 1 | 0.15 | 0.21 | 0.16 | 0.16 |
Doc2Vec + PCA | AKS B = 1 | 0.02 | 0.02 | 0.02 | 0.02 |
Doc2Vec | AKS B = 10 | 0.16 | 0.26 | 0.21 | 0.17 |
Doc2Vec + PCA | AKS B = 10 | 0.01 | 0.02 | 0.01 | 0.02 |
Doc2Vec | AKS B = 100 | 0.13 | 0.22 | 0.23 | 0.16 |
Doc2Vec + PCA | AKS B = 100 | 0.01 | 0.02 | 0.01 | 0.02 |
Doc2Vec | Retrofitting | 0.19 | 0.33 | 0.33 | 0.23 |
Doc2Vec | Alternating | 0.22 | 0.36 | 0.37 | 0.26 |
Random 768 | None | 0.30 | 0.46 | 0.46 | 0.38 |
Random 768 | Retrofitting | 0.30 | 0.44 | 0.47 | 0.34 |
BERTLEF | None | 0.32 | 0.46 | 0.47 | 0.33 |
BERTLEF | AKS B = 1 | 0.32 | 0.46 | 0.42 | 0.34 |
BERTLEF + PCA | AKS B = 1 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | AKS B = 10 | 0.32 | 0.43 | 0.43 | 0.35 |
BERTLEF + PCA | AKS B = 10 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | AKS B = 100 | 0.29 | 0.41 | 0.45 | 0.35 |
BERTLEF + PCA | AKS B = 100 | 0.01 | 0.03 | 0.01 | 0.01 |
BERTLEF | Retrofitting | 0.27 | 0.40 | 0.41 | 0.31 |
BERTLEF | Alternating | 0.31 | 0.43 | 0.47 | 0.36 |
5.2 Prediction of Lexical Variant Preference
In this section, we evaluate embeddings based on their ability to predict lexical variant preference. Lexical variation is the choice between two semantically similar lexical items, such as pop versus soda. Lexical variation is a good determiner of linguistic variation (Cassidy, Hall, and Von Schneidemesser 1985; Carver 1987). Thus, if a voting precinct embedding approach can be used to predict lexical variation, the embeddings should be reflective of linguistic variation.
We model lexical variation as a binomial distribution. We suppose a population can choose between two variants lex1 and lex2, for example, pop and soda. Each voting precinct acts like a weighted coin where heads is one variant and tails is the other. Given n mentions of soft drinks, this corresponds to n flips of the weighted coin. Thus, the number of times a voting precinct uses one form over the other is a binomial distribution.
If the voting precinct embedding approach captures linguistic variation, then it should be able to predict the probability of a voting precinct choosing lex1 over lex2. In other words, we use binomial regression to predict the probability of a lexical choice from the embeddings. The benefit of this approach is that it naturally handles differences in data size (less data in a precinct just means smaller n) and reliability of the probability (a probability of 50% is more reliable when n = 500 than when n = 2).
We derive our lexical variation pairs from two Twitter lexical normalization datasets from Han and Baldwin (2011) and Liu et al. (2011). The Han and Baldwin (2011) dataset was formed from three annotators normalizing 1,184 out of vocabulary tokens from 549 English tweets. The Liu et al. (2011) dataset was formed from Amazon Turkers normalizing 3,802 nonstandard tokens (tokens that are rare and diverge from a standard form) from 6,150 tweets. In both cases, humans manually annotated what appears to be “non standard” uses of tokens with their “standard” variants. These pairs therefore reflect lexical variation.6 We filter out pairs that have data in less than 500 voting precincts. This leads to a list of 66 pairs from Han and Baldwin (2011) and 110 pairs from Liu et al. (2011). See Sections Appendix C and Appendix D in the Appendix for the list of pairs and statistics. For each voting precinct, we derive the frequency of each variant in a pair directly from our Twitter data.
With the frequency data, we fit binomial regression models for each pair of words with each voting precinct as a datapoint. Models that have a stronger fit indicate that the corresponding embeddings better capture the choice of variant in the voting precincts.
We present the results of this evaluation in Table 5. See Section 4.6 for a reference for the method names. We see many of the same insights as in the dialect area prediction analysis. We see that MVP approaches are competitive with Doc2Vec Alternating on the Han and Baldwin (2011) and underperform Doc2Vec Alternating on the Liu et al. (2011) dataset. We see that Doc2Vec does better with Alternating smoothing than other approaches and BERTLEF approaches can do worse than baseline.
Method | Alternation | Han and Baldwin . | Liu et al. . | ||||
---|---|---|---|---|---|---|---|
AIC | R2 | Pairs | AIC | R2 | Pairs | ||
Static | None | 5,037.90 | −0.00 | 66 | 7,332.17 | −0.00 | 109 |
Coordinates | None | 4,820.86 | 0.02 | 66 | 7,242.46 | 0.01 | 110 |
MVP | AKS B = 1 | 3,968.56 | 0.37 | 66 | 5,855.48 | 0.38 | 110 |
MVP + PCA | AKS B = 1 | 4,100.76 | 0.34 | 66 | 6,248.76 | 0.34 | 110 |
MVP | AKS B = 10 | 3,946.91 | 0.34 | 66 | 5,810.90 | 0.35 | 110 |
MVP + PCA | AKS B = 10 | 4,108.08 | 0.30 | 66 | 6,199.99 | 0.32 | 110 |
MVP | AKS B = 100 | 4,160.22 | 0.25 | 66 | 5,948.60 | 0.28 | 110 |
MVP + PCA | AKS B = 100 | 4,263.89 | 0.21 | 66 | 6,495.72 | 0.22 | 110 |
Random 300 | None | 4,469.52 | 0.34 | 66 | 5,614.97 | 0.26 | 110 |
Random 300 | Retrofitting | 4,173.60 | 0.42 | 66 | 6,033.76 | 0.40 | 110 |
Doc2Vec | None | 3,720.66 | 0.57 | 66 | 4,274.39 | 0.53 | 110 |
Doc2Vec | AKS B = 1 | 4,601.33 | 0.33 | 66 | 5,785.18 | 0.35 | 110 |
Doc2Vec + PCA | AKS B = 1 | 4,953.07 | 0.03 | 66 | 7,038.40 | 0.05 | 110 |
Doc2Vec | AKS B = 10 | 4,460.91 | 0.34 | 66 | 5,905.68 | −0.35 | 110 |
Doc2Vec + PCA | AKS B = 10 | 4,914.14 | 0.04 | 66 | 7,102.57 | −0.10 | 110 |
Doc2Vec | AKS B = 100 | 6,322.71 | −0.86 | 66 | 13,100.68 | −1.34 | 110 |
Doc2Vec + PCA | AKS B = 100 | 5,247.45 | −1.00 | 66 | 7,139.56 | 0.05 | 110 |
Doc2Vec | Retrofitting | 10,318.41 | −3.26 | 66 | 12,927.14 | −2.94 | 110 |
Doc2Vec | Alternating | 3,991.38 | 0.48 | 66 | 5,064.28 | 0.46 | 110 |
Random 768 | None | 4,652.19 | 0.56 | 66 | 5,570.99 | 0.45 | 110 |
Random 768 | Retrofitting | 4,501.30 | 0.59 | 66 | 8,982.39 | 0.00 | 110 |
BERTLEF | None | 4,446.72 | 0.63 | 66 | 5,360.23 | 0.51 | 110 |
BERTLEF | AKS B = 1 | 4,675.30 | 0.56 | 62 | 5,576.14 | 0.46 | 103 |
BERTLEF + PCA | AKS B = 1 | 4,896.52 | 0.05 | 66 | 6,860.40 | 0.07 | 110 |
BERTLEF | AKS B = 10 | 4,639.71 | 0.56 | 64 | 5,579.60 | 0.46 | 107 |
BERTLEF + PCA | AKS B = 10 | 4,922.05 | 0.04 | 66 | 7,055.13 | 0.06 | 110 |
BERTLEF | AKS B = 100 | 4,698.94 | 0.56 | 64 | 5,679.19 | 0.46 | 103 |
BERTLEF + PCA | AKS B = 100 | 4,942.70 | 0.03 | 66 | 7,269.16 | −0.13 | 110 |
BERTLEF | Retrofitting | N/A | N/A | 22 | N/A | N/A | 35 |
BERTLEF | Alternating | 4,488.41 | 0.59 | 66 | 5,880.80 | 0.49 | 110 |
Shared Number of pairs | 60 | 96 |
Method | Alternation | Han and Baldwin . | Liu et al. . | ||||
---|---|---|---|---|---|---|---|
AIC | R2 | Pairs | AIC | R2 | Pairs | ||
Static | None | 5,037.90 | −0.00 | 66 | 7,332.17 | −0.00 | 109 |
Coordinates | None | 4,820.86 | 0.02 | 66 | 7,242.46 | 0.01 | 110 |
MVP | AKS B = 1 | 3,968.56 | 0.37 | 66 | 5,855.48 | 0.38 | 110 |
MVP + PCA | AKS B = 1 | 4,100.76 | 0.34 | 66 | 6,248.76 | 0.34 | 110 |
MVP | AKS B = 10 | 3,946.91 | 0.34 | 66 | 5,810.90 | 0.35 | 110 |
MVP + PCA | AKS B = 10 | 4,108.08 | 0.30 | 66 | 6,199.99 | 0.32 | 110 |
MVP | AKS B = 100 | 4,160.22 | 0.25 | 66 | 5,948.60 | 0.28 | 110 |
MVP + PCA | AKS B = 100 | 4,263.89 | 0.21 | 66 | 6,495.72 | 0.22 | 110 |
Random 300 | None | 4,469.52 | 0.34 | 66 | 5,614.97 | 0.26 | 110 |
Random 300 | Retrofitting | 4,173.60 | 0.42 | 66 | 6,033.76 | 0.40 | 110 |
Doc2Vec | None | 3,720.66 | 0.57 | 66 | 4,274.39 | 0.53 | 110 |
Doc2Vec | AKS B = 1 | 4,601.33 | 0.33 | 66 | 5,785.18 | 0.35 | 110 |
Doc2Vec + PCA | AKS B = 1 | 4,953.07 | 0.03 | 66 | 7,038.40 | 0.05 | 110 |
Doc2Vec | AKS B = 10 | 4,460.91 | 0.34 | 66 | 5,905.68 | −0.35 | 110 |
Doc2Vec + PCA | AKS B = 10 | 4,914.14 | 0.04 | 66 | 7,102.57 | −0.10 | 110 |
Doc2Vec | AKS B = 100 | 6,322.71 | −0.86 | 66 | 13,100.68 | −1.34 | 110 |
Doc2Vec + PCA | AKS B = 100 | 5,247.45 | −1.00 | 66 | 7,139.56 | 0.05 | 110 |
Doc2Vec | Retrofitting | 10,318.41 | −3.26 | 66 | 12,927.14 | −2.94 | 110 |
Doc2Vec | Alternating | 3,991.38 | 0.48 | 66 | 5,064.28 | 0.46 | 110 |
Random 768 | None | 4,652.19 | 0.56 | 66 | 5,570.99 | 0.45 | 110 |
Random 768 | Retrofitting | 4,501.30 | 0.59 | 66 | 8,982.39 | 0.00 | 110 |
BERTLEF | None | 4,446.72 | 0.63 | 66 | 5,360.23 | 0.51 | 110 |
BERTLEF | AKS B = 1 | 4,675.30 | 0.56 | 62 | 5,576.14 | 0.46 | 103 |
BERTLEF + PCA | AKS B = 1 | 4,896.52 | 0.05 | 66 | 6,860.40 | 0.07 | 110 |
BERTLEF | AKS B = 10 | 4,639.71 | 0.56 | 64 | 5,579.60 | 0.46 | 107 |
BERTLEF + PCA | AKS B = 10 | 4,922.05 | 0.04 | 66 | 7,055.13 | 0.06 | 110 |
BERTLEF | AKS B = 100 | 4,698.94 | 0.56 | 64 | 5,679.19 | 0.46 | 103 |
BERTLEF + PCA | AKS B = 100 | 4,942.70 | 0.03 | 66 | 7,269.16 | −0.13 | 110 |
BERTLEF | Retrofitting | N/A | N/A | 22 | N/A | N/A | 35 |
BERTLEF | Alternating | 4,488.41 | 0.59 | 66 | 5,880.80 | 0.49 | 110 |
Shared Number of pairs | 60 | 96 |
In Figure 11, we present the difference in AIC and McFadden’s pseudo-R2 across pairs. As different pairs may naturally be easier or harder to predict, we compare the Doc2Vec Alternating to provide a more neutral comparison of methods. We see that the MVP approaches tend to have more rightward AIC boxes. Together with the averages being close, this indicates that MVP approaches do better than Doc2Vec Alternating more often, but perform much worse when they do perform worse. For the approaches that are applied to raw text (and use smoothing), we see that the boxes are to the left of the blue line, which indicates that they do worse than Doc2Vec Alternating. What this indicates is that among approaches that do not require manually crafted features, Doc2Vec Alternating performs the best.
Table 5 does also highlight some very different conclusions than the previous evaluation. In the previous evaluation, all methods had a positive McFadden’s pseudo-R2, whereas here we see that many approaches have a negative R2, which is a sign that predictions are extremely off the mark. We also see that some models, especially Doc2Vec Retrofitting, have AICs that are nearly double the others, which is also a sign of poor prediction. Additionally, we see issues in fitting the binomial regression models in the first place. The “Pairs” column indicates how many of the 66 Han and Baldwin (2011) pairs and 110 Liu et al. (2011) pairs were fit successfully and did not throw collinearity errors. For example, BERTLEF AKS B = 1 only had 62 pairs with complete fitting, which means 4 pairs failed to fit. The BERTLEF Retrofitting model succeeded on only about a third of the pairs, so was thrown out. In other words, we see that several models have severe issues in this evaluation.
In Figure 12, we compare the number of smoothing iterations to the average AIC (top graphs), average McFadden’s pseudo-R2 (middle graphs), and number of pairs that were successfully fit. We see that Retrofitting approaches get substantially worse with more iterations. BERTLEF approaches are particularly susceptible to this issue.7 In contrast, the Alternating smoothing approaches do not have these issues. The Doc2Vec Alternating approach is stable from start to finish and the BERTLEF Alternating approach has more minor deviations.
We believe the cause of these problems is that retrofitting, with voting precinct level data, causes the embeddings to become collinear and thus susceptible to modeling issues. In Figure 13, we compare the number of smoothing iterations to the column rank of the embedding matrix (as calculated by NumPy’s matrix_rank method). The gray lines are the desired rank. Doc2Vec approaches have a dimension of 300 so should have a column rank of 300. BERTLEF has a dimension of 768 so should have a column rank of 768. In the figure, we see that, for Retrofitting approaches, the rank sharply declines, which indicates that smoothing after training causes the embedding dimensions to rapidly become collinear and thus have limited predictive value. In contrast, the Doc2Vec Alternating approach does not suffer any decrease in column rank and the BERTLEF Alternating approach only suffers minor loss in column rank.
The lesson to draw from this is that, for working with fine-grained areas like voting precincts, alternating training and smoothing is not just a model improvement, but a necessary part to prevent severe numerical issues. With large areas like cities, retrofitting has enough data to prevent the kinds of issues seen here. However, to gain insight at a much smaller resolution, alternating is not just nice to have, but a necessity.
5.3 Finer Resolution Analyses Through Variant Maps
As with dialect area prediction, we can generate maps that predict where one variant of a word is chosen over another. This may allow sociolinguists to better explore sociolinguistic phenomena. We show an example of this with bro vs. brother in Figure 14.
In panel (a), we have the percentage of times bro was used. In panel (b), we have the Black percentage throughout Texas. We include this as bro has been recognized as African American slang (Widawski 2015). The bottom four panels are the predicted percentages from various models. We see that both the gold values and Black Percentage have an East–West divide. We also see that the models predict a similar divide with the Retrofitting/Alternating models having a clearer distinction.
A more interesting facet appears when we focus on the divide in bro vs. brother around Houston, Texas (Figure 15). In panel (a), we show the Black Percentage demographics around Houston and see that Black people are not uniformly distributed throughout the city and that there are sections of the city where Black people are more concentrated (highlighted with a red ellipse is one such section). In panel (b), we show our predictions for bro vs. brother from the Doc2Vec Alternating model and see that the predictions are also not uniformly distributed throughout the city and instead are concentrated in the same areas that the Black population are (also highlighted with an ellipse). What this indicates is that using voting precincts as our subregions, we are able to narrow down our analyses to specific, relatively tiny areas.
In contrast, larger areas, such as cities and counties, cannot capture these insights. If we use counties instead of voting precincts, as in Huang et al. (2016), we see in panel (c)8 that the bro–brother distinction we identified would be enveloped by a single area. If we use cities instead of voting precincts, as in Hovy and Purschke (2018), we see in panel (d) that we would also envelop that area and similarly be completely unable to make any finer-grained analyses. Thus, we have shown that finer-grained subregions can produce finer-grained insights. However, as discussed in previous sections, one needs to use a different modeling approach in order to be able to gain these insights and not run into data issues.
5.4 Embeddings as Linguistic Gene to Connect Language Use with Sociology
The previous sections describe various embedding methods for representing language use in a voting precinct. Language use in any area is connected to race, socioeconomic status, population density, among many, many other factors and these factors are all represented within the embedding. In this section, we explore how we can use extractions of these embeddings that correlate to sociological factors and use these extractions to make sociolinguistic analyses.
Our proposed methodology is similar to how genes are used as a nexus to connect two different biological phenomena. For example, consider the HOX genes. HOX genes are common throughout animal genetic sequences and are responsible for limb formation (such as determining whether a human should grow an arm or a leg out of their shoulder) (Grier et al. 2005). By looking at expressions of HOX genes, researchers have found a connection between HOX genes and genetic disorders related to finger development—for example, synpolydactyly and brachydactyly. From this, researchers identified a possible connection between limb formation and finger development via the HOX gene link.
We use a similar strategy to link sociological phenomena with linguistic phenomena. We have embeddings for each voting precinct (genetic sequences for each species). We can identify what portion of these embeddings correspond to a sociological variable of interest (find the genes for limb formation). We can use these portions to predict a linguistic phenomenon (use gene expressions to predict a separate physiological phenomenon). Then, if successful, we can then link the sociological phenomenon with the linguistic phenomenon (connect limb formation and finger disorders through the HOX genes).
To extract the section of the embedding that corresponds to a sociological variable, we use Orthogonal Matching Pursuit (OMP), which is a linear regression that zeros out all but a fixed number of weights. We can train an OMP model to predict the sociological variable from the voting precinct embeddings. The coordinates with non-zero weights are the section of the embedding that correspond to how the sociological phenomenon interacts with language use in an area. For example, if we use the embeddings to predict Black Percentage in a voting precinct, the extracted section should correlate with how race intersects with language use.
More formally, OMP is a linear regression model where all but a fixed upper bound of weights is zero. For input matrix X, for example, where each row is a voting precinct embedding, output vector y, for example, the corresponding variable, and number of non-zero weights n, OMP minimizes the following loss:
||y −Xw|| where w are the regression weights, ||w||0 ≤ n and n > 0.
We use OMP to extract the 10 coordinates in the precinct embeddings that most correspond to a sociological variable of interest. For example, if our sociological variable was Black Percentage, OMP would give us the 10 coordinates that correlate more with Black Percentage. We can connect Black Percentage to other linguistic phenomenon by how well those 10 coordinates predict a linguistic phenomenon of interest as well as identify new linguistic phenomena that could be related to the sociological variable.
First, we explore what insights we can derive from the Black Percentage “gene” in voting precincts’ language “genetic code.” We use OMP to identify 10 coordinates that highly correlate with Black Percentage. We can connect this “gene” to linguistic phenomena by using it to predict lexical variation. We can then look at how to increase accuracy by using the gene instead of the entire genetic code. If we find a lexical variant pair that is better modeled with the gene than the entire embedding, that is an indication that the pair is connected to the sociological variable, here Black Percentage.
We measure increase in accuracy by percent decrease in AIC or percent increase in McFadden’s pseudo-R2. We use percentage increase/decrease to account for different pairs having natural ease of modeling. If a pair has a high percentage increase/decrease, then they are likely to be connected to the underlying sociological variable. We also compare to using the sociological variable directly and the percentage improvement.
In Tables 6 and 7 we show the top 30 lexical variant pairs from Han and Baldwin (2011) and Liu et al. (2011). The Gene columns are the rankings as derived from using the extracted embedding section and the SV columns are using the sociological variables alone. From these, a sociolinguist can look at the rankings and possibly identify insights that were previously missed.
Dataset: Han and Baldwin (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Black Percentage . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | umm-um | umm-um | til-until | lil-little |
2 | convo-conversation | convo-conversation | lil-little | bro-brother |
3 | freakin-freaking | freakin-freaking | bro-brother | umm-um |
4 | gf-girlfriend | gf-girlfriend | convo-conversation | tha-the |
5 | sayin-saying | sayin-saying | tha-the | gon-gonna |
6 | chillin-chilling | chillin-chilling | fb-facebook | da-the |
7 | yess-yes | bf-boyfriend | hrs-hours | yu-you |
8 | playin-playing | txt-text | comin-coming | fb-facebook |
9 | lawd-lord | yess-yes | playin-playing | cuz-because |
10 | bf-boyfriend | lawd-lord | fam-family | bs-bullshit |
11 | txt-text | bs-bullshit | btw-between | ppl-people |
12 | cus-because | ohh-oh | lookin-looking | dat-that |
13 | ahh-ah | cus-because | de-the | dawg-dog |
14 | prolly-probably | pics-pictures | dawg-dog | kno-know |
15 | ohh-oh | ahh-ah | yu-you | chillin-chilling |
16 | bs-bullshit | prolly-probably | thx-thanks | til-until |
17 | nothin-nothing | hahah-haha | cuz-because | jus-just |
18 | hahah-haha | hahahaha-haha | def-definitely | bday-birthday |
19 | naw-no | talkin-talking | da-the | wat-what |
20 | tht-that | til-till | jus-just | goin-going |
21 | pics-pictures | naw-no | bday-birthday | de-the |
22 | talkin-talking | nothin-nothing | ahh-ah | prolly-probably |
23 | hahahaha-haha | playin-playing | mis-miss | gettin-getting |
24 | doin-doing | hahaha-haha | mins-minutes | nd-and |
25 | bb-baby | tht-that | gettin-getting | fuckin-fucking |
26 | til-till | gon-gonna | kno-know | lookin-looking |
27 | fb-facebook | doin-doing | doin-doing | naw-no |
28 | comin-coming | fuckin-fucking | gon-gonna | fam-family |
29 | thx-thanks | bb-baby | soo-so | cus-because |
30 | kno-know | goin-going | yr-year | mis-miss |
AP | 0.055 | 0.057 | 0.252 | 0.237 |
Dataset: Han and Baldwin (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Black Percentage . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | umm-um | umm-um | til-until | lil-little |
2 | convo-conversation | convo-conversation | lil-little | bro-brother |
3 | freakin-freaking | freakin-freaking | bro-brother | umm-um |
4 | gf-girlfriend | gf-girlfriend | convo-conversation | tha-the |
5 | sayin-saying | sayin-saying | tha-the | gon-gonna |
6 | chillin-chilling | chillin-chilling | fb-facebook | da-the |
7 | yess-yes | bf-boyfriend | hrs-hours | yu-you |
8 | playin-playing | txt-text | comin-coming | fb-facebook |
9 | lawd-lord | yess-yes | playin-playing | cuz-because |
10 | bf-boyfriend | lawd-lord | fam-family | bs-bullshit |
11 | txt-text | bs-bullshit | btw-between | ppl-people |
12 | cus-because | ohh-oh | lookin-looking | dat-that |
13 | ahh-ah | cus-because | de-the | dawg-dog |
14 | prolly-probably | pics-pictures | dawg-dog | kno-know |
15 | ohh-oh | ahh-ah | yu-you | chillin-chilling |
16 | bs-bullshit | prolly-probably | thx-thanks | til-until |
17 | nothin-nothing | hahah-haha | cuz-because | jus-just |
18 | hahah-haha | hahahaha-haha | def-definitely | bday-birthday |
19 | naw-no | talkin-talking | da-the | wat-what |
20 | tht-that | til-till | jus-just | goin-going |
21 | pics-pictures | naw-no | bday-birthday | de-the |
22 | talkin-talking | nothin-nothing | ahh-ah | prolly-probably |
23 | hahahaha-haha | playin-playing | mis-miss | gettin-getting |
24 | doin-doing | hahaha-haha | mins-minutes | nd-and |
25 | bb-baby | tht-that | gettin-getting | fuckin-fucking |
26 | til-till | gon-gonna | kno-know | lookin-looking |
27 | fb-facebook | doin-doing | doin-doing | naw-no |
28 | comin-coming | fuckin-fucking | gon-gonna | fam-family |
29 | thx-thanks | bb-baby | soo-so | cus-because |
30 | kno-know | goin-going | yr-year | mis-miss |
AP | 0.055 | 0.057 | 0.252 | 0.237 |
Dataset: Liu et al. (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Black Percentage . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | wheres-whereas | wheres-whereas | homies-homes | trippin-tripping |
2 | quiero-query | quiero-query | cali-california | lil-little |
3 | max-maximum | max-maximum | re-regarding | bro-brother |
4 | tv-television | tv-television | mo-more | tha-the |
5 | homies-homes | bbq-barbeque | trippin-tripping | wit-with |
6 | re-regarding | homies-homes | lil-little | yo-you |
7 | bbq-barbeque | cali-california | bro-brother | bout-about |
8 | cali-california | trippin-tripping | convo-conversation | tho-though |
9 | convo-conversation | convo-conversation | fa-for | da-the |
10 | trippin-tripping | freakin-freaking | wit-with | yea-yeah |
11 | freakin-freaking | gf-girlfriend | tha-the | cause-because |
12 | mines-mine | mines-mine | th-the | yu-you |
13 | gf-girlfriend | sayin-saying | fb-facebook | fb-facebook |
14 | sayin-saying | chillin-chilling | bout-about | dis-this |
15 | chillin-chilling | txt-text | hrs-hours | gon-going |
16 | yess-yes | cutie-cute | tho-though | cuz-because |
17 | playin-playing | yess-yes | comin-coming | bs-bullshit |
18 | lawd-lord | nun-nothing | fr-for | ppl-people |
19 | txt-text | lawd-lord | playin-playing | dat-that |
20 | cus-because | bs-bullshit | dis-this | sum-some |
21 | cutie-cute | ohh-oh | fam-family | fr-for |
22 | nun-nothing | cus-because | fml-family | kno-know |
23 | wen-when | wen-when | fav-favorite | quiero-query |
24 | wut-what | pics-pictures | yo-you | chillin-chilling |
25 | prolly-probably | wut-what | hwy-highway | tv-television |
26 | ohh-oh | prolly-probably | app-application | jus-just |
27 | thot-thought | sis-sister | thru-through | thang-thing |
28 | nada-nothing | thot-thought | sum-some | mo-more |
29 | turnt-turn | feelin-feeling | lookin-looking | bday-birthday |
30 | sis-sister | talkin-talking | yu-you | wat-what |
AP | 0.080 | 0.077 | 0.264 | 0.110 |
Dataset: Liu et al. (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Black Percentage . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | wheres-whereas | wheres-whereas | homies-homes | trippin-tripping |
2 | quiero-query | quiero-query | cali-california | lil-little |
3 | max-maximum | max-maximum | re-regarding | bro-brother |
4 | tv-television | tv-television | mo-more | tha-the |
5 | homies-homes | bbq-barbeque | trippin-tripping | wit-with |
6 | re-regarding | homies-homes | lil-little | yo-you |
7 | bbq-barbeque | cali-california | bro-brother | bout-about |
8 | cali-california | trippin-tripping | convo-conversation | tho-though |
9 | convo-conversation | convo-conversation | fa-for | da-the |
10 | trippin-tripping | freakin-freaking | wit-with | yea-yeah |
11 | freakin-freaking | gf-girlfriend | tha-the | cause-because |
12 | mines-mine | mines-mine | th-the | yu-you |
13 | gf-girlfriend | sayin-saying | fb-facebook | fb-facebook |
14 | sayin-saying | chillin-chilling | bout-about | dis-this |
15 | chillin-chilling | txt-text | hrs-hours | gon-going |
16 | yess-yes | cutie-cute | tho-though | cuz-because |
17 | playin-playing | yess-yes | comin-coming | bs-bullshit |
18 | lawd-lord | nun-nothing | fr-for | ppl-people |
19 | txt-text | lawd-lord | playin-playing | dat-that |
20 | cus-because | bs-bullshit | dis-this | sum-some |
21 | cutie-cute | ohh-oh | fam-family | fr-for |
22 | nun-nothing | cus-because | fml-family | kno-know |
23 | wen-when | wen-when | fav-favorite | quiero-query |
24 | wut-what | pics-pictures | yo-you | chillin-chilling |
25 | prolly-probably | wut-what | hwy-highway | tv-television |
26 | ohh-oh | prolly-probably | app-application | jus-just |
27 | thot-thought | sis-sister | thru-through | thang-thing |
28 | nada-nothing | thot-thought | sum-some | mo-more |
29 | turnt-turn | feelin-feeling | lookin-looking | bday-birthday |
30 | sis-sister | talkin-talking | yu-you | wat-what |
AP | 0.080 | 0.077 | 0.264 | 0.110 |
To produce an estimate of the accuracy of these lists, we use the African American slang dictionary in Widawski (2015) as our gold labels and use them to calculate the average precision (AP). We see that using McFadden’s pseudo-R2 provides the best results, with use of the “gene” performing slightly better than use of the sociological variable on its own. We also see that the “gene” approach provides different predictions from solely using the sociological variable, such as the prediction that the til versus until distinction was possibly connected to Black Percentage.
This indicates that our approach can provide lexical variants that are connected to sociological variables and thus can be used by sociologists to find new variants that could be useful in research. Our approach is completely unsupervised, so novel changes and spread in different communities can be monitored and continually updated with new data, which is not feasible for traditional methods.
We perform a similar experiment with the Population Density variable. We show the top ranked pairs in Tables 8 and 9. As g-dropping is a well explored phenomenon for the rural vs. urban divide (Campbell-Kibler 2005), we use this as our gold data. Here, we see that AIC performs best overall with the “gene” approach slightly outperforming the sociological variable. From these lists, it appears that there is a connection between shortening words and population density, for example, convo vs. conversation, gf vs. girlfriend, bf vs. boyfriend, txt vs. text, and prolly vs. probably. By using genes, we might be able to identify new connections that we may not have found otherwise.
Dataset: Han and Baldwin (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Population Density (log scaled) . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | umm-um | umm-um | de-the | til-until |
2 | convo-conversation | convo-conversation | til-until | fuckin-fucking |
3 | freakin-freaking | freakin-freaking | convo-conversation | hahaha-haha |
4 | gf-girlfriend | gf-girlfriend | dawg-dog | lookin-looking |
5 | sayin-saying | sayin-saying | mis-miss | hahah-haha |
6 | yess-yes | txt-text | hrs-hours | btw-between |
7 | chillin-chilling | chillin-chilling | mins-minutes | hahahaha-haha |
8 | bf-boyfriend | bf-boyfriend | yu-you | yess-yes |
9 | txt-text | yess-yes | fb-facebook | talkin-talking |
10 | cus-because | lawd-lord | comin-coming | naw-no |
11 | lawd-lord | cus-because | tha-the | cus-because |
12 | ahh-ah | ohh-oh | playin-playing | de-the |
13 | playin-playing | bs-bullshit | lookin-looking | prolly-probably |
14 | ohh-oh | hahah-haha | bro-brother | mis-miss |
15 | prolly-probably | ahh-ah | ahh-ah | fam-family |
16 | bs-bullshit | prolly-probably | cus-because | freakin-freaking |
17 | hahah-haha | pics-pictures | gon-gonna | til-till |
18 | pics-pictures | hahahaha-haha | fam-family | goin-going |
19 | nothin-nothing | talkin-talking | congrats-congratulations | lil-little |
20 | naw-no | naw-no | pic-picture | hrs-hours |
21 | hahahaha-haha | til-till | nd-and | bs-bullshit |
22 | talkin-talking | nothin-nothing | thx-thanks | pls-please |
23 | tht-that | hahaha-haha | lil-little | nah-no |
24 | mis-miss | playin-playing | cuz-because | congrats-congratulations |
25 | til-till | tht-that | prolly-probably | def-definitely |
26 | doin-doing | fuckin-fucking | fuckin-fucking | da-the |
27 | hahaha-haha | bb-baby | yess-yes | sayin-saying |
28 | bb-baby | doin-doing | da-the | tht-that |
29 | fuckin-fucking | goin-going | yr-year | dawg-dog |
30 | gon-gonna | pic-picture | wat-what | txt-text |
AP | 0.293 | 0.278 | 0.164 | 0.264 |
Dataset: Han and Baldwin (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Population Density (log scaled) . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | umm-um | umm-um | de-the | til-until |
2 | convo-conversation | convo-conversation | til-until | fuckin-fucking |
3 | freakin-freaking | freakin-freaking | convo-conversation | hahaha-haha |
4 | gf-girlfriend | gf-girlfriend | dawg-dog | lookin-looking |
5 | sayin-saying | sayin-saying | mis-miss | hahah-haha |
6 | yess-yes | txt-text | hrs-hours | btw-between |
7 | chillin-chilling | chillin-chilling | mins-minutes | hahahaha-haha |
8 | bf-boyfriend | bf-boyfriend | yu-you | yess-yes |
9 | txt-text | yess-yes | fb-facebook | talkin-talking |
10 | cus-because | lawd-lord | comin-coming | naw-no |
11 | lawd-lord | cus-because | tha-the | cus-because |
12 | ahh-ah | ohh-oh | playin-playing | de-the |
13 | playin-playing | bs-bullshit | lookin-looking | prolly-probably |
14 | ohh-oh | hahah-haha | bro-brother | mis-miss |
15 | prolly-probably | ahh-ah | ahh-ah | fam-family |
16 | bs-bullshit | prolly-probably | cus-because | freakin-freaking |
17 | hahah-haha | pics-pictures | gon-gonna | til-till |
18 | pics-pictures | hahahaha-haha | fam-family | goin-going |
19 | nothin-nothing | talkin-talking | congrats-congratulations | lil-little |
20 | naw-no | naw-no | pic-picture | hrs-hours |
21 | hahahaha-haha | til-till | nd-and | bs-bullshit |
22 | talkin-talking | nothin-nothing | thx-thanks | pls-please |
23 | tht-that | hahaha-haha | lil-little | nah-no |
24 | mis-miss | playin-playing | cuz-because | congrats-congratulations |
25 | til-till | tht-that | prolly-probably | def-definitely |
26 | doin-doing | fuckin-fucking | fuckin-fucking | da-the |
27 | hahaha-haha | bb-baby | yess-yes | sayin-saying |
28 | bb-baby | doin-doing | da-the | tht-that |
29 | fuckin-fucking | goin-going | yr-year | dawg-dog |
30 | gon-gonna | pic-picture | wat-what | txt-text |
AP | 0.293 | 0.278 | 0.164 | 0.264 |
Dataset: Liu et al. (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Population Density (log scaled) . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | wheres-whereas | wheres-whereas | homies-homes | mo-more |
2 | quiero-query | quiero-query | cali-california | th-the |
3 | max-maximum | max-maximum | mo-more | hr-hour |
4 | tv-television | tv-television | re-regarding | ft-feet |
5 | homies-homes | bbq-barbeque | fa-for | wut-what |
6 | bbq-barbeque | homies-homes | dis-this | fuckin-fucking |
7 | re-regarding | cali-california | trippin-tripping | lookin-looking |
8 | cali-california | trippin-tripping | th-the | bby-baby |
9 | convo-conversation | convo-conversation | convo-conversation | dis-this |
10 | trippin-tripping | freakin-freaking | mi-my | fa-for |
11 | freakin-freaking | gf-girlfriend | ft-feet | yess-yes |
12 | mines-mine | mines-mine | hrs-hours | mi-my |
13 | gf-girlfriend | sayin-saying | hr-hour | nun-nothing |
14 | sayin-saying | txt-text | mins-minutes | em-them |
15 | yess-yes | chillin-chilling | yu-you | talkin-talking |
16 | chillin-chilling | yess-yes | fav-favorite | naw-no |
17 | txt-text | cutie-cute | hwy-highway | bout-about |
18 | cutie-cute | nun-nothing | fb-facebook | cus-because |
19 | cus-because | lawd-lord | comin-coming | prolly-probably |
20 | nun-nothing | wut-what | fml-family | yo-you |
21 | lawd-lord | cus-because | tha-the | fml-family |
22 | playin-playing | ohh-oh | tho-though | fam-family |
23 | ohh-oh | bs-bullshit | wit-with | freakin-freaking |
24 | wut-what | prolly-probably | playin-playing | fr-for |
25 | prolly-probably | pics-pictures | fr-for | quiero-query |
26 | bs-bullshit | talkin-talking | lookin-looking | til-till |
27 | nada-nothing | sis-sister | nada-nothing | goin-going |
28 | wen-when | bby-baby | bro-brother | lil-little |
29 | feelin-feeling | wen-when | cus-because | hrs-hours |
30 | sis-sister | feelin-feeling | yea-yeah | bs-bullshit |
AP | 0.197 | 0.196 | 0.119 | 0.151 |
Dataset: Liu et al. (2011) . | ||||
---|---|---|---|---|
Sociological Variable: Population Density (log scaled) . | ||||
Rank . | Gene AIC . | SV AIC . | Gene R2 . | SV R2 . |
1 | wheres-whereas | wheres-whereas | homies-homes | mo-more |
2 | quiero-query | quiero-query | cali-california | th-the |
3 | max-maximum | max-maximum | mo-more | hr-hour |
4 | tv-television | tv-television | re-regarding | ft-feet |
5 | homies-homes | bbq-barbeque | fa-for | wut-what |
6 | bbq-barbeque | homies-homes | dis-this | fuckin-fucking |
7 | re-regarding | cali-california | trippin-tripping | lookin-looking |
8 | cali-california | trippin-tripping | th-the | bby-baby |
9 | convo-conversation | convo-conversation | convo-conversation | dis-this |
10 | trippin-tripping | freakin-freaking | mi-my | fa-for |
11 | freakin-freaking | gf-girlfriend | ft-feet | yess-yes |
12 | mines-mine | mines-mine | hrs-hours | mi-my |
13 | gf-girlfriend | sayin-saying | hr-hour | nun-nothing |
14 | sayin-saying | txt-text | mins-minutes | em-them |
15 | yess-yes | chillin-chilling | yu-you | talkin-talking |
16 | chillin-chilling | yess-yes | fav-favorite | naw-no |
17 | txt-text | cutie-cute | hwy-highway | bout-about |
18 | cutie-cute | nun-nothing | fb-facebook | cus-because |
19 | cus-because | lawd-lord | comin-coming | prolly-probably |
20 | nun-nothing | wut-what | fml-family | yo-you |
21 | lawd-lord | cus-because | tha-the | fml-family |
22 | playin-playing | ohh-oh | tho-though | fam-family |
23 | ohh-oh | bs-bullshit | wit-with | freakin-freaking |
24 | wut-what | prolly-probably | playin-playing | fr-for |
25 | prolly-probably | pics-pictures | fr-for | quiero-query |
26 | bs-bullshit | talkin-talking | lookin-looking | til-till |
27 | nada-nothing | sis-sister | nada-nothing | goin-going |
28 | wen-when | bby-baby | bro-brother | lil-little |
29 | feelin-feeling | wen-when | cus-because | hrs-hours |
30 | sis-sister | feelin-feeling | yea-yeah | bs-bullshit |
AP | 0.197 | 0.196 | 0.119 | 0.151 |
6 Dialect Map Prediction via Visualization
In this section, we use dimensionality reduction techniques applied to the precinct embeddings to geographic boundaries of linguistic variation, or “isoglosses.” The precinct embeddings are reduced to RGB color values and hard transition in colors indicate a boundary. To project embeddings into RGB color coordinates, we explore two approaches. The first is principal component analysis (PCA), which is previously used in prior work (Hovy et al. 2020). The second is t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton 2008), which is a probabilistic approach often used for visualizing word embedding clusters.
6.1 Principal Component Analysis
PCA is widely used in the humanities for descriptive analyses of data. If we have a collection of continuous variables, PCA essentially creates a new set of axes that captures the greatest variance in the original variables. In particular, the first axis captures the greatest variance in the data, the second axis captures the second greatest variance, and so on. By quantifying the connection between the original variables and the axes, researchers can explore what variables have the most impact in the data. For example, Huang et al. (2016) use this approach to explore the geographic information contained inside area embeddings.
Hovy et al. (2020) use PCA to produce variation maps by reducing area embeddings to three dimensions and then standardizing these dimensions to between 0 and 1 to be used as RGB values. We perform a similar analysis for a select set of methods in the left images in Figures 16 and 17. We see that the geography only approach (Random 300 Retrofitting) produces a mostly random pattern of areas while the Doc2Vec None approach produces some regionalization, but is rather noisy.
The smoothing approaches generally highlight the cities (possibly with coloring the cities differently) and leave the countryside a uniform color. In other words, using PCA to produce an isogloss map, we only see the urban–rural divide and do not see larger region divides. The reason that is that the urban–rural divide appears to be the biggest source of variation in the data and PCA is designed to extract the biggest sources of variation. However, by attaching itself to the strongest signal, PCA is unable to find key regional differences in language use. Thus, while PCA approaches are useful for analyzing the information contained in embeddings, it has limited ability to produce isogloss boundaries.
6.2 t-Distributed Stochastic Neighbor Embedding
To fix the above issue, we explore a different dimensionality reduction approach, t-SNE (Van der Maaten and Hinton 2008). Unlike PCA, which tries to find the strongest signals overall, t-SNE instead tries to make sure that points that are similar in the original space are similar in the reduced space. As retrofitting enforces places that are geographically close to have similar embeddings, t-SNE may be much more capable of capturing regions.
The right images in Figures 16 and 17 use t-SNE to visualize embeddings. We see that there are largely three blocks: one block to the East, one block to the Southwest, and one block to the Northwest. This indicates that t-SNE may be better at identifying isoglosses than PCA.
By comparing to the dialect areas in our DAREDS analysis (Section 5.1), we see that the block to the East overlaps nicely with the predicted “Gulf States” dialect region. Similarly, we see that the Southwest block overlaps nicely with the West and Southwest blocks. Finally, the Northwest region seems distinct from the other regions. This indicates that we may have a region that is not accounted for by the Dictionary of American Regional English (Cassidy, Hall, and Von Schneidemesser 1985). It may be because in the nearly 40 years since publication, Texas may have experienced a great linguistic shift. Alternatively, the region may be understudied and thus may reflect a dialect we know little about. In either case, the t-SNE graphs may have shown a particular region of Texas that warrants further investigation.
7 Summary
We demonstrated that it is possible to embed areas as small as voting precincts and that doing so can lead to higher resolution analyses of sociolinguistic phenomena. To make this feasible, we proposed a novel embedding approach that alternates training with smoothing. We showed that both training and smoothing have negative effects when it comes to embedding voting precincts and that smoothing in particular can cause numerical issues. In contrast, we found that alternating training and smoothing mitigates these issues.
We also proposed new evaluations that reflect how voting precinct embeddings can be used directly by sociolinguists. The first explores how well different models are able to predict the location of a dialect given terms specific to that dialect. The second explores how well different models are able to capture preferences in lexical variants, such as the preference between pop and soda. We then propose a methodology where we identify portions of the embeddings that correspond to sociological variables and use these portions to find novel linguistic insights, thereby connecting sociological variables with linguistic expression. Finally, we explored approaches for using the embeddings to identify isoglosses and showed that PCA overly focuses on the urban–rural divide while t-SNE produces distinct regions.
7.1 Future Work
Finally, we present some directions for future work:
Although we can produce embeddings that reflect language use in an area, further research is needed to produce more interpretable representations (while retaining accuracy and ease of construction) and more informative uses of regional embeddings. We do propose a method of connecting linguistic phenomena to lexical variation using regional embeddings, but much more work is needed to devise methods that directly address linguists’ needs.
Currently, there is a divide between traditional linguistic approaches to analyzing variation and computational linguistic approaches to analyzing variation. Given access to a wide variety of social media data, one goal may be to close the gap between these approaches and develop definitions of variation that can represent linguistic insights as well as are rigorous and scalable. There is work that uses linguistic features to define regional embeddings (Bohmann 2020), but this still operates under traditional linguistic metrics and region-insensitive methodology (embeddings). Future work could build on our results to produce a flexible definition of variation that could directly leverage Twitter data.
Finally, a future direction could be to connect the regional embedding work with temporal embedding work (e.g., Hamilton, Leskovec, and Jurafsky 2016; Rosenfeld and Erk 2018) to have a unified spacio–temporal exploration of Twitter data. There is quite a bit of work that does spacio–temporal work with Twitter data (e.g., Goel et al. 2016; Eisenstein et al. 2014), but this work makes limited use of embedding models. Future work could better explain movement of language patterns with greater accuracy and resolution.
Appendix A. Grieve and Asnaghi (2013) Lexical Variation Pairs
In Table A1, we provide the list of alternates used in our count-based models.
Main . | Alternates . | Num VP . | Main Total . | Alt Total . | P-Value . |
---|---|---|---|---|---|
before | afore | 4,416 | 16,267 | 33 | 0.000 |
lane | alley | 2,684 | 14,615 | 2,939 | 0.000 |
car | automobile | 6,425 | 309,589 | 162 | 0.000 |
baby | infant | 5,117 | 21,176 | 187 | 0.000 |
bag | sack | 2,026 | 4,217 | 381 | 0.000 |
ban | prohibit, forbid | 4,297 | 29,532 | 235 | 0.000 |
beg | plead | 2,261 | 5,268 | 138 | 0.000 |
best | greatest | 5,750 | 32,971 | 1,408 | 0.000 |
bet | wager | 5,750 | 36,660 | 29 | 0.000 |
big | large | 4,979 | 24,258 | 1,326 | 0.000 |
bought | purchased | 1,630 | 2,289 | 147 | 0.000 |
butte | mesa | 1,342 | 2,250 | 872 | 0.000 |
cab | taxi | 1,664 | 3,736 | 288 | 0.000 |
center | middle | 3,314 | 24,299 | 3,878 | 0.000 |
clothes | clothing | 1,733 | 2,342 | 1,254 | 0.000 |
understand | comprehend | 2,761 | 4,937 | 50 | 0.000 |
creek | stream | 1,332 | 5,075 | 1,179 | 0.000 |
dad | father | 4,705 | 16,457 | 2,344 | 0.000 |
dinner | supper | 2,490 | 7,873 | 275 | 0.000 |
sleepy | drowsy | 1,894 | 2,898 | 37 | 0.000 |
each other | one another | 1,552 | 2,164 | 170 | 0.000 |
hug | embrace | 2,947 | 8,201 | 326 | 0.000 |
loyal | faithful | 1,336 | 1,410 | 644 | 0.000 |
real | genuine | 6,559 | 67,748 | 307 | 0.000 |
sneakers | gym shoes, running shoes, tennis shoes | 216 | 256 | 85 | 0.000 |
honest | truthful | 2,675 | 4,724 | 51 | 0.000 |
rush | hurry | 2,874 | 4,753 | 1,867 | 0.000 |
ill | sick | 7,266 | 223,879 | 5,173 | 0.000 |
wrong | incorrect | 3,364 | 7,136 | 62 | 0.000 |
little | small | 5,227 | 24,025 | 3,846 | 0.000 |
maybe | perhaps | 3,296 | 6,423 | 178 | 0.000 |
mom | mother | 5,727 | 27,826 | 5,489 | 0.000 |
needed | required | 2,007 | 4,526 | 445 | 0.000 |
prairie | plains | 540 | 3,896 | 476 | 0.000 |
student | pupil | 1,383 | 5,573 | 34 | 0.000 |
fast | quick, rapid | 4,325 | 11,958 | 7,274 | 0.000 |
sad | unhappy | 5,000 | 23,613 | 192 | 0.000 |
stomach | belly, tummy | 1,778 | 2,110 | 1,419 | 0.000 |
trash | garbage, rubbish | 1,248 | 1,726 | 248 | 0.000 |
while | whilst | 3,950 | 12,434 | 48 | 0.000 |
smart | intelligent | 1,521 | 2,453 | 225 | 0.000 |
holiday | vacation | 1,542 | 1,850 | 1,339 | 0.000 |
island | isle | 881 | 2,261 | 1,091 | 0.000 |
slim | slender | 492 | 916 | 11 | 0.000 |
especially | particularly | 1,269 | 1,816 | 38 | 0.000 |
obviously | clearly | 1,357 | 1,141 | 777 | 0.000 |
rude | impolite | 1,262 | 1,860 | 2 | 0.000 |
grandma | grandmother, granny, nana | 2,259 | 1,739 | 2,339 | 0.000 |
bathroom | restroom, washroom | 1,005 | 1,151 | 443 | 0.000 |
garage sale | rummage sale, tag sale, yard sale | 182 | 218 | 94 | 0.000 |
icing | frosting | 579 | 899 | 62 | 0.000 |
grandpa | grandfather | 860 | 1,024 | 140 | 0.000 |
rare | scarce | 691 | 1,063 | 12 | 0.000 |
anywhere | anyplace | 737 | 979 | 8 | 0.000 |
ping pong | table tennis | 101 | 184 | 2 | 0.000 |
pharmacy | drug store | 392 | 3,243 | 5 | 0.000 |
sunset | sundown | 941 | 7,725 | 115 | 0.000 |
dawn | daybreak | 340 | 523 | 92 | 0.000 |
bucket | pail | 666 | 974 | 32 | 0.000 |
brag | boast | 370 | 403 | 43 | 0.000 |
madness | insanity | 612 | 780 | 185 | 0.000 |
false | untrue | 336 | 512 | 12 | 0.000 |
expensive | costly | 459 | 520 | 22 | 0.000 |
global | worldwide | 460 | 1,007 | 329 | 0.000 |
couch | sofa | 810 | 891 | 400 | 0.000 |
spine | backbone | 186 | 191 | 93 | 0.000 |
fridge | refrigerator | 333 | 324 | 73 | 0.000 |
porch | veranda | 340 | 526 | 36 | 0.000 |
hot tub | jacuzzi | 159 | 154 | 40 | 0.000 |
sudden | abrupt | 525 | 590 | 14 | 0.000 |
wallet | billfold | 337 | 465 | 1 | 0.000 |
instantly | instantaneously | 157 | 170 | 2 | 0.000 |
hallway | corridor | 313 | 313 | 161 | 0.000 |
disappear | vanish | 324 | 340 | 44 | 0.000 |
explode | blow up | 358 | 218 | 181 | 0.000 |
bleach | clorox | 209 | 241 | 6 | 0.000 |
bookstore | bookshop | 90 | 153 | 14 | 0.000 |
polite | courteous | 97 | 101 | 10 | 0.000 |
fatal | deadly, lethal | 286 | 431 | 348 | 0.000 |
on accident | by accident | 160 | 107 | 71 | 0.000 |
accomplishment | achievement | 249 | 186 | 185 | 0.000 |
brave | courageous | 356 | 480 | 68 | 0.000 |
except for | aside from | 299 | 285 | 52 | 0.000 |
eggplant | aubergine | 46 | 56 | 2 | 0.000 |
cut the grass | mow the grass, mow the lawn | 28 | 18 | 10 | 0.000 |
out loud | aloud | 278 | 284 | 55 | 0.000 |
cellar | basement | 147 | 259 | 148 | 0.000 |
cinema | movie theater | 397 | 1,221 | 174 | 0.000 |
similar to | akin to | 70 | 68 | 12 | 0.001 |
shant | shall not | 120 | 82 | 60 | 0.001 |
quilt | comforter | 94 | 181 | 33 | 0.001 |
inappropriate | improper | 133 | 130 | 40 | 0.001 |
sunrise | sun up | 485 | 3,486 | 14 | 0.003 |
cemetery | graveyard | 191 | 318 | 120 | 0.004 |
sufficient | adequate | 81 | 56 | 33 | 0.008 |
inquire | enquire | 28 | 49 | 2 | 0.028 |
jeep | suv | 524 | 873 | 199 | 0.050 |
casket | coffin | 92 | 70 | 60 | 0.058 |
thrive | flourish | 131 | 224 | 57 | 0.067 |
fierce | ferocious | 181 | 250 | 19 | 0.067 |
unbearable | insufferable | 45 | 42 | 4 | 0.079 |
unexplainable | inexplicable | 24 | 18 | 8 | 0.105 |
endurance | stamina | 80 | 90 | 28 | 0.114 |
defy | disobey | 50 | 48 | 9 | 0.166 |
dampen | moisten | 8 | 8 | 1 | 0.183 |
passionate | impassioned | 159 | 205 | 1 | 0.208 |
saggy | droopy | 49 | 38 | 14 | 0.263 |
furthest | farthest | 62 | 40 | 25 | 0.294 |
agree to | consent to | 90 | 93 | 3 | 0.361 |
food processor | cuisinart | 3 | 3 | 2 | 0.439 |
somewhere else | elsewhere | 197 | 147 | 62 | 0.443 |
skillet | frying pan | 65 | 93 | 6 | 0.493 |
mailman | postman | 23 | 22 | 6 | 0.566 |
afire | ablaze, aflame | 31 | 29 | 19 | 0.575 |
inadequate | insufficient | 22 | 11 | 11 | 0.612 |
enclose | inclose | 9 | 10 | 1 | 0.656 |
husk | shuck | 253 | 330 | 129 | 0.662 |
ski doo | snowmobile | 2 | 1 | 1 | 0.671 |
slow cooker | crock pot | 19 | 16 | 8 | 0.745 |
flammable | inflammable | 5 | 8 | 4 | 0.754 |
murderous | homicidal | 11 | 6 | 5 | 0.760 |
entrust | intrust | 19 | 14 | 9 | 0.799 |
unarm | disarm | 33 | 47 | 3 | 0.857 |
shoelace | shoestring | 21 | 16 | 8 | 0.884 |
water fountain | drinking fountain | 22 | 23 | 4 | 0.890 |
incarcerate | imprison | 17 | 9 | 8 | 0.908 |
leaned in | leaned forward | 4 | 4 | 1 | 0.909 |
Main . | Alternates . | Num VP . | Main Total . | Alt Total . | P-Value . |
---|---|---|---|---|---|
before | afore | 4,416 | 16,267 | 33 | 0.000 |
lane | alley | 2,684 | 14,615 | 2,939 | 0.000 |
car | automobile | 6,425 | 309,589 | 162 | 0.000 |
baby | infant | 5,117 | 21,176 | 187 | 0.000 |
bag | sack | 2,026 | 4,217 | 381 | 0.000 |
ban | prohibit, forbid | 4,297 | 29,532 | 235 | 0.000 |
beg | plead | 2,261 | 5,268 | 138 | 0.000 |
best | greatest | 5,750 | 32,971 | 1,408 | 0.000 |
bet | wager | 5,750 | 36,660 | 29 | 0.000 |
big | large | 4,979 | 24,258 | 1,326 | 0.000 |
bought | purchased | 1,630 | 2,289 | 147 | 0.000 |
butte | mesa | 1,342 | 2,250 | 872 | 0.000 |
cab | taxi | 1,664 | 3,736 | 288 | 0.000 |
center | middle | 3,314 | 24,299 | 3,878 | 0.000 |
clothes | clothing | 1,733 | 2,342 | 1,254 | 0.000 |
understand | comprehend | 2,761 | 4,937 | 50 | 0.000 |
creek | stream | 1,332 | 5,075 | 1,179 | 0.000 |
dad | father | 4,705 | 16,457 | 2,344 | 0.000 |
dinner | supper | 2,490 | 7,873 | 275 | 0.000 |
sleepy | drowsy | 1,894 | 2,898 | 37 | 0.000 |
each other | one another | 1,552 | 2,164 | 170 | 0.000 |
hug | embrace | 2,947 | 8,201 | 326 | 0.000 |
loyal | faithful | 1,336 | 1,410 | 644 | 0.000 |
real | genuine | 6,559 | 67,748 | 307 | 0.000 |
sneakers | gym shoes, running shoes, tennis shoes | 216 | 256 | 85 | 0.000 |
honest | truthful | 2,675 | 4,724 | 51 | 0.000 |
rush | hurry | 2,874 | 4,753 | 1,867 | 0.000 |
ill | sick | 7,266 | 223,879 | 5,173 | 0.000 |
wrong | incorrect | 3,364 | 7,136 | 62 | 0.000 |
little | small | 5,227 | 24,025 | 3,846 | 0.000 |
maybe | perhaps | 3,296 | 6,423 | 178 | 0.000 |
mom | mother | 5,727 | 27,826 | 5,489 | 0.000 |
needed | required | 2,007 | 4,526 | 445 | 0.000 |
prairie | plains | 540 | 3,896 | 476 | 0.000 |
student | pupil | 1,383 | 5,573 | 34 | 0.000 |
fast | quick, rapid | 4,325 | 11,958 | 7,274 | 0.000 |
sad | unhappy | 5,000 | 23,613 | 192 | 0.000 |
stomach | belly, tummy | 1,778 | 2,110 | 1,419 | 0.000 |
trash | garbage, rubbish | 1,248 | 1,726 | 248 | 0.000 |
while | whilst | 3,950 | 12,434 | 48 | 0.000 |
smart | intelligent | 1,521 | 2,453 | 225 | 0.000 |
holiday | vacation | 1,542 | 1,850 | 1,339 | 0.000 |
island | isle | 881 | 2,261 | 1,091 | 0.000 |
slim | slender | 492 | 916 | 11 | 0.000 |
especially | particularly | 1,269 | 1,816 | 38 | 0.000 |
obviously | clearly | 1,357 | 1,141 | 777 | 0.000 |
rude | impolite | 1,262 | 1,860 | 2 | 0.000 |
grandma | grandmother, granny, nana | 2,259 | 1,739 | 2,339 | 0.000 |
bathroom | restroom, washroom | 1,005 | 1,151 | 443 | 0.000 |
garage sale | rummage sale, tag sale, yard sale | 182 | 218 | 94 | 0.000 |
icing | frosting | 579 | 899 | 62 | 0.000 |
grandpa | grandfather | 860 | 1,024 | 140 | 0.000 |
rare | scarce | 691 | 1,063 | 12 | 0.000 |
anywhere | anyplace | 737 | 979 | 8 | 0.000 |
ping pong | table tennis | 101 | 184 | 2 | 0.000 |
pharmacy | drug store | 392 | 3,243 | 5 | 0.000 |
sunset | sundown | 941 | 7,725 | 115 | 0.000 |
dawn | daybreak | 340 | 523 | 92 | 0.000 |
bucket | pail | 666 | 974 | 32 | 0.000 |
brag | boast | 370 | 403 | 43 | 0.000 |
madness | insanity | 612 | 780 | 185 | 0.000 |
false | untrue | 336 | 512 | 12 | 0.000 |
expensive | costly | 459 | 520 | 22 | 0.000 |
global | worldwide | 460 | 1,007 | 329 | 0.000 |
couch | sofa | 810 | 891 | 400 | 0.000 |
spine | backbone | 186 | 191 | 93 | 0.000 |
fridge | refrigerator | 333 | 324 | 73 | 0.000 |
porch | veranda | 340 | 526 | 36 | 0.000 |
hot tub | jacuzzi | 159 | 154 | 40 | 0.000 |
sudden | abrupt | 525 | 590 | 14 | 0.000 |
wallet | billfold | 337 | 465 | 1 | 0.000 |
instantly | instantaneously | 157 | 170 | 2 | 0.000 |
hallway | corridor | 313 | 313 | 161 | 0.000 |
disappear | vanish | 324 | 340 | 44 | 0.000 |
explode | blow up | 358 | 218 | 181 | 0.000 |
bleach | clorox | 209 | 241 | 6 | 0.000 |
bookstore | bookshop | 90 | 153 | 14 | 0.000 |
polite | courteous | 97 | 101 | 10 | 0.000 |
fatal | deadly, lethal | 286 | 431 | 348 | 0.000 |
on accident | by accident | 160 | 107 | 71 | 0.000 |
accomplishment | achievement | 249 | 186 | 185 | 0.000 |
brave | courageous | 356 | 480 | 68 | 0.000 |
except for | aside from | 299 | 285 | 52 | 0.000 |
eggplant | aubergine | 46 | 56 | 2 | 0.000 |
cut the grass | mow the grass, mow the lawn | 28 | 18 | 10 | 0.000 |
out loud | aloud | 278 | 284 | 55 | 0.000 |
cellar | basement | 147 | 259 | 148 | 0.000 |
cinema | movie theater | 397 | 1,221 | 174 | 0.000 |
similar to | akin to | 70 | 68 | 12 | 0.001 |
shant | shall not | 120 | 82 | 60 | 0.001 |
quilt | comforter | 94 | 181 | 33 | 0.001 |
inappropriate | improper | 133 | 130 | 40 | 0.001 |
sunrise | sun up | 485 | 3,486 | 14 | 0.003 |
cemetery | graveyard | 191 | 318 | 120 | 0.004 |
sufficient | adequate | 81 | 56 | 33 | 0.008 |
inquire | enquire | 28 | 49 | 2 | 0.028 |
jeep | suv | 524 | 873 | 199 | 0.050 |
casket | coffin | 92 | 70 | 60 | 0.058 |
thrive | flourish | 131 | 224 | 57 | 0.067 |
fierce | ferocious | 181 | 250 | 19 | 0.067 |
unbearable | insufferable | 45 | 42 | 4 | 0.079 |
unexplainable | inexplicable | 24 | 18 | 8 | 0.105 |
endurance | stamina | 80 | 90 | 28 | 0.114 |
defy | disobey | 50 | 48 | 9 | 0.166 |
dampen | moisten | 8 | 8 | 1 | 0.183 |
passionate | impassioned | 159 | 205 | 1 | 0.208 |
saggy | droopy | 49 | 38 | 14 | 0.263 |
furthest | farthest | 62 | 40 | 25 | 0.294 |
agree to | consent to | 90 | 93 | 3 | 0.361 |
food processor | cuisinart | 3 | 3 | 2 | 0.439 |
somewhere else | elsewhere | 197 | 147 | 62 | 0.443 |
skillet | frying pan | 65 | 93 | 6 | 0.493 |
mailman | postman | 23 | 22 | 6 | 0.566 |
afire | ablaze, aflame | 31 | 29 | 19 | 0.575 |
inadequate | insufficient | 22 | 11 | 11 | 0.612 |
enclose | inclose | 9 | 10 | 1 | 0.656 |
husk | shuck | 253 | 330 | 129 | 0.662 |
ski doo | snowmobile | 2 | 1 | 1 | 0.671 |
slow cooker | crock pot | 19 | 16 | 8 | 0.745 |
flammable | inflammable | 5 | 8 | 4 | 0.754 |
murderous | homicidal | 11 | 6 | 5 | 0.760 |
entrust | intrust | 19 | 14 | 9 | 0.799 |
unarm | disarm | 33 | 47 | 3 | 0.857 |
shoelace | shoestring | 21 | 16 | 8 | 0.884 |
water fountain | drinking fountain | 22 | 23 | 4 | 0.890 |
incarcerate | imprison | 17 | 9 | 8 | 0.908 |
leaned in | leaned forward | 4 | 4 | 1 | 0.909 |
Appendix B. DAREDS Dialect-Specific Terms
In Table A2, we provide the list of dialect-specific terms used in our dialect prediction evaluation.
DAREDS Dialect . | Term . | Num VP . | Total Freq . |
---|---|---|---|
Gulf States | aguardiente | 1 | 1 |
Gulf States | bogue | 1 | 1 |
Gulf States | cavalla | 1 | 1 |
Gulf States | chinaberry | 1 | 3 |
Gulf States | cooter | 12 | 23 |
Gulf States | curd | 17 | 18 |
Gulf States | doodlebug | 1 | 1 |
Gulf States | jambalaya | 27 | 27 |
Gulf States | loggerhead | 1 | 3 |
Gulf States | maguey | 4 | 5 |
Gulf States | nibbling | 3 | 3 |
Gulf States | nig | 72 | 76 |
Gulf States | pollywog | 1 | 1 |
Gulf States | redfish | 14 | 20 |
Gulf States | sardine | 4 | 4 |
Gulf States | scratcher | 8 | 8 |
Gulf States | shinny | 3 | 4 |
Gulf States | squinch | 1 | 1 |
Gulf States | whoop | 488 | 588 |
Southwest | acequia | 2 | 5 |
Southwest | agarita | 1 | 1 |
Southwest | agave | 38 | 72 |
Southwest | aguardiente | 1 | 1 |
Southwest | alacran | 1 | 1 |
Southwest | alberca | 12 | 12 |
Southwest | albondigas | 3 | 3 |
Southwest | alcalde | 5 | 6 |
Southwest | alegria | 20 | 21 |
Southwest | armas | 8 | 16 |
Southwest | arriero | 1 | 1 |
Southwest | arroba | 1 | 1 |
Southwest | arrowwood | 2 | 5 |
Southwest | atajo | 1 | 1 |
Southwest | atole | 7 | 7 |
Southwest | ayuntamiento | 1 | 3 |
Southwest | azote | 1 | 1 |
Southwest | baile | 41 | 54 |
Southwest | bajada | 1 | 30 |
Southwest | baldhead | 2 | 2 |
Southwest | barranca | 3 | 3 |
Southwest | basto | 5 | 5 |
Southwest | beaner | 31 | 32 |
Southwest | blinky | 3 | 4 |
Southwest | booger | 47 | 49 |
Southwest | burro | 17 | 44 |
Southwest | caballo | 12 | 13 |
Southwest | caliche | 1 | 1 |
Southwest | camisa | 16 | 16 |
Southwest | carcel | 2 | 2 |
Southwest | carga | 7 | 39 |
Southwest | cargador | 8 | 9 |
Southwest | carreta | 5 | 6 |
Southwest | cenizo | 2 | 2 |
Southwest | chalupa | 17 | 17 |
Southwest | chaparreras | 1 | 1 |
Southwest | chapo | 47 | 67 |
Southwest | chaqueta | 2 | 2 |
Southwest | charco | 7 | 8 |
Southwest | charro | 27 | 39 |
Southwest | chicalote | 1 | 1 |
Southwest | chicharron | 4 | 4 |
Southwest | chiquito | 20 | 25 |
Southwest | cholo | 39 | 40 |
Southwest | cienaga | 1 | 1 |
Southwest | cocinero | 1 | 1 |
Southwest | colear | 1 | 1 |
Southwest | comadre | 11 | 12 |
Southwest | comal | 31 | 124 |
Southwest | compadre | 37 | 97 |
Southwest | concha | 15 | 18 |
Southwest | conducta | 4 | 4 |
Southwest | cowhand | 2 | 2 |
Southwest | cuidado | 25 | 29 |
Southwest | cuna | 4 | 5 |
Southwest | dinero | 75 | 84 |
Southwest | dueno | 2 | 2 |
Southwest | enchilada | 39 | 47 |
Southwest | encinal | 4 | 9 |
Southwest | estufa | 1 | 1 |
Southwest | fierro | 16 | 77 |
Southwest | freno | 5 | 5 |
Southwest | frijole | 2 | 2 |
Southwest | garbanzo | 5 | 9 |
Southwest | goober | 26 | 29 |
Southwest | gotch | 6 | 6 |
Southwest | greaser | 3 | 3 |
Southwest | grulla | 5 | 8 |
Southwest | jacal | 2 | 3 |
Southwest | junco | 2 | 3 |
Southwest | kiva | 9 | 25 |
Southwest | lechuguilla | 1 | 1 |
Southwest | loafer | 4 | 4 |
Southwest | maguey | 4 | 5 |
Southwest | malpais | 1 | 2 |
Southwest | menudo | 94 | 107 |
Southwest | mescal | 1 | 1 |
Southwest | mestizo | 3 | 8 |
Southwest | milpa | 2 | 3 |
Southwest | nogal | 4 | 5 |
Southwest | nopal | 8 | 9 |
Southwest | olla | 6 | 9 |
Southwest | paisano | 14 | 73 |
Southwest | pasear | 7 | 8 |
Southwest | pelado | 1 | 1 |
Southwest | peon | 17 | 17 |
Southwest | picacho | 2 | 11 |
Southwest | pinole | 2 | 2 |
Southwest | plait | 2 | 2 |
Southwest | potrero | 4 | 4 |
Southwest | potro | 6 | 12 |
Southwest | pozo | 3 | 4 |
Southwest | pulque | 2 | 2 |
Southwest | quelite | 1 | 1 |
Southwest | ranchero | 14 | 19 |
Southwest | reata | 6 | 28 |
Southwest | runaround | 3 | 3 |
Southwest | seesaw | 3 | 3 |
Southwest | serape | 6 | 12 |
Southwest | shorthorn | 1 | 1 |
Southwest | slouch | 2 | 2 |
Southwest | tamale | 47 | 64 |
Southwest | tinaja | 2 | 2 |
Southwest | tomatillo | 5 | 21 |
Southwest | tostada | 16 | 23 |
Southwest | tule | 3 | 6 |
Southwest | vaquero | 19 | 37 |
Southwest | vara | 2 | 2 |
Southwest | wetback | 18 | 18 |
Southwest | zaguan | 1 | 3 |
Texas | agarita | 1 | 1 |
Texas | banquette | 3 | 3 |
Texas | blackland | 3 | 4 |
Texas | bluebell | 14 | 15 |
Texas | borrego | 10 | 17 |
Texas | cabrito | 5 | 27 |
Texas | caliche | 1 | 1 |
Texas | camote | 1 | 1 |
Texas | cenizo | 2 | 2 |
Texas | cerillo | 1 | 1 |
Texas | chicharra | 1 | 1 |
Texas | coonass | 3 | 3 |
Texas | ducking | 66 | 68 |
Texas | firewheel | 19 | 114 |
Texas | foxglove | 3 | 3 |
Texas | goatsbeard | 1 | 2 |
Texas | granjeno | 1 | 3 |
Texas | grulla | 5 | 8 |
Texas | guayacan | 2 | 3 |
Texas | hardhead | 1 | 1 |
Texas | huisache | 4 | 7 |
Texas | icehouse | 46 | 132 |
Texas | juneteenth | 12 | 16 |
Texas | kinfolk | 88 | 96 |
Texas | lechuguilla | 1 | 1 |
Texas | mayapple | 1 | 1 |
Texas | mayberry | 8 | 8 |
Texas | norther | 3 | 3 |
Texas | piloncillo | 1 | 1 |
Texas | pinchers | 1 | 1 |
Texas | piojo | 18 | 20 |
Texas | praline | 14 | 17 |
Texas | priss | 5 | 5 |
Texas | redhorse | 1 | 1 |
Texas | resaca | 5 | 5 |
Texas | retama | 11 | 31 |
Texas | sabino | 2 | 2 |
Texas | scissortail | 1 | 3 |
Texas | sendero | 9 | 26 |
Texas | shallot | 1 | 1 |
Texas | sharpshooter | 3 | 3 |
Texas | sook | 1 | 1 |
Texas | sotol | 6 | 28 |
Texas | spaniard | 2 | 2 |
Texas | squinch | 1 | 1 |
Texas | tecolote | 2 | 6 |
Texas | trembles | 1 | 1 |
Texas | tush | 4 | 4 |
Texas | vamos | 392 | 580 |
Texas | vaquero | 19 | 37 |
Texas | vara | 2 | 2 |
Texas | washateria | 16 | 24 |
Texas | wetback | 18 | 18 |
West | arbuckle | 8 | 25 |
West | barefooted | 2 | 2 |
West | barf | 44 | 47 |
West | bawl | 10 | 10 |
West | biddy | 3 | 6 |
West | blab | 3 | 3 |
West | blat | 3 | 3 |
West | boudin | 29 | 36 |
West | breezeway | 6 | 10 |
West | buckaroo | 9 | 10 |
West | bucking | 19 | 21 |
West | bunkhouse | 4 | 5 |
West | caballo | 12 | 13 |
West | cabeza | 70 | 74 |
West | cack | 4 | 4 |
West | calaboose | 1 | 2 |
West | capper | 2 | 2 |
West | chapping | 1 | 1 |
West | chileno | 1 | 1 |
West | chippy | 7 | 12 |
West | clabber | 1 | 1 |
West | clunk | 1 | 1 |
West | cribbage | 1 | 1 |
West | cutback | 1 | 1 |
West | dally | 3 | 3 |
West | dogger | 2 | 3 |
West | entryway | 7 | 8 |
West | freighter | 1 | 1 |
West | frenchy | 4 | 5 |
West | gaff | 2 | 7 |
West | gesundheit | 1 | 1 |
West | glowworm | 1 | 1 |
West | goop | 5 | 5 |
West | grayback | 1 | 2 |
West | groomsman | 1 | 2 |
West | hackamore | 1 | 2 |
West | hardhead | 1 | 1 |
West | hardtail | 2 | 5 |
West | headcheese | 1 | 1 |
West | heave | 3 | 3 |
West | heinie | 1 | 1 |
West | highline | 4 | 8 |
West | hoodoo | 1 | 2 |
West | husk | 1 | 1 |
West | irrigate | 1 | 1 |
West | jibe | 4 | 5 |
West | jimmies | 4 | 8 |
West | kaput | 1 | 1 |
West | kike | 15 | 16 |
West | latigo | 3 | 4 |
West | lockup | 3 | 4 |
West | longear | 1 | 1 |
West | lunger | 1 | 1 |
West | maguey | 4 | 5 |
West | makings | 7 | 30 |
West | manzanita | 5 | 6 |
West | mayapple | 1 | 1 |
West | mochila | 4 | 4 |
West | nester | 1 | 1 |
West | nighthawk | 6 | 10 |
West | paintbrush | 19 | 29 |
West | partida | 5 | 5 |
West | peddle | 3 | 3 |
West | peeler | 1 | 1 |
West | pincushion | 3 | 6 |
West | pith | 1 | 1 |
West | plastered | 9 | 9 |
West | podunk | 2 | 2 |
West | pollywog | 1 | 1 |
West | prat | 1 | 1 |
West | puncher | 5 | 5 |
West | riffle | 1 | 1 |
West | ringy | 1 | 1 |
West | rustle | 1 | 1 |
West | rustler | 3 | 4 |
West | seep | 4 | 4 |
West | serape | 6 | 12 |
West | sinker | 11 | 15 |
West | sizzler | 5 | 5 |
West | snoozer | 1 | 1 |
West | snuffy | 2 | 2 |
West | sprangletop | 1 | 1 |
West | sunfish | 1 | 1 |
West | superhighway | 1 | 1 |
West | swamper | 2 | 4 |
West | tallboy | 2 | 2 |
West | tamarack | 2 | 3 |
West | tenderfoot | 2 | 4 |
West | tennie | 1 | 1 |
West | tumbleweed | 11 | 37 |
West | vamos | 392 | 580 |
West | waddy | 2 | 2 |
West | waken | 9 | 9 |
West | washateria | 16 | 24 |
West | weedy | 1 | 1 |
West | wienie | 4 | 4 |
West | wrangle | 4 | 5 |
West | zori | 1 | 1 |
DAREDS Dialect . | Term . | Num VP . | Total Freq . |
---|---|---|---|
Gulf States | aguardiente | 1 | 1 |
Gulf States | bogue | 1 | 1 |
Gulf States | cavalla | 1 | 1 |
Gulf States | chinaberry | 1 | 3 |
Gulf States | cooter | 12 | 23 |
Gulf States | curd | 17 | 18 |
Gulf States | doodlebug | 1 | 1 |
Gulf States | jambalaya | 27 | 27 |
Gulf States | loggerhead | 1 | 3 |
Gulf States | maguey | 4 | 5 |
Gulf States | nibbling | 3 | 3 |
Gulf States | nig | 72 | 76 |
Gulf States | pollywog | 1 | 1 |
Gulf States | redfish | 14 | 20 |
Gulf States | sardine | 4 | 4 |
Gulf States | scratcher | 8 | 8 |
Gulf States | shinny | 3 | 4 |
Gulf States | squinch | 1 | 1 |
Gulf States | whoop | 488 | 588 |
Southwest | acequia | 2 | 5 |
Southwest | agarita | 1 | 1 |
Southwest | agave | 38 | 72 |
Southwest | aguardiente | 1 | 1 |
Southwest | alacran | 1 | 1 |
Southwest | alberca | 12 | 12 |
Southwest | albondigas | 3 | 3 |
Southwest | alcalde | 5 | 6 |
Southwest | alegria | 20 | 21 |
Southwest | armas | 8 | 16 |
Southwest | arriero | 1 | 1 |
Southwest | arroba | 1 | 1 |
Southwest | arrowwood | 2 | 5 |
Southwest | atajo | 1 | 1 |
Southwest | atole | 7 | 7 |
Southwest | ayuntamiento | 1 | 3 |
Southwest | azote | 1 | 1 |
Southwest | baile | 41 | 54 |
Southwest | bajada | 1 | 30 |
Southwest | baldhead | 2 | 2 |
Southwest | barranca | 3 | 3 |
Southwest | basto | 5 | 5 |
Southwest | beaner | 31 | 32 |
Southwest | blinky | 3 | 4 |
Southwest | booger | 47 | 49 |
Southwest | burro | 17 | 44 |
Southwest | caballo | 12 | 13 |
Southwest | caliche | 1 | 1 |
Southwest | camisa | 16 | 16 |
Southwest | carcel | 2 | 2 |
Southwest | carga | 7 | 39 |
Southwest | cargador | 8 | 9 |
Southwest | carreta | 5 | 6 |
Southwest | cenizo | 2 | 2 |
Southwest | chalupa | 17 | 17 |
Southwest | chaparreras | 1 | 1 |
Southwest | chapo | 47 | 67 |
Southwest | chaqueta | 2 | 2 |
Southwest | charco | 7 | 8 |
Southwest | charro | 27 | 39 |
Southwest | chicalote | 1 | 1 |
Southwest | chicharron | 4 | 4 |
Southwest | chiquito | 20 | 25 |
Southwest | cholo | 39 | 40 |
Southwest | cienaga | 1 | 1 |
Southwest | cocinero | 1 | 1 |
Southwest | colear | 1 | 1 |
Southwest | comadre | 11 | 12 |
Southwest | comal | 31 | 124 |
Southwest | compadre | 37 | 97 |
Southwest | concha | 15 | 18 |
Southwest | conducta | 4 | 4 |
Southwest | cowhand | 2 | 2 |
Southwest | cuidado | 25 | 29 |
Southwest | cuna | 4 | 5 |
Southwest | dinero | 75 | 84 |
Southwest | dueno | 2 | 2 |
Southwest | enchilada | 39 | 47 |
Southwest | encinal | 4 | 9 |
Southwest | estufa | 1 | 1 |
Southwest | fierro | 16 | 77 |
Southwest | freno | 5 | 5 |
Southwest | frijole | 2 | 2 |
Southwest | garbanzo | 5 | 9 |
Southwest | goober | 26 | 29 |
Southwest | gotch | 6 | 6 |
Southwest | greaser | 3 | 3 |
Southwest | grulla | 5 | 8 |
Southwest | jacal | 2 | 3 |
Southwest | junco | 2 | 3 |
Southwest | kiva | 9 | 25 |
Southwest | lechuguilla | 1 | 1 |
Southwest | loafer | 4 | 4 |
Southwest | maguey | 4 | 5 |
Southwest | malpais | 1 | 2 |
Southwest | menudo | 94 | 107 |
Southwest | mescal | 1 | 1 |
Southwest | mestizo | 3 | 8 |
Southwest | milpa | 2 | 3 |
Southwest | nogal | 4 | 5 |
Southwest | nopal | 8 | 9 |
Southwest | olla | 6 | 9 |
Southwest | paisano | 14 | 73 |
Southwest | pasear | 7 | 8 |
Southwest | pelado | 1 | 1 |
Southwest | peon | 17 | 17 |
Southwest | picacho | 2 | 11 |
Southwest | pinole | 2 | 2 |
Southwest | plait | 2 | 2 |
Southwest | potrero | 4 | 4 |
Southwest | potro | 6 | 12 |
Southwest | pozo | 3 | 4 |
Southwest | pulque | 2 | 2 |
Southwest | quelite | 1 | 1 |
Southwest | ranchero | 14 | 19 |
Southwest | reata | 6 | 28 |
Southwest | runaround | 3 | 3 |
Southwest | seesaw | 3 | 3 |
Southwest | serape | 6 | 12 |
Southwest | shorthorn | 1 | 1 |
Southwest | slouch | 2 | 2 |
Southwest | tamale | 47 | 64 |
Southwest | tinaja | 2 | 2 |
Southwest | tomatillo | 5 | 21 |
Southwest | tostada | 16 | 23 |
Southwest | tule | 3 | 6 |
Southwest | vaquero | 19 | 37 |
Southwest | vara | 2 | 2 |
Southwest | wetback | 18 | 18 |
Southwest | zaguan | 1 | 3 |
Texas | agarita | 1 | 1 |
Texas | banquette | 3 | 3 |
Texas | blackland | 3 | 4 |
Texas | bluebell | 14 | 15 |
Texas | borrego | 10 | 17 |
Texas | cabrito | 5 | 27 |
Texas | caliche | 1 | 1 |
Texas | camote | 1 | 1 |
Texas | cenizo | 2 | 2 |
Texas | cerillo | 1 | 1 |
Texas | chicharra | 1 | 1 |
Texas | coonass | 3 | 3 |
Texas | ducking | 66 | 68 |
Texas | firewheel | 19 | 114 |
Texas | foxglove | 3 | 3 |
Texas | goatsbeard | 1 | 2 |
Texas | granjeno | 1 | 3 |
Texas | grulla | 5 | 8 |
Texas | guayacan | 2 | 3 |
Texas | hardhead | 1 | 1 |
Texas | huisache | 4 | 7 |
Texas | icehouse | 46 | 132 |
Texas | juneteenth | 12 | 16 |
Texas | kinfolk | 88 | 96 |
Texas | lechuguilla | 1 | 1 |
Texas | mayapple | 1 | 1 |
Texas | mayberry | 8 | 8 |
Texas | norther | 3 | 3 |
Texas | piloncillo | 1 | 1 |
Texas | pinchers | 1 | 1 |
Texas | piojo | 18 | 20 |
Texas | praline | 14 | 17 |
Texas | priss | 5 | 5 |
Texas | redhorse | 1 | 1 |
Texas | resaca | 5 | 5 |
Texas | retama | 11 | 31 |
Texas | sabino | 2 | 2 |
Texas | scissortail | 1 | 3 |
Texas | sendero | 9 | 26 |
Texas | shallot | 1 | 1 |
Texas | sharpshooter | 3 | 3 |
Texas | sook | 1 | 1 |
Texas | sotol | 6 | 28 |
Texas | spaniard | 2 | 2 |
Texas | squinch | 1 | 1 |
Texas | tecolote | 2 | 6 |
Texas | trembles | 1 | 1 |
Texas | tush | 4 | 4 |
Texas | vamos | 392 | 580 |
Texas | vaquero | 19 | 37 |
Texas | vara | 2 | 2 |
Texas | washateria | 16 | 24 |
Texas | wetback | 18 | 18 |
West | arbuckle | 8 | 25 |
West | barefooted | 2 | 2 |
West | barf | 44 | 47 |
West | bawl | 10 | 10 |
West | biddy | 3 | 6 |
West | blab | 3 | 3 |
West | blat | 3 | 3 |
West | boudin | 29 | 36 |
West | breezeway | 6 | 10 |
West | buckaroo | 9 | 10 |
West | bucking | 19 | 21 |
West | bunkhouse | 4 | 5 |
West | caballo | 12 | 13 |
West | cabeza | 70 | 74 |
West | cack | 4 | 4 |
West | calaboose | 1 | 2 |
West | capper | 2 | 2 |
West | chapping | 1 | 1 |
West | chileno | 1 | 1 |
West | chippy | 7 | 12 |
West | clabber | 1 | 1 |
West | clunk | 1 | 1 |
West | cribbage | 1 | 1 |
West | cutback | 1 | 1 |
West | dally | 3 | 3 |
West | dogger | 2 | 3 |
West | entryway | 7 | 8 |
West | freighter | 1 | 1 |
West | frenchy | 4 | 5 |
West | gaff | 2 | 7 |
West | gesundheit | 1 | 1 |
West | glowworm | 1 | 1 |
West | goop | 5 | 5 |
West | grayback | 1 | 2 |
West | groomsman | 1 | 2 |
West | hackamore | 1 | 2 |
West | hardhead | 1 | 1 |
West | hardtail | 2 | 5 |
West | headcheese | 1 | 1 |
West | heave | 3 | 3 |
West | heinie | 1 | 1 |
West | highline | 4 | 8 |
West | hoodoo | 1 | 2 |
West | husk | 1 | 1 |
West | irrigate | 1 | 1 |
West | jibe | 4 | 5 |
West | jimmies | 4 | 8 |
West | kaput | 1 | 1 |
West | kike | 15 | 16 |
West | latigo | 3 | 4 |
West | lockup | 3 | 4 |
West | longear | 1 | 1 |
West | lunger | 1 | 1 |
West | maguey | 4 | 5 |
West | makings | 7 | 30 |
West | manzanita | 5 | 6 |
West | mayapple | 1 | 1 |
West | mochila | 4 | 4 |
West | nester | 1 | 1 |
West | nighthawk | 6 | 10 |
West | paintbrush | 19 | 29 |
West | partida | 5 | 5 |
West | peddle | 3 | 3 |
West | peeler | 1 | 1 |
West | pincushion | 3 | 6 |
West | pith | 1 | 1 |
West | plastered | 9 | 9 |
West | podunk | 2 | 2 |
West | pollywog | 1 | 1 |
West | prat | 1 | 1 |
West | puncher | 5 | 5 |
West | riffle | 1 | 1 |
West | ringy | 1 | 1 |
West | rustle | 1 | 1 |
West | rustler | 3 | 4 |
West | seep | 4 | 4 |
West | serape | 6 | 12 |
West | sinker | 11 | 15 |
West | sizzler | 5 | 5 |
West | snoozer | 1 | 1 |
West | snuffy | 2 | 2 |
West | sprangletop | 1 | 1 |
West | sunfish | 1 | 1 |
West | superhighway | 1 | 1 |
West | swamper | 2 | 4 |
West | tallboy | 2 | 2 |
West | tamarack | 2 | 3 |
West | tenderfoot | 2 | 4 |
West | tennie | 1 | 1 |
West | tumbleweed | 11 | 37 |
West | vamos | 392 | 580 |
West | waddy | 2 | 2 |
West | waken | 9 | 9 |
West | washateria | 16 | 24 |
West | weedy | 1 | 1 |
West | wienie | 4 | 4 |
West | wrangle | 4 | 5 |
West | zori | 1 | 1 |
Appendix C. Han and Baldwin (2011) Lexical Variants
Variant . | Canonical . | Var VP . | Var Freq . | Can VP . | Can Freq . | Shared VP . |
---|---|---|---|---|---|---|
ahh | ah | 1,009 | 1,319 | 1,162 | 1,800 | 1,839 |
bb | baby | 665 | 861 | 4,828 | 17,472 | 4,908 |
bc | because | 2,808 | 6,220 | 4,802 | 17,280 | 5,276 |
bday | birthday | 1,281 | 2,033 | 4,650 | 19,210 | 4,814 |
bf | boyfriend | 974 | 1,194 | 2,172 | 3,398 | 2,653 |
bro | brother | 3,735 | 12,036 | 2,747 | 5,263 | 4,535 |
bs | bullshit | 953 | 1,308 | 1,395 | 1,952 | 2,016 |
btw | between | 686 | 862 | 1,890 | 6,710 | 2,288 |
chillin | chilling | 1,174 | 1,653 | 888 | 1,185 | 1,773 |
comin | coming | 563 | 681 | 3,612 | 10,765 | 3,737 |
congrats | congratulations | 1,542 | 2,945 | 881 | 1,765 | 2,002 |
convo | conversation | 521 | 586 | 960 | 1,259 | 1,336 |
cus | because | 541 | 675 | 4,802 | 17,280 | 4,876 |
cuz | because | 2,288 | 3,959 | 4,802 | 17,280 | 5,162 |
da | the | 2,326 | 5,497 | 7,669 | 598,549 | 7,670 |
dat | that | 1,648 | 2,900 | 7,134 | 142,061 | 7,145 |
dawg | dog | 806 | 1,240 | 2,356 | 5,337 | 2,750 |
de | the | 3,267 | 21,053 | 7,669 | 598,549 | 7,692 |
def | definitely | 617 | 2,575 | 1,832 | 3,224 | 2,141 |
doin | doing | 941 | 1,272 | 4,153 | 11,681 | 4,334 |
fam | family | 2,040 | 3,921 | 3,862 | 12,856 | 4,376 |
fb | 1,127 | 1,637 | 1,246 | 1,962 | 2,037 | |
freakin | freaking | 554 | 654 | 1,555 | 2,157 | 1,884 |
fuckin | fucking | 1,891 | 3,064 | 4,209 | 12,868 | 4,547 |
gettin | getting | 1,380 | 1,992 | 5,066 | 21,187 | 5,226 |
gf | girlfriend | 772 | 942 | 1,474 | 2,087 | 1,959 |
goin | going | 1,446 | 2,089 | 5,881 | 33,556 | 5,949 |
gon | gonna | 1,227 | 1,914 | 5,327 | 22,704 | 5,449 |
hahah | haha | 901 | 1,104 | 4,667 | 15,314 | 4,793 |
hahaha | haha | 2,597 | 4,730 | 4,667 | 15,314 | 5,097 |
hahahaha | haha | 1,201 | 1,595 | 4,667 | 15,314 | 4,821 |
hrs | hours | 739 | 1,393 | 3,043 | 8,568 | 3,284 |
jus | just | 1,011 | 1,537 | 7,074 | 131,656 | 7,082 |
kno | know | 929 | 1,377 | 6,425 | 55,510 | 6,453 |
lawd | lord | 510 | 634 | 1,938 | 3,244 | 2,185 |
lil | little | 2,990 | 7,405 | 4,913 | 21,558 | 5,435 |
lookin | looking | 1,134 | 1,534 | 4,499 | 55,830 | 4,690 |
mins | minutes | 1,583 | 14,602 | 2,352 | 5,244 | 3,164 |
mis | miss | 561 | 948 | 5,103 | 19,099 | 5,171 |
nah | no | 2,882 | 5,869 | 6,526 | 6,6786 | 6,604 |
naw | no | 882 | 1,234 | 6,526 | 66,786 | 6,539 |
nd | and | 1,972 | 4,823 | 7,449 | 349,628 | 7,455 |
nothin | nothing | 692 | 839 | 4,074 | 10,591 | 4,213 |
ohh | oh | 736 | 869 | 5,264 | 20,804 | 5,343 |
pic | picture | 2,675 | 6,195 | 2,981 | 6,474 | 4,066 |
pics | pictures | 1,521 | 2,483 | 2,123 | 3,707 | 2,881 |
playin | playing | 585 | 679 | 3,163 | 7,102 | 3,350 |
pls | please | 1,107 | 1,635 | 4,164 | 12,972 | 4,388 |
plz | please | 840 | 1,313 | 4,164 | 12,972 | 4,340 |
ppl | people | 2,164 | 3,896 | 5,882 | 34,714 | 6,020 |
prolly | probably | 709 | 847 | 2,968 | 5,624 | 3,242 |
sayin | saying | 626 | 744 | 2,831 | 5,194 | 3,055 |
soo | so | 1,467 | 2,019 | 7,105 | 123,174 | 7,117 |
talkin | talking | 1,029 | 1,385 | 3,790 | 9,014 | 4,027 |
tha | the | 1,394 | 2,630 | 7,669 | 598,549 | 7,672 |
tht | that | 531 | 738 | 7,134 | 142,061 | 7,135 |
thx | thanks | 713 | 1,031 | 4,707 | 19,000 | 4,791 |
til | till | 1,401 | 2,279 | 2,887 | 5,588 | 3,435 |
til | until | 1,401 | 2,279 | 3,842 | 11,761 | 4,301 |
txt | text | 713 | 886 | 4,102 | 10,789 | 4,229 |
umm | um | 555 | 625 | 826 | 1,090 | 1,265 |
ur | your | 2,810 | 5,917 | 6,729 | 83,776 | 6,794 |
wat | what | 983 | 1,318 | 6,617 | 67,576 | 6,634 |
yess | yes | 576 | 665 | 4,924 | 18,365 | 4,997 |
yr | year | 566 | 809 | 4,530 | 16,848 | 4,614 |
yu | you | 1,082 | 2,144 | 7,550 | 476,752 | 7,551 |
Variant . | Canonical . | Var VP . | Var Freq . | Can VP . | Can Freq . | Shared VP . |
---|---|---|---|---|---|---|
ahh | ah | 1,009 | 1,319 | 1,162 | 1,800 | 1,839 |
bb | baby | 665 | 861 | 4,828 | 17,472 | 4,908 |
bc | because | 2,808 | 6,220 | 4,802 | 17,280 | 5,276 |
bday | birthday | 1,281 | 2,033 | 4,650 | 19,210 | 4,814 |
bf | boyfriend | 974 | 1,194 | 2,172 | 3,398 | 2,653 |
bro | brother | 3,735 | 12,036 | 2,747 | 5,263 | 4,535 |
bs | bullshit | 953 | 1,308 | 1,395 | 1,952 | 2,016 |
btw | between | 686 | 862 | 1,890 | 6,710 | 2,288 |
chillin | chilling | 1,174 | 1,653 | 888 | 1,185 | 1,773 |
comin | coming | 563 | 681 | 3,612 | 10,765 | 3,737 |
congrats | congratulations | 1,542 | 2,945 | 881 | 1,765 | 2,002 |
convo | conversation | 521 | 586 | 960 | 1,259 | 1,336 |
cus | because | 541 | 675 | 4,802 | 17,280 | 4,876 |
cuz | because | 2,288 | 3,959 | 4,802 | 17,280 | 5,162 |
da | the | 2,326 | 5,497 | 7,669 | 598,549 | 7,670 |
dat | that | 1,648 | 2,900 | 7,134 | 142,061 | 7,145 |
dawg | dog | 806 | 1,240 | 2,356 | 5,337 | 2,750 |
de | the | 3,267 | 21,053 | 7,669 | 598,549 | 7,692 |
def | definitely | 617 | 2,575 | 1,832 | 3,224 | 2,141 |
doin | doing | 941 | 1,272 | 4,153 | 11,681 | 4,334 |
fam | family | 2,040 | 3,921 | 3,862 | 12,856 | 4,376 |
fb | 1,127 | 1,637 | 1,246 | 1,962 | 2,037 | |
freakin | freaking | 554 | 654 | 1,555 | 2,157 | 1,884 |
fuckin | fucking | 1,891 | 3,064 | 4,209 | 12,868 | 4,547 |
gettin | getting | 1,380 | 1,992 | 5,066 | 21,187 | 5,226 |
gf | girlfriend | 772 | 942 | 1,474 | 2,087 | 1,959 |
goin | going | 1,446 | 2,089 | 5,881 | 33,556 | 5,949 |
gon | gonna | 1,227 | 1,914 | 5,327 | 22,704 | 5,449 |
hahah | haha | 901 | 1,104 | 4,667 | 15,314 | 4,793 |
hahaha | haha | 2,597 | 4,730 | 4,667 | 15,314 | 5,097 |
hahahaha | haha | 1,201 | 1,595 | 4,667 | 15,314 | 4,821 |
hrs | hours | 739 | 1,393 | 3,043 | 8,568 | 3,284 |
jus | just | 1,011 | 1,537 | 7,074 | 131,656 | 7,082 |
kno | know | 929 | 1,377 | 6,425 | 55,510 | 6,453 |
lawd | lord | 510 | 634 | 1,938 | 3,244 | 2,185 |
lil | little | 2,990 | 7,405 | 4,913 | 21,558 | 5,435 |
lookin | looking | 1,134 | 1,534 | 4,499 | 55,830 | 4,690 |
mins | minutes | 1,583 | 14,602 | 2,352 | 5,244 | 3,164 |
mis | miss | 561 | 948 | 5,103 | 19,099 | 5,171 |
nah | no | 2,882 | 5,869 | 6,526 | 6,6786 | 6,604 |
naw | no | 882 | 1,234 | 6,526 | 66,786 | 6,539 |
nd | and | 1,972 | 4,823 | 7,449 | 349,628 | 7,455 |
nothin | nothing | 692 | 839 | 4,074 | 10,591 | 4,213 |
ohh | oh | 736 | 869 | 5,264 | 20,804 | 5,343 |
pic | picture | 2,675 | 6,195 | 2,981 | 6,474 | 4,066 |
pics | pictures | 1,521 | 2,483 | 2,123 | 3,707 | 2,881 |
playin | playing | 585 | 679 | 3,163 | 7,102 | 3,350 |
pls | please | 1,107 | 1,635 | 4,164 | 12,972 | 4,388 |
plz | please | 840 | 1,313 | 4,164 | 12,972 | 4,340 |
ppl | people | 2,164 | 3,896 | 5,882 | 34,714 | 6,020 |
prolly | probably | 709 | 847 | 2,968 | 5,624 | 3,242 |
sayin | saying | 626 | 744 | 2,831 | 5,194 | 3,055 |
soo | so | 1,467 | 2,019 | 7,105 | 123,174 | 7,117 |
talkin | talking | 1,029 | 1,385 | 3,790 | 9,014 | 4,027 |
tha | the | 1,394 | 2,630 | 7,669 | 598,549 | 7,672 |
tht | that | 531 | 738 | 7,134 | 142,061 | 7,135 |
thx | thanks | 713 | 1,031 | 4,707 | 19,000 | 4,791 |
til | till | 1,401 | 2,279 | 2,887 | 5,588 | 3,435 |
til | until | 1,401 | 2,279 | 3,842 | 11,761 | 4,301 |
txt | text | 713 | 886 | 4,102 | 10,789 | 4,229 |
umm | um | 555 | 625 | 826 | 1,090 | 1,265 |
ur | your | 2,810 | 5,917 | 6,729 | 83,776 | 6,794 |
wat | what | 983 | 1,318 | 6,617 | 67,576 | 6,634 |
yess | yes | 576 | 665 | 4,924 | 18,365 | 4,997 |
yr | year | 566 | 809 | 4,530 | 16,848 | 4,614 |
yu | you | 1,082 | 2,144 | 7,550 | 476,752 | 7,551 |
Appendix D. Liu et al. (2011) Lexical Variants
Variant . | Canonical . | Var VP . | Var Freq . | Can VP . | Can Freq . | Shared VP . |
---|---|---|---|---|---|---|
aye | yes | 1,055 | 1,409 | 4,924 | 18,365 | 5,037 |
b | be | 2,915 | 8,312 | 7,081 | 212,570 | 7,108 |
bae | baby | 3,001 | 6,203 | 4,828 | 17,472 | 5,312 |
bb | baby | 665 | 861 | 4,828 | 17,472 | 4,908 |
bby | baby | 814 | 958 | 4,828 | 17,472 | 4,949 |
bc | because | 2,808 | 6,220 | 4,802 | 17,280 | 5,276 |
bday | birthday | 1,281 | 2,033 | 4,650 | 19,210 | 4,814 |
bout | about | 3,295 | 8,238 | 6,463 | 94,613 | 6,594 |
bro | brother | 3,735 | 12,036 | 2,747 | 5,263 | 4,535 |
bros | brothers | 635 | 1,066 | 1,145 | 1,899 | 1,561 |
bs | bullshit | 953 | 1,308 | 1,395 | 1,952 | 2,016 |
butt | but | 1,312 | 1,846 | 6,808 | 86,579 | 6,825 |
c | see | 2,332 | 7,926 | 6,259 | 132,803 | 6,358 |
cause | because | 4,439 | 13,497 | 4,802 | 17,280 | 5,735 |
chillin | chilling | 1,174 | 1,653 | 888 | 1,185 | 1,773 |
comin | coming | 563 | 681 | 3,612 | 10,765 | 3,737 |
convo | conversation | 521 | 586 | 960 | 1259 | 1,336 |
cus | because | 541 | 675 | 4,802 | 17,280 | 4,876 |
cutie | cute | 692 | 880 | 3,951 | 10,397 | 4,073 |
cuz | because | 2,288 | 3,959 | 4,802 | 17,280 | 5,162 |
da | the | 2,326 | 5,497 | 7,669 | 598,549 | 7,670 |
dat | that | 1,648 | 2,900 | 7,134 | 142,061 | 7,145 |
def | definitely | 617 | 2,575 | 1,832 | 3,224 | 2,141 |
dem | them | 556 | 767 | 5,320 | 23,430 | 5,361 |
dis | this | 891 | 1,269 | 7,247 | 392,504 | 7,249 |
doin | doing | 941 | 1,272 | 4,153 | 11,681 | 4,334 |
em | them | 2,585 | 5,577 | 5,320 | 23,430 | 5,578 |
fa | for | 607 | 942 | 7,429 | 438,864 | 7,431 |
fam | family | 2,040 | 3,921 | 3,862 | 12,856 | 4,376 |
fav | favorite | 1,422 | 2,199 | 3,531 | 10,655 | 39,20 |
fb | 1,127 | 1,637 | 1,246 | 1,962 | 2,037 | |
feelin | feeling | 753 | 950 | 3,300 | 7,215 | 3,511 |
fml | family | 750 | 898 | 3,862 | 12,856 | 4,053 |
fr | for | 1,059 | 1,672 | 7,429 | 438,864 | 7,436 |
freakin | freaking | 554 | 654 | 1,555 | 2,157 | 1,884 |
ft | feet | 1,273 | 11,113 | 1,303 | 1,916 | 2,173 |
fuckin | fucking | 1,891 | 3,064 | 4,209 | 12,868 | 4,547 |
gettin | getting | 1,380 | 1,992 | 5,066 | 21,187 | 5,226 |
gf | girlfriend | 772 | 942 | 1,474 | 2,087 | 1,959 |
goin | going | 1,446 | 2,089 | 5,881 | 33,556 | 5,949 |
gon | going | 1,227 | 1,914 | 5,881 | 33,556 | 5,936 |
homie | home | 1,343 | 2,249 | 5,314 | 27,569 | 5,442 |
hr | hour | 852 | 2,624 | 2,404 | 5,606 | 2,838 |
hrs | hours | 739 | 1,393 | 3,043 | 8,568 | 3,284 |
ii | i | 770 | 9,871 | 7,699 | 621,319 | 7,699 |
jus | just | 1,011 | 1,537 | 7,074 | 131,656 | 7,082 |
k | ok | 3,145 | 7,414 | 3,940 | 71,563 | 4,824 |
kno | know | 929 | 1,377 | 6,425 | 55,510 | 6,453 |
lawd | lord | 510 | 634 | 1,938 | 3,244 | 2,185 |
lil | little | 2,990 | 7,405 | 4,913 | 21,558 | 5,435 |
lookin | looking | 1,134 | 1,534 | 4,499 | 55,830 | 4,690 |
luv | love | 1,030 | 1,390 | 6,698 | 76,733 | 6,714 |
m | am | 2,507 | 7,994 | 5,176 | 25,099 | 5,507 |
ma | my | 783 | 1,231 | 7,512 | 309,237 | 7,512 |
mi | my | 2,204 | 6,510 | 7,512 | 309,237 | 7,551 |
min | minutes | 1,203 | 2,314 | 2,352 | 5,244 | 2,941 |
mines | mine | 510 | 589 | 2,755 | 5,078 | 2,968 |
mins | minutes | 1,583 | 14,602 | 2,352 | 5,244 | 3,164 |
mo | more | 585 | 20,581 | 5,669 | 31,459 | 5,706 |
n | and | 3,408 | 17,544 | 7,449 | 349,628 | 7,478 |
nada | nothing | 508 | 712 | 4,074 | 10,591 | 4,187 |
nah | no | 2,882 | 5,869 | 6,526 | 66,786 | 6,604 |
naw | no | 882 | 1,234 | 6,526 | 66,786 | 6,539 |
nd | and | 1,972 | 4,823 | 7,449 | 349,628 | 7,455 |
nothin | nothing | 692 | 839 | 4,074 | 10,591 | 4,213 |
nun | nothing | 622 | 788 | 4,074 | 10,591 | 4,195 |
ohh | oh | 736 | 869 | 5,264 | 20,804 | 5,343 |
pic | picture | 2,675 | 6,195 | 2,981 | 6,474 | 4,066 |
pics | pictures | 1,521 | 2,483 | 2,123 | 3,707 | 2,881 |
playin | playing | 585 | 679 | 3,163 | 7,102 | 3,350 |
pls | please | 1,107 | 1,635 | 4,164 | 12,972 | 4,388 |
plz | please | 840 | 1,313 | 4,164 | 12,972 | 4,340 |
ppl | people | 2,164 | 3,896 | 5,882 | 34,714 | 6,020 |
prolly | probably | 709 | 847 | 2,968 | 5,624 | 3,242 |
pt | part | 570 | 2,138 | 2,647 | 11,220 | 2,823 |
r | are | 2,280 | 5,466 | 6,657 | 76,873 | 6,712 |
rd | road | 2,123 | 15,149 | 2,022 | 5,075 | 3,220 |
sayin | saying | 626 | 744 | 2,831 | 5,194 | 3,055 |
sis | sister | 857 | 1,219 | 2,714 | 5,257 | 3,022 |
soo | so | 1,467 | 2,019 | 7,105 | 123,174 | 7,117 |
sum | some | 990 | 1,541 | 6,017 | 42,637 | 6,052 |
talkin | talking | 1,029 | 1,385 | 3,790 | 9,014 | 4,027 |
th | the | 3,238 | 17,089 | 7,669 | 598,549 | 7,672 |
tha | the | 1,394 | 2,630 | 7,669 | 598,549 | 7,672 |
thang | thing | 691 | 876 | 4,434 | 12,995 | 4,550 |
tho | though | 3,959 | 11,480 | 3,879 | 9,628 | 5,092 |
thot | thought | 607 | 791 | 3,690 | 8,510 | 3,844 |
thru | through | 1,406 | 2,281 | 3,400 | 8,800 | 3,818 |
tht | that | 531 | 738 | 7,134 | 142,061 | 7,135 |
thx | thanks | 713 | 1,031 | 4,707 | 19,000 | 4,791 |
til | till | 1,401 | 2,279 | 2,887 | 5,588 | 3,435 |
trippin | tripping | 790 | 975 | 558 | 669 | 1,204 |
turnt | turn | 684 | 836 | 2,918 | 5,943 | 3,161 |
tx | texas | 6,275 | 456,640 | 4,983 | 96,986 | 6,869 |
txt | text | 713 | 886 | 4,102 | 10,789 | 4,229 |
u | you | 5,375 | 34,958 | 7,550 | 476,752 | 7,578 |
ur | your | 2,810 | 5,917 | 6,729 | 83,776 | 6,794 |
w | with | 4,195 | 28,363 | 7,043 | 146,575 | 7,124 |
wat | what | 983 | 1,318 | 6,617 | 67,576 | 6,634 |
wen | when | 524 | 653 | 6,637 | 67,470 | 6,650 |
wit | with | 1,769 | 3,389 | 7,043 | 146,575 | 7,054 |
wut | what | 582 | 724 | 6,617 | 67,576 | 6,627 |
y | why | 3,107 | 11,552 | 5,974 | 36,088 | 6,182 |
ya | you | 4,484 | 15,215 | 7,550 | 476,752 | 7,563 |
yea | yeah | 2,418 | 4,617 | 4,499 | 13,843 | 4,938 |
yess | yes | 576 | 665 | 4,924 | 18,365 | 4,997 |
yo | you | 3,677 | 10,918 | 7,550 | 476,752 | 7,559 |
yr | year | 566 | 809 | 4,530 | 16,848 | 4,614 |
yu | you | 1,082 | 2,144 | 7,550 | 476,752 | 7,551 |
yup | yes | 1,056 | 1,499 | 4,924 | 18,365 | 5,040 |
Variant . | Canonical . | Var VP . | Var Freq . | Can VP . | Can Freq . | Shared VP . |
---|---|---|---|---|---|---|
aye | yes | 1,055 | 1,409 | 4,924 | 18,365 | 5,037 |
b | be | 2,915 | 8,312 | 7,081 | 212,570 | 7,108 |
bae | baby | 3,001 | 6,203 | 4,828 | 17,472 | 5,312 |
bb | baby | 665 | 861 | 4,828 | 17,472 | 4,908 |
bby | baby | 814 | 958 | 4,828 | 17,472 | 4,949 |
bc | because | 2,808 | 6,220 | 4,802 | 17,280 | 5,276 |
bday | birthday | 1,281 | 2,033 | 4,650 | 19,210 | 4,814 |
bout | about | 3,295 | 8,238 | 6,463 | 94,613 | 6,594 |
bro | brother | 3,735 | 12,036 | 2,747 | 5,263 | 4,535 |
bros | brothers | 635 | 1,066 | 1,145 | 1,899 | 1,561 |
bs | bullshit | 953 | 1,308 | 1,395 | 1,952 | 2,016 |
butt | but | 1,312 | 1,846 | 6,808 | 86,579 | 6,825 |
c | see | 2,332 | 7,926 | 6,259 | 132,803 | 6,358 |
cause | because | 4,439 | 13,497 | 4,802 | 17,280 | 5,735 |
chillin | chilling | 1,174 | 1,653 | 888 | 1,185 | 1,773 |
comin | coming | 563 | 681 | 3,612 | 10,765 | 3,737 |
convo | conversation | 521 | 586 | 960 | 1259 | 1,336 |
cus | because | 541 | 675 | 4,802 | 17,280 | 4,876 |
cutie | cute | 692 | 880 | 3,951 | 10,397 | 4,073 |
cuz | because | 2,288 | 3,959 | 4,802 | 17,280 | 5,162 |
da | the | 2,326 | 5,497 | 7,669 | 598,549 | 7,670 |
dat | that | 1,648 | 2,900 | 7,134 | 142,061 | 7,145 |
def | definitely | 617 | 2,575 | 1,832 | 3,224 | 2,141 |
dem | them | 556 | 767 | 5,320 | 23,430 | 5,361 |
dis | this | 891 | 1,269 | 7,247 | 392,504 | 7,249 |
doin | doing | 941 | 1,272 | 4,153 | 11,681 | 4,334 |
em | them | 2,585 | 5,577 | 5,320 | 23,430 | 5,578 |
fa | for | 607 | 942 | 7,429 | 438,864 | 7,431 |
fam | family | 2,040 | 3,921 | 3,862 | 12,856 | 4,376 |
fav | favorite | 1,422 | 2,199 | 3,531 | 10,655 | 39,20 |
fb | 1,127 | 1,637 | 1,246 | 1,962 | 2,037 | |
feelin | feeling | 753 | 950 | 3,300 | 7,215 | 3,511 |
fml | family | 750 | 898 | 3,862 | 12,856 | 4,053 |
fr | for | 1,059 | 1,672 | 7,429 | 438,864 | 7,436 |
freakin | freaking | 554 | 654 | 1,555 | 2,157 | 1,884 |
ft | feet | 1,273 | 11,113 | 1,303 | 1,916 | 2,173 |
fuckin | fucking | 1,891 | 3,064 | 4,209 | 12,868 | 4,547 |
gettin | getting | 1,380 | 1,992 | 5,066 | 21,187 | 5,226 |
gf | girlfriend | 772 | 942 | 1,474 | 2,087 | 1,959 |
goin | going | 1,446 | 2,089 | 5,881 | 33,556 | 5,949 |
gon | going | 1,227 | 1,914 | 5,881 | 33,556 | 5,936 |
homie | home | 1,343 | 2,249 | 5,314 | 27,569 | 5,442 |
hr | hour | 852 | 2,624 | 2,404 | 5,606 | 2,838 |
hrs | hours | 739 | 1,393 | 3,043 | 8,568 | 3,284 |
ii | i | 770 | 9,871 | 7,699 | 621,319 | 7,699 |
jus | just | 1,011 | 1,537 | 7,074 | 131,656 | 7,082 |
k | ok | 3,145 | 7,414 | 3,940 | 71,563 | 4,824 |
kno | know | 929 | 1,377 | 6,425 | 55,510 | 6,453 |
lawd | lord | 510 | 634 | 1,938 | 3,244 | 2,185 |
lil | little | 2,990 | 7,405 | 4,913 | 21,558 | 5,435 |
lookin | looking | 1,134 | 1,534 | 4,499 | 55,830 | 4,690 |
luv | love | 1,030 | 1,390 | 6,698 | 76,733 | 6,714 |
m | am | 2,507 | 7,994 | 5,176 | 25,099 | 5,507 |
ma | my | 783 | 1,231 | 7,512 | 309,237 | 7,512 |
mi | my | 2,204 | 6,510 | 7,512 | 309,237 | 7,551 |
min | minutes | 1,203 | 2,314 | 2,352 | 5,244 | 2,941 |
mines | mine | 510 | 589 | 2,755 | 5,078 | 2,968 |
mins | minutes | 1,583 | 14,602 | 2,352 | 5,244 | 3,164 |
mo | more | 585 | 20,581 | 5,669 | 31,459 | 5,706 |
n | and | 3,408 | 17,544 | 7,449 | 349,628 | 7,478 |
nada | nothing | 508 | 712 | 4,074 | 10,591 | 4,187 |
nah | no | 2,882 | 5,869 | 6,526 | 66,786 | 6,604 |
naw | no | 882 | 1,234 | 6,526 | 66,786 | 6,539 |
nd | and | 1,972 | 4,823 | 7,449 | 349,628 | 7,455 |
nothin | nothing | 692 | 839 | 4,074 | 10,591 | 4,213 |
nun | nothing | 622 | 788 | 4,074 | 10,591 | 4,195 |
ohh | oh | 736 | 869 | 5,264 | 20,804 | 5,343 |
pic | picture | 2,675 | 6,195 | 2,981 | 6,474 | 4,066 |
pics | pictures | 1,521 | 2,483 | 2,123 | 3,707 | 2,881 |
playin | playing | 585 | 679 | 3,163 | 7,102 | 3,350 |
pls | please | 1,107 | 1,635 | 4,164 | 12,972 | 4,388 |
plz | please | 840 | 1,313 | 4,164 | 12,972 | 4,340 |
ppl | people | 2,164 | 3,896 | 5,882 | 34,714 | 6,020 |
prolly | probably | 709 | 847 | 2,968 | 5,624 | 3,242 |
pt | part | 570 | 2,138 | 2,647 | 11,220 | 2,823 |
r | are | 2,280 | 5,466 | 6,657 | 76,873 | 6,712 |
rd | road | 2,123 | 15,149 | 2,022 | 5,075 | 3,220 |
sayin | saying | 626 | 744 | 2,831 | 5,194 | 3,055 |
sis | sister | 857 | 1,219 | 2,714 | 5,257 | 3,022 |
soo | so | 1,467 | 2,019 | 7,105 | 123,174 | 7,117 |
sum | some | 990 | 1,541 | 6,017 | 42,637 | 6,052 |
talkin | talking | 1,029 | 1,385 | 3,790 | 9,014 | 4,027 |
th | the | 3,238 | 17,089 | 7,669 | 598,549 | 7,672 |
tha | the | 1,394 | 2,630 | 7,669 | 598,549 | 7,672 |
thang | thing | 691 | 876 | 4,434 | 12,995 | 4,550 |
tho | though | 3,959 | 11,480 | 3,879 | 9,628 | 5,092 |
thot | thought | 607 | 791 | 3,690 | 8,510 | 3,844 |
thru | through | 1,406 | 2,281 | 3,400 | 8,800 | 3,818 |
tht | that | 531 | 738 | 7,134 | 142,061 | 7,135 |
thx | thanks | 713 | 1,031 | 4,707 | 19,000 | 4,791 |
til | till | 1,401 | 2,279 | 2,887 | 5,588 | 3,435 |
trippin | tripping | 790 | 975 | 558 | 669 | 1,204 |
turnt | turn | 684 | 836 | 2,918 | 5,943 | 3,161 |
tx | texas | 6,275 | 456,640 | 4,983 | 96,986 | 6,869 |
txt | text | 713 | 886 | 4,102 | 10,789 | 4,229 |
u | you | 5,375 | 34,958 | 7,550 | 476,752 | 7,578 |
ur | your | 2,810 | 5,917 | 6,729 | 83,776 | 6,794 |
w | with | 4,195 | 28,363 | 7,043 | 146,575 | 7,124 |
wat | what | 983 | 1,318 | 6,617 | 67,576 | 6,634 |
wen | when | 524 | 653 | 6,637 | 67,470 | 6,650 |
wit | with | 1,769 | 3,389 | 7,043 | 146,575 | 7,054 |
wut | what | 582 | 724 | 6,617 | 67,576 | 6,627 |
y | why | 3,107 | 11,552 | 5,974 | 36,088 | 6,182 |
ya | you | 4,484 | 15,215 | 7,550 | 476,752 | 7,563 |
yea | yeah | 2,418 | 4,617 | 4,499 | 13,843 | 4,938 |
yess | yes | 576 | 665 | 4,924 | 18,365 | 4,997 |
yo | you | 3,677 | 10,918 | 7,550 | 476,752 | 7,559 |
yr | year | 566 | 809 | 4,530 | 16,848 | 4,614 |
yu | you | 1,082 | 2,144 | 7,550 | 476,752 | 7,551 |
yup | yes | 1,056 | 1,499 | 4,924 | 18,365 | 5,040 |
Acknowledgments
The authors thank Axel Bohmann, Katrin Erk, John Beavers, Danny Law, Ray Mooney, and Jessy Li for their helpful discussions. The authors also thank the Texas Advanced Computing Center for the computer resources provided.
Notes
While voting precincts were a better fit for our needs, similar analyses could be done with Census tracts, Census block groups, or any fine-grained sectioning of a region.
The representative point is produced by Shapely’s (Gillies et al. 2007) representative_point method.
As Poisson regressions can go to infinity, we cap the values to a standard deviation above the mean to prevent particularly large predictions hiding other predictions.
We note that these pairs contain pairs that do not necessarily reflect lexical variation, such as typos. However, drawing the line between typo and variation is a difficult question of its own and beyond the scope of our analysis.
While BERTLEF Retrofitting results do appear to climb back up, the number of pairs that are being averaged over are decreasing, so may indicate survivor bias and not improvement.
Images come from US News & World Report and Wikipedia.
References
Author notes
Research performed while attending The University of Texas at Austin.
Action Editor: Ekaterina Shutova