Abstract
Computational typology has gained traction in the field of Natural Language Processing (NLP) in recent years, as evidenced by the increasing number of papers on the topic and the establishment of a Special Interest Group on the topic (SIGTYP), including the organization of successful workshops and shared tasks. A considerable amount of work in this sub-field is concerned with prediction of typological features, for example, for databases such as the World Atlas of Language Structures (WALS) or Grambank. Prediction is argued to be useful either because (1) it allows for obtaining feature values for relatively undocumented languages, alleviating the sparseness in WALS, in turn argued to be useful for both NLP and linguistics; and (2) it allows us to probe models to see whether or not these typological features are encapsulated in, for example, language representations. In this article, we present a critical stance concerning prediction of typological features, investigating to what extent this line of research is aligned with purported needs—both from the perspective of NLP practitioners, and perhaps more importantly, from the perspective of linguists specialized in typology and language documentation. We provide evidence that this line of research in its current state suffers from a lack of interdisciplinary alignment. Based on an extensive survey of the linguistic typology community, we present concrete recommendations for future research in order to improve this alignment between linguists and NLP researchers, beyond the scope of typological feature prediction.
1 Introduction
Over the course of the past two centuries, linguistic typologists have studied languages with respect to their structural and functional properties, thereby implicitly classifying languages as being more or less similar to one another by virtue of such properties (Comrie 1988; Haspelmath et al. 2001; Velupillai 2012). Typology has a long history (Herder 1772; Gabelentz 1891; Greenberg 1960, 1974; Dahl 1985; Comrie 1989; Croft 2003), and recently computational approaches have gained substantial popularity (Wichmann and Saunders 2007; Dunn et al. 2011; Wälchli 2014; Östling 2015; Cotterelland Eisner 2017; Asgari and Schütze 2017; Malaviya, Neubig, and Littell 2017; Bjerva and Augenstein 2018b; Levshina 2019; Bjerva et al. 2020; Oncevay, Haddow, and Birch 2020; Östling and Kurfalı 2023; Baylor, Ploeger, and Bjerva 2023). One part of traditional typological research deals with manually extracting features of languages from existing descriptions, for instance, ending up in databases such as the World Atlas of Language Structures (WALS, Dryer and Haspelmath 2013), URIEL (Littell et al. 2017), AUTOTYP (Bickel et al. 2023), PHOIBLE (Moran and McCloy 2019), and most recently Grambank (Skirgård et al. 2023). A recent development that can be seen as complementary to this is the process of learning distributed language representations in the form of dense real-valued vectors, often referred to as language embeddings (Tsvetkov et al. 2016; Östling and Tiedemann 2017; Malaviya, Neubig, and Littell 2017; Jin and Xiong 2022; Bjerva et al. 2019c; Harvill, Girju, and Hasegawa-Johnson 2022; Chen, Biswas, and Bjerva 2023).
In this article, we focus on the task of typological feature prediction, as introduced by work such as Teh, Daumé III, and Roy (2009) and Daumé III and Campbell (2007), and featured in the SIGTYP 2020 Shared Task (Bjerva et al. 2020). Once a relatively niche topic in the NLP community, studying typological features has recently risen in popularity and importance for a number of reasons. The field has seen considerable advances in cross-lingual transfer learning, whereby stable cross-lingual representations can be learned on massive amounts of data in an unsupervised way, be it for words (Ammar et al. 2016; Wada, Iwata, and Matsumoto 2019) or sentences (Artetxe and Schwenk 2019; Devlin et al. 2019; Conneau and Lample 2019; Conneau et al. 2020; Tiyajamorn et al. 2021; Ouyang et al. 2021). This naturally raises the question of what these representations encode, and some have turned to typology for potential answers (Choenni and Shutova 2020; Zhao et al. 2021; Stanczak et al. 2022). In a similar vein, research has shown that these learned representations can be fine-tuned for supervised tasks, then applied to new languages in a few- or even zero-shot fashion with surprisingly high performance. This has raised the question of what causes this performance, and to what degree typological similarities are exploited by such models (Bjerva and Augenstein 2018a; Nooralahzadeh et al. 2020; Zhao et al. 2021; Östling and Kurfalı 2023). In addition to using typology for diagnostic purposes, prior work has also found that typology can, to some extent, guide cross-lingual sharing (de Lhoneux et al. 2018). Finally, the relationship between typological resources such as WALS (Dryer and Haspelmath 2013) and language representations has been studied, which has shown that knowledge base population methods can be used to complete typological resources (Malaviya, Neubig, and Littell 2017; Murawaki 2017; Bjerva and Augenstein 2018a; Bjerva et al. 2019c), and that typological implications can be discovered automatically (Daumé III and Campbell 2007; Bjerva et al. 2019b). Experiments in using typological features for NLP typically find sporadic and limited benefits (O’Horan et al. 2016; Ponti et al. 2019; Oncevay, Haddow, and Birch 2020).
While many such applications are well-motivated, the precise purpose of predicting typological features remains unclear. In this article, we investigate this question, provide an overview of arguments used in the NLP literature, and assess these arguments critically. In order to address this question, we first provide an overview of past work and current usage areas of typology and typological feature prediction in NLP. We next turn to linguistics, and present results of a survey and in-depth interviews of experts in typology, experts in language documentation, and other linguists, in order to map out the usefulness of our current work. Finally, we give recommendations on future research directions based on our findings, in an attempt to improve alignment between work in computational linguistics focused on typological feature prediction, and what may actually be of use to field linguists and typologists.
2 Related Work
We present a brief overview of typological feature prediction and its uses in NLP here, and refer the reader to Ponti et al. (2019) for a more thorough overview focusing on empirical usefulness of typological feature prediction. In the context of NLP, typological feature prediction is commonly done in the context of existing databases (e.g., WALS, Dryer and Haspelmath 2013), or more recently in Grambank (Skirgård et al. 2023). Methodologically speaking, features are typically either used or predicted in the context of other features and other languages (Teh, Daumé III, and Roy 2009; Daumé III and Campbell 2007; Naseem, Barzilay, and Globerson 2012; Täckström, McDonald, and Nivre 2013; Berzak, Reichart, and Katz 2014; Malaviya, Gormley, and Neubig 2018; Bjerva et al. 2019c, 2019a, 2020, 2019b; Vastl, Zeman, and Rosa 2020; Jäger 2020; Choudhary 2020; Gutkin and Sproat 2020; Kumar et al. 2020). That is to say, given a language l ∈ L, where L is the set of all languages contained in a specific database, and the features of that language Fl, the setup is typically to attempt to predict some subset of features f ⊂ Fl, based on the remaining features Fl ∖ f. This language may be (partially) held out during training, such that a typological feature prediction model is fine-tuned on L ∖ l, before being evaluated on language l. Variations of this setup exist, with attempts to control for language relatedness in training/test sets, using genealogical, areal, or structural similarities (Bjerva et al. 2020; Östling and Kurfalı 2023). In general, the degree to which areal and genealogical factors are controlled for in typological feature prediction is quite limited. Typically, previous work attempts to hold out languages during training in a given radius of, for example, 1,000 km (Jaeger et al. 2011; Cysouw 2013; Bjerva et al. 2020), or attempt to use family and branch information to avoid overestimation of prediction power. Related work either follows this type of approach (Östling and Kurfalı 2023), or omits controls altogether. While not the core of this article, a general recommendation is that future work take this type of factor into account—for example, by using linguistically motivated filtering approaches based on macroareas (for example) (Dryer 1989, 1992; Hammarström and Donohue 2014; Miestamo, Bakker, and Arppe 2016), the somewhat more fine-grained AUTOTYP areas (Nichols and Bickel 2009) which include historical, genetic, archaeological and anthropological factors, sociolinguistic environments (Sinnemäki and Di Garbo 2018), or using information regarding shared borders between languages (Cysouw, Dediu, and Moran 2012; Dryer 2018).
3 Why Do NLP Practitioners Predict Typological Features?
The adoption of the task of typological feature prediction in NLP stems from three core arguments in the literature: (1) sparsity, (2) continuity, and (3) utility for NLP. Although these arguments are frequently made, we here argue that they are largely unsubstantiated.
3.1 Sparsity: “Typological Databases Are Sparse and Incomplete”
Many typological databases indeed contain gaps for feature-language combinations. This certainly is the case with, for example, WALS and URIEL, where many combinations are absent. Many gaps exist for good reasons—the WALS feature Nasal Vowels in West Africa is, for obvious reasons, absent for languages outside of West Africa. Some databases, such as Phoible and Grambank, generally do not suffer from this particular issue (Skirgård et al. 2023). There is a general argument echoed by, for example, Daumé III and Campbell (2007), Berzak, Reichart, and Katz (2014), Buis and Hulden (2019), and Bjerva et al. (2019a), stating that completing these databases would be useful for typologists, highlighting that it is a difficult task to solve. Furthermore, it is argued that inaccurate information in databases can be detected and fixed automatically.
While imputing missing data can be useful for downstream tasks in, for example, computational historical linguistics, we here consider whether such predictions constitute a contribution to typological knowledge. Generally speaking, the predictions made by such systems for undocumented features in WALS are rather well-known. This stems from the core issue with such methods, namely, that they are first and foremost based on correlations, be it between typological features (e.g., affixation correlates with basic word order), or between similar languages (e.g., most Germanic languages are SVO). As models are typically good at picking up on such correlations, one of the findings in the SIGTYP2020 shared task was that practically all system submissions are able to correctly predict easy features. In the case of more difficult features (e.g., rare or atypical combinations), the best models only attained an accuracy of roughly 65% (Bjerva et al. 2020). Hence, in the cases where a language is typologically interesting (e.g., where an uncommon combination of typological features occurs), current state-of-the-art models do not fare well.
3.2 Continuity: “NLP Can Facilitate a Continuous Scale View on Typology”
An argument with support in the linguistic literature deals with the fact that, for example, word-order typology arguably lies on a continuum, rather than discrete categorization (Levshina et al. 2023). For instance, French allows for both Noun-Adjective and Adjective-Noun ordering, depending on various constraints (Laenzlinger 2005). An empirical investigation of word-order typology in the Universal Dependencies dataset provides a detailed cross-lingual perspective on the matter. Following Baylor, Ploeger, and Bjerva (2023), we use dependency links to calculate the proportion of, for example, Noun-Adjective vs. Adjective-Noun ordering examples across 100 languages. Contrasting this with categorical features, as represented in WALS, highlights the fact that this type of representation is a poor match with the feature distributions seen across corpora (Figure 1). Clearly, basic approaches to computational linguistics can help paint a descriptive picture of language data in this manner. However, would an output from a black-box NLP model, saying that a language is “40% Noun-Adjective,” be useful, or is a more descriptive and transparent method, as described, required?
3.3 Utility: “Prediction of Typological Features Can Be Useful for NLP”
Finally, it is commonly argued that typological features can aid performance in multilingual NLP models, for example, serving as a guide in cross-lingual transfer (Lent et al. 2023). Indeed, limited benefits can be found in various experimental setups across common NLP tasks and languages with annotated features (Naseem, Barzilay, and Globerson 2012; Täckström, McDonald, and Nivre 2013; de Lhoneux et al. 2018), and previous work has shown that typological information is learned as a by-product of training (Bjerva and Augenstein 2021). As Figure 1 hints, it may also be that the culprit is the inherent mismatch between typological database information and data-driven gradient typology (Baylor, Ploeger, and Bjerva 2023, 2024). Considering predicted typological features, Üstün et al. (2022) find benefits in zero-shot settings for parsing. However, work considering typological similarities when finding appropriate language pairings in cross-lingual transfer often finds combinations which are not easily explained by typology, likely due to artifacts in training or evaluation setups (Dolicki and Spanakis 2021; de Vries, Wieling, and Nissim 2022). In this vein, Srinivasan et al. (2021) contend that low performance for Yoruba may be due to its vigesimal number system, whereas m-BERT is primarily trained on languages using the decimal system—it is difficult to substantiate that this is much more than a spurious correlation. In sum, although there is debate on the subject of utility, we argue that this argument is likely the only valid reason for predicting typological features, as it stands today.
4 Do Linguists Want Typological Feature Prediction?
Having established the common arguments for prediction of typological features used by the NLP community, we now turn to the linguistic community to investigate these claims. Do linguists agree that typological feature prediction constitutes a contribution to the field, solving an inherently difficult problem? In terms of difficulty, it appears at first glance that this might be the case. For instance, Dryer (2007) points out that “it may be difficult to distinguish pronouns from nouns except on a semantic basis”, and Curnow (2000) argues that it is difficult to distinguish between inflectional and zero copulas in languages without verbal morphology. However, Haspelmath (2021) outlines an important distinction in this area. It is not that drawing distinctions between typological categorization is difficult, but rather that there is an underlying data issue making it difficult to draw sound conclusions based on a sufficient sample.
The literature does not have much to say about the usefulness of the task, however. Based on this, we have developed a questionnaire to investigate whether this line of research is useful to linguists, and if not, what needs to be changed so as to provide utility. The design of the questionnaire focused on the core research question of this article in mind, aiming to tease apart whether what NLP is currently doing is useful and, if not, what might be useful for future work. Specifically, the findings in this section are based on a survey and in-depth interviews with experts in linguistic typology, language documentation, and general linguistics. The survey was disseminated among experts on the Lingtyp mailing list, and directly to several linguistics departments worldwide, including follow-up interviews with linguists at various career stages. The respondents were initially informed of the survey’s scope:
In recent years, the field(s) of Natural Language Processing (NLP) and Computational Linguistics (CL) have started paying increased amounts of attention to linguistic typology. Among other things, NLP/CL researchers have developed systems for automatic inference of typological features. Typically, NLP/CL researchers working with typological features claim that this research direction has potential relevance to linguistics. However, it is not established that this line of research has any relevance to linguists at all. In this survey, we aim to bridge the gap between NLP/CL and linguistics researchers. Initially we want to create an overview of how linguists perceive such NLP/CL efforts, to what extent they are useful, or may be useful to the field in the future. The end goal is to improve alignment of research efforts of NLP/CL researchers with an interest in, e.g., typology, with the actual needs of the linguistic community.
4.1 Survey Respondents
The survey attracted a total of 34 responses, across career stages, with representation from 20 countries, on 3 continents. Out of the surveyed population, 80% identify as being linguistic typology researchers, 70% as working with language documentation, and 60% as working with general linguistics. Eighty percent of respondents are at a postdoctoral stage or later, with the remaining 20% being graduate students, bachelor’s students, or other.
4.2 Quantitative Responses
Following this prompt above, an initial survey was carried out in which respondents provided answers on a 5-point Likert scale, with descriptors at each end point (Not at all useful – Highly useful). The following questions were provided, with summaries of responses in Table 1:
Is automated prediction of features based on other known features useful?
Is prediction based on descriptions of language, e.g., grammars, useful?
Is prediction from textual input in a language, e.g., collected and transcribed samples, useful?
Is prediction from sound input in a language, e.g., recorded speech, useful?
How important is explainability in the utility of the models?
The general trend in the survey responses is provided in Table 1, which is generally symptomatic of a lack of alignment between NLP practitioners and linguists. All approaches to TFP are found to be not useful, or moderately useful. Explainability is highlighted as a key feature for the success of any TFP tool.
4.3 Qualitative Responses
In addition to these questions, qualitative responses were gathered in part from a free-text input in the questionnaire, in addition to a series of semi-structured interviews with experts in the community. Generally speaking, the qualitative responses gathered tell a story of skepticism. NLP practitioners are viewed as neglecting the efforts of language documentation, without much understanding of basic documentation workflow, highlighting a need for us as a community to get a grasp of this before commenting on it. While many responses indicate that NLP/CL researchers offer valuable feedback to the linguistic community, for example, in facilitating access to automatic speech recognition for corpora creation, the specific aspect of typological feature prediction generally does not seem to be particularly valued. Indeed, the surveyed population also point out the well-established aspect of the problem of categorical values in typological databases, for example, stating that language descriptions are better formulated as “language X has category Y, but …”, or “morphosyntactic pattern X is attested in language Y, but …”
5 The Future of Typological Feature Prediction
Based on the survey, we here propose three concrete directions for future work.
5.1 Make Predictions Explainable
Explainability is a crucial factor in typological feature prediction, particularly if the goal is for predicted features to be useful for typologists. Both quantitative and qualitative survey responses indicate that specific attributions of feature predictions are needed, e.g., via indication of specific examples in grammars or transcriptions that verify any claims made. This echoes findings in other work on acceptance of artificial intelligence (AI), specifically in that explainability is the key to AI acceptance (Shin 2021). Concretely, methodologies based on saliency metrics or contrastive learning between typologically distinct languages may be useful avenues to explore in future NLP research incorporating TFP.
5.2 Communicate with Domain Experts
The issue outlined in this article is one of misalignment between communities, essentially instantiation of a long-standing issue in NLP, commonly referred to as a pendulum oscillating between a linguistic focus, and an engineering focus (Church and Liberman 2021). Typological feature prediction has, perhaps, seemed like a task with clear utility to a specific community, due to its inherent “difficulty.” However, as outlined in this article, and as argued by, for example, Haspelmath (2021), the difficulty is not in categorization of languages into specific feature buckets, but rather one of data scarcity. Concretely, we suggest that future work that aims to have relevance to the linguistic community is spurred by communication with domain experts. Linguistics offers rigorous frameworks for understanding the intricate properties and structures inherent in human language. This theoretical foundation has found its way into many areas of NLP, and is lacking in others. Conversely, empirical findings from NLP can highlight potential research avenues within linguistics. A structured communication channel between the two domains can alleviate introduction of theoretical findings from linguistics in computational models, and empirical results from NLP can be contextualized within linguistic theories. With improved alignment, a deeper and more comprehensive understanding of both the structure and function of language can be achieved, fostering novel scientific insights.
5.3 Base Predictions on Real Data
Basing predictions on structured data, such as from existing features, is not deemed particularly insightful by the community. As highlighted, correlative predictions based on other features are typically either well-known and carry little novel value for a linguist, or are based on spurious correlations and entirely nonsensical. Concretely, development of a typological feature prediction system that uses text or meta-text as input might have significant value to the community, if paired with explainability. For instance, correctly analyzing a language as being suffixing in its inflectional morphology, while pointing to concrete examples of such suffixing, is an example with potential value.
6 Conclusions
For years, the NLP and CL communities have touted the task of typological feature prediction as one fulfilling a specific need in the linguistic community. The results outlined in this article largely refute this claim. While any further claims to this effect should be revisited, we further recommend that other claims in the NLP community are sanity checked with regards to the group they supposedly help. This is clearly the case in interactions with linguists, but also echoes the sentiment of Bird (2021) in that, for example, some communities simply do not have an expressed need for specific language technologies. In short, future work in NLP making claims of community interaction and benefiting marginalized groups ought to invest the effort needed to verify these claims, before they become widespread and accepted without any interdisciplinary grounding.
Acknowledgments
This work was immensely improved thanks to input from discussions with Esther Ploeger, Emi Baylor, Marcell Fekete, Yiyi Chen, Heather Lent, Carl Börstell, Johan Sjons, and Bruno Olsson. We further thank the LINGTYP community for participating in the survey in this work, as well as the broad range of anonymous reviewers for their feedback on the initial version of the article. This work was supported by a Semper Ardens: Accelerate research grant (CF21-0454) from the Carlsberg Foundation.