Abstract
This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.
1. Introduction
Compositional models in distributional semantics combine word vectors to yield new compositional vectors that represent the meaning of composite expressions. Some compositional approaches use syntactically enriched vector models, assuming a structured vector space in which word contexts are defined on the basis of dependency relations (Erk and Padó 2008; Thater, Fürstenau, and Pinkal 2010). In those approaches, the compositional vectors correspond to the meaning of words in context and the syntax-based combination of vectors enables words to be disambiguated as a process of contextualization (Weir et al. 2016). Similarly, the compositional process applied to a bilingual vector space should also enable translating polysemous words in context in an appropriate way.
Table 1 shows an example. Given the expression catch a ball, the sense of catch combined with ball is similar to grab, and can be translated into Spanish by coger. By contrast, this verb has a similar meaning to contract when combined with disease in the expression catch a disease, and its more appropriate translation into Spanish is now contraer. On the other hand, the sense of ball when combined with catch designates a spherical object and its translation into Spanish is pelota. However, the meaning of ball refers to a dancing event when it is combined with attend in attend a ball, being translated now into Spanish by baile. Both sense disambiguation and language translation are sensitive to the compositional construction of new meanings (Brown et al. 1991). In a bilingual distributional framework, we call “contextualized translation” the construction of compositional vectors for the expressions in the target language that are similar to the compositional vectors of the expressions in the source language. The target expression with the most similar compositional vector to the vector of the source expression will be considered as its most likely (contextualized) translation. This task was known in machine translation as target word selection, namely, the task of deciding which target language word is the most appropriate equivalent of a source language word in context (Dagan 1991).
catch a ball | grab | coger (spa) |
catch a disease | contract | contraer (spa) |
catch a ball | spherical object | pelota (spa) |
attend a ball | dancing event | baile (spa) |
catch a ball | grab | coger (spa) |
catch a disease | contract | contraer (spa) |
catch a ball | spherical object | pelota (spa) |
attend a ball | dancing event | baile (spa) |
In a monolingual vector space, we propose a compositional model based on that described in Erk and Padó (2008) and Erk et al. (2010). When two words, catch and ball, are related by a syntactic dependency, for instance dobj (direct object), we actually perform two different combinations: on the one hand, we combine the vector of the head word, noted catch, with the selectional preferences, noted balld↓, imposed by the dependent word ball on the head catch, in order to obtain a new compositional vector: catchdobj↑. This is the contextualized sense of the head catch given ball in relation dobj, which would be close to the meaning of grab and not to that of contract. On the other hand, the vector of the dependent word, ball, is combined with the selectional preferences, noted catchh↑, imposed by the head catch on ball, so as to obtain a new compositional vector: balldobj↓. This is the contextualized sense of ball given catch in the same relation dobj, which should denote a spherical object and not a dancing event. So, when two words are syntactically dependent, the compositional process builds two new vectors: one for the head expression and another one for the dependent one. In this approach, the vector space is structured with syntactic dependencies, and word senses are contextualized as words are combined with each other through their dependencies. This compositional strategy is useful to identify paraphrases, that is, similar composite expressions in the same language. The key element of such a syntax-sensitive vector space, which is high dimensional and sparse, is the concept of selectional preferences (defined later in Section 3).
Our main contribution is to adapt this syntax-based compositional process to a bilingual model. The contextualized translation of a given composite expression in the source language is performed by searching its nearest-neighbor vectors, among a set of candidates, in the target language, after having been contextualized as described here. This could be seen as a first step toward the definition of a compositional strategy for machine translation. Another important contribution of our work is the creation of an evaluation data set consisting of 1,119 Spanish translations of English sentences containing phrasal verbs.
The objective of the article is to define a bilingual vector space to perform contextualized translations of phrasal verbs, on the basis of a distributional compositional model enriched with syntactic information. What we call contextualized translation is actually a sort of unsupervised compositional-based machine translation. However, we prefer keeping the term contextualization because the compositional strategy we use is the same as that required for generating contextualized senses in the same language. Preliminary ideas underlying this method have been reported in Gamallo (2017c).
This article is organized as follows. Some related work is addressed in the next section (2). Then, Section 3 describes the compositional distributional model we follow. Next, Section 4 introduces the bilingual word space, and Section 5 defines our contextualized translation strategy. Experiments on translation of phrasal verbs are performed in Section 6 and, finally, relevant conclusions are addressed in Section 7.
2. Related Work
Our approach relies on three strategies: a compositional method to build vectors representing the contextualized sense of composite expressions (Subsection 2.1); a way of building a bilingual vector space using monolingual corpora (Subsection 2.2); and a strategy to propose contextualized translations with bilingual and compositional vectors (Subsection 2.3).
2.1 Compositional Vectors
In the last decade, different compositional models have been proposed and most of them use bag-of-words as basic representations of word contexts in the vector space. The most intuitive approach, reported in Mitchell and Lapata (2008, 2009, 2010), consists of combining vectors of two related words with arithmetic operations: component-wise addition and multiplication. Mitchell and Lapata (2009, 2010) describe weighted additive models giving more weight to some constituents—for instance, to the head word in a verb-noun expression, as the whole construction is closer to the verb than to the noun. Other weighted additive models are described in Guevara (2010) and Zanzotto et al. (2010). These models only define composition operations for two syntactically related words. Their main drawback is that they do not propose a more systematic model covering all types of semantic composition. More precisely, they do not focus on the function–argument relationship underlying compositionality in categorial grammar (CG) approaches—that is, they do not provide a linguistic combination with the elegant mechanism expressed by the principle of compositionality, where words interact with each other according to their syntactic categories (Montague 1970).
Other approaches develop robust models of meaning composition inspired by CG approaches. They learn the meaning of functional words (e.g., verbs and adjectives) from corpus-based occurrences by making use of regression predictive modeling (Baroni and Zamparelli 2010; Baroni 2013; Krishnamurthy and Mitchell 2013; Baroni, Bernardi, and Zamparelli 2014). In our proposal, by contrast, compositional functions are associated not with functional words, but with syntactic dependencies. Besides, they are not learned using regression techniques, but are just modeled as basic arithmetic operations on vectors as in Mitchell and Lapata (2008) and Weir et al. (2016). Arithmetic operations are easy to implement and produce high-quality compositional vectors, which makes them suited to practical applications (Baroni, Bernardi, and Zamparelli 2014).
There are other compositional methods still relying on CG that use tensor products (Coecke, Sadrzadeh, and Clark 2010; Grefenstette et al. 2011). Two problems can arise with tensor products. First, they lead to a problem of information scalability, because tensor representations grow exponentially as the phrases lengthen (Turney 2013). Second, tensor products do not seem to perform as well as basic arithmetic operations (e.g., multiplication) as was reported in Mitchell and Lapata (2010).
There are also studies making use of neural-based approaches (or deep learning strategies) to deal with word contextualization. In Peters et al. (2018), unlike traditional word type embeddings, each token is assigned a representation that is a function of the entire input sentence. In particular, the authors use vectors derived from a bidirectional long short-term memory network (LSTM) that is trained with a coupled language model in order to build contextualized vectors. Melamud, Goldberger, and Dagan (2016) also make use of bidirectional LSTM for efficiently learning a generic context embedding function. Devlin et al. (2018) make use of masked language models to enable pre-trained deep bidirectional representations. In a similar way, McCann et al. (2017) use a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors. However, in these four studies, word contextualization is not defined by means of syntax-based compositional functions, as they do not consider the syntactic functions of the constituent words.
Other pieces of work make use of deep learning strategies to build compositional vectors, such as recursive neural network models (Socher et al. 2012; Hashimoto and Tsuruoka 2015), which share with our model the idea that in the composition of two words both words modify each other’s meaning. Similarly, the deep recursive neural network reported in Irsoy and Cardie (2014) considers the structural representation of a phrase (e.g., a parse tree) so as to recursively generate parent representations in a bottom–up fashion, by combining tokens to produce representations for phrases. However, the opaque embeddings built by means of neural-based strategies cannot be easily adapted to our compositional method since it requires a transparent and syntax-sensitive vector space made of lexico-syntactic contexts.
So far, all the cited works represent vector contexts by means of window-based techniques. However, there are a few studies using vector spaces structured with syntactic information as in our approach. Thater, Fürstenau, and Pinkal (2010) distinguish between first-order and second-order vectors in order to allow two syntactically incompatible vectors to be combined. This work is inspired by that described in Erk and Padó (2008) and Erk et al. (2010), in which second order (or indirect) vectors represent selectional preferences and each word combination gives rise to two contextualized word senses. More recently, Weir et al. (2016) describe a similar approach where the meaning of a sentence is represented by the contextualized senses of its constituent words. The main difference is the type of context they use to build word vectors. Each word occurrence is modeled by what they call anchored packed dependency tree, which is a dependency-based graph that captures the full sentential context of the word. The main drawback of this context approach is its critical tendency to build very sparse word representations. In the deep learning paradigm, special attention should be given to a syntax-sensitive compositional version of CBOW algorithm, which is called C-PHRASE (Pham et al. 2015).
Our proposal is an attempt to merge the main ideas of these syntax-sensitive models (i.e., to consider two word senses per combination and to use the concept of selectional preferences) in order to apply them to contextualized translation.
2.2 Cross-Lingual Word Similarity from Monolingual Corpora
The method proposed in this article also relies on techniques to build bilingual vectors from monolingual corpora. Most approaches to extract translation equivalents from monolingual corpora define the contextual distribution of a word by considering bilingual pairs of seed words. In most cases, seed words are provided by external bilingual dictionaries (Fung and McKeown 1997; Fung and Yee 1998; Rapp 1999; Chiao and Zweigenbaum 2002; Shao and Ng 2004; Saralegi, Vicente, and Gurrutxaga 2008; Gamallo 2007; Gamallo and Pichel 2008; Yu and Tsujii 2009a; Ismail and Manandhar 2010; Rubino and Linarés 2011; Tamura, Watanabe, and Sumita 2012; Aker, Paramita, and Gaizauskas 2013; Ansari et al. 2014). So, a word in the target language is a translation candidate of a word in the source language if it tends to co-occur with the pairs of words from the seed words. A slightly different strategy is reported in Wijaya et al. (2017), where the learning task is modeled as a matrix completion problem with source words in the columns and target words in the rows. More precisely, starting from some observed translations (e.g., from existing bilingual dictionaries), the method infers missing translations in the matrix using matrix factorization with a Bayesian Personalized Ranking.
A very similar but different task is cross-lingual hypernymy detection, which determines whether a word in one language (e.g., vehicle) is a hypernym of a word in another language (e.g., coche [car] in Spanish). Upadhyay et al. (2018) describe an unsupervised approach for cross-lingual hypernymy detection, which learns sparse, bilingual word embeddings based on dependency contexts. Neural-based strategies also have been used to learn translation equivalents from word embeddings (Mikolov, Le, and Sutskever 2013; Artetxe, Labaka, and Agirre 2016, 2018). They learn a linear mapping between embeddings in two languages, which minimizes the distances between equivalences listed in a bilingual dictionary. Artetxe, Labaka, and Agirre (2017) provide very good results using small lists of seed words. Mapped embeddings are used to train unsupervised machine translation systems, which leverage automatic generation of parallel data by back-translating with a backward model operating in the other direction, and the denoising effect of a language model trained on the target side (Artetxe et al. 2017; Lample et al. 2018).
Unlike most approaches to extract word translations from monolingual corpora, which are based on windowing techniques without syntactic information, we will use a method that relies on dependency-based contexts. A significant number of papers report that contexts based on syntactic dependencies outperform window-based strategies in bilingual extraction (Gamallo and Pichel 2008; Yu and Tsujii 2009b; Andrade, Matsuzaki, and Tsujii 2011; Hazem and Morin 2014).
2.3 Compositional Translation of Composite Expressions
Most approaches to unsupervised compositional translation of phrases, multiwords, and composite terms consist of decomposing the source term into atomic components, translating these components into the target language, and recomposing the translated components into target terms (Tanaka and Baldwin 2003; Grefenstette 1999; Delpech et al. 2012; Morin and Daille 2012). The simplest strategy assumes that the translation of a compound may be obtained by translating each component individually thanks to a general dictionary, by generating all the combinations of word positions, and then filtering the translated expressions using either the target corpus or the Web, as in Grefenstette (1999).
This strategy is limited to the subset of compound expressions that share the same compositional property in both the source and target languages, and it is also limited by the coverage of the translation dictionary (Morin and Daille 2012). Several problems can arise, namely, fertile translations in which the target expression has more words than the source term (e.g., the English word estrogen-sensitive is translated in Spanish by sensible a los estrógenos), and collocations that can be translated by just one word (for instance, the English expression go for a walk is translated in Spanish by the word pasear). Our translation approach also follows the decomposing strategy but, unlike the works cited earlier, the source expression will be compared against a very large list of candidates, including single words and composite expressions with different morphological and syntactic properties. The use of syntax-based transfer rules helps us enlarge the list of candidates and makes the model fully compositional.
Finally, concerning neural machine translation (NMT), it is worth noting that it does not decompose the source sentence in a compositional way as in our approach. Instead, NMT encodes the source sentence using recurrent neural networks (RNN) and then decodes it to generate the target sentence word-by-word. At each generation step, the decoder has access not just to a single word from the source sentence, but to the contextual representations of every word in the source sentence (e.g., Cho et al. 2014). RNN encoder-decoder architecture captures both semantic and syntactic structures of phrases and permits a sequence-to-sequence prediction where the length of the input and output sequences may vary. This makes it possible to deal elegantly with cases of fertile and non-compositional translations. However, unlike our unsupervised approach, standard NMT is a supervised strategy, as it relies on parallel corpora.
3. Compositional Distributional Semantics
In this section, we describe first how word senses are contextualized by making use of selectional preferences (Section 3.1). Then, we describe how compositional vectors are created by combining head-dependent words (Section 3.2). Finally, this process is generalized and extended to a dependency tree (Section 3.3).
3.1 Contextualized Senses and Selectional Preferences
Consider now that we want to build the contextualized senses of catch and ball in the composite expression, catch a ball. Let us start with the distributional vectors of the two related words. In a structured space, the vector of a word represents all the (lexico-)syntactic contexts with regard to which the target word is either a head or a dependent. Given that the vector of a noun is defined in a different syntactic space from the vector of a verb, it is not possible to find common contexts shared by the two vectors. In fact, in a structured (or typed) vector space they are incompatible vectors and cannot be combined (Kober et al. 2017). Our structured vector space is a distributional semantic model where words are represented as lemma–part of speech (PoS) pairs and dimensions are lexico-syntactic positions. For instance, the word ball (represented in our structured space as the lemma–PoS pair <ball, NOUN> is assigned a corpus-based frequency in context <dobj↑, catch> (direct object of the verb catch)).2 This lexico-syntactic context is only used to define noun vectors as neither verbs nor adjectives can be direct objects of verb catch.
To make the composition of two dependent words compatible, we propose to combine word vectors with their selectional preferences as shown in Figure 1. Selectional preferences (or indirect vectors) are formally defined in the following paragraphs.
- •
the selectional preferences imposed by the dependent noun on the head verb in that syntactic position: balld↓,
- •
the selectional preferences imposed by the head verb on the dependent noun in the same syntactic position: catchh↑.
On the other hand, C in Equation (2) represents the vector set of those nouns occurring as direct object of catch in the corpus: {train, bus, disease, …} (see Figure 1). More precisely, given the lexico-syntactic context <dobj↑, catch>, the vector catchh↑ is obtained by adding the vectors {n|n ∈ C} of those nouns that occur at the direct object position of the verb catch. Indirect vector catchh↑ stands for the selectional preferences imposed by the verb on any noun in the dobj relation. Such a new vector is only constituted by nominal contexts, and, therefore, is compatible and might be combined with the word vector of ball.
3.2 Dependencies and Composition
Each multiplicative operation results in a compositional vector of a contextualized word. Component-wise multiplication has an intersective effect. The indirect vector restricts the direct vector by assigning frequency 0 to those contextual features that are not shared by both vectors.
3.3 Incremental Composition
Following dependency grammar (Kahane 2003; Hudson 2003), in our approach, semantic composition is driven by syntactic dependencies. They contextualize word senses in an incremental way. The consecutive application of the syntactic dependencies found in an expression is, in fact, the process of building the contextualized sense of all the lexical words constituting the expression. So, the meaning of a complex expression is represented by a contextualized vector for each constituent word rather than by a single vector standing for the entire expression. Figure 2 illustrates the incremental process of building the sense of words by the consecutive application of two dependencies. Given the expression a girl catches the ball and its dependency analysis shown on the top of the figure, two compositional processes are carried out by the two dependencies involved in the analysis: nsubj and dobj. Each dependency is decomposed into two functions: head h↑ and dependent d↓. As a final result, no single meaning has been constructed for the entire expression a girl catches the ball, but we have obtained one contextualized sense per lexical word: girlnsubj↓, catchnsubj↑+dobj↑, and balldobj↓. This strategy may be considered as an incremental extension of that reported in Erk et al. (2010). The main difference with their approach is that we use contextualized selectional preferences at different levels of analysis. By contrast, the work by Erk et al. (2010) is not incremental because selectional preferences are not contextualized. In the dobj application of Figure 2, the contextualized selectional preferences imposed by the verb, and noted , were created by selecting the contexts of the nouns appearing as direct object of catch, which are also part of girl after having been contextualized by the verb at the subject position. In other terms, the application of the second dependency requires building the selectional preferences imposed by the verbal expression the girl catches on the nouns appearing in the direct object position.3
It is worth noting that it would also be possible to incrementally apply dependencies following a different order—for example, right-to-left direction. However, in the experiments of Section 6, we will use only the incremental left-to-right order. A more informal and linguistic-based description of the current method is reported in Gamallo (2017c).
4. Bilingual Vectors Extracted from Monolingual Corpora
Notice that the other candidate translation, estación autobuses, instanciating the construction (nmod, N1, N2) and derived from contexts (9) and (10), is not found in the corpus because it is grammatically odd in Spanish.
The vector space is built with the bilingual contexts described above and their word-context co-occurrences. This is thus a count-based approach characterized by being high dimensional and sparse. In order to reduce sparseness, we apply a technique to filter out not very informative contexts by relevance, as described in Gamallo (2017b). The reducing technique consists of two tasks: First, an association measure (e.g., loglikelihood) is computed between each word and their bilingual contexts and, second, for each word, only the N contexts with highest loglikelihood scores are selected. In this bilingual vector space, given a word in the source language, the nearest neighbors in the target language (in terms of distributional similarity) are, in fact, its most likely translation candidates. A more detailed description of our count-based bilingual model can be found in Gamallo and Pichel (2008) and Gamallo and Bordag (2011).
5. Contextualized Translation in a Bilingual Vector Space
Contextualized translation is the result of combining the method to extract bilingual vectors defined in Section 4, with the compositional distributional approach introduced in Section 3.
Figure 3 depicts the general architecture of the strategy, consisting of two main tasks: extraction from monolingual corpora and contextualized translation in a bilingual distributional space. In the figure, the source language is English (en) and the target is Spanish (es). The extraction module, to the left of the figure, requires monolingual corpora in the source and target languages (English and Spanish). All texts of the corpora are linguistically processed and syntactically analyzed. A bilingual dictionary and transfer rules are also required to define a bilingual distributional model, by making use of the technique described in Section 4. The resulting bilingual model provides all English and Spanish words with a distributional meaning representation out of context. This distributional model is the input of the compositional algorithm used by the translation strategy.
The translation module is illustrated in the right side of Figure 3. It consists of three sub-tasks: (1) generation of the Spanish candidates, (2) building compositional models for the English sentence and Spanish candidates, and (3) selection of the most similar candidate.
(1) Generation of candidates: The input of the system is a sequence in English (en) that is syntactically analyzed. The generation sub-module takes the analyzed sentence and expands it into a set of candidate translations in Spanish (es1, es2, …, esn), by making use of the bilingual dictionary and bilingual transfer rules.
(2) Compositional meaning: Once the candidates have been generated, the next step is to build the distributional meaning representation of the input sentence (meaning en) and the translation candidates: meaninges1, …, meaningesn. For this purpose, the compositional algorithm described in Section 3 makes use of composition functions operating on the bilingual vector space. The distributional meaning of each sentence stands for the contextualized senses of its constituent words.
(3) Selection by similarity: Finally, the distributional meanings of the generated candidates are compared pairwise by means of cosine similarity with the English sentence. The generated Spanish sentence associated with the most similar meaning (in bold in the figure) is selected as the best Spanish translation of the English sentence.
It is worth noting that the figure shows a simplified architecture of the translation module, since incrementality across dependencies is not represented. In order to better understand the translation process, the following subsections will help us to explain the three stages of the process from a concrete example: the English expression coach station.
5.1 Generation of Candidates
The English expression is syntactically analyzed and then a set of Spanish candidates is generated by using a English–Spanish dictionary and the transfer rules defined above in Equations (5) and (6). Considering the different translations of these two ambiguous words in the dictionary and the two transfer rules, Table 2 shows all 56 possible combinations.
(nmod/de, estación, bus), (nmod/de, estación, autobús), |
(nmod/de, estación, autocar), (nmod/de, estación, entrenador), |
(nmod/de, estación, preparador), (nmod/de, estación, instructor), |
(nmod/de, estación, monitor), (nmod/de, canal, bus), |
(nmod/de, canal, autobús), (nmod/de, canal, autocar), |
(nmod/de, canal, entrenador), (nmod/de, canal, preparador), |
(nmod/de, canal, instructor), (nmod/de, canal, monitor), |
(nmod/de, emisora, bus), (nmod/de, emisora, autobús), |
(nmod/de, emisora, autocar), (nmod/de, emisora, entrenador), |
(nmod/de, emisora, preparador), (nmod/de, emisora, instructor), |
(nmod/de, emisora, monitor), (nmod/de, puesto, bus), |
(nmod/de, puesto, autobús), (nmod/de, puesto, autocar), |
(nmod/de, puesto, entrenador), (nmod/de, puesto, preparador), |
(nmod/de, puesto, instructor), (nmod/de, puesto, monitor), |
(nmod, estación, bus), (nmod, estación, autobús), |
(nmod, estación, autocar), (nmod, estación, entrenador), |
(nmod, estación, preparador), (nmod, estación, instructor), |
(nmod, estación, monitor), (nmod, canal, bus), (nmod, canal, autobús), |
(nmod, canal, autocar), (nmod, canal, entrenador), (nmod, canal, preparador), (nmod, canal, instructor), |
(nmod, canal, monitor), (nmod, emisora, bus), (nmod, emisora, autobús), |
(nmod, emisora, autocar), (nmod, emisora, entrenador), |
(nmod, emisora, preparador), (nmod, emisora, instructor), |
(nmod, emisora, monitor), (nmod, puesto, bus), (nmod, puesto, autobús), |
(nmod, puesto, autocar), (nmod, puesto, entrenador), |
(nmod, puesto, preparador), (nmod, puesto, instructor), |
(nmod, puesto, monitor) |
(nmod/de, estación, bus), (nmod/de, estación, autobús), |
(nmod/de, estación, autocar), (nmod/de, estación, entrenador), |
(nmod/de, estación, preparador), (nmod/de, estación, instructor), |
(nmod/de, estación, monitor), (nmod/de, canal, bus), |
(nmod/de, canal, autobús), (nmod/de, canal, autocar), |
(nmod/de, canal, entrenador), (nmod/de, canal, preparador), |
(nmod/de, canal, instructor), (nmod/de, canal, monitor), |
(nmod/de, emisora, bus), (nmod/de, emisora, autobús), |
(nmod/de, emisora, autocar), (nmod/de, emisora, entrenador), |
(nmod/de, emisora, preparador), (nmod/de, emisora, instructor), |
(nmod/de, emisora, monitor), (nmod/de, puesto, bus), |
(nmod/de, puesto, autobús), (nmod/de, puesto, autocar), |
(nmod/de, puesto, entrenador), (nmod/de, puesto, preparador), |
(nmod/de, puesto, instructor), (nmod/de, puesto, monitor), |
(nmod, estación, bus), (nmod, estación, autobús), |
(nmod, estación, autocar), (nmod, estación, entrenador), |
(nmod, estación, preparador), (nmod, estación, instructor), |
(nmod, estación, monitor), (nmod, canal, bus), (nmod, canal, autobús), |
(nmod, canal, autocar), (nmod, canal, entrenador), (nmod, canal, preparador), (nmod, canal, instructor), |
(nmod, canal, monitor), (nmod, emisora, bus), (nmod, emisora, autobús), |
(nmod, emisora, autocar), (nmod, emisora, entrenador), |
(nmod, emisora, preparador), (nmod, emisora, instructor), |
(nmod, emisora, monitor), (nmod, puesto, bus), (nmod, puesto, autobús), |
(nmod, puesto, autocar), (nmod, puesto, entrenador), |
(nmod, puesto, preparador), (nmod, puesto, instructor), |
(nmod, puesto, monitor) |
5.2 Compositional Meaning
The compositional meaning of the English expression coach station corresponds to two compositional vectors, stationnmod↑ and coachnmod↓, resulting from the two functions (head and dependent, respectively) derived from the nmod relation. Then, the meaning of the 56 Spanish candidates are built following the same procedure, giving rise to 118 compositional vectors. For instance, the two contextualized vectors corresponding to estación de autobuses, both derived from the prepositional dependency nmod/de, are: estacionnmod/de↑ and autobusnmod/de↓.
5.3 Selection by Similarity
Following our example, the 56 Spanish candidates in Table 2 represent the ϕ set of translation candidates. Only 3 out of 56 are acceptable translations, the rest are unsuitable candidates. Using Equation (15) and the ranked sample (16), the target candidate (nmod/de, estación, autobús) (estación de autobuses) is selected as it reaches the highest similarity score to the source expression (coach station).
Finally, a very simple decoder takes the CT results, dependency by dependency, and builds the lemmatized expression in the target language by taking into account word order information provided by the transfer rules: estación de autobús cerrar. In the current version, we only deal with lemmas. In the case of incompatibility between two target words, the decoder adds the CTi scores obtained by all dependencies in which the incompatible words are involved, and selects the word with the highest global CT score. For instance, consider Equation (17) and function CT2 returning (nsubj, cerrar, emisora) instead of (nsubj, cerrar, estación). The new Spanish noun emisora (radio station) is different from estación (bus station), which has been selected by the first dependency. The two nouns are incompatible as they are assigned to the same syntactic position in the syntactic graph, namely, head of nmod and dependent of nsubj. In this case, the word with the highest global CT value will be estación because it reaches high scores in both dependencies and not just in one of them.
6. Experiments
The proposed method for contextualized translation relies on two strategies: compositional distributional semantics and bilingual extraction from monolingual corpora. The syntax-based compositional distributional algorithm described in Section 3 was tested against several monolingual data sets (with intransitive and transitive constructions) and the results of these experiments were reported in Gamallo (2017c). The method to extract bilingual lexicons described in Section 4 participated in the SemEval 2017 Task 10, being the best system using monolingual corpora in the English–Spanish cross-lingual sub-task (Gamallo 2017a).
Concerning contextualized translation, which is the objective of the current work, the most similar task that has been evaluated is cross-lingual semantic textual similarity, which was defined as a shared task at SemEval-2016 Task 1 (Agirre et al. 2016). However, the objective of textual similarity is not to generate a candidate translation but just to provide a degree of similarity between the source and target sentences. The best system in the cross-lingual subtask at SemEval-2016 Task 1 (Brychcín and Svoboda 2016) is very different from the syntax-based strategy we propose in the present work. They translated Spanish sentences to English via Google Translate and, next, made use of the same semantic textual similarity strategy as for the monolingual task. The monolingual task for semantic textual similarity represents the meaning of a sentence using simple linear combination of word vectors, as in the compositional distributional strategy reported in Mikolov, Yih, and Zweig (2013), which is not syntax-based.
Moreover, Task 1 at SemEval-2016 consists of data sets with quite complex and heterogeneous sentences belonging to a large variety of syntactic constructions, which makes it not trivial to treat them through syntax-based compositional approaches. In order to evaluate a syntax-based, contextualized translation system, we require bilingual data sets with simple syntactic constructions, for example, adjective-noun or intransitive and transitive constructions, such as those defined and used for monolingual tasks (Mitchell and Lapata 2008; Grefenstette and Sadrzadeh 2011).
As there is no such bilingual data set with the required characteristics, we created a new resource to evaluate systems aimed at generating contextualized translations in restricted syntactic domains.
6.1 The Data Set
The focus is to create a large number of examples with short and simple constructions, but very ambiguous sentences that require being contextualized in order to be disambiguated. For this purpose, we focused on English sequences containing phrasal verbs, which give rise to very ambiguous expressions. Whereas linguistic ambiguity can be dealt with by means of contextualization, the domain of application is syntactically restricted and, thereby, experiments can be evaluated in an enclosed and controlled setting.
First, an English native translator built a bilingual verbal lexicon with 2,411 different phrasal verbs and 5,761 English–Spanish translations by making use of a great variety of lexicographic resources. Then, she built a list of English (transitive and intransitive) expressions using the most polysemous phrasal verbs of the lexicon. The final data set, called PhrasalVerbsToSpanish,4 consists of 1,119 English sentences with 665 different phrasal verbs, and 1,837 Spanish translations with 1,241 different Spanish verbs (including single and multiword verbs). The 665 English phrasal verbs are highly ambiguous and then have multiple Spanish translations: Their average Spanish translations per verb in the bilingual lexicon is 5.25.
Table 3 is a sample of the data set showing three English sentences with the act out phrasal verb. These sentences are in the first column. The Spanish translations for each English sentence are in the second column, and the third column provides lemmatized Spanish verbs corresponding to the correct translations of the English phrasal verb in context. All examples contain simple constructions: intransitive or transitive constructions merely including noun phrases, verb phrases, adjectives, and prepositional phrases. By contrast, coordination or embedding structures such as relative clauses or completives are not allowed. As distributional-based translation is focused on the meaning of lexical units, grammatical and encyclopedic units such as pronouns, conjunctions, and proper nouns are also not allowed.
English sentence . | Spanish translations . | Spanish verbal phrases . |
---|---|---|
the actors acted out the characters | los actores representaron a los personajes | representar a |
the actors act out their performances | los actores interpretan sus representaciones | interpretar |
the tired child acted out | el niño cansado se comportó mal, el niño cansado se portó mal | comportar se mal, portar se mal |
English sentence . | Spanish translations . | Spanish verbal phrases . |
---|---|---|
the actors acted out the characters | los actores representaron a los personajes | representar a |
the actors act out their performances | los actores interpretan sus representaciones | interpretar |
the tired child acted out | el niño cansado se comportó mal, el niño cansado se portó mal | comportar se mal, portar se mal |
The PhrasalVerbsToSpanish data set is actually focused on the task of translating the phrasal verb of an English sentence by disambiguating its sense using the meaning of the context words. Thus, contextualization is a key concept in this task. It is worth noting that the bilingual dictionary is used in two different tasks: for constructing this data set and to generate the translation candidates before the contextualized model selects the best one. For constructing the data set, the human translator composed sentences containing the English phrasal verbs included in the dictionary. Concerning the contextualized translation model, both the dictionary and the transfer rules are used to generate candidates that may include phrasal verbs.
6.2 Monolingual Corpora and Linguistic Resources
The extraction module built the bilingual vector space from English and Spanish monolingual corpora. The English corpus consists of 2007–2009 posts of Reddit Comment Corpus, containing about 875 M words.5 The Spanish corpus corresponds to a 2014 dump file of the Spanish Wikipedia,6 along with a sample of posts extracted from MenÉame.7 The whole Spanish corpus contains about 480 M word tokens. We decided to use Reddit instead of Wikipedia for English because phrasal verbs are more frequent in informal language such as that used in social forum comments. Notice that the English and Spanish corpora are not comparable.
All texts were linguistically analyzed with LinguaKit (Gamallo et al. 2018), a multilingual suite that also includes the dependency-based parser, DepPattern (Gamallo and Garcia 2018), used to syntactically analyze the two corpora and the input phrases of the translation module. Vectors were built for lexical units occurring more than 100 times in each monolingual corpus.
Concerning the lexical resources, the English–Spanish Collins dictionary,8 containing 52,463 entries, was merged with our lexicon of phrasal verbs so as to create a new bilingual resource with 57,975 entries. This bilingual dictionary is used for several tasks: to identify English and Spanish phrasal verbs (not only single words) in the monolingual corpora before extraction, to define bilingual distributional contexts in the extraction module, and to generate Spanish candidates in the translation module.
Finally, a set of bilingual transfer rules were manually defined by a linguist. The type of rules chosen to be implemented was determined by the examples of sentences found in the PhrasalVerbsToSpanish data set. As they are just transitive and intransitive clauses with no recursive structures and basic nominal modification, most transfer rules required are just duplicated dependencies, as shown in Table 4. The α symbol stands for any English preposition and β represents a Spanish preposition. For the current experiments, 13 English prepositions were identified and each one was paired with its three most similar Spanish prepositions, according to distributional similarity. So, each nmod/α → nmod/β transfer rule was expanded with 13 × 3 specific rules. In total, 74 specific transfer rules were defined with just verbs, nouns, adjectives, and prepositions. Adverbs and other syntactic categories were not considered for the current experiment.
(nsubj, V, N) | → | ((Lnsubj, V, N), LR) |
(dobj, V, N) | → | ((dobj, V, N), RL) |
(iobj/α, V, N) | → | ((iobj/β, V, N), RL) |
(cop, V, A) | → | ((cop, V, A), RL) |
(amod, N, A) | → | ((amod, N, A), RL) |
(nmod, N1, N2) | → | ((nmod, N1, N2), R) |
(nmod, N1, N2) | → | ((nmod/de, N1, N2), R) |
(nmod/α, N1, N2) | → | ((nmod/β, N1, N2), R) |
(nsubj, V, N) | → | ((Lnsubj, V, N), LR) |
(dobj, V, N) | → | ((dobj, V, N), RL) |
(iobj/α, V, N) | → | ((iobj/β, V, N), RL) |
(cop, V, A) | → | ((cop, V, A), RL) |
(amod, N, A) | → | ((amod, N, A), RL) |
(nmod, N1, N2) | → | ((nmod, N1, N2), R) |
(nmod, N1, N2) | → | ((nmod/de, N1, N2), R) |
(nmod/α, N1, N2) | → | ((nmod/β, N1, N2), R) |
Transfer rules are provided with four types of word order restrictions: R (the dependent word is on the right), L (the dependent word is on the left), RL (the canonical position of the dependent word is on the right), and LR (the cannonical position is on the left). In the last two cases, both positions are allowed but one of them (the non-canonical one) requires more restrictions to be activated.
6.3 Evaluation
Our Contextualized Translation (CT) system was evaluated using the PhrasalVerbsToSpanish data set as the gold standard. The system selected the most likely translation for each English sentence and then we computed its accuracy. Accuracy is just the result of dividing positive cases by the total size of the data set (1,119 examples). A positive case is defined as follows: The phrasal verb is correctly translated (positive) by checking whether the Spanish verb or phrasal verb in the third column of the data set is also returned by the system. Otherwise, it is considered a negative case.
We also measured four state-of-the-art commercial machine translators, namely DeepL,9 Google Translator,10 Bing,11 and Yandex (all consulted in December 2017).12 The final evaluation of these systems was done manually because they return inflected verbs that might not match with the verbal lemmas in the third column of the gold standard. So, a manual revision comparing forms with lemmas was required to find positive cases. Additionally, we also implemented some baseline methods. Table 5 shows the accuracy of all evaluated systems as well as a statistical test of significance (last column). The symbols “≫” and “≪” respectively indicate a strong rise and drop with regard to the accuracy of the previous system in the table, being the rise or drop significant for a p-value ≤ 0.005 (paired sample t-test). The symbols “>” and “<” mean that there is a lighter rise and drop with respect to the previous system, being significant for a p-value ≤ 0.05 and >0.005. Finally, ”∼” indicates that the difference is not statistically significant (p-value > 0.05). Baseline strategies are ordered starting with the lowest accuracy, and commercial translators are ordered from highest to lowest accuracy. The CT system is situated after the best baseline and before the best commercial translator.
systems . | positive . | negative . | accuracy . | s-test . |
---|---|---|---|---|
Dict-first | 349 | 770 | 0.312 | |
Dict-Corpus-Based | 375 | 744 | 0.335 | > |
Dict-Nocomp | 430 | 689 | 0.383 | ≫ |
Dict-Nocomp-VecMap | 437 | 682 | 0.390 | ∼ |
CT | 571 | 548 | 0.510 | ≫ |
DeepL | 501 | 618 | 0.447 | ≪ |
Google Trans. | 410 | 709 | 0.366 | ≪ |
Bing | 326 | 793 | 0.291 | ≪ |
Yandex | 281 | 838 | 0.251 | < |
UNdreaMT | 12 | 1,107 | 0.010 | ≪ |
systems . | positive . | negative . | accuracy . | s-test . |
---|---|---|---|---|
Dict-first | 349 | 770 | 0.312 | |
Dict-Corpus-Based | 375 | 744 | 0.335 | > |
Dict-Nocomp | 430 | 689 | 0.383 | ≫ |
Dict-Nocomp-VecMap | 437 | 682 | 0.390 | ∼ |
CT | 571 | 548 | 0.510 | ≫ |
DeepL | 501 | 618 | 0.447 | ≪ |
Google Trans. | 410 | 709 | 0.366 | ≪ |
Bing | 326 | 793 | 0.291 | ≪ |
Yandex | 281 | 838 | 0.251 | < |
UNdreaMT | 12 | 1,107 | 0.010 | ≪ |
Four baseline strategies are on the top of the table. Dict-first is based on looking up our bilingual lexicon of phrasal verbs. This method identifies the phrasal verb within the English sentence, looks up the lexicon, and selects the first Spanish translation. The result of this baseline, 0.312 accuracy, is in accordance with the fact that the phrasal verbs occurring in the English sentences of the data set have 5.5 translations/meanings on average, and there are some examples where more than one translation is allowed.
We also tested two more baselines based on non-compositional similarity. Dict-Nocomp compares each phrasal verb with just their translation candidates generated with the bilingual lexicon, and the most similar one is selected. The similarity is computed on the same transparent bilingual vector space as the one used by our CT system. Dict-Nocomp-VecMap computes the same non-compositional similarity by using word embeddings for each language and a linear mapping between the two vector spaces (Mikolov, Le, and Sutskever 2013). The mapping between embeddings was learned using VecMap (Artetxe, Labaka, and Agirre 2018). These two non-compositional methods returned scores (0.383 and 0.390), significantly improving the accuracy of random dictionary consultation (Dict-first) for a p-value ≤ 0.005. However, such an improvement is not too pronounced (less than 8 points over 100). The reason is that non-compositional similarity tends to select the most popular sense/translation, but many examples of the phrasal verbs in the gold standard were created with infrequent meanings. Rare senses are just those that a compositional strategy should try to select in context. It is worth noting that there is no significant difference between the two non-compositional strategies, even though the use of VecMap slightly improves our way of computing vector similarity.
Finally, the phrasal verb with the highest frequency is selected: hacer de, with 5 (4 + 1) occurrences in total. Notice that this is the correct answer because in the gold reference the human translator also selected hacer de as the best choice for the input sentence. The accuracy of this strategy (0.335) is higher than that obtained by the Dict-First baseline, even though there is just a slight improvement with a p-value = 0.03. On the other hand, the Dict-Corpus-Based strategy is outperformed by the non-compositional baselines in a significant way (p-value = 0.006).
As Table 5 shows, CT outperforms both the best baseline (Dict-Nocomp-VecMap) and the best commercial translator (DeepL) in a significant way (p-value ≤ 0.005). Concerning the differences between the four commecial systems, all are strongly significant (p-value ≤ 0.005), except that separating Bing from Yandex, which is just significant for a p-value ≤ 0.05. It is worth noting that all accuracy scores are low because the task at stake has a high degree of complexity. All sentences contain very ambiguous phrasal verbs, some of them with very infrequent senses, even though all of them can be disambiguated by considering the meaning of the context words: nominal subjects and/or objects. The improvement of our system with regard to the baseline (from 349 to 571 positives) is perhaps not conclusive, but it shows that the compositional vectors built by the CT system help contextualize in an important number of cases.
The difficulty of the task is demonstrated by the low values obtained by the unsupervised machine translation system, UNdreaMT (Artetxe et al. 2017) (last row in Table 5). This is a state-of-the-art, unsupervised translation strategy, based on denoising and back-translation, whose embeddings are learned from monolingual corpora. We have trained UNdreaMT using the embeddings mapped with VecMap, and the embeddings were built by applying word2vec (CBOW algorithm, window 5, and 300 dimensions) (Mikolov, Yih, and Zweig 2013) on the same English and Spanish corpora as the ones we used to train our CT method. However, it should be noted that UndreaMT is at a clear disadvantage with respect to all the systems it is compared with: On the one hand, it is a completely unsupervised system that has been trained with very small corpora in relation to the commercial translators that use huge amounts of parallel corpora. On the other hand, the generation of word sequences in the target language is not controlled by a bilingual dictionary and syntax-based transfer rules as in the case of CT. We must point out that the commercial translators also do not have access to the bilingual dictionary and the generated candidates, which places them at a disadvantage with respect to CT.
6.4 Error Analysis
In order to analyze the type of errors produced by the evaluated systems, 50 negative examples were randomly selected and manually classified to error categories. Table 6 shows the distribution of the four error types found in the sample. The only negative cases that can be considered as being clearly wrong choices directly derived from the translation system are called “semantically odd,” and reach 34% of the total sample. For instance, the CT system wrongly chose the Spanish verb matar (to kill) to translate blow away in the singer blew away the audience, instead of deslumbrar a. In these cases, the translation module did not select a semantically acceptable translation. By contrast, 32% (called “similar sense”) are acceptable cases in CT even if the most acceptable translation, which fits better the collocation requirements, has not been chosen. For example, in the state acted on the evidence, the system returned responder a, but the human translator preferred another, more appropriate option (reaccionar ante), which is semantically similar but seems to be more used in that specific context. It is therefore a stylistic error, less serious than the previous one. We also found a significant number of errors (22%) in CT inherited from the linguistic analyzers, either PoS tagger or dependency parser, which intervene before the construction of the compositional meanings in the semantic step. Finally, the fourth type of error (preposition) stands for those cases where the Spanish verb is correct but the preposition is missing, as when the preposition “a” introduces direct objects. In such cases, the presence of the preposition is recommended but not mandatory, for instance: the worker decided to ask around his colleagues is translated as trabajador tantear colega, instead of trabajador tantear a colega. So, it is a stylistic error, like the second one. For CT, this analysis shows that serious errors (“semantically odd”) are only one-third of the total, and that there is room for improvement by solving morpho-syntactic problems.
systems . | semantically odd . | similar sense . | wrong analysis . | preposition . |
---|---|---|---|---|
CT | 34% | 32% | 22% | 12% |
Dict-Corpus-Based | 52% | 40% | – | 8% |
Dict-Nocomp | 66% | 28% | – | 6% |
DeepL | 32% | 68% | – | – |
Google Trans. | 58% | 42% | – | – |
Bing | 78% | 22% | – | – |
Yandex | 50% | 50% | – | – |
systems . | semantically odd . | similar sense . | wrong analysis . | preposition . |
---|---|---|---|---|
CT | 34% | 32% | 22% | 12% |
Dict-Corpus-Based | 52% | 40% | – | 8% |
Dict-Nocomp | 66% | 28% | – | 6% |
DeepL | 32% | 68% | – | – |
Google Trans. | 58% | 42% | – | – |
Bing | 78% | 22% | – | – |
Yandex | 50% | 50% | – | – |
With respect to the other systems evaluated, the number of semantically odd cases exceed 50%, except for DeepL, which always tries to find an interpretable solution and, in many cases, semantically approaches the most appropriate target expression. It is worth noting that the error type called wrong analysis only applies to CT because it is the only strategy based on PoS tagging and syntactic parsing. Besides, it is interesting to note that commercial translators do not make mistakes with the preposition a. Dict-Nocomp-VecMap has not been analyzed, as it is not significantly different from Dict-Nocomp-Vec. The two systems with lowest accuracy values, namely, Dict-First and UMdreaMT, have not been analyzed either.
7. Conclusions
One of the main benefits of distributional compositionality is that the systems based on it (try to) solve word-sense disambiguation by modeling the mutual contextualization of words in a compositional way, guided by the syntactic structure. In this article, we claim that it is possible to apply the same procedure on a bilingual vector space to propose contextualized translations.
We have worked with count-based vector spaces because their dimensions are more transparent, more interpretable, and easier to combine in a compositional way than neural network-based models (word embeddings). However, as deep-learning compositional models are emerging in recent years (Cheng, Kartsaklis, and Grefenstette 2014; Cheng and Kartsaklis 2015), they should be studied in order to discover how they might be used for modeling compositional distributional translation.
It is important to point out three important drawbacks of the proposed contextualization method that need to be addressed in the future. First, in the case of collocations such as save time, go mad, heavy rain, the borderline between compositional and non-compositional interpretation is blurred. For these cases, it is not clear whether it is more appropriate to apply either a compositional method of contextualization or simply identify them previously as non-compositional expressions along with their frequency in a corpus. Second, in the case of complex expressions giving rise to deep dependency trees, we may have frequency scarcity problems due to the iterative application of several contextualizations to the same word vector. And third, as transfer rules are manually defined, it makes it complicated to extend the model to more language pairs. These are challenges that we will have to take into account in the future when we extend the approach to all types of linguistic expressions and other language pairs.
In future work, we will address and go into detail about the idea of incremental translation, guided dependency-by-dependency. With the help of incremental translation, we think that unsupervised machine translation, based on monolingual corpora, can be improved. For this purpose, it will be necessary to better define how to generate translation candidates (our ϕ set) at whatever level of composition. Translating dependency-by-dependency with a narrow set of translation candidates and few transfer rules would yield too literal and poor quality translations. To expand the set of candidates, we should consider pseudo-compositional compounds that may be better translated by a single word, as well as fertile translations, that is, translations in which the target term has more words than the source one. Moreover, in order to avoid the limitations of generating candidates through a bilingual dictionary, we will also generate candidates from the context-free bilingual word embeddings learned from monolingual corpora, such as VecMap or similar cross-lingual techniques.
However, if the set of candidates is expanded too much, other problems may arise concerning both precision and computational efficiency. In order to expand candidates in a controlled manner, it would be required to define transfer rules by taking into account complex syntactic alternations at the level of the sentence construction: passive/active, transitive/unaccusative, and so forth. In fact, the translation system should be provided with a rich set of cross-lingual constructions (Boas 2010) to define deep syntactic transfer rules and thereby expand the set of candidates in a much more accurate way. By doing this, the translation system would be actually based on a hybrid strategy, relying on deep linguistic knowledge and corpora-based data collected by distributional methods.
Acknowledgments
This work has received financial support from the FBBVA Leonardo program, the DOMINO project (PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE), the Consellería de Cultura, Educación e Ordenación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional Development Fund (ERDF). Mikel Artetxe has a doctoral grant from the Spanish MECD.
Notes
Selectional preferences or constraints are the tendency for a word to semantically select or constrain which other words may appear in a direct syntactic relation with it (Resnik 1996).
For the sake of simplicity, we will continue to represent words not as lemma-PoS pairs, but as simple lemmas.
We do not consider the meaning of determiners, auxiliary verbs, or tense affixes. Quantificational issues associated with them are beyond the scope of this work.
References
Author notes
Centro de Investigación en Tecnoloxías Intelixentes.