Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-6 of 6
Mark Steedman
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Computational Linguistics 1–65.
Published: 25 November 2024
Abstract
View article
PDF
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgment. However, these results are often obtained by averaging predictions across large test sets without any insights into the strengths and weaknesses of these metrics across different error types. Challenge sets are used to probe specific dimensions of metric behavior but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES , a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from basic alterations at the word/character level to more intricate errors based on discourse and real-world knowledge. We conducted a large-scale study by benchmarking ACES on 47 metrics submitted to the WMT 2022 and WMT 2023 metrics shared tasks. We also measure their sensitivity to a range of linguistic phenomena. We further investigate claims that Large Language Models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by using a dataset that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods are unreliable. We expose a number of major flaws with existing methods: Most metrics ignore the source sentence; metrics tend to prefer surface level overlap; and over-reliance on language-agnostic representations leads to confusion when the target language is similar to the source language. To further encourage detailed evaluation beyond singular scores, we expand ACES to include error span annotations, denoted as SPAN-ACES , and we use this dataset to evaluate span-based error metrics, showing that these metrics also need considerable improvement. Based on our observations, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing metrics to explicitly focus on the source sentence, focusing on semantic content rather than relying on the lexical overlap, and choosing the right pre-trained model for obtaining representations.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2022) 48 (1): 237.
Published: 04 April 2022
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2021) 47 (1): 9–42.
Published: 21 April 2021
FIGURES
| View All (7)
Abstract
View article
PDF
Steedman ( 2020 ) proposes as a formal universal of natural language grammar that grammatical permutations of the kind that have given rise to transformational rules are limited to a class known to mathematicians and computer scientists as the “separable” permutations. This class of permutations is exactly the class that can be expressed in combinatory categorial grammars (CCGs). The excluded non-separable permutations do in fact seem to be absent in a number of studies of crosslinguistic variation in word order in nominal and verbal constructions. The number of permutations that are separable grows in the number n of lexical elements in the construction as the Large Schröder Number S n −1 . Because that number grows much more slowly than the n ! number of all permutations, this generalization is also of considerable practical interest for computational applications such as parsing and machine translation. The present article examines the mathematical and computational origins of this restriction, and the reason it is exactly captured in CCG without the imposition of any further constraints.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2018) 44 (4): 613–629.
Published: 01 December 2018
FIGURES
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2008) 34 (1): 137–144.
Published: 01 March 2008
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2007) 33 (3): 355–396.
Published: 01 September 2007
Abstract
View article
PDF
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies. The resulting corpus, CCGbank, includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium, and has been used to train wide-coverage statistical parsers that obtain state-of-the-art rates of dependency recovery. In order to obtain linguistically adequate CCG analyses, and to eliminate noise and inconsistencies in the original annotation, an extensive analysis of the constructions and annotations in the Penn Treebank was called for, and a substantial number of changes to the Treebank were necessary. We discuss the implications of our findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.