Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-6 of 6
Ehud Reiter
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2024) 50 (2): 795–805.
Published: 01 June 2024
FIGURES
| View All (4)
Abstract
View article
PDF
While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2018) 44 (3): 393–401.
Published: 01 September 2018
FIGURES
| View All (5)
Abstract
View article
PDF
The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2009) 35 (4): 529–558.
Published: 01 December 2009
Abstract
View article
PDF
There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2007) 33 (2): 283–287.
Published: 01 June 2007
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2002) 28 (4): 545–553.
Published: 01 December 2002
Abstract
View article
PDF
Much natural language processing research implicitly assumes that word meanings are fixed in a language community, but in fact there is good evidence that different people probably associate slightly different meanings with words. We summarize some evidence for this claim from the literature and from an ongoing research project, and discuss its implications for natural language generation, especially for lexical choice, that is, choosing appropriate words for a generated text.
Journal Articles
Publisher: Journals Gateway
Computational Linguistics (2000) 26 (2): 251–259.
Published: 01 June 2000
Abstract
View article
PDF
Some types of documents need to meet size constraints, such as fitting into a limited number of pages. This can be a difficult constraint to enforce in a pipelined natural language generation (NLG) system, because size is mostly determined by content decisions, which usually are made at the beginning of the pipeline, but size cannot be accurately measured until the document has been completely processed by the NLG system. I present experimental data on the performance of single-solution pipeline, multiple-solution pipeline, and revision-based variants of the STOP system (which produces personalized smoking-cessation leaflets) in meeting a size constraint. This shows that a multiple-solution pipeline does much better than a single-solution pipeline, and that a revision-based system does best of all.