Abstract
Dialogue-level dependency parsing, despite its growing academic interest, often encounters underperformance issues due to resource shortages. A potential solution to this challenge is data augmentation. In recent years, large language models (LLMs) have demonstrated strong capabilities in generation, which can facilitate data augmentation greatly. In this study, we focus on Chinese dialogue-level dependency parsing, presenting three simple and effective strategies with LLM to augment the original training instances, namely word-level, syntax-level, and discourse-level augmentations, respectively. These strategies enable LLMs to either preserve or modify dependency structures, thereby assuring accuracy while increasing the diversity of instances at different levels. We conduct experiments on the benchmark dataset released by Jiang et al. (2023) to validate our approach. Results show that our method can greatly boost the parsing performance in various settings, particularly in dependencies among elementary discourse units. Lastly, we provide in-depth analysis to show the key points of our data augmentation strategies.
1 Introduction
Dialogue-level dependency parsing, which extends vanilla dependency parsing (Marcus et al. 1994; Xue et al. 2005; Nivre 2005; McDonald et al. 2013) to dialogue texts, has attracted considerable attention in recent years (Afantenos et al. 2015; Asher et al. 2016; Davidson, Yu, and Yu 2019; Jiang et al. 2023). Given a piece of dialogue text, the task is to build a structural dependency tree covering not only the inner-sentence words but also the words across utterances by well-designed machine learning models. For the Chinese language, Jiang et al. (2023) present the initial work on dialogue-level dependency parsing. Figure 1 shows an example. A dialogue text is split into elementary discourse units (EDUs), where the inner-EDU dependencies reflect sentence-level syntax, and the inter-EDU dependencies reflect discourse structure.
A major problem with building a high-performance dialogue-level dependency parsing model is the relatively small amount of training corpora available. Such dependency treebank annotation is remarkably difficult and can be extremely expensive. It requires a high-degree background in linguistics, and long-distance global observations are needed to determine inter-utterance dependencies. Using expert annotation, Jiang et al. (2023) build a benchmark corpus containing only 850 dialogues with great effort. The small scale of the training corpus is insufficient for standard supervised learning. Jiang et al. (2023) exploited 50 instances as training and the remaining instances as evaluation, reporting a result of 88.20 and 55.73 by the inner-EDU and inter-EDU labeled attachment scores (LASs), respectively, which indicates that accurate dialogue understanding is still a long way away.
Data augmentation can be one prospective method to fix this problem. Given extremely limited (or even no) annotated instances, data augmentation aims to produce a number of pseudo training instances automatically (Scudder 1965; Tanner and Wong 1987). The line of work has been applied successfully to a number of NLP tasks (Liu et al. 2020; Feng et al. 2021; Shorten, Khoshgoftaar, and Furht 2021). The key to success is to ensure the diversity as well as the quality of the automatically generated training instances, enriching the training corpus effectively. Recently, large language models (LLMs) have shown great potential for data augmentation in NLP (Whitehouse, Choudhury, and Aji 2023; Dai et al. 2023) by their strong capabilities in text generation. With appropriate prompting, we can produce several transformed texts with controllable variations.
In this work, we make an initial attempt at data augmentation in Chinese dialogue-level dependency parsing, aiming to construct a number of pseudo instances automatically to supplement the training data. Our key idea is to leverage the generation ability of LLMs to obtain high-quality transformations of a gold-standard dependency tree. On the basis of the characteristics of dialogue-level dependency parsing, we transform the original dependency tree to new well-formed dependency trees gradually along three different levels: word, syntax, and discourse levels, which correspond to the alternations of surface word information, inner-EDU syntactic information, and inter-EDU discourse information, respectively.
We conduct experiments on the benchmark dataset of Jiang et al. (2023), following their work as the start-up baseline. Two settings are evaluated, namely, zero-shot and few-shot according to their work. The zero-shot setting is only with silver training instances that are constructed by rules, and the few-shot setting includes an extra 50 gold-standard training instances. We choose the LLM GPT-3.5-Turbo mainly to drive our data augmentation. The results show that our data augmentation methods are able to boost the performance in both settings, especially on inter-EDU dependencies. In the zero-shot setting, our method can achieve an improvement of 3.04 in inter-EDU LAS. Under the few-shot setting, the increase reaches to 3.85. We also conduct experiments based on Llama2-7B and Qwen-7B, and the results are consistent with GPT-3.5-Turbo. All our datasets as well as the source code are available for research purposes.1
2 Background
Given a text , dependency parsing aims to establish a direct, labeled dependency tree between words in the text. Each word has exactly one head word except the root word which has none, that is, dependency . There is only one root word in the given text. Traditionally, dependency parsing handles mostly sentences. Recently, there is a growing interest in extending it to paragraphs and dialogues, uniting inner-sentence syntactic/semantic as well as discourse structures (Afantenos et al. 2015; Asher et al. 2016; Davidson, Yu, and Yu 2019; Jiang et al. 2023).
Here, we focus on dialogue-level dependency parsing in Chinese, as shown in Figure 1. The dependency trees also maintain the projective property, i.e., no dependencies cross when they are all depicted above the text. Jiang et al. (2023) present seminal work on this task. Given an input dialogue text, they divide it into a sequence of EDUs. For the inner-EDU dependencies, they use syntactic dependencies following the guideline of Jiang et al. (2018). While for the inter-EDU dependencies, they define a set of discourse labels according to the characteristics of Chinese dialogue.
3 Baseline Parser
It is feasible to solve Chinese dialogue-level dependency parsing with traditional sentence-level dependency parsing models directly. However, this straightforward method would be inefficient with respect to both speed and performance because of the increased numbers of input words as well as the dependency labels. Thus, a hierarchical decoding of inner-EDU and inter-EDU dependencies is more suitable. In this work, we extend the state-of-the-art biaffine parser (Dozat and Manning 2016) with the support of pretrained language model (PLM) into our Chinese dialogue-level dependency parser. The parser is a slightly modified version of Dozat and Manning (2016).
In more detail, given an input dialogue text and its EDU-level sequence , where m is the number of EDUs and indicates the words covered by one EDU, we illustrate the baseline parser as follows by an encoding-decoding view.
Encoding
Decoding
By using the above two-step hierarchical decoding, the efficiency is largely improved by filtering out the head and label candidates.
Training
We exploit the standard cross-entropy loss as the training objective, where the losses of dependency arc recognition and label classification are computed separately. Given the output (either or ), we use softmax over and (yh is the correct dependency tree) to calculate the probabilities of all candidate dependency heads and syntactic/discourse labels, respectively. Our training strategy is essentially equivalent to that of Dozat and Manning (2016).
Particularly, the training of our baseline parser can be divided into two parts, that is, inner-EDU and inter-EDU dependency parsing. The inner-EDU parsing may receive supervised signals from the existing syntactic treebank, as well as the inner-EDU dependencies in the 50 gold-standard training instances provided by Jiang et al. (2023). This part could be trained adequately. While for the inter-EDU dependency parsing, there are only very few training instances. We follow Jiang et al. (2023), using their rule-based silver training corpus along with the same 50 gold-standard instances. For the construction of this benchmark corpus, one can refer to their paper for the details.
4 LLM-Assisted Data Augmentation
As mentioned in Jiang et al. (2023), they have annotated only a total of 850 gold-standard dialogue-level dependency trees for training and evaluation, at great cost. There are two main reasons for this. First, dependency-style treebank annotation requires a high degree of linguistic background, which limits the pool of annotators, and the cost of training an annotator is also expensive. Second, discourse-level dependencies often involve long-term deep understandings of dialogue texts, making the annotation process extremely challenging. As a result, dialog-level dependency parsing in a low-resource setting is more practical and desirable.
Data augmentation is one popular strategy in low-resource settings for a variety of NLP tasks (Liu et al. 2020; Feng et al. 2021; Shorten, Khoshgoftaar, and Furht 2021). The main idea is to produce a number of high-quality and high-diversity training instances by transforming the existing training instances. To transform the natural-language training instances, LLMs have been shown great potential because of their strong capability in sentence rewriting. The use of LLMs for data augmentation has been suggested in sentence classification (Dai et al. 2023) and commonsense reasoning (Whitehouse, Choudhury, and Aji 2023). In this work, we exploit the training dependency trees as the base, reforming them gradually by LLM prompting.
There are three types of information in a dialogue-level dependency tree: (1) words, (2) syntax dependencies and (3) discourse dependencies. Here we transform one dependency tree to generate new dependency trees by disturbing the three types of information. We call the three levels of alteration as follows: word-level, syntax-level, and discourse-level, respectively. The word-level alteration is the most basic, where the higher-level one may also include lower-level variations. Figure 2 shows the overall architecture of our method, accompanied by three examples to illustrate the three augmentation strategies.
As shown in Figure 2a, all three-level data augmentation mechanisms adopt a universal style of LLM prompting to rewrite the source input text, that is, “{Characterization}{Chain of Thought (CoT)}{Constraint}{One-Shot Example} {Input} → {Output}”:
Characterization: We initially provide a characterization to the LLM, aiming to stimulate the language understanding ability of the LLM. Concretely, we prompt the LLM: “你具有一定的语言学背景,精通中文文本理解,尤其是依存分析。 (You have a background in linguistics and are proficient in understanding Chinese text, especially dependency parsing.)”. This part is the same over all augmentations.
CoT: A CoT component is exploited to achieve a more reasonable rewriting goal, which is believed to significantly enhance the generative capability of the LLM (Wei et al. 2022). This part starts with “我们一步一步的进行思考 (Let’s think step by step),” followed by more detailed instructions which are different across the three strategies.
Constraint: More importantly, we impose certain constraints to control the LLM output within a fixed language, style, and format, ensuring that the generated text does not undergo language or style shifts and meanwhile making the follow-up information extraction more convenient. This part starts with “请严格遵守如下规定。 (Please strictly observe the constraints.).” There are three common constraints: “1. 使用中文并遵循原始文本的风格. 2. 上下文逻辑应当合理. 3. 不要对给定文本进行回复或续写 (1. Use Chinese and adhere to the style of the original text. 2. The contextual logic should make sense. 3. Do not respond to or continue the given text).”
One-Shot Example: Although the aforementioned method can effectively enhance and standardize the generation of the LLM, it is still difficult to ensure the stability of the output due to the LLM’s tendency to use random sampling during the decoding process. To address this issue, we manually design an example to guide the LLM generation in accordance with the above requirements. This part starts with “请仿照如下的例子。 (Please follow the example below.)”. Here the selection is totally empirical and random, where the concrete examples are offered in Section 5.1.4. In this section, we only depict the format of the example for each augmentation strategy.
We notice that the results of rewriting can be inevitably different when prompts vary even slightly. The phenomenon is acceptable since data augmentation always encounters this issue: Picking up pseudo instances often results in such randomness due to the uncertainty of raw input selection (Feng et al. 2021). The key success of data augmentation is to ensure the high quality of generated outputs. Furthermore, all three strategies can be applied either to the individually segmented EDUs, to the entire utterance, or even to the whole dialogue. The only difference lies in the input and the one-shot example in terms of the prompt. However, our preliminary experimental findings indicate that the results are unsatisfactory when applied to the entire dialogue. The reason might be attributed to the fact that with the increase in sample length, the complexity of executing accurate augmentations escalates. In the following, we describe the three-level data augmentation mechanisms by LLM prompting in detail.
4.1 Word-Level Transformation
By substituting a word into an alternative word, while maintaining the same syntactic and discourse structure, we can obtain a transformed new dependency tree. For the word-level substitution, high-quality word substitution is the key to success. Previous studies have often exploited semantically closed words to replace the original words, largely ignoring the current contexts (Liu et al. 2020). With the help of LLMs, the issue can be greatly reduced. Figure 2b shows an example to illustrate our method.
Concretely, given a dialogue-level dependency tree, we sample a proportion of words in text, which are expected to be replaced by the LLM with a well-defined prompt. The updated text might be ill-formed. To alleviate this, we propose a straightforward and effective approach. It involves automatically verifying the alignment of punctuation marks such as commas, periods, and question marks in the rewritten text to ensure their positions match the original text. Additionally, the sequence of words separated by punctuation marks is checked to ensure that the word counts are equal. The entire rewriting process continues until these conditions are met. In this manner, the entire dependency structure remains unchanged, thereby achieving a direct mapping of dependencies. The specific prompt definition is as follows:
CoT: “1. 找出句子中的谓词。 2. 以谓词为中心,对句子进行逐词改写。 (1. Identify the predicates of given text. 2. Centered on the predicate, rewrite the given text word-by-word.).” In this case, we focus more on predicate-centered words as they are usually the core part of a text.
Constraint: “4. 不允许改变词的顺序。 5. 输出格式:谓词‘{1}’, 文本‘{2}’。 (4. The order of that change of words is not allowed. 5. The output format: Predicates ‘{1}’, Text ‘{2}’.)” The former ensures the LLM does not disrupt the original word order and therefore maintains the syntactic structure. The latter prompts the LLM to generate a formatted output. These two constraints allow the dependencies of the original samples to be directly transferred to the rewritten samples.
One-Shot Example: “输入:文本‘{1}’ /n 输出:谓词‘{1}’,文本‘{2}’。 (‘Input: Text ‘{1}’ /n Output: Predicates ‘{1}’, Text ‘{2}’.)” The example takes the source text as input, and outputs the predicate words as well as the target text after being rewritten.
4.2 Syntax-Level Transformation
The word-level transformation maintains the unchanged syntactic and discourse structures, resulting in limited alterations. Although this alteration through simple word replacement can achieve a satisfactory quality of labels, it cannot yield substantial diversity. The low diversity limits the extra supervised signals. To increase diversity and, in turn, enhance performance more effectively, we implement alterations at a higher level, involving changes in syntax dependencies. Figure 2c illustrates an overview of this method.
The detailed prompt specific to this strategy is defined as follows:
CoT: “1. 对于给定的文本及其篇章关系,请解释当前篇章关系。 2. 对于给定的文本及其篇章关系,请解释当前篇章关系。 3. 对于指定的依附,根据当前篇章关系,列举出有意义的核心示例。 4. 根据第二步和第三步中的依存组合示例,对输入文本进行重写,并按照指定格式输出。 (1. For a given text and its discourse relationship, please explain the discourse relationship. 2. For the specified nucleus, provide meaningful attachment examples based on the current discourse relationship. 3. For the specified attachment, provide meaningful nucleus examples based on the current discourse relationship. 4. Rewrite the input text based on the dependency combination examples from steps 2 and 3 and output it according to the specified format.)”
Constraint: “4. 尽可能改变句法结构,但严格保留语篇结构。 5. 输出格式:文本‘{1}’,核心‘{2}’,依附‘{3}’,推理步骤‘{4}’。 (4. Alter the syntactic structure as much as possible, but strictly preserve the discourse structure. 5. Output format: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Reasoning Step ‘{4}’.)” By the former instructions, the LLM has greater freedom in generating samples, while being constrained not to alter the discourse structure. The latter requires the LLM to output in a fixed format for easy extraction.
One-Shot Example: “输入:文本‘{1}’,核心‘{2}’ ,依附‘{3}’,当前篇章关系‘{4}’。 /n 输出:: 文本‘{1}’,核心‘{2}’,依附‘{3}’ ,推理步骤:‘{4}’。 (Input: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Current Discourse Relationship ‘{4}’. /n Output: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Reasoning Steps ‘{4}’.) ”
After generation, the baseline parser is used to assign syntactic dependencies to the altered text, ignoring discourse dependency predictions. Note that our data augmentation mainly aims at discourse-level dependencies, because the inner-EDU dependency parsing is already acceptable due to various available syntactic/semantic dependency treebanks. Thus, this transformation is executed from the perspective of keeping discourse dependencies unchanged, enriching the same discourse structure with abundant syntax contexts.
4.3 Discourse-Level Transformation
The syntax-level transformation changes the syntactic structure, resulting in a broader scope of sample variations. Nevertheless, the rigidity of discourse semantics can limit the model’s ability to generalize outside the boundaries of the original discourse structure. Hence, we further propose a discourse-level transformation mechanism, as shown in Figure 2d.
The prompt of discourse-level augmentation is defined as follows:
CoT: “1. 对于给定的文本及其篇章关系,请解释当前篇章关系和目标篇章关系。 2. 对于指定的核心,根据新的篇章关系,列举出有意义的依附示例。 3. 对于指定的依附,根据新的篇章关系,列举出有意义的核心示例。 4. 根据第二步和第三步中的依存组合示例,对输入文本进行重写,并按照指定格式输出。 (1. For a given text and its discourse relationship, please explain the current discourse relationship and the target discourse relationship. 2. For the specified nucleus, provide meaningful attachment examples based on the new discourse relationship. 3. For the specified attachment, provide meaningful nucleus examples based on the new discourse relationship. 4. Rewrite the input text based on the dependency combination examples from steps two and three and output it according to the specified format.).” We randomly choose a discourse relation for the LLM to explain and generate text based on it. Through the above chained reasoning, the LLM can generate and combine EDUs that fit the target discourse relationship.
Constraint: “4. 输出格式:文本‘{1}’ ,核心‘{2}’,依附‘{3}’,推理步骤 ‘{4}’。 (4. The output format: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Reasoning Steps ‘{4}’. )” This format constraint allows the LLM output to position EDUs and their relationships by fixed-form natural language, facilitating the alignment and completion of dependencies.
One-Shot Example: “输入:文本‘{1}’,核心 ‘{2}’,依附‘{3}’,当前篇章关系‘{4}’ ,目标篇章关系‘{5}’。 /n 输出: 文本‘{1}’,核心‘{2}’ ,依附‘{3}’。‘{4}’,推理步骤: /n (Input: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Current Discourse Relationship ‘{4}’, Target Discourse Relationship ‘{5}’. /n Output: Text ‘{1}’, Nucleus ‘{2}’, Attachment ‘{3}’, Reasoning Steps ‘{4}’.) ”
To assist the LLM in interpreting each relationship, we establish a well-defined description table that contains discourse relations and embed it into the prompt before the constraint part. To auto-annotate the generated text, we exploit the same method as Section 4.2 by the baseline parser to perform inner-EDU dependency parsing, while the inter-EDU dependencies are built upon the root words of EDUs with the labels specified by our prompting texts.
4.4 A Viewpoint from Self-Training
In this work, we exploit LLM to generate new dialogues by heuristic prompting strategies, where the inner-EDU and inter-EDU dependencies of these dialogues can be easily inferred from their source training instances. There is an interesting question regarding how this method relates to previously representative data augmentation approaches. Essentially, our method is a form of self-training with carefully chosen unlabeled examples. Many previous data augmentation studies assume that the newly added unlabeled instances can be easily labeled by heuristic strategies, e.g., rules (Sahin and Steedman 2018) and interpolation (Zhang et al. 2018). In contrast, our approach could be categorized as model-based as mentioned in Feng et al. (2021), and self-training is responsible for most of the annotations.
As compared to previous works of self-training for dependency parsing (Yu, El-karef, and Bohnet 2015; Rotman and Reichart 2019; Guo et al. 2022), our approach is unique in two ways. First, part of the augmented dependencies (i.e., the inter-EDU dependencies) are not annotated by the basic parser. As our heuristic annotations produce better inter-EDU dependencies than the basic parser, it is reasonable to expect that our approach will be more effective, and our preliminary findings support this expectation. Second, we use LLMs to generate unlabeled instances instead of heuristic selection from a large-scale pool that was widely used before. Alternatively, we can use other text generators. As of yet, LLM is the best fit for this situation. Furthermore, with LLMs, we can produce high-quality dialogues with specifications that can be parsed most accurately, which is difficult with other tools.
There is another question that arises: Why not use LLM to perform dialogue-level parsing directly following a standard strategy of self-training? Currently, we find that LLMs are not well suited to dependency parsing at this point. Even using direct supervised fine-tuning based on an open-source Llama2-7B, the plain syntactic parsing (inner-EDU parsing) performs much worse than our baseline (the gap is greater than 4% in unlabeled attachment score [UAS]). As a result, we believe that a long-term investigation is needed in order to explore decoder-style LLMs for dependency parsing. Due to the indirect exploitation of syntax and discourse properties, our method can be considered a special case of distilling soft knowledge from LLMs.
5 Experiments
5.1 Settings
5.1.1 Dataset
We use the publicly available corpus released by Jiang et al. (2023), which is the only benchmark dataset for Chinese dialogue-level dependency parsing. The dataset unites syntactic dependencies and discourse dependencies as a whole over dialogue texts. The syntactic dependency structure is sourced from Jiang et al. (2018), and the dependency-based discourse structure is reorganized according to the characteristics of dialogue and previous RST-based discourse parsing (Li et al. 2014; Carlson, Marcu, and Okurovsky 2001). Table 1 shows the statistics of this dataset.
5.1.2 Evaluation
We assess the performance using the UAS and LAS ignoring the punctuation words, following the standard evaluation of dependency parsing. To provide a detailed analysis, we report the scores of inner-EDU and inter-EDU dependencies separately. The inter-EDU performance is calculated based on the concrete dependency arcs over words, not EDUs, which means that the correctness of EDU heads is a requirement for the correctness of inter-EDU dependencies. In situations where a development set is not available due to limited resources, we choose the last training checkpoint for evaluation purposes. All experiments are conducted on a single RTX 2080 Ti GPU.
5.1.3 Hyperparameters
For the baseline parser, we utilize the base scale discriminator of the Chinese version ELECTRA (Clark et al. 2019; Cui et al. 2020) as the PLM.2 The hidden sizes of the subsequent neural modules are all 200, and the dropout ratio is set to 0.2. The AdamW (Loshchilov and Hutter 2019) optimizer is used for objective optimization, and the weight decay is 0.01. We use the linear warmup for the first 10% training steps, setting the initial learning rate of the PLM to 2e-5 and of the subsequent modules to 1e-4. To alleviate gradient explosion, we apply the gradient clipping mechanism by a maximum value of 2.0. The training batch size is set to 32 for both the syntactic treebank and pseudo-labeled dialogue, whereas it is set to 1 for the altered data based on the LLM. The number of epochs for training is set at 10.
For data augmentation, we utilize GPT-3.5-Turbo-06133 as the main LLM for its impressive performance in various NLP tasks. We also examine two open-source LLMs, namely, Llama2-7B (Touvron et al. 2023b) and Qwen-7B (Bai et al. 2023). Particularly, Qwen-7B has been optimized much for the Chinese language. We use the default temperature setting and empirically set top_p to 0.5, ensuring the stability of the generated output from the LLM while retaining diversity. For each type of transformation, we use LLM to generate three different augmentations based on a given instance in the training dataset.
5.1.4 One-Shot Example
The one-shot examples of word-level, syntax-level, and discourse-level transformations are designed according to the prompting descriptions in CoT and Constraint. Table 2 presents specific details. The one-shot example of word-level transformation consists of identifying the predicates first, and then rewriting the sentence word-by-word. In syntax-level transformation, the one-shot example first outputs the rewritten text along with the discourse nucleus and attachment in the text, followed by a CoT that details the reasoning steps. The CoT is achieved by first describing the discourse relationship, then generating several rewritten nuclei and attachments in line with the relationship, and finally combining one specific pair of the produced candidates. The one-shot example of discourse-level transformation is similar to that at the syntax level. The key difference lies in the goal of discourse-level transformation, which aims to change discourse relationships rather than preserve them.
wrd | 输入: 文本“这款笔记本我要退了,明天买其它样式。” /n 输出: 谓词“退、买”, 文本“那件衣服我要买了,今天下订单。” |
Input “I would like to refund this laptop, and I will buy other things tomorrow.” /n Output: Predicates “refund, buy,” Text “I am going to buy that dress, and I will order this item today.” | |
syn | 输入: 文本“这款笔记本我要退了,质量实在太差了。”,核心“这款笔记本我要退了”,依附“质量实在太差了“,篇章关系“原因” /n 输出: 文本“我准备购买其它笔记本,因为它无法正常使用。”, 核心“我准备购买其它笔记本,”,依附“因为它无法正常使用。”。推理步骤“1. 篇章关系“原因”是指依附为核心的原因。2. “这款笔记本我要退了”,对于篇章关系“原因”,依附的例子可以为“笔记本不符合我购买的需求”,“这件商品质量不好”,“它无法正常使用”。3. “质量实在太差了。”,对于篇章关系“原因”,核心的例子可以为“我无法正常使用该笔记本”,“我准备购买其它笔记本”,“我要退款”。4. 综上依存组合,根据“ 因为笔记本无法正常使用。”,核心“我准备购买其它笔记本”和“因为它无法正常使用”,生成了新的文本。 |
Input: Text “I want to refund this laptop, because the quality is really bad.”, Nucleus “I want to refund this laptop,”, Attachment “because the quality is really bad.”, Discourse Relationship “Cause” /n Output: Text “Because the laptop cannot function properly, I plan to purchase other products.”, Nucleus “I plan to purchase other products.”, Attachment “Because the laptop cannot function properly,”, Reasoning Steps “1. The discourse relation discourse relation “Cause” refers to reasons with attachment as the nucleus. 2. For the nucleus “I want to refund this laptop,”, in the context of discourse relation “Cause”, examples of attachments could be “the laptop doesn’t meet my purchasing needs,” “this quality of laptop is bad,” “the laptop cannot function properly”. 3. For the attachment “because the quality is really bad.”, examples of nucleus elements related to the discourse relation “Cause” could be “I won’t be able to use the product properly”, “I’m considering buying another laptop”, “I want a refund”. 4. Based on the dependency combinations mentioned above, a new text has been generated using “Because the laptop cannot function properly,” and “I’m considering buying another laptop”. | |
dis | 输入: 文本“如果这款笔记本明天还没有到货,我就要退了”,核心“我就要退了”,依附“如果这款笔记本明天还没有到货”,当前篇章关系“条件”,目标篇章关系“因果” /n 输出: 文本“因为这款笔记本没有到货,我要退款”,核心“我要退了”,依附“因为这件商品没到货”, 推理步骤:1. 篇章关系“ 条件”是指依附为核心的前提,而篇章关系“原因”是指依附为核心的原因。2. “我就要退了”,对于篇章关系“原因”,依附的例子可以为“笔记本不符合我购买的需求”,“因为这款笔记本没有到货”,“ 笔记本无法正常使用”。3. “如果这款笔记本明天还没有到货”,对于篇章关系“原因”核心的例子可以为“我明天无法使用该商品”,“我准备购买其它商品”,“我要退款”。4. 综上依存组合,根据“因为这款笔记本没有到货”和“我要退款”,生成了新的文本。 |
Input: Text “If this laptop doesn’t arrive tomorrow, I am going to refund this merchandise.”, Nucleus “I am going to refund this merchandise”, Attachment “If this laptop doesn’t arrive tomorrow”, Current Discourse Relationship “Condition”, Target Discourse Relationship “Cause” /n Output: Text “Because this laptop hasn’t arrived, I want a refund.”, Nucleus “I want a refund.”, Attachment “Because this laptop didn’t arrive,”, Reasoning Steps “1. The discourse relation “Condition” refer to premises with attachment as the nucleus, while discourse relation “Cause” refers to reasons with attachment as the nucleus. 2. For the nucleus “I am going to refund this merchandise”, in the context of discourse relation “Cause”, examples of attachments could be “the laptop doesn’t meet my purchasing needs,” “Because this laptop hasn’t arrived,” “Because the laptop can’t be used properly.” 3. For the attachment “If this laptop doesn’t arrive tomorrow,” examples of nucleus elements related to the discourse relation “Cause” could be “I won’t be able to use the laptop tomorrow”, “I’m considering buying another laptop”, “I want a refund”. 4. Based on the dependency combinations mentioned above, a new text has been generated using “Because this laptop hasn’t arrived” and “I want a refund” as the current inputs.”. |
wrd | 输入: 文本“这款笔记本我要退了,明天买其它样式。” /n 输出: 谓词“退、买”, 文本“那件衣服我要买了,今天下订单。” |
Input “I would like to refund this laptop, and I will buy other things tomorrow.” /n Output: Predicates “refund, buy,” Text “I am going to buy that dress, and I will order this item today.” | |
syn | 输入: 文本“这款笔记本我要退了,质量实在太差了。”,核心“这款笔记本我要退了”,依附“质量实在太差了“,篇章关系“原因” /n 输出: 文本“我准备购买其它笔记本,因为它无法正常使用。”, 核心“我准备购买其它笔记本,”,依附“因为它无法正常使用。”。推理步骤“1. 篇章关系“原因”是指依附为核心的原因。2. “这款笔记本我要退了”,对于篇章关系“原因”,依附的例子可以为“笔记本不符合我购买的需求”,“这件商品质量不好”,“它无法正常使用”。3. “质量实在太差了。”,对于篇章关系“原因”,核心的例子可以为“我无法正常使用该笔记本”,“我准备购买其它笔记本”,“我要退款”。4. 综上依存组合,根据“ 因为笔记本无法正常使用。”,核心“我准备购买其它笔记本”和“因为它无法正常使用”,生成了新的文本。 |
Input: Text “I want to refund this laptop, because the quality is really bad.”, Nucleus “I want to refund this laptop,”, Attachment “because the quality is really bad.”, Discourse Relationship “Cause” /n Output: Text “Because the laptop cannot function properly, I plan to purchase other products.”, Nucleus “I plan to purchase other products.”, Attachment “Because the laptop cannot function properly,”, Reasoning Steps “1. The discourse relation discourse relation “Cause” refers to reasons with attachment as the nucleus. 2. For the nucleus “I want to refund this laptop,”, in the context of discourse relation “Cause”, examples of attachments could be “the laptop doesn’t meet my purchasing needs,” “this quality of laptop is bad,” “the laptop cannot function properly”. 3. For the attachment “because the quality is really bad.”, examples of nucleus elements related to the discourse relation “Cause” could be “I won’t be able to use the product properly”, “I’m considering buying another laptop”, “I want a refund”. 4. Based on the dependency combinations mentioned above, a new text has been generated using “Because the laptop cannot function properly,” and “I’m considering buying another laptop”. | |
dis | 输入: 文本“如果这款笔记本明天还没有到货,我就要退了”,核心“我就要退了”,依附“如果这款笔记本明天还没有到货”,当前篇章关系“条件”,目标篇章关系“因果” /n 输出: 文本“因为这款笔记本没有到货,我要退款”,核心“我要退了”,依附“因为这件商品没到货”, 推理步骤:1. 篇章关系“ 条件”是指依附为核心的前提,而篇章关系“原因”是指依附为核心的原因。2. “我就要退了”,对于篇章关系“原因”,依附的例子可以为“笔记本不符合我购买的需求”,“因为这款笔记本没有到货”,“ 笔记本无法正常使用”。3. “如果这款笔记本明天还没有到货”,对于篇章关系“原因”核心的例子可以为“我明天无法使用该商品”,“我准备购买其它商品”,“我要退款”。4. 综上依存组合,根据“因为这款笔记本没有到货”和“我要退款”,生成了新的文本。 |
Input: Text “If this laptop doesn’t arrive tomorrow, I am going to refund this merchandise.”, Nucleus “I am going to refund this merchandise”, Attachment “If this laptop doesn’t arrive tomorrow”, Current Discourse Relationship “Condition”, Target Discourse Relationship “Cause” /n Output: Text “Because this laptop hasn’t arrived, I want a refund.”, Nucleus “I want a refund.”, Attachment “Because this laptop didn’t arrive,”, Reasoning Steps “1. The discourse relation “Condition” refer to premises with attachment as the nucleus, while discourse relation “Cause” refers to reasons with attachment as the nucleus. 2. For the nucleus “I am going to refund this merchandise”, in the context of discourse relation “Cause”, examples of attachments could be “the laptop doesn’t meet my purchasing needs,” “Because this laptop hasn’t arrived,” “Because the laptop can’t be used properly.” 3. For the attachment “If this laptop doesn’t arrive tomorrow,” examples of nucleus elements related to the discourse relation “Cause” could be “I won’t be able to use the laptop tomorrow”, “I’m considering buying another laptop”, “I want a refund”. 4. Based on the dependency combinations mentioned above, a new text has been generated using “Because this laptop hasn’t arrived” and “I want a refund” as the current inputs.”. |
5.2 Results
We consider two different settings during evaluation: (1) the zero-shot setting consistent with Jiang et al. (2023), where a set of rule-based silver instances is used as the initial training dataset, and (2) the few-shot setting, where the 50 human-annotated instances together with the silver corpus are used for training. For each setting, we evaluate the baseline method, each augmentation strategy alone, pairwise combinations, and the full combination of all three data augmentations. In this way, we can examine the potential of all our data augmentations comprehensively. Table 3 shows the main results. We conduct significant tests between the baseline and our methods using pairwise t-test.
Training Data . | Zero-shot . | Few-shot . | ||||||
---|---|---|---|---|---|---|---|---|
Inner-EDU . | Inter-EDU . | Inner-EDU . | Inter-EDU . | |||||
UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | |
Jiang et al. (2023) | 88.22 | 84.34 | 66.48 | 50.78 | 91.74 | 88.20 | 71.09 | 55.73 |
baseline | 88.20 | 84.40 | 66.41 | 50.85 | 91.66 | 89.12 | 71.59 | 56.32 |
GPT-3.5-Turbo-0613 | ||||||||
+ wrd | 88.12 | 84.23 | 67.73 | 52.14 | 92.37 | 90.01 | 73.06 | 58.50 |
+ syn | 88.09 | 84.36 | 68.19 | 52.81 | 92.13 | 89.94 | 73.22 | 59.33 |
+ dis | 88.27 | 84.31 | 68.57 | 53.41 | 92.35 | 90.11 | 73.57 | 59.68 |
+ wrd & syn | 88.32 | 84.43 | 68.44 | 53.39 | 92.38 | 90.16 | 73.52 | 59.47 |
+ wrd & dis | 88.24 | 84.12 | 68.64 | 53.52 | 92.19 | 90.04 | 73.84 | 59.81 |
+ syn & dis | 88.08 | 84.17 | 68.77 | 53.62 | 92.23 | 90.18 | 73.88 | 59.94 |
+ wrd & syn & dis | 88.33 | 84.51 | 68.82 | 53.89 | 92.46 | 90.35 | 73.81 | 60.17 |
Llama2-7B | ||||||||
+ wrd | 87.79 | 83.99 | 66.83 | 51.28 | 91.91 | 89.73 | 72.33 | 57.63 |
+ syn | 87.75 | 84.00 | 67.17 | 51.67 | 91.65 | 89.51 | 72.31 | 58.28 |
+ dis | 88.01 | 84.02 | 67.75 | 52.27 | 91.90 | 89.85 | 72.76 | 58.45 |
+ wrd & syn | 88.02 | 84.13 | 67.52 | 52.22 | 91.87 | 89.81 | 72.56 | 58.38 |
+ wrd & dis | 88.05 | 83.89 | 67.78 | 52.46 | 91.82 | 89.63 | 73.13 | 58.75 |
+ syn & dis | 87.85 | 83.90 | 67.80 | 52.59 | 91.76 | 89.91 | 72.92 | 58.79 |
+ wrd & syn & dis | 88.01 | 84.32 | 68.15 | 52.69 | 91.97 | 89.89 | 72.95 | 59.01 |
Qwen-7B | ||||||||
+ wrd | 87.87 | 84.12 | 67.22 | 51.43 | 92.03 | 89.88 | 72.68 | 57.94 |
+ syn | 87.88 | 84.14 | 67.63 | 52.03 | 91.94 | 89.69 | 72.80 | 58.46 |
+ dis | 88.16 | 84.09 | 68.11 | 52.51 | 92.01 | 89.97 | 73.19 | 58.85 |
+ wrd & syn | 88.15 | 84.21 | 67.92 | 52.64 | 91.84 | 89.97 | 73.05 | 58.74 |
+ wrd & dis | 88.11 | 84.00 | 68.23 | 52.86 | 91.87 | 89.76 | 73.47 | 59.05 |
+ syn & dis | 87.98 | 84.02 | 68.18 | 53.02 | 92.07 | 89.99 | 73.42 | 59.14 |
+ wrd & syn & dis | 88.18 | 84.31 | 68.41 | 53.12 | 91.96 | 89.85 | 73.52 | 59.31 |
Training Data . | Zero-shot . | Few-shot . | ||||||
---|---|---|---|---|---|---|---|---|
Inner-EDU . | Inter-EDU . | Inner-EDU . | Inter-EDU . | |||||
UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | |
Jiang et al. (2023) | 88.22 | 84.34 | 66.48 | 50.78 | 91.74 | 88.20 | 71.09 | 55.73 |
baseline | 88.20 | 84.40 | 66.41 | 50.85 | 91.66 | 89.12 | 71.59 | 56.32 |
GPT-3.5-Turbo-0613 | ||||||||
+ wrd | 88.12 | 84.23 | 67.73 | 52.14 | 92.37 | 90.01 | 73.06 | 58.50 |
+ syn | 88.09 | 84.36 | 68.19 | 52.81 | 92.13 | 89.94 | 73.22 | 59.33 |
+ dis | 88.27 | 84.31 | 68.57 | 53.41 | 92.35 | 90.11 | 73.57 | 59.68 |
+ wrd & syn | 88.32 | 84.43 | 68.44 | 53.39 | 92.38 | 90.16 | 73.52 | 59.47 |
+ wrd & dis | 88.24 | 84.12 | 68.64 | 53.52 | 92.19 | 90.04 | 73.84 | 59.81 |
+ syn & dis | 88.08 | 84.17 | 68.77 | 53.62 | 92.23 | 90.18 | 73.88 | 59.94 |
+ wrd & syn & dis | 88.33 | 84.51 | 68.82 | 53.89 | 92.46 | 90.35 | 73.81 | 60.17 |
Llama2-7B | ||||||||
+ wrd | 87.79 | 83.99 | 66.83 | 51.28 | 91.91 | 89.73 | 72.33 | 57.63 |
+ syn | 87.75 | 84.00 | 67.17 | 51.67 | 91.65 | 89.51 | 72.31 | 58.28 |
+ dis | 88.01 | 84.02 | 67.75 | 52.27 | 91.90 | 89.85 | 72.76 | 58.45 |
+ wrd & syn | 88.02 | 84.13 | 67.52 | 52.22 | 91.87 | 89.81 | 72.56 | 58.38 |
+ wrd & dis | 88.05 | 83.89 | 67.78 | 52.46 | 91.82 | 89.63 | 73.13 | 58.75 |
+ syn & dis | 87.85 | 83.90 | 67.80 | 52.59 | 91.76 | 89.91 | 72.92 | 58.79 |
+ wrd & syn & dis | 88.01 | 84.32 | 68.15 | 52.69 | 91.97 | 89.89 | 72.95 | 59.01 |
Qwen-7B | ||||||||
+ wrd | 87.87 | 84.12 | 67.22 | 51.43 | 92.03 | 89.88 | 72.68 | 57.94 |
+ syn | 87.88 | 84.14 | 67.63 | 52.03 | 91.94 | 89.69 | 72.80 | 58.46 |
+ dis | 88.16 | 84.09 | 68.11 | 52.51 | 92.01 | 89.97 | 73.19 | 58.85 |
+ wrd & syn | 88.15 | 84.21 | 67.92 | 52.64 | 91.84 | 89.97 | 73.05 | 58.74 |
+ wrd & dis | 88.11 | 84.00 | 68.23 | 52.86 | 91.87 | 89.76 | 73.47 | 59.05 |
+ syn & dis | 87.98 | 84.02 | 68.18 | 53.02 | 92.07 | 89.99 | 73.42 | 59.14 |
+ wrd & syn & dis | 88.18 | 84.31 | 68.41 | 53.12 | 91.96 | 89.85 | 73.52 | 59.31 |
First, we examine the results of the zero-shot setting as a whole. The baseline method achieves 88.20 UAS and 84.40 LAS on inner-EDU dependencies, but only 66.41 UAS and 50.85 LAS for the inter-EDU dependencies, indicating that the inter-EDU dependency parsing is still underperforming. With our two-step parsing of inner-EDU and inter-EDU dependencies, which considers the hierarchical structure of dialogue-level dependency parsing, the baseline achieves better performance than Jiang et al. (2023). Through our word-level, syntax-level, and discourse-level data augmentations, both the inner-and inter-EDU performance can be improved, and the inter-EDU performance can be improved even more. As shown, the final model has 84.51 −84.40 = 0.11 improvement in inner-EDU dependencies, and 53.89 −50.85 = 3.04(p < 0.001) improvements in inter-EDU dependencies. The marginal gain on inner-EDU dependencies can be attributed to the sufficiently large scale of the dependencies during training provided by a syntactic treebank.
Furthermore, we examine the performance of word-level, syntax-level, and discourse-level data augmentation separately, as well as their pairwise combinations. The overall tendency is that discourse-level > syntax-level > word-level in terms of performance. Among the single augmentation strategies, the discourse-level method shows the best performance, and the word-level approach is the worst. Among the pairwise combinations, the combination of syntax- and discourse-level strategies yields the highest LAS, whereas the word- and syntax-level combination performs the poorest. The possible reason for this observation might be that high-level substitutions can actually cover the low-level alternations to some degree. Our results also show that the three methods can be supplementary to one another because the addition of another augmentation strategy can always bring improved performance. The reason might be that the low-level data augmentation can obtain higher-quality dependency trees because of the relatively smaller variations.
Third, we shift our focus to the few-shot results with an extra 50 human-annotated dialogue-level dependency trees. As shown, the baseline performance of both inner-EDU and inter-EDU dependencies has been greatly boosted. There are two aspects that contribute to the significant improvements: in-domain dialogue data for inner-EDU syntactic parsing, and supervised data for inter-EDU discourse parsing. In addition, we observe completely consistent results in the few-shot setting compared to the zero-shot. The discourse-level data augmentation can bring the highest gains, whereas the word-level one is the lowest. The pairwise combination is always better than single augmentation alone. The final method, which combines all three strategies, obtains the best performance, leading to improvements of 90.35 −89.12 = 1.23(p < 0.001) in inner-EDU LAS and 60.17 −56.32 = 3.85(p < 0.001) in inter-EDU LAS. Interestingly, we find that larger improvements can be achieved in this setting by our data augmentation, despite a stronger baseline. The reason might be that by adding higher-quality source instances, the produced pseudo instances after data augmentation are less noisy.
Finally, we compare the zero-shot and few-shot performance across different LLMs. The above results are based on the closed-source GPT-3.5-Turbo, and here we further verify our method based on two open-source LLMs: (1) Llama2-7B and (2) Qwen-7B. As shown, GPT-3.5-Turbo exhibits the highest performance on our task among the three LLMs. Compared with Llama2-7B, the difference is significant in the inter-EDU results. We can see that the performance gaps are 53.89 −52.69 = 1.20 and 60.17 −59.01 = 1.16 for the zero-shot and few-shot settings, respectively. In addition, Qwen-7B performs better than Llama2-7B. The reason might be that Qwen-7B involves Chinese-oriented optimization during pretraining.
6 Analysis
6.1 Prompt Design
6.1.1 Characterization
The integration of a specialist role position enables the LLM to comprehend and adapt to a new task more accurately. Given that the generation of coherent text and rational dependency structures necessitates expertise in language, we position the LLM as a natural language specialist, as illustrated in Figure 2.
To probe the influence of prompt design on the LLM’s generation performance, we select a sample as a case study and compare the generation results without and with characterization. As depicted in Figure 3, characterization empowers the LLM to produce samples that better match our requirements. In the syntax-level strategy, we aim for the LLM to maintain the original syntactic and discourse structures. Without characterization, the LLM might lack necessary prior knowledge, causing difficulties in understanding syntactic and discourse structures. By meticulously defining roles for the LLM, it can potentially assimilate NLP field-specific prior knowledge effectively, thereby circumventing this issue.
6.1.2 CoT
A CoT encompasses a series of intermediate logical steps, notably enhancing the capability of LLMs to execute intricate reasoning tasks (Wei et al. 2022). Following this work, we guide the LLM to produce logical inference results progressively, ultimately obtaining outcomes that fulfill the generation specifications. To assess the influence of CoT on LLM-based data augmentation, we select the same sample as previously discussed as a case study. We then observe the effects on the discourse-level data augmentation with and without the utilization of CoT. As illustrated in Figure 4, CoT effectively guides the generation of samples that meet our criteria. Our discourse-level transformation aims to alter the discourse structure of the original sample, whereas this objective is not fulfilled in the absence of CoT. One plausible explanation for this can be that without step-by-step inference for complex tasks, the LLM might struggle to accurately comprehend the task and generate logical outcomes.
6.1.3 Constraint
Despite the impressive language comprehension capability of LLMs, it is still a major challenge to generate stable and reliable texts consistently. Fortunately, leveraging natural language to instill constraints into the LLM’s generation process has shown to be effective. This method capitalizes on the LLM’s language understanding aptitude by conveying constraint information to the LLM in the form of natural language directives, thereby enabling the LLM to comprehend and adhere to these constraints. Following the line of these works, we delineate a set of constraints to steer the LLM in its generation process. Taking the syntax-level approach as an example, the specific constraints are shown in Figure 2c.
When constraints are not provided, the LLM may produce unpredictable results. We illustrate the effects of these constraints on the LLM-generated results by using several key constraints. First, it is essential to specify that the LLM should output text in Chinese; without this instruction, it may default to English responses. Second, word segmentation, a crucial step for Chinese dependency analysis, must be explicitly required; otherwise, the LLM will output unsegmented results, leading to misalignment in dependency relations. Third, the LLM must be prompted to stop generating after providing a response; otherwise, it may continue generating unrelated content. Fourth, the LLM must be explicitly directed not to reply to or extend the given content; without this directive, it may respond to some interrogative sentences, thereby invalidating the generated results. Thus, the LLM needs to identify EDUs and clearly delineate their boundaries.
6.1.4 One-Shot Instruction
Owing to its vast scale, fine-tuning the LLM poses a challenge, rendering it difficult to supply supervision signals for the LLM’s adaptation to downstream tasks. Fortunately, the introduction of supervision signals into the LLM via in-context learning has been demonstrated to be straightforward and effective (Brown et al. 2020). Using the method, the LLM can produce reliable and desired text by mimicking the given examples. We manually select a sample at random from the training set and meticulously craft a generation example in accordance with the generation strategy, subsequently appending it to the prompt. In addition, we provide a reason for the generation to support CoT. Figure 2b provides a demonstration of the prompt utilized in the word-level transformation. Here, an initial text sample is supplied, followed by the generation of sample outcomes, accompanied by the elucidation of the reasoning behind these outcomes.
As depicted in Figure 5, we observe noticeable disparities in generation with and without the provision of an example. In the absence of example guidance, an erroneous “elaboration” direction is generated, which can potentially be attributed to the LLM’s inability to comprehend the dependency structure and necessary structural modifications. When provided with an example, the LLM can mimic the existing sample, subsequently generating stable and reliable structures.
6.2 Different Input Granularity
In Section 4, we mention that our data augmentation methods can take either EDUs or complete utterances as inputs. Here, we compare the performance of the two to demonstrate the differences, as shown in Table 4. We observe that using utterances as inputs generally achieves higher performance. One possible reason is that this approach provides a larger receptive field for the LLM, allowing it to balance fluency and diversity. The highest performance is still achieved using the discourse-level transformation method, consistent with previous experiments. Both methods can lead to significant performance improvements, underscoring the superiority of the methods we propose. Furthermore, in our preliminary experiments, we observed that when LLM rewriting with dialogue-level input fails to follow the provided instructions, the rewritten samples cannot be assigned or filled with labels. One possible reason is that the difficulty in accurately rewriting arises when the input text becomes longer.
Augmented Data . | Zero-shot . | Few-shot . | ||||||
---|---|---|---|---|---|---|---|---|
Inner-EDU . | Inter-EDU . | Inner-EDU . | Inter-EDU . | |||||
UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | |
wrd w/i EDUs | 88.08 | 84.10 | 67.34 | 51.86 | 92.18 | 89.95 | 72.87 | 58.22 |
wrd w/i utterances | 88.12 | 84.23 | 67.73 | 52.14 | 92.37 | 90.01 | 73.06 | 58.50 |
syn w/i EDUs | 88.16 | 84.21 | 67.91 | 52.42 | 92.24 | 90.01 | 73.33 | 58.89 |
syn w/i utterances | 88.09 | 84.36 | 68.19 | 52.81 | 92.13 | 89.94 | 73.22 | 59.33 |
dis w/i EDUs | 88.21 | 84.33 | 68.18 | 53.07 | 92.21 | 89.97 | 73.43 | 59.27 |
dis w/i utterances | 88.27 | 84.31 | 68.57 | 53.41 | 92.35 | 90.11 | 73.57 | 59.68 |
Augmented Data . | Zero-shot . | Few-shot . | ||||||
---|---|---|---|---|---|---|---|---|
Inner-EDU . | Inter-EDU . | Inner-EDU . | Inter-EDU . | |||||
UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | UAS . | LAS . | |
wrd w/i EDUs | 88.08 | 84.10 | 67.34 | 51.86 | 92.18 | 89.95 | 72.87 | 58.22 |
wrd w/i utterances | 88.12 | 84.23 | 67.73 | 52.14 | 92.37 | 90.01 | 73.06 | 58.50 |
syn w/i EDUs | 88.16 | 84.21 | 67.91 | 52.42 | 92.24 | 90.01 | 73.33 | 58.89 |
syn w/i utterances | 88.09 | 84.36 | 68.19 | 52.81 | 92.13 | 89.94 | 73.22 | 59.33 |
dis w/i EDUs | 88.21 | 84.33 | 68.18 | 53.07 | 92.21 | 89.97 | 73.43 | 59.27 |
dis w/i utterances | 88.27 | 84.31 | 68.57 | 53.41 | 92.35 | 90.11 | 73.57 | 59.68 |
6.3 Instance Diversity
We calculate the overlap between the original dataset () and the augmented dataset () by averaging the Rouge-1 score for each pair of the original and generated instances. Intuitively, samples exhibiting lower overlap with the original ones are indicative of greater diversity. As shown in Figure 6, the diversity of data samples augmented through word-level, syntax-level, and discourse-level methods exhibits an increasing trend, corroborating our intuitive expectations. Simultaneously, the overlapping of rewritten or generated samples at the utterance level is frequently lower than that at the EDU level. This result suggests that the reconstruction of the entire utterance can introduce a higher degree of diversity. By correlating with the experimental results in few-shot and zero-shot settings, we observe a consistency between the increase in diversity and the improvement in performance. Thus, the diversity of the augmented samples contributes positively to the efficacy of the parser.
6.4 Influence of Data Mixture
In Section 5.2, our best results are obtained by a combination of data generated by word-level, syntax-level, and discourse-level transformations. We set the mixture ratio as equally distributed, namely, 1:1:1. This might not be the optimal ratio. Here we conduct experiments to study the influence of the mixing ratio of augmented data on parsing performance. For simplicity, we set the data ratio as three cases: 0%, 50%, and 100%, and report the inter-EDU LAS. The results are presented in Figure 7 in a three-dimensional graph with larger and darker bubbles representing the better-performing mixtures. It can be observed that mixing all three methods results in a stronger performance improvement than mixing only two methods or using a single method. Additionally, we find that a 1:1:1 ratio indeed is not optimal. A better performance can be achieved when one part is set to 0.5. When the ratio of word-level, syntax-level, and discourse-level augmentations is 1:0.5:1, the performance is the best among our investigated cases. This is possibly due to the word-level transformation adding the most basic and accurate data, whereas the discourse-level one is the most diverse, and the syntax-level approach is in between. When all three are equally mixed, this may introduce significant overlap, which can be alleviated by simply reducing the data in the middle part, while balancing accuracy and diversity.
6.5 Influence of Augmentation Ratio
According to the main experimental setup, each training instance is augmented into nine new ones by our three types of transformation using LLMs. Each type of transformation offers three different augmentations, defined by the augmentation ratio. Here, we examine how this ratio affects the final performance.
Figure 8 shows the experimental results based on the setting of few-shot learning, where the inner-EDU as well as the inter-EDU LASs are depicted separately. The three single-augmentation strategies and their various combinations are all examined. When the ratio is 3, almost all data augmentation methods achieve their peak, while minimal gains are obtained as the ratio is increased further. The inner-EDU LAS shows very marginal improvements when the ratio goes over 1. Therefore, the observation indicates that there is an upper bound to the capabilities of our data augmentation. With a three-time augmentation, we can make the most of our approach. Furthermore, our data augmentation does not degrade significantly as the augmentation ratio increases after the peak is reached, indicating the robustness of our method.
7 Related Work
7.1 Dependency Parsing
Dependency parsing has received widespread academic attention (Kübler, McDonald, and Nivre 2009). Up to the present, various Chinese dependency paradigms and their associated treebanks have been established (Xue et al. 2005; Che, Li, and Liu 2012; McDonald et al. 2013; Qiu et al. 2014). The majority of these studies are devoted to sentence-level dependency parsing, whereas document-level parsing is noticeably underrepresented in the literature. A dependency parsing paradigm has been proposed for discourse parsing (Li et al. 2014). This paradigm, characterized by an EDU-centric pattern, overlooks parsing operations within the EDUs themselves. Jiang et al. (2023) have undertaken preliminary studies on dialogue-level dependency parsing in Chinese, taking into account both inner- and inter-EDU considerations. However, the proposed methodology is not fully integrated, or end-to-end, and exhibits a shortfall in comprehensive data exploitation investigations.
7.2 Data Augmentation
Data augmentation has been a topic of focus from an early stage (Scudder 1965; Tanner and Wong 1987; Van Dyk and Meng 2001; Feng et al. 2021). Utilizing existing treebanks to train models, and assigning pseudo-labels to unlabeled data, is a common approach during the annotation stage of dependency parsing (Jiang et al. 2018; Li et al. 2019). Additionally, pseudo-labeled data within the target domain can provide weak supervision signals, effectively enhancing the generalization ability of models (Scudder 1965; Lee et al. 2013; Guo et al. 2022; Li et al. 2023). On the basis of these studies, our approach leverages both a syntactic treebank (Jiang et al. 2018) and a pseudo-labeled dialogue dataset, aiming to improve model performance within a few-shot learning environment. In the present era, LLMs have demonstrated profound comprehension and generative abilities (OpenAI 2023). Given this context, it is natural to utilize LLMs for the generation of pseudo-samples that are both natural and logically consistent, thereby facilitating the training of more proficient models (Whitehouse, Choudhury, and Aji 2023; Dai et al. 2023).
7.3 Large Language Models
As of now, the field of NLP has experienced the emergence and growing prominence of LLMs, such as PaLM (Chowdhery et al. 2023), ChatGPT (OpenAI 2023), and GPT-4 (Achiam et al. 2023). After instruction tuning, LLMs can accurately comprehend user instructions and generate text in accordance with user preferences (Ouyang et al. 2022; Wang et al. 2022; Peng et al. 2023). The recent breakthroughs achieved by GPTs (OpenAI 2023; Achiam et al. 2023) present significant opportunities to enhance the capabilities of open-source LLMs such as LLaMA (Touvron et al. 2023a), Standford Alpaca (Taori et al. 2023), and Vicuna (Vicuna 2023) via instruction-tuning methodologies. Based on the powerful instruction comprehension and text generation abilities of LLMs, several studies have attempted to use LLMs for data augmentation (Whitehouse, Choudhury, and Aji 2023; Dai et al. 2023). Nonetheless, current LLM-based data augmentation methods are primarily applied to text classification tasks, and would face issues related to untranslatable labels in structural analysis such as dependency parsing. Our approach can accurately map the dependency structure from the original text to the altered text, while also ensuring the diversity of the data.
8 Conclusion
In this study, we focused on dialogue-level dependency parsing in Chinese. To address the low-resource challenges posed by this task, we proposed using LLM assistance for data augmentation to provide more supervised signals. Considering the hierarchical structure of dialogue dependencies, we implemented data augmentation at different levels: from the lowest word-level, to the intermediate syntax-level, and then to the discourse-level. To meet the requirements of these strategies, we integrated multiple prompt design methods, including characterization, CoT, constraint, and a one-shot example, and meticulously designed prompts accordingly. Experimental results demonstrated that our approach effectively improves the performance of dialogue-level dependency parsing. We also provided in-depth analysis covering the impact of prompt design, the mixture of augmented data by different level of transformations, the augmentation ratio, and so forth.
The limitations of this study are primarily manifested in two ways. Although carefully designed prompt engineering is exploited for different levels of instance transformation, our approach still relies on manual prompt design, which could introduce subjectivity and potentially limit the scalability of our method. Moreover, the effectiveness of our method has only been demonstrated in the context of dialogue-level dependency parsing. It remains unclear whether it can be generalized across different levels and languages of dependency parsing, and further to broader NLP tasks. In the future, given the flexibility of our method, we intend to explore its application to a broader range of NLP tasks in diverse languages.
Acknowledgments
We sincerely thank the reviewers for their invaluable feedback, which significantly improved the quality of this work. This work is supported by the National Natural Science Foundation of China (NSFC) grant nos. 62336008 and 62176180.
Notes
References
Author notes
Action Editor: Carlos Gómez Rodríguez