Abstract
The current dialogue system can be sensitive to the emotions in the user's words, generating an empathetic response to help calm the user's emotions. But in some cases, eliciting empathetic responses may not adequately mitigate the adverse effects that the current conversation topic is having on users. The dialogue system will continue the conversation with the user under this uncomfortable topic, which will lead to a worse chat situation or even an impasse. To solve this problem, a dialogue system that can change the topic autonomously according to the user's emotions is proposed in this paper. Specifically, the dialogue system first collects the emotional semantic information of the users and then detects it according to the emotion classification module. Once the detection results show that the user is in a bad mood, the topic change module selects a new topic from the context to shift to and generates a response . This not only helps to calm the user's mood but also move the conversation to a new area, steering the user away from the uncomfortable topic. The experimental results show that the proposed method incurs less cost in terms of content quality, but improves the emotional perception ability. Additionally, it endows the dialogue system with the ability to change the topic, and improves the user's dialogue experience.
Keywords
0 Introduction
Incorporating human emotions into conversational AI systems is a challenge of long-term significance. The ability to recognize and express emotions is a key component of human intelligence and the basis for establishing natural human-computer interaction[1]. Lately, there has been a growing interest in empathetic communication within discussions, particularly within natural language processing and question-answering systems[2]. Incorporating emotional intelligence into conversational AI systems will enhance their ability to engage in more human-like interactions[3]. It can identify and track the speaker's emotional state during a conversation and generate emotionally reasonable and natural responses[4]. This technology is applicable across various domains, including commercial client support, jurisprudence, healthcare, and additional spheres of dialogue. The development of conversational artificial intelligence with emotional cognition and expression capabilities is a long-term challenge and goal for AI research to reproduce human intelligence[5].
In this work, the unit of analysis is the dialogue. A dialogue typically contains contextual cues[6], which trigger the emotion of the current utterance, such as reasons or circumstances. The mothods proposed by Lin[7] and Majumder[8] gauged the speaker's emotions by deciphering the total meaning and feelings conveyed within the discourse . These methods are prone to the wrong response, which is attributed to ignoring the nuances of human emotion in the conversation. Recent work on empathetic response has been devoted to sensing the emotional vocabulary in the speaker's context through deep learning models[9-10]. However, the current technology of empathetic dialogue still faces some challenges. Current methods fall short in grasping the intricate emotional nuances within the scenario and the underlying semantic meanings in individual experiences, and the same utterance may express different emotional connotations depending on what different people think[11]. To further improve emotion recognition and understanding in conversation, models need to analyze contextual cues better and capture the semantics of personal experiences. In addition, since the current dialogue system will only perform sentiment analysis based on the user's input to generate responses, in the event of a bad topic, it will only continue to generate responses under the current topic, allowing the dialogue context to deteriorate then. Developing a dialogue system that can strategically decide to shift the topic in line with the present context is a challenging area of study.
Most of the current emotion-aware dialogue systems focus on the overall semantics of the contextual language and the emotional words in the communicator's language as static vectors to perceive subtle emotions[12-13], however, ignoring the impact of the current chat topic on the user's emotions. To tackle the aforementioned challenges, we propose a novel framework designed to facilitate topic shifts in conversational systems, enabling them to transition seamlessly into chat contexts. The framework performs emotion classification on user input and guides the model to conduct more accurate emotion classification by introducing a dynamic relational graph convolutional network, thereby facilitating better topic shifts and generating more empathetic responses. The emotion classification module will classify the emotions captured by the dynamic correlation graph convolutional network, and the dialogue system will work normally when positive or neutral emotions are detected. When extremely negative emotions are detected, the conversation system will choose to shift the topic, and the topic shift module will select other topics in the context to generate a response. Although shifting the topic may diminish the coherence of the conversation, it can eliminate the unfavorable subject matter, enhance the scalability of dialogue, and provide comfort to the user's feelings by offering a different viewpoint.
We evaluated the validity of the proposed approach on the empathetic dialogue benchmark with the EmpatheticDialogues datasets and compared it to current state-of-the-art empathetic dialogue models. The results show that the proposed method can effectively change the topic, get rid of the bad topic, accurately understand the conversation, and produce grammatically fluent empathetic feedback.
The contribution of this study can be summarized as follows: A new framework is proposed for the dialogue system, which realizes the topic shift by classifying the emotion of the dialogue; There are few researches on topic shifting in conversation system, but the feasibility of this research is proved by experiments.
1 Related Work
1.1 Empathetic Dialogue Generation
The objective of creating empathetic dialogues necessitates that systems acquire the ability to convey emotions in an empathetically fitting manner. Rashkin et al.[14] introduced affective representations generated by pre-trained affective classifiers to learn and express specific affective types in dialogue. Lin et al.[7] determined the nuanced emotional patterns in replies by leveraging a consortium of specialists, thereby enhancing the capacity for empathic reactions within a dialogue model. Majumder et al.[8] simulated the speaker's emotions to a certain extent to produce an empathetic response. These various methods concentrate on the general ambiance of the dialogue, neglecting the nuanced feelings. Detecting nuanced sentiments, Gao et al.[12] and Kim et al.[15] developed a sentiment analysis tool aimed at identifying emotionally charged terms and discerning delicate emotional states. Wang et al.[16] used a fine-grained encoding strategy to make the model more sensitive to the flow of emotions in a conversation. Yang et al.[17] employed a contextually adaptive convolutional neural network to direct a system in discerning the interplay of emotional content and semantic meaning. In addition, some methods guide dialogue models to empathic response abilities by introducing common sense knowledge. Sabour et al.[10] used common sense to infer subtle emotions in conversations. Li et al.[18] incorporated general knowledge and sentiment-related lexical information to enable models to comprehend and articulate emotions clearly. Jiang et al.[19] extracted knowledge with emotional representation from common sense knowledge, guided the interactive reasoning between conversation history and common sense knowledge and imitated human reasoning based on common sense to generate empathic responses. Zhou et al.[20] introduced common-sense cognitive maps and emotion concept maps to learn subtle emotions in a dialogue.
1.2 Topic Shifting
A majority of current advisory conversational systems primarily reply to users' statements in an effort to more accurately discern their indicated preferences or inquiries, subsequently offering tailored suggestions. This responsive suggestion dialogue system faces practical constraints, given that individuals often lack defined inclinations toward novel topics or unfamiliar objects.Wang et al.[21] proposed a goal-driven recommendation dialogue system in which, given a specified target topic (e.g., movie, music, food) , the dialogue system will actively shift the topic of the conversation toward its set topic. In Ref.[22], Wang et al. proposed a coherent dialogue planning method that employed a stochastic mechanism to capture the sequential evolution of the conversational trajectory, generate more coherent statements, and realize the transfer of the target topic with a higher success rate. Wang et al.[23] proposed a new goal-constrained bi-directional planning method that plans appropriate dialogue paths through foresight and retrospection, thus enabling topic transfer.
In addition, Liu et al.[24] proposed a spoken conversation system that combines the two modes of small talk and task orientation, enabling switching from one mode to the other. Bang et al.[25] put forward an example-based dialogue system. By retrieving samples in the database, matching samples are selected for reply. If there is no suitable sample, topics that users are interested in are selected and inserted into the template to generate reply.Sakai et al.[26] proposed a Twitter-based system for topic selection within dialogue platforms, designed to spark discussions on topics that pique users' curiosity while being new to them.
With the progress of dialogue, the topic shift at the discourse level will naturally occur in the context of continuous multi-round dialogue. Nevertheless, existing retrieval-focused systems rely solely on local thematic vocabulary for representing the context of discourse, yet they are unable to seize critical overarching subject-related indicators on the discourse level. Xu et al.[27] proposed a novel topic perception solution for modeling multiple rounds of dialogue, which identifies and isolates sections of topic-aware text without supervision, enabling the resulting model to detect prominent topic shifts necessary for discourse management and to monitor the interaction efficiently.
Overall, in this research field, some dialogue systems are able to capture the subject words in discourse during multiple rounds of dialogue, so that the topic of dialogue can shift with the passive movement of users. Some task-led conversation systems do not perceive the topic in the user's speech, but instead turn the topic to the system's set topic. Another part of the dialogue system will actively change the topic, but the logic of changing the topic is that there is no matching reply in the dialogue system database. Unlike these approaches, our dialogue system can actively shift topics based on the user's emotions, and the topics chosen are those that appear in the user's context and are not set.
2 Method
We provide an overview of Topic Change Emotion Semantic (TCES) dialogue system in Fig.1, which consists of six components: a) Context encoder: used to understand context semantics, and generate context word embeddings and context semantic representations; b) Emotion semantic integrator: combining emotion embeddings with contextual semantics embeddings to generate emotion-semantic representations; c) Syntactic dependency module: by introducing a dependency tree, the correlation between words is fed back into emotion-semantic representations to generate correlation representations; d) Emotion classification module: emotion prediction based on context semantic representation and correlation representation; e) Topic change module: judgment of whether the topic is shifted or not based on predicted sentiments, and generation of replies based on concepts in the conversation combined with templates in the event of a topic shift; f) Dialogue generation: generates empathetic responses based on context semantic representations and correlation representations. Specific methods are demonstrated as follows.
2.1 Task Formulation
Considering the conversational framework X =[S1, S2, ..., SN] between two speakers, the model is tasked with precisely discerning the emotional and semantic nuances within the conversational framework and produce an empathic response Y =[y1,y2,...,yj,yN]. Here, Si=[] represents the i-th utterance containing m words.Y is a response to N words.
Fig.1Overview of the proposed model
2.2 Context Encoder
As in previous approaches[10, 18], we concatenate segments in the context of a conversation and use [CLS] as a prefix for the whole sequence to form contextual input C =[CLS] S1 S2 ... SN. Here, represents the connector. This study translates context C into vector representations of words E c to incorporate it into the model. Subsequently, word embeddings, position embeddings, and situation embeddings results in the creation of composite semantic embeddings E CM. These situation embeddings aid in identifying the roles of speakers or respondents and are initialized stochastically. In order to grasp the meaning behind the dialogue, the study feeds semantic embeddings E CM into a context encoder E Encct to derive representations H ct of the dialogue's context:
(1)
where H ct∈ R L×d, L is the length of the context sequence and d denotes the size of the hidden layer of the encoder.
2.3 Emotion Semantic Integrator
Linguistic research[28] indicates that the context significantly shapes the meaning of words, and interpreting words in isolation can result in misinterpretations. Take, for instance, the term ‘well’, devoid of context, it typically conveys a favorable emotion. However, within the phrase ‘well, we should go home’ its function as an exclamation amplifies the statement's intensity, conveying no particular emotion. Such employment of language does not convey an affirmative connotation. Consequently, there is a need for a contextual adaptation of both emotional tone and meaning.
In terms of meaning, we apply a weighted calibration of semantics to the context word embeddings E c in order to derive adaptive semantic vectors E cs that E cs∈ R L×cs. cs is the hidden size of the adaptive semantic vectors E cs . In terms of sentiment, this paper converts words denoting sentiment into emotion embeddings E e and combines them with context word embeddings to obtain fused sentiment vectors E ce, E ce∈ , where ce denotes the number of sentiment categories.
Subsequently, the research aims to merge emotion embeddings E ce with adaptive semantic vectors E cs into the encoder, aiming to acquire representations that capture emotional semantics. By doing so, the model is able to account for emotional content and semantic meaning simultaneously during training, leading to a more comprehensive grasp of contextual word meanings.
(2)
where E Encces is the encoder and H ces is the representation of the emotion-semantic. H ces∈ .
2.4 Syntactic Dependency Module
To improve the accuracy of emotion classification and the quality of responses generated, dependency trees are introduced to reflect the grammatical dependencies between related words. In this paper, the representation of the emotion-semantic, parts of speech, and dependency types are connected to form a guidance vector V k . Where V k∈ , dr is the size of the part of speech embeddings or dependency embeddings. Then, we calculate the correlation probability between vertex m and vertex n to get the correlation representation H ck between node m and node n.
(3)
(4)
Among them, . w m, n is 1 when node m and node n are directly related in the dependency tree and w m, n is 0 otherwise. W v and b v are trainable parameters, and ReLU is the ReLU activation function.
2.5 Emotion Classification
To cope with the problem of poor accuracy in emotion classification, recent research[10] proposed an approach that identifies and analyzes sentiment words during the communication process and then transforms them into static metrics, intending to achieve precise capture of subtle sentiments, thereby improving the overall accuracy of emotion classification. The study employed a pair of networks, sharing an identical structure yet distinct parameters, to integrate semantic context and correlated features for the purpose of classifying emotions. Below, we use the processing of context semantic representations as a concrete example.
We use a hidden representation of the [CLS] tag in the context representation to classify emotions in the case of providing emotional labels e* for each conversation.
(5)
(6)
In this paper, first passes through a linear layer and then performs Softmax operation to generate an emotion category distribution ∈ . Where ∈ R d, W e∈ is the weight vector of the linear layer. Similarly, we obtain a relational representation of the emotion category distribution ∈ . These two probability distributions are added together to get the final emotion P emo.
During training, this study optimizes these weights by minimizing the cross entropy (CE) loss between the emotion category distribution P and the ground reality label.
(7)
(8)
2.6 Topic Change Module
Most of the current research on topic shifting aims to solve the problem that the database of the conversation system can not find a matching response or the user is tired of the current topic, and does not pay attention to the emotional aspects of the user. To solve the topic that users are in a bad mood when chatting, a topic change module is designed according to emotion classification. In this paper, the various categories of emotions are first labeled with scores, with positive strong emotions set to 1 and negative strong emotions set to-1. All other emotions are set to 0, defined as neutral emotions. Then, the three emotions predicted by the part of the emotion classification module were weighted and combined to obtain the emotion score. Finally, the topic change is judged according to the emotion score. If the emotion score is less than 0, the topic change is carried out. Formally, the topic change module extracts the flow of concepts from the context of the conversation. F= represents the concept observed in the dialogue, where fi corresponds to the concept set of the i-th discourse, collecting all the entities in Si, i.e. . After that, the extracted entities and the topic entity in the current user dialogue are converted into word vectors, and the cosine similarity between each entity in the concept set and the topic entity in the current dialogue is calculated respectively, and then multiplied by the weight of each concept set. Finally, the entity with the best result is selected and inserted into the created template to generate a reply.
(9)
Among them, the weights of the concept set αi are shown above. Here we set the weight of the more distant sentences higher because entities that are further away from the topic of the current chat are better able to stay away from the topic of the moment that user feel uncomfortable. If the emotion score is great than 0, then no topic transfer is performed and the dialogue system generates the response through the decoder.
2.7 Response Generation
Finally, the decoder generates responses based on contextual semantic representations and correlation representations, which incorporate emotion semantics, parts of speech, and dependencies.
(10)
where denotes the generated labeled embedding. We use the standard negative log-likelihood loss for the target response y :
(11)
All parameters of the proposed model are trained and optimized based on the weighting of the above three losses.
(12)
where r1, r2, and r3 are the hyperparameters which are used to control for the effects of the three losses. To highlight the effect of grammatical correlation on the dialogue system, we set r1=1, r2=1, and r3=1.5.
3 Experiments
In this section, we present the details of the datasets, the comparison methods, and the implementation of our model. By typing a sentence, the dialogue system generates a response. Here, we first automatically evaluate the accuracy of the dialogue system in judging the emotion expressed in the user's sentences, and the fluency and diversity of the dialogue system in generating responses. We then manually evaluated the performance of the dialogue system's responses in four areas.
3.1 Datasets
In terms of data sets, there is no data set that focuses on evaluating topic change, but only one that evaluates empathetic response generation. Therefore, we conducted experiments on the Empathetic Dialogues (ED) dataset[14]. ED is a large-scale multi-round conversation dataset that contains 25000 empathic conversations between speakers and listeners, providing32 uniformly distributed emotion labels. During dialogues, individuals share their private encounters, allowing others to deduce their circumstances and feelings, leading to a compassionate response. Following the method of Rashkin et al.[14], we partitioned the training set/validation set/test set in the ratio of 8∶1∶1.
3.2 Baselines
We chose the following baseline model to compare with the TCES model presented above:
1) MoEL (Mixture of Empathetic Listeners) [7]: An extension of the transformer model that pairs a decoder with each predicted emotion, and then generates a response with a meta-decoder combining the output of each decoder.
2) MIME[8]: An empathetic dialogue system considers polar emotion categorization and emotion replication, reflecting user sentiments in its replies.
3) KEMP (Knowledge-aware Empathetic Dialogue Generation Mothod) [18]: An empathetic dialogue system grounded in knowledge incorporates ConceptNet to augment the understanding of implicit emotions, enhancing the representation with supplementary insights.
4) CEM (Commonsense-aware Empathetic Response Generation) [10]: An empathetic conversational framework that considers the emotional and cognitive dimensions of empathy, enriched with general knowledge.
5) SEEK (Serial Encoding and Emotion-Knowledge Interaction) [16]: An empathetic dialogue model that uses fine-grained coding strategies to capture emotional dynamics in conversation, designs a new framework to simulate the interaction of knowledge and emotion.
6) CASE (Cognition and Affection for Responding Empathetically) [20]: A model for empathetic conversation through common-sense cognitive graphs and emotion concept graphs, integrates the user's cognitive processes and emotional states across macro and micro levels of detail.
3.3 Implementation Details
This paper implements the model using PyTorch[29] and initializes word embeddings using pre-trained GloVe vectors[30] of dimension 300. The model is optimized using the Adam optimizer[31], β1=0.90, β2=0.98. In the training phase, the learning rate is initialized to 0.0001, and we varied this value during training according to Vaswani et al[32]. The model in this paper was trained on an NVIDIA Geforce RTX 4090 GPU using a batch size of 16 and an early stopping strategy. The model converged after 18400 training rounds.
3.4 Automatic Evaluation
In the model generation evaluation, we used emotion accuracy as an assessment criterion for the consistency between the underlying true emotion labels and the predicted emotion labels. For assessing the quality of model generation, we choose the widely used Perplexity (PPL) and Distinct-1/2 (Dist-1/2) . PPL signifies the degree of certainty the model has regarding its potential reply options. When confidence is higher, PPL is lower. Dist-1/2 denotes the proportion of distinct one-word words/double-word words in all the generated results, indicating the diversity of the generated results.
3.5 Automatic Result and Analysis
Table1 shows the results of the automatic evaluation of the model in this paper and other comparative models. From Table1, it can be seen that TCES outperforms the previous best by 1.11% in emotion classification. It demonstrates that concentrating on the link between emotional content and semantics enhances the model's ability to classify emotion more accurately. Table2 shows the results of ablation experiments for the model in this paper. By analyzing Tables 1 and 2, it can be seen that the template response strategy used in Module C resulted in an increase of 1.34 for the Dist-1 indicator and 3.22 for the Dist-2 indicator. After removing this module, the Dist-1/2 values of the model are 0.69 and 3.17 respectively, which are at a normal level. In terms of PPL, the introduction of the B module results in a confusion score of 37.27 for TCES, which is lower than that of the other models. However, this module can effectively improve the accuracy of emotion classification. In Table2, the w/o A denotes no emotion semantic integrator, w/o B denotes no syntactic dependency module, and w/o C denotes no topic change module.
Table1Results of automatic evaluation
Table2Ablation studies
In order to verify the validity of adaptive semantic vector and fused sentiment vector, module A is removed. The findings indicated a marked decrease in the precision of emotional detection and the variety of language used. This shows the important role of adaptive semantic vectors in accurate understanding and information expression in dialogue and the integration of sentiment vectors in capturing emotion.To verify the validity of syntax dependencies, the B module is removed. The findings indicated a marked decrease in the precision of emotions and the range of language used. This implies that grammatical dependencies greatly affect the recognition of emotions and meaning, which are crucial for conveying informational reactions. After removing the C module, we found that the diversity of the dialogue system decreased significantly. This shows that the C module can also significantly improve the quality of reply generation. In addition, we found that removing modules A and B can improve fluency, but reduce the diversity of response generation, indicating that the sentences generated by the ablation model are relatively simple.
3.6 Human Evaluation
In order to start from the user's point of view, this study designed an experiment of human evaluation and convened 15 evaluators to test these systems. The15 evaluators were evenly distributed between the ages of 20 to 30, 30 to 40 and 40 to 50. Evaluators used each dialogue system for 10 min and then rated each dialogue system on a scale of 1 to 5, with the final score being averaged. We asked the evaluators to talk mainly about recent bad things and bad moods, but the format of the conversation was free. We selected the three strongest baseline models to compare TCES's performance on the following questions: a) Are you satisfied with the chat content? b) Are you satisfied with the topic? c) Does the system detect your mood? d) Has your mood improved? The experimental results are shown in Table3.
Table3Results of human evaluation
3.7 Human Result and Analysis
The user believes that the proposed model performs better on problems b and d and worse on problems a and c. Question a assesses the fluency and variety of responses generated by each dialogue system, and question c assesses the ability of each dialogue system to perceive emotion. Because CASE is a dialogue system built on common sense cognitive map and emotion concept map, and the system synchronizes users' cognitive and emotional states across both broad and detailed scales. So the system performs best on these two issues. Question b assesses which of the two methods is preferred by users, and question d assesses which of the two methods is more soothing to users. According to the results of manual evaluation, the system can properly detect the negative emotions caused by the dialogue topic and change the topic to improve the user's impression of the dialogue topic.
4 Conclusions and Future Work
In this study, we propose a topic change empathetic dialogue system based on emotion classification, which performs topic shifting by recognizing the user's conversational mood to escape from some topics that put the user in a bad mood. The systematic assessment outcomes demonstrate the efficacy of our methodology for generating empathetic responses while adeptly navigating conversational topics. However, our work is not yet complete, and in the future, we would like to find a way for the dialogue system to generate replies about a topic through a topic in context.