Knowledge Extraction From Email Conversations and Its Application to Question Answering




Journal Title

Journal ISSN

Volume Title



Email communication is the exchange of messages between two or more people over the internet using electronic devices. With billions of emails exchanged every day, extracting knowledge from emails is beneficial to various email-based user applications. In any form of communication, it is important to identify the participating users or entities and their interactions throughout the conversation. Previous research on email processing has primarily focused on classification, searching, and intent detection. However, it has overlooked studying the interaction between entities participating in an email conversation. One of the tasks to capture the interaction is entity coreference resolution. End-to-end entity coreference resolution extracts entities and their references throughout the conversation. Post extraction, it is important to use a knowledge representation format that preserves and enriches the extracted knowledge. Knowledge graphs can assist in representing these extracted intra-email interactions compactly. They can also enrich the knowledge by capturing inter-email interactions using a robust technique like matching entities across email conversations. One of the main applications to use the extracted knowledge is question answering that focuses on the entities in a conversation. These tasks, when put together, paint a simplistic yet holistic picture of a knowledge extraction pipeline for email conversations. A deep joint learning framework is proposed for the novel task of entity coreference resolution for email conversations. Two datasets were created during the framework’s development process. These datasets were used to evaluate the task difficulty and identify the limitations of the available solutions. The framework used the task of text classification for joint learning to improve the scoring of text spans. This task was also used for incorporating singletons in the result. The joint learning framework and singleton addition achieved an improvement of 4.87 and 5.26 F1 points on the two datasets, respectively. A combination of automatic and manual methods was used to carry out relation extraction parallel to entity coreference resolution. The extracted knowledge from the tasks was used to create two knowledge graphs - one that contained the knowledge from the relation extraction task and the other that contained knowledge extracted from both tasks. The knowledge graph creation process used the NEPOMUK framework, which was created to simplify data sharing across different user applications. Changes to the NEPOMUK framework have been proposed for adding coreference knowledge to the graphs. Lastly, previous work has investigated doing question answering using digital voice assistants. However, this dissertation explores the novel task setting of doing question answering using digital voice assistants for email conversations. The sub-task of entity resolution has been identified as essential to the proposed formulation, and a dataset evaluating the same was created using templates. A deep learning-based and two SPARQL template-based systems were used for the evaluation process. Empirical results showed an increase of 3.91% in the template-based system’s accuracy when coreference information was incorporated in knowledge graphs. By laying a framework for the knowledge extraction pipeline, creating open-source datasets and benchmarks for comparison, one can hope to advance research in email processing.



Computer Science