Iconic : Discerning Insightful Contexts in Image Captioning By Leveraging Commonsense Knowledge
One of the most remarkable qualities of a human is the ability to describe an image with numerous invaluable insights at will. These insights are the result of one’s background knowledge, as well as the ability to reason using it. This knowledge is common to most of the people around us. It is common to believe that “Buses don’t fly. Birds do.” We call it common sense. Of course, commonsense reasoning is defeasible and context-sensitive (Liu and Singh, 2004b). Birds with broken wings do not fly, or, at least for now, buses do not fly. It has been a long-standing goal of artificial intelligence to build machines that emulate this ability of humans. While current state-of-the-art techniques achieve very good results on benchmark datasets, very few of them make use of explicit knowledge sources to acquire and incorporate this commonsense reasoning into existing systems. These systems learn from the training data and, thus, limit the amount of commonsense knowledge that can be incorporated into them using the external knowledge sources. In the task of image captioning, which involves describing an image using natural language text, training only on such datasets limits the expressiveness of the caption due to inherent restrictions on the reference captions in such datasets. Moreover, these captions are often too generic and describe only the salient objects and relationships in the image, missing the commonsense contextual information about the image. Our goal in this thesis is to generate insightful contexts about the contents of an image by incorporating commonsense reasoning. We develop a technique for ranking different regions of the image based on their saliency and use these ranked region descriptions to discern insightful contexts using commonsense knowledge. We first show the importance of identifying salient and non-salient regions in the image, and the need for ranking them based on their saliency. We use the region descriptions available in the Visual Genome dataset (Krishna et al., 2017) to fetch descriptions about different regions in an image. Due to the lack of existing datasets that rank the information in an image based on their saliency, we propose a technique to heuristically determine the saliency of a region. We train a model to predict the saliency of a region in an image and rank the regions based on their saliency. We propose an evaluation metric based on MAP, Dense Score-based Discounting MAP, to evaluate this ranking of region descriptions. Next, we propose three different techniques of acquiring commonsense knowledge and applying it to generate insightful contexts using the region descriptions. First, we explore how to use commonsense-infused word embeddings to draw contextual conclusions about the image. Second, we extract knowledge from an already existing source of common knowledge, Wikipedia, that does not explicitly express commonsense. Third, we make use of the commonsense knowledge graph, ConceptNet, and devise a technique of generating the contexts of an image using the knowledge graph. Finally, we perform a human evaluation on the commonsense results generated by our three systems. Overall, this thesis focuses on acquiring insightful contexts from an image and can be further used to generate detailed and insightful captions in the future.