Automatic Description Generation from ImagesOver the past two decades, the fields of natural language processing (NLP) and computer vision (CV) have seen great advances in their respective goals of analyzing and generating text, and of understanding images and videos. While both fields share a similar set of methods rooted in artificial intelligence and machine learning, they have historically developed separately, and their scientific communities have typically interacted very little. Recent years, however, have seen an upsurge of interest in problems that require a combination of linguistic and visual information. A lot of tasks are of this nature, e.g., interpreting a photo in the context of a newspaper article, following instructions in conjunction with a diagram or a map, understanding slides while listening to a lecture.In addition, the web provides a vast amount of data that combines linguistic and visual information: tagged photographs, illustrations in newspaper articles, videos with subtitles, and multimodal feeds on social media. To tackle combined language and vision tasks and to exploit the large amounts of multimodal data, the CV and NLP communities have moved closer together, for example by organizing workshops on language and vision that have been held regularly at both CV and NLP conferences over the past few years. In this new language-vision community, automatic image description has emerged as a key task. This task involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most salient aspects of the image. This is challenging from a CV point of view, as the description could in principle talk about any visual aspect of the image: it can mention objects and their attributes, it can talk about features of the scene (e.g., indoor/outdoor), or verbalize how the people and objects in the scene interact. More challenging still, the description could even refer to objects that are not depicted (e.g., it can talk about people waiting for a train, even when the train is not visible because it has not arrived yet) and provide background knowledge that cannot be derived directly from the image. In short, a good image description requires full image understanding, and therefore the description task is an excellent test bed for computer vision systems, one that is much more comprehensive than standard CV evaluations that typically test, for instance, the accuracy of object detectors or scene classifiers over a limited set of classes.Image understanding is necessary, but not sufficient for producing a good description. Imagine we apply an array of state-of-the-art detectors to the image to localize objects, determine attributes, compute scene properties, and recognize human-object interactions. The result would be a long, unstructured list of labels, which would be unusable as an image description. A good image description, in contrast, has to be comprehensive but concise (talk about all and only the important things in the image), and has to be formally correct, i.e., consists of grammatically well-formed sentences. From an NLP point of view, generating such a description is a natural language generation (NLG) problem. The task of NLG is to turn a non-linguistic representation into human-readable text. Classically, the non-linguistic representation is a logical form, a database query, or a set of numbers. In image description, the input is an image representation, which the NLG model has to turn into sentences. Generating text involves a series of steps, traditionally referred to as the NLP pipeline: we need to decide which aspects of the input to talk about, then we need to organize the content and verbalize it (surface realization). Surface realization in turn requires choosing the right words (lexicalization), using pronouns if appropriate (referential expression generation), and grouping related information together (aggregation).In other words, automatic image description requires not only full image understanding, but also sophisticated natural language generation. This is what makes it such an interesting Automatic Description Generation from Images: A survey task that has been embraced by both the CV and the NLP communities. Note that the description task can become even more challenging when we take into account that good descriptions are often user-specific.