Explain and answer: Intelligent systems which can communicate about what they see

By Marcus Rohrbach

Today’s talk was organized under the LTI Colloquium Series hosted by Language Technologies Institute of Carnegie Mellon University. The focus of the talk was similar to one of the previous talks I attended in LTI before: From Naming Concrete Objects to Sharing Abstract Thought: Vision-to-Language Begins to Grow Up. The speaker, who is a post-doctoral researcher at University of California, Berkeley, focused on the topics about how visual recognition and natural language understanding can be achieved with machine learning.


First, he elaborated on the objectives and goals of their studies in this domain. Mainly, they aim at building intelligent systems which has capability of communicating about hat they see in a way that people can understand. Hence, those systems would be of high importance for telling the blind people about their surroundings. Then, he continued with the high level requirements of those intelligent systems, namely being compositional, explainable, and scalable. Being compositonal corresponds to providing a semantic mapping between the vision and language. Explainable means that they should help users to make decisions so that users have trust in the system. Scalability is about system’s capability of operating with limited amount of supervision.

Then, he introduced example application areas, which are image description and video description. He elaborated on the machine learning models and techniques they used to make systems they build to provide image/video descriptions. They utilize from deep learning models using conventional neural networks to train the system. Here, they use different methods for grounding like encoding a phrase provided by humans and using unsupervised techniques. Using those techniques and machine learning models, the systems they build are targeted to answer visual questions. They evaluate the performance of the system using two measures: discriminative loss and ….. with reinforcement learning.

He presented an example case for explaining how to combine image representation and question representation to answer visual questions. Given an image of a table full of foods and drinks, the question to be answered is : “Is this going to be a feast?”. They constructed two neural networks: one for the image and another for the question, so those networks use the keywords in the image and question as features for training, respectively.

At the last part of the talk, he elaborated on visual explanations provided by the intelligent system. For example, the system is provided with images of several different birds and is asked to answer the question: “what bird is this?”.  Then the system provides its answer to this visual question with the type of bird, and the explanation of why it selected this answer. A sample explanation is: “This is a Western Grebe because this bird has a long white neck, pointy yellow beak and red eye.” Then he briefly mentioned about experiments they conducted about bird images. At the end, he enumerated the future work they plan for  high level requirements of the system, which I noted above (namely being compositional, explainable, and scalable). For example, they aim at visual-language dialogue with virtual or real agents.


Similar to the talk I attended before related to vision-to-language and hosted in LTI, I do not have too much feedback because it is not in my area of research. Basically, I can say that the topic was interesting and shows a lot of promise because of its benefits for especially visually impaired people.






Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s