From Naming Concrete Objects to Sharing Abstract Thought: Vision-to-Language Begins to Grow Up

By Margaret Mitchell

The talk I attended today was held in the Language Technologies Institute of Carnegie Mellon University. Indeed, I was not expecting too much learning from this talk related to my research domain, but the topic and the speaker’s being from Google Research has attracted my attention. As the result, I preferred to attend it among the ones occuring today and posted on Comet.


At the beginning of the talk, she mentioned about the ultimate goal of the vision-to-language studies. The long-held dream was the visual understanding by computers. For example, looking at a picture (the one as the featured image of this blog post), a computer system would say that there is a boy riding horse. In addition to this, the system would communicate further like by expressing its thoughts (I hope he does not fall of) or by making jokes (Oh, or he does fall, and we own that horse! hahaha!).

Actually, she pointed out that the talk was intended for 3 hours and she had to squeeze it into 1 hour, so she went very fast over a considerable amount of theoretical concepts and the result of her research studies. Since I am not a linguistic, psychologist or cognitive scientist person, I had almost no idea about the theories and terms she was mentioning in a quite fast manner. Without a background information part, I could not get quite much of the talk, but I will try to summarize the talk briefly. The talk had 3 major parts: 1)the cognitive & psycholinguistic part, 2)the math&computer science part , and 3)the application part.

In the first part (the cognitive & psycholinguistic part), she mainly talked about how humans perceive concrete objects considering the features they used to describe the objects and the order of importance of those feaatures. For example, people first perceive the colors of objects, then they differentiate objects based on their shapes. She mentioned about series of experiments she and her colleagues have conducted in order to understand the cognitive mechanisms behind the object perception and their relation to linguistic. For example, in one of the experiments, they displayed participants with several images of objects to understand how people perceive “size” of objects. As the overall result of those experiments, they derived attribute people used to differentiate objects and ordered them according to the importance as follows: color, size, shape, type/material, sheer, texture, orientation, pattern, location.

In the second part of the talk, she intended to present the mathematical and computational models behind visionto-language studies. Unfortunately, there was no enough discussion about the models used in this domain. Rather, she has just presented the findings of a few of her prior studies. The overall lesson they learned according to those results was that sophisticated models do not outperform n-gram language modelling (actually they compared n-gram to other models like class-based models, models which I do not have any idea). As she claimed, n-gram language models are satisfying from ML perspective but are dissatisfying from linguistic perspective).

For the third part, the application part, she did not have enough time for discussion. However, she showed some examples from the results of the application they developed to show its performance. By giving several images to the application, they checked how the application described them. For most of the images, the performance of it was quite acceptable almost describing the major view/events etc yet producing weird modifiers sometimes (like a dog having cool look).


I think I do not have enough experience and judgement capability to criticize a talk in this domain. However, without being destructive, I wanted to say that the talk might be more organized and better prepared assuming that the speaker knew it would last for 1 hour. When I compared what I listened with the commitments in the abstract of the talk, I saw a gap between them. The reason was that the talk was committed to be interesting for anybody independent from his/her backgorund (From the abstract of the talk: “This will be a multi-modal, multi-disciplinary talk (with pictures!), aimed to be interesting no matter what your background is.”) , but it was actually addressing to the researchers in psychology, cognitive science and linguistic…



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s