We Want Robots to See and Understand the World

Adolfo Plasencia:

Antonio, some people say that MIT is like the technological center of the universe; if someone wants to do cutting-edge technology and science, then the best place for it is MIT. Your research here focuses on machine vision, machine learning, and visual perception. You work in the area of vision and graphics, right?

Antonio Torralba:

Yes, I am part of the MIT Computer Vision Group. The Computer Vision Group is different from the MIT Computer Graphics Group, but lately these two areas have been closely connected; we are sort of neighbors. Let’s say that if we were to divide it into districts, then I would live in the Vision and Graphics neighborhood.

A.P.:

We could say that you are like a digital-visual alchemist, yes? Scene setup and recognition of scene objects are usually separate elements, but you study them as a whole here. Your goal is initially to build recognition systems that can be much more efficient and robust. But how did you get to this point?

A.T.:

The study of vision allows you to focus on a more specific level than if you studied artificial intelligence (AI) in a more general way. But in fact the human visual center is a brain in itself. It has everything a broader knowledge system needs to solve. You can work on a focused problem and at the same time solve very general questions.

I became interested in the question of vision when I was a student at the Universitat Politècnica de Catalunya, or BarcelonaTech (UPC). Some people there were working on image processing. Image processing and vision are closely related but nevertheless very different things. Basically, in image processing, the input is one image and the output is another image that may have been treated in some way so as to highlight a certain type of information. In vision, the input is an image, but the output might simply be the interpretation of that image, and in this case, the objects that are present need semantic tags. Moreover, there may not even be an output image. It is a more closed system that processes the image in a certain way and produces, for example, information to be used to control a robot.

A.P.:

First things first: you were already thinking about AI as a youngster in Madrid, your hometown. Nobody seemed to understand you. Would you explain AI now in a very different way from those days, when you dreamed of it and those around you scarcely understood a word?

A.T.:

It is quite an interesting question. I think my explanation would be very similar, but there is something that has changed. People now listen to you, they are less skeptical. As regards AI, twenty or thirty years ago, it was common to think that a computer would never be intelligent. In fact, as obstacles were overcome, for example with Deep Blue winning in chess, people started to believe that it was possible to automate certain things.

A.P.:

Yes, particularly lately, with Siri on everybody’s iPhone.

A.T.:

Yes, it is quite curious. People used to think some things were not possible and now they see them as absolutely normal. And they are not aware of the enormous complexity behind the systems they use.

A.P.:

And they carry it in their pockets.

A.T.:

Yes. For example, now, any digital camera detects faces and adjusts the focus automatically. This technology is only ten years old. Formerly, researchers themselves thought that to reliably detect faces, it would be necessary to overcome the problem of object recognition in general.

A.P.:

Rodney Brooks told me that he is skeptical. Do you think your scientific generation will be able to answer the big questions on AI?

A.T.:

I think some of those questions will be answered, but we will have to wait for answers to some of them, maybe for several generations. I do not think this is a field that will be clarified in the next ten or twenty years. It will take much longer.

A.P.:

I referred to alchemy earlier, which is at the origin of chemistry, when everything was based more on beliefs about components than on genuine scientific knowledge. And today, in AI. …

A.T.:

Perhaps AI is at that point. We are playing, trying things out, learning about what works and what does not work. Collecting lots of data will take a long time, until we get to know the real fundamentals, to tell apart the things that work from those that do not work, and we are not there yet. In particular, in the field of vision, we’re pretty far advanced. It is true, though, that now, in AI and computer vision, progress has been made for operating systems to function in such a way that people find them useful as engineering parts: digital cameras, phones, simple robots, and for production, and in many other applications that already include AI and vision algorithms to solve problems in ways that were previously unthinkable. In fact, there are already commercial products, such as Machine Vision, that use computer vision. But complex vision remains unresolved; it has to deal with the natural environment in which humans move.

A.P.:

And process their context, right?

A.T.:

Yes, and their context too. That environment is very complex. Applications are now starting to operate in that environment.

A.P.:

But AI self-expectations have not been fulfilled. Are people disappointed?

A.T.:

I understand that some people may not be satisfied with the speed at which certain areas have progressed. But there must be a reason for that. Basically, in AI, the problem was more complex than expected. For example, there is a very famous story about the onset of computer vision. Vision seems a very simple problem. After all, we all see without much of an effort. You open your eyes, you look at things, and you see them! You do not have to be resting. Anyway, it is a very automatic process. Therefore, the problem with vision seemed simple, so that is why an MIT professor suggested that some of his students, as a summer project, should make a vision system. It was not supposed to be an entire visual system but it was to include many of the elements that make up a complete system; and that was the students’ summer assignment. It was in the 1960s. When they got to it, of course, they realized it was not so easy. For a start, they couldn’t even capture and record images because in the 1960s there were no digital cameras like the ones we have today. They were very costly. They were large, expensive systems. A digital camera cost $30,000 or $40,000, and then where were you supposed to store the pictures? You had no memory. A single picture would take all of your computer memory. Maybe you could work with a few images, but not many. All in all, the prospects were not very good.

Some of the systems in today’s digital cameras, like the face recognition system, were developed in 2000, and object recognition systems, which are now the state of the art in their field, are based on articles from the 1970s. But at that time some thought it wasn’t an interesting line of work. In fact, there was heated controversy about it.

A.P.:

Controversy?

A.T.:

Yes, controversy. There were two major schools in computer vision. One, based on geometric methods, sought to shape the world with mathematics; the other one focused on learning techniques. At the time, learning techniques were not as fashionable as they are today, and these two communities were in conflict. It was very difficult to tell which was going to be right and bear fruit. Many of the fundamentals were laid out then that today are leading to more powerful systems that feed on both communities.

A.P.:

Antonio, you work in one area of computer science and AI that allows machines to interpret images in order to understand scenes and the meaning of objects based on their context. You want to create systems that can understand how objects relate to each other, just as you are trying to achieve in the project “Places-CNN.”¹

A.T.:

What we are studying is object recognition and understanding, that is, understanding scenes. And by scene I mean an environment, for example, a living room, which has lots of objects. Our goal is to develop a system that, by seeing images in that environment, is capable of understanding the objects in the scene, and then, perhaps, how they are interrelated.

A.P.:

So, for example, you are part of a scene, you are in a room, and you ask the robot, “Bring me the ball.” It will do so as long as it knows what a ball is. It finds it in the scene and brings it to you. Is that what the dream is all about?

A.T.:

Yes. In fact, perhaps a good way to understand what we are trying to do is through applications, in particular robotics. One of the reasons why today there are no robots in homes, or driverless cars on the streets (although some experiments have been made), or why there are not more autonomous systems in the world, apart from mechanical aspects (motion has been almost resolved), is that they are not capable of understanding the world. So, when moving around, if you do not understand how the world is set up, it is very difficult to move without bumping into objects, or just to know where things are, or how to interact with objects. A robot would be helpful if you could say “Bring me the scissors” and it knows what scissors are, how to move in space until it finds them, where it has to go to find them because it remembers where it left them or, if it does not remember where it left them exactly, the places in which it is more likely to find them; and then, once it gets there, it can recognize the scissors, their three-dimensional structure, and is capable of grabbing them and bringing them back.

A.P.:

And it must also know that if something is made of glass it can break, or if something is made of wood. … And even know the implications of the material the object is made of.

A.T.:

It must understand the material it is made of.

A.P.:

Because if the robot picks up a glass, if too much pressure is applied, it can break, right?

A.T.:

Yes. The right amount of pressure must be applied. For example, if it brings you a glass of water, the robot needs to know that a glass must be picked up in a particular way.

A.P.:

It cannot be held upside down.

A.T.:

And it has to move carefully, because the object has its own dynamics too. So as the robot moves, the water also moves inside the glass, and even if it has been positioned correctly, water may spill. The ability to understand all these different features of the world on the basis of visual information is what we are trying to tackle. Many groups are working on this issue.

A.P.:

Yes, and there are many things yet to be resolved.

A.T.:

Yes, there are many issues. It really is very difficult to deal with all this in a summer project, as they intended to do back in the 1960s. We focus on object recognition in particular, and not on face detection; the goal is for the object to be recognized and for the scene to be understood: to identify that a particular place is, for example, an office or a hallway. And to know the objects you expect to find in that space and what spatial distribution is typical of such an environment or what items can be found in that type of place. Using this information, we can better identify objects that are difficult to recognize if there is no information about the context. For example, a chair can always be identified as a chair; there is nothing special about it. It can be anywhere; it will always be a chair. But other objects are contextually defined—for example, a portion of something, or a blank sheet of paper, which is seen as a white rectangle. The definition—what the white rectangle really is—will very much depend on its location.

A.P.:

And on its contents, and whether it is valuable or not.

A.T.:

Sure. Then you must also understand the functions it has. But that rectangle … if it is on the table then it is a sheet of paper, or if it is large and on the wall, maybe it is a whiteboard. The visual information received is ultimately very similar, but it is the rest of the scene that defines what it is, what the rectangle actually is.

A.P.:

There are many different rectangular objects.

A.T.:

Of course, and not of all them are sheets of paper. So in that sense, recognition, understanding the whole scene, is essential to giving meaning to objects. Objects are defined not only by their internal structure but also by the context in which they are found.

A.P.:

And there are metaprocesses, too. There may be a display—all over the place nowadays—with a representation of the object, which is not the actual object.

A.T.:

Exactly.

A.P.:

So it gets tricky.

A.T.:

Yes, it is rather complicated. In fact, when you work with many objects, the boundaries between them become blurred. There are objects that, as they are slowly changed, become a different object, and this happens with objects in real life, too. And that border, where you decide that an object starts to be another object, is not well defined yet. For example, a table has a very specific function, and a chair too. But there are many objects that are “in between”: you can sit on them, but you could also use them as tables, and so it is not so clear, visually speaking, what the difference between them is.

A.P.:

Let me refer to the tip you gave me earlier: “For machines to do certain things, some issues need to be resolved: this issue, this one and this one.” So the list of issues to solve is huge. An interesting thing would be to have such a list of big issues, but who sets the priorities? Because that’s yet another challenge.

A.T.:

Yes. Yes, that is a complex question, because the list of problems in AI is not well defined. The greatest difficulty is not even knowing what the questions are. It is really a problem in research; you go forward blindly, not knowing what will end up working. You follow your own intuition and the things you think are going to pay off. But that list of questions, one that everyone agrees with, simply does not exist. There are questions people more or less agree with, but they are possibly just one piece in the jigsaw puzzle.

A.P.:

And one of them is how to get machines to make decisions. In vision, technologies are needed for a machine to be autonomous and make decisions, am I right?

A.T.:

Yes, you are. In fact, to overcome problems in object recognition in general, many of the challenges were already there in the 1980s, 1990s, or in 2000. One of the major problems has to do with the big data, a very trendy subject today.

A.P.:

Yes, you were a pioneer in the use of big data for machine vision with a very attractive project, the “Visual Dictionary: Teaching Computers to Recognize Objects”—a very appealing title: “80 Million Tiny Images.”² You used 7,527,697 images to create a huge mosaic of tiles whose colors matched 53,464 English nouns, arranged by their meaning. It is a project intended to teach computers to recognize objects. I am interested in the big data side to your project because, for this book, Ricardo Baeza-Yates, at Yahoo! Labs, told me, in a very simple way, what big data was to him. He said (and I’ll paraphrase): “It’s what we do: we work with data generated by seven hundred million people. We collect and manage data, and we draw conclusions from them.”³ That is a big data dimension, as happens with some aspects in your project, right?

A.T.:

Yes.

A.P.:

The order of magnitude of the information that you handle, in relation to objects or people, is huge. Computing enables this nowadays.

A.T.:

Yes. The concept is simple. But it is not just a matter of having a lot of data but rather of having enough data for certain patterns to emerge. Imagine you want to see a movie, and instead of the whole movie I just give you three frames (and there are twenty-four frames per second). You would miss the whole story. That’s what I mean.

A.P.:

The story, the plot, the excitement. …

A.T.:

Many things. Maybe you get to grasp what the film is about, but that’s it, really. You miss the plot entirely. Then the three frames in the field of AI in particular would correspond to the type and extent of the data used for many years to train systems. The volume of data was so small that the system was confronted with a situation that made it impossible for it to understand the world. No chance. Only three frames! To put it in perspective with the film example, big data would mean having enough frames to really understand the movie. That’s when you start getting the whole picture.

A.P.:

And even be moved by the film.

A.T.:

Indeed, to understand enough so that you do not miss anything. Then, having data about seven hundred million people is almost like watching the whole movie.

A.P.:

In society, or in a part of society. …

A.T.:

Of course, it is already a very large proportion.

A.P.:

Coming back to your work in the vision “neighborhood.” Can you already train computer systems? What is the state of art in your field in AI?

A.T.:

Today’s computer systems must learn to see. Actually, the visual world is arbitrary. The fact that tables and chairs are the way they are is a random process in which human beings have designed objects with a sequence of functional constraints. But then there are a number of aspects and more open parameters that, perhaps, allow you to be more creative, right? The style of the chair, for example. That variety of visual appearances that objects possess is arbitrary. In fact, they change according to fashion. Therefore, developing vision systems that can see but do not learn makes no sense.

Progress has been made over the past twenty years in machine learning. The goal is, given a set of data, to be able to draw out essential properties that then allow you to generalize. And you have to learn to identify those features that give you the definition of the object. That is a very important area. In machine learning, having to deal with a huge quantity of data gives rise to a series of challenges. First technologically, because you have to deal with lots of information, store it, and gain quick access to it. To do that, new techniques must be developed that allow us to search for things very efficiently amid a large amount of visual data. This is already an area where there have been many developments, but we need people working on them to publish their latest advances. And then another major problem to be solved applies particularly to computer vision: the creation of large databases where images are provided with semantic information. For example, one can take a picture, but that picture is not something that we can directly inject into the computer for it to learn; it must be told what the different objects that make up the picture are: here is a table, here is a bottle. Accessing that information requires people to contribute to tagging such pictures on a large scale.

A.P.:

That is a social aspect of research, right?

A.T.:

Indeed. In the construction of these annotated databases, it is really difficult for a laboratory alone to build up a sufficient volume of data. So social media are being used intensively, so that people can also contribute to research in this field. There are already tools that allow people to annotate images. Then access to this information is given to researchers.

A.P.:

From what you said earlier, I gather that Google’s driverless car, for instance, would not be possible if there were not a gigantic visual database behind it, to be able to tell the car that is a street lamp, that is a dog, that is a pedestrian, and those lines on the ground represent a street crossing. You see a driverless car and you ask yourself how it does it, but we don’t really think about the big data behind it, as you said. Would this be a good example for today?

A.T.:

Yes. We must have not only visual knowledge about objects but also access to maps—for example, to Google Street View, and access to the three-dimensional structure of the scene. Just with the GPS coordinates of the vehicle, lots of semantic information is supplied about the car’s environment.

A.P.:

And the system must also say: a dog is crossing, a person is waiting. …

A.T.:

Of course, and then it has to be able to discriminate things that are not fixed.

A.P.:

You mean they are not fixed in the territory.

A.T.:

Exactly.

A.P.:

Moving things, random things that may suddenly appear. …

A.T.:

Yes, and there are things that change over time.

A.P.:

When a moving object appears, the system must know whether it is a dog, but it has to compare patterns of dog pictures to be able to do this.

A.T.:

Exactly.

A.P.:

And that requires a huge amount of information that must be handled by a car that looks so normal, doesn’t it?

A.T.:

Yes. It has to deal with an enormous amount of data because the list of objects it may come across is not a short one. In fact, there are many possible objects.

A.P.:

Well, your job is huge, too.

A.T.:

Yes, it is.

A.P.:

Antonio, it has been a pleasure. Thank you for your time and your ideas.

A.T.:

You are welcome. Thank you for the conversation.

27 We Want Robots to See and Understand the World

Notes