It’s the summer of 1966.
Seymour Papert, a South African mathematician and computer scientist, has just joined the Artificial Intelligence group at MIT. Next year, he will co-invent the programming language LOGO. But for now, he’s doing something he thinks is much simpler along with his colleague Marvin Minsky: assigning the Summer Vision Project to undergrads.
The aim of the project is to build a system that can analyze a scene and identify objects in it — ‘a real landmark in the development of pattern recognition’.
So the vast, puzzling field of computer vision that research scientists and tech giants are still trying to decode was first thought to be simple enough for an undergraduate summer project by the very people who pioneered artificial intelligence.
The challenge of building a broad visual system turned out to be rather trickier than Papert expected. And it’s still proving difficult, despite all our advances in technology.
Visual AI is a tough nut to crack
Computer vision is not just a way to convert pictures to pixels, and it can’t make sense of a picture just from its pixels. It’s the ability of a machine to take a step back and interpret the big picture that those pixels represent. And that’s much harder than we think it is.
For instance, when we see a picture of a model wearing a dress, we automatically identify which part of the body we’re looking at and from which angle. We can figure out lighting conditions. We may even be able to judge the color and texture of the clothes based on shadows, highlights and color temperature.
(Every image in a catalog is different — backgrounds, pose, lighting, positions, and even image quality can all change. A nightmare for a computer. Images from Voonik.com.)
Show the same picture to a computer that’s been trained to identify only people and clothing, and it can easily get sidetracked with shadows, or even fail if a reference it’s been trained to identify is missing or hard to separate from other elements — for instance, if there’s a decorative gold pattern on the wall directly behind the model that’s similar to the gold embroidery on her dress.
Or it will, of course, spectacularly fall apart if you switch the photo with one of T.Rex battling King Kong, because it has no clue what a dinosaur or a giant gorilla looks like.
Computer vision is extremely good at doing specifically defined tasks. But human vision is holistic and can interpret a wide range of things that it detects, something we’re yet to crack with AI. Because tasks that seem easy to us, aren’t quite as simple to model for a computer system.
As one of our data scientists observed:
But developing even a fraction of that ability can drastically change the way we interact with technology. Terry Winograd, a computer science professor at Stanford, in his book ‘Understanding Computers and Cognition: A New Foundation for Design’ said,
“In working with people we establish domains of conversation in which our common pre-understanding lets us communicate with a minimum of words and conscious effort. If machines could understand in the same way people do, interactions with computers would be equally transparent.”
That kind of transparency translates into a cornucopia of use cases in business and tech. And the upshot with these applications is that people don’t have to overcome the barrier of understanding how a system works before they can use it.
Robot store assistants could behave as naturally as their human counterparts. Conversational bots may seem more like a person you talk to over the phone everyday. And web applications may be so specifically tailored to you, that you don’t need to navigate anywhere when you login. What you need will be on the first page.
But we aren’t there yet. For years, the technology has been employed to do a very particular set of highly complex tasks. Recent amazing successes in CV on tests such as the imageNet competition show that AI in tech is moving forward quickly. But it also needs a different set of skills to make computer vision easily accesible to millions of regular people doing regular things like shopping, booking tickets, ordering meals and making appointments.
So how can we bridge the gap between what seems like a larger-than-life technology that isn’t perfect yet, and its implementation to solve problems grounded in everyday lives?
Bringing in the human factor
One of the practical ways to make visual AI usable now is to augment it with human intelligence. Rich and Knight defined AI as:
“…The study of how to make computers do things at which, at the moment, people are better.’’
And people are better at understanding the nuances of vision than machines, there’s no splitting hairs about it.
While creating perfectly human-like computer vision modules is the ultimate dream, we can achieve results that make AI viable commercial options and reduce errors to a large extent with ‘beautiful assemblages of humans and computers’, as Sreeharsh Kalkar puts it in the CASTAC blog,
"The reality of computer vision and AI today is that it needs human help for optimal performance."
Take the instance of how DARPA took on the Twitter bot menace. It held a competition to identify methods that were most effective for identifying influence bots on Twitter that generally manipulate large scale, social decision making. The winning team used a pre-trained algorithm to find bot-like behavior, and then used the information they got from that algorithm to train a machine learning algorithm to find the rest.
But the pre-trained algorithm was created primarily with human inputs on how to find a bot, like looking for unusual grammar or speech patterns that resembled already known bots. And this helped the team find all the bots six days ahead of the deadline. And that’s exactly why having human inputs help: it’s the key to building useful products now.
Also, when you think about building or using a computer vision product, the purpose of the vision is always to prompt an action. For example,
- The hospitality robot, Savioke, can bring things to your hotel room. After it arrives at your door, it first sees that the door is closed and calls your phone. When you open the door, it sees that it’s open and opens its lid so you can get out whatever you want. Every instance of ‘sight’ is followed by an appropriate response.
- An augmented reality app like Layar needs computer vision to see content in the real world. Then it overlays it with digital content using AR that extends the experience, whether it’s an informative video, a news article or a discount page online.
- Visual intelligence products like our own Vue.ai identify what a customer is looking for, say a striped sweater, and then show the person visually similar items — a slow and tedious job if shoppers have to do it by manual means.
(Vue.ai’s visual recommendations on the Yepme store)
Taking computer vision and giving it a purpose that is useful, engaging and immersive lies firmly in the domain of human expertise as well.
It’s encouraging that there are so many companies in different verticals like mobile, entertainment, health and automotive that are attempting to integrate computer vision into their own businesses with a human touch and succeeding.
Research and Markets reports that the computer vision market will grow from $5.7 billion in 2014 to $33.3 billion by 2019, and that the highest growth rate will be in the consumer segment, followed by robotics and machine vision.
Google’s DeepMind defeating the Go world champion is an absolutely amazing breakthrough, but for now, we also need to pay attention to the smaller, simpler problems people face everyday that need elegant solutions. And that needs more than just a vast knowledge of computer science and neuroscience, although that is vital and hard to find. It also needs an understanding of human nature.
Solving esoteric Gordian knots is not the only focus of AI anymore. Computer vision is breaking out from the bastions of research into commercial and social spaces. And it’s here to stay.