Free the Data!

Development of AI depends heavily on access to deep reservoirs of clean, structured data from which to learn. The giant firms that have it aren’t sharing.

By Chad Steelberg | edited by Dan Bova | Jan 04, 2018

Add Entrepreneur

Comment

MEHAU KULYK | SCIENCE PHOTO LIBRARY | Getty Images

Opinions expressed by Entrepreneur contributors are their own.

“What good are wings without the courage to fly?” These words of wisdom come to mind as I consider the open-source craze among leading artificial-intelligence technology providers.

Top firms, including IBM, Google and Facebook, have opened the source code of their artificial intelligence software tools, making them available for developers to use in their own devices and applications. This is most certainly a good thing, for the companies themselves and for the AI business generally.

However, open source is only part of the equation. Unlike previous generations of software, AI algorithms are worthless without a dataset to work on. And in contrast to their open-source code policies, these companies maintain a closed-data stance, hoarding their vast information repositories as a competitive advantage for developing better AI technology.

Essentially these companies have given us wings — but have denied us the sky. What the top tech firms need is the courage to stop hoarding information and embrace open data, giving the rest of the world access to the information required for AI cognitive engines to attain their full potential.

The data-rich get richer.

In the age of AI, a new 1 percent is arising. This upper, upper crust consists of companies blessed both with machine-learning technology and with large quantities of information.

Some companies have been dubbed “the Superrich” of the AI business, including Google, Facebook, Amazon and Microsoft. It has been reported that, while there are very few of these companies in the world, they have a massive advantage over everyone else in the machine learning space because they have access to vast amounts of clean, structured data.

Such data is needed to train machine-learning algorithms, giving them the basic information they need to function on their own in the real world. For example, an object-recognition algorithm designed to recognize cats in photos will be trained by reviewing massive numbers of images depicting felines. These images need to have some structure, i.e., they must be tagged with keywords that properly indicate they are depicting cats.

The larger the quantity of training data, the better the algorithm will perform, with more information providing more examples that can be used to find patterns. Conversely, inadequate quantities of training data can produce algorithms that deliver substandard results—sometimes to the extreme embarrassment of their creators.

Because of this, the usefulness of an AI algorithm is intrinsically tied to the availability of high-quality data. In this regard, AI algorithms are fundamentally different from other types of software, whose code is valuable on its own without any additional data.

Thus, when a company open-sources an AI cognitive engine such as a translation tool, it’s not the same as open-sourcing a piece of traditional software, like a spreadsheet. Without also providing access to the data, open isn’t really open.

Close-minded.

Such data-denial is no accident. Rather, it’s part of a deliberate strategy to maintain a competitive advantage. With AI models well known and well distributed, the data set is the one commodity that can be locked away and kept from rivals.

That’s why top technology players are hoarding data. For example, IBM didn’t buy The Weather Channel’s data operations because it wanted to know if it’s going to rain in Tallahassee tomorrow.

Weather is the number-one factor driving global GDP. By combining The Weather Channel’s vast repository of climate-related information with its Watson AI, IBM can take the lead in forecasting the weather for private businesses, allowing it to do everything from predicting winter energy demand to forecasting crop yields.

This gives IBM a huge market impact and a built-in advantage that will be hard for other companies to match.

Google, Facebook and others hold similar advantages in their respective areas, possessing vast quantities of consumer and social-media data that can be used to train highly-valuable AI tasks, from sentiment analysis for marketing to object-recognition for photos, to natural language processing for user interfaces.

Open season.

Examples of open AI software tools offered by technology powerhouses include:

Google’s TensorFlow, which is designed for building and training neural networks.
Microsoft’s Computational Network Toolkit, which can be used for applications including machine translation, image recognition, image captioning, text processing, language understanding and language modeling.
IBM’s SystemML, which can be used to create customized machine learning software.
Facebook’s deep-learning technologies, which the company has donated to an open source software project known as Torch.

With such initiatives, these companies are essentially giving away software that’s the product of enormous investments in manpower and intellectual property. However, these efforts are far from altruistic; by proliferating their technologies, companies aim to build large communities of developers accustomed to using their tools, establishing them as standards in the AI market.

Furthermore, with the real value of AI locked up in their proprietary data, these companies have little to lose by giving away their software tools.

Data Dump.

So how can companies be convinced to give up their prized data for the greater good of the AI business? One example can be found in an Uber initiative called Movement, which opens up data collected by the company’s fleet of cars regarding urban traffic patterns. Via the Uber Movement website, city planners can gather information to help improve traffic conditions.

What’s in it for Uber? The company doesn’t conduct road planning and construction itself, so providing this information to planners allows the government to make changes that improve driving conditions. This results in an improved user experience for Uber vehicles.

For AI tech companies with large treasure troves of data, there may be other opportunities to open up access to information in order to stimulate broad societal benefits. These benefits could indirectly boost demand for their technologies.

The AI market is ready to take wing — now all the big players need to do is give clearance for takeoff by having the courage to open up their data.

“What good are wings without the courage to fly?” These words of wisdom come to mind as I consider the open-source craze among leading artificial-intelligence technology providers.