The AI-First Future of Open Source Data

By Sam Ramji, leader of strategy at DataStax

Is giving people the right to change, alter, and amend your software a good thing? What about doing this for your data? Companies used to think that publicizing their source code was the same as giving away their secret sauce.

But they’re beginning to realize the impact that open source has had in creating the things around them, such as mobile devices or TVs, and how much open source is a vehicle for change.

What if your secret sauce was the data you owned, and not the source code? Would you be as comfortable making it public? Is it possible to have a General Public License (GPL) for data?

I recently sat down with Larry Augustin to delve into this topic. Augustin is an open-source titan: he was part of the group that coined the term “open source.” He led the first open source IPO at VA Linux, led SugarCRM for a decade, and most recently, he served as Vice President of Applications at Amazon Web Services (AWS), responsible for services including Connect, Pinpoint, SES, Workspaces, Chime, Alexa for Business, and many others.

From open source to open source data

Augustin was in the open source world at its origins. He watched open source like the tide, fading away into the distance at times, then rushing back in a gigantic wave.

Back in the 1990s and early 2000s, open source was the new kid on the block. Some people were excited about it, while the majority asked questions like “why does it matter?” and “what is the strategy around it?” until the hype wore off. Now, in the 2020s, businesses are being built on an open source model by default.

Augustin speaks about the transition of open source from data centers, like the ones he was building in the Linux days, to integrating into consumer devices. But the consumer often has no idea how open source benefits them. As he points out, you wouldn’t have a functioning TV without open source – looking into your TV settings will likely show you the open source licenses of the software used to build it.

The future of software, however, is not about the source code. It’s about the data. In an AI-centric world, the machine learning code itself is not the powerful part. Its only purpose is to enable training—building a system of neural weights, in other words — based on vast streams of data. Given the data you can reproduce the AI, but with just the code you cannot.

So as we move to the future, Larry sees an “AI-native” era of apps and businesses that build on the fundamental premise of AI-powered software that elevates human work.

“Why should a salesperson have to enter data that the system already knows? Smart systems should import that data automatically. That’s a design principle I call ‘zero data entry.’ Instead, software should be helping the salesperson do their job. For example, help the salesperson know what information the customer likely wants next. I call that creating a ‘system of action,’ one that helps the person do something (take action) in their job,” Augustin said.

To reach the AI-native future, we’re going to have to figure out how to apply the heuristics of open source to the world of open source data.

Open source: two core themes

There are two core themes that Augustin speaks of that have made a great impact on open source software, which he believes should be applied to open source data as well. The first is the ability to extend, enhance, and reuse software. And the second is the ability to fix a bug or repair a problem.

Extend, enhance, and reuse

The notion of extending or enhancing code is something you might have run into when using software, and finding a small thing that, if changed, would make your life easier. Open source grants you the freedom to make these changes and share them with other people who might be in a similar situation.

Extending, enhancing, and reusing open source data is also applicable, but it isn’t as simple as just sharing data. As Augustin puts it: “You have to have the correct licensing. There are access mechanisms. Does that mean you get the data in a structured format? Do you need to change the schema? People who think about data all the time don’t always think about the metadata that goes with it.”

Augustin has seen many companies providing data without the metadata. It’s a key component as it contains information about the history and the causality of how the data was generated. Without this metadata, the value of the data collapses, because we have crippled our ability to trust and analyze it.

Fixing bugs and repairing problems

The second core theme is the ability to fix a bug or repair a problem. It’s annoying when one little thing can prevent you from using the software as you want, all because of a small oversight in coding or a lack of clear understanding of internal workings.

As an example, Augustin brought up an issue he ran into using QuickBooks at a startup many years ago: “I was using QuickBooks to do the accounting. And there was this field. If I put in 12 characters, it crashed. But if I put in 11, everything worked. And it was very clear when you put the 12 characters in, it went off the end, and boom, everything blew up. I could see the person writing this code thinking, ‘Oh, yeah, these things will never be longer than 11 characters.”

Augustin contacted QuickBooks support, but they weren’t interested in fixing the problem. It’s an example of why open source is so attractive: you don’t have to “live with it” or wrangle with workarounds when you run into a software issue. You can change the code and share the benefit with others who might also benefit from it. It’s about “permissionless innovation,” as Vint Cerf stated so well.

Data also needs to be “fixed” at times. It can be hard to think of data as “broken,” but Augustin said that he rarely sees a clean dataset. And the larger the dataset, the greater the amount of “noise” in the data. The ability to improve the signal-to-noise ratio is an important part of opening up data.

What is the GPL for data?

As in the software world, where a user gives up some control through a contribution agreement, users of open source data have to give up some rights to their data. But the question we’re facing now is, what would that agreement or General Public License (GPL) look like?

“On the data side, what are the set of rights that a contributor of data needs to give up to still feel comfortable that they can use their data the way they want to, the way they intended, that they haven’t sort of lowered their own rights?” Augustin says.

Contributors who understand this trade-off enable the open-source community to enhance and create new items from their data.

This user agreement also opens up the possibility of accelerated human progress. For instance, academic researchers in biological sciences are producing brand new data. Sharing their findings would allow others the opportunity to train new models on it.

The data-in-to-data-out ratio

If we take it one step further from the GPL for data, we begin to see the value equation of data, or “the data-in-to-data-out ratio” as Augustin calls it. He uses the example of why people are so willing to give up parts of their data and privacy to websites because the small amount of data they’re handing over returns greater value back to them.

Augustin sees the data-in-to-data-out ratio as a tipping point in open source data. Calling it one of his application principles, Augustin suggests that data engineers should focus on providing users with more value but take less and less information from them.

He also wants to figure out a way never to ask your users for anything. You’re only providing them an advantage. For example, new app users will always be asked for information. But how can we skip that step and collect data directly in exchange for providing value?

“Most people are willing to [give up data] because they get a lot of utility back. Think about the ratio of how much you put in versus how much you get back. You get back an awful lot. People are willing to give up so much of their personal information because they get a lot back,” he says.

The future landscape of AI-native applications will generate billions of dollars through improved efficiency of enterprises as systems. Perhaps more importantly, we have a chance to make work more meaningful and joyful for the people freed from data administration to create value. AI has taught us that computers can learn things, and that they can know things. What’s special about humans is that we are creative beings who love to spend our time connecting with other humans. Let’s design a future where the use of AI sets us free.

Learn more about DataStax here, and subscribe to the Open||Source||Data podcast.

About Sam Ramji:

Sam leads strategy at DataStax. A 25-year veteran of the Silicon Valley and Seattle technology scenes, Sam led Kubernetes and DevOps product management for Google Cloud, founded the Cloud Foundry foundation, has helped build two multi-billion dollar markets (API Management at Apigee and Enterprise Service Bus at BEA Systems) and redefined Microsoft’s open source and Linux strategy from “extinguish” to “embrace”.

He is nerdy about open source, platform economics, middleware, and cloud computing with emphasis on developer experience and enterprise software. He is an advisor to multiple companies including Dell Technologies, Accenture, Observable, Insight Engines, and the Linux Foundation.

IT Leadership, Open Source

Read More from This Article: The AI-First Future of Open Source Data
Source: News