The Data Conundrum: Open Source AI Is Not As Open As You Think

Anyone following the AI world these days might think that there are more or less two types of AI models: those that are “open source”, and those that are not. Most large-company players have taken a stance in this debate, and are actively marketing their position. ChatGPT, for example, is probably the most famous example of a “closed source” model. On the other hand, the Llama family of models is emblematic of open source models.

What is Open Source AI

“Open source AI” usually entails the model file (typically a file containing a large blob of parameters) being made publicly available so that users can freely plug the model into their application, and even fine-tune it to suit their purpose. Usually, tools to do inference (prediction) on new data are also released alongside the model file. This is the case for the Llama models, for example, and also the recently released Gemma models.

However, there is more here than meets the eye. The term “open source” comes from the software world, where a user who has access to the source code can not only use it, but also understand and modify it to best suit their objective. AI models, on the other hand, are not reducible to software. While software is used in the process of training and serving the AI model, the key missing piece is the data used for training. With the notable exception of the newly released OLMo model, open source AI models only make available the model weights and the code to make predictions, but not the data or the exact methodology used to build the model, which are essential if your goal is to build a model that you actually own. 

In a world where the AI industry is converging towards publicly available foundation models (Large Language Models, but also models for image and video), specialized, proprietary datasets will be a key competitive advantage. If everyone has access to great models, the main differentiator is quality data.

While text or image models heavily rely on data harvested on the web, predictive ML models need specialized, domain-specific data - which is expensive to collect, and expensive to buy.

The Data Ownership Spectrum

On the broad spectrum of data ownership, there are two extremes. One extreme includes data that is publicly available to everyone: Wikipedia, ImageNet, or even the good old UCI repository. These datasets are commonly used to train AI models used for research, and they can serve as a starting point for a data scientist. They don’t necessarily give out a distinct competitive advantage, but they allow models to be built and experimented with. This is the “visible matter” of the data universe. 

On the other side of that data spectrum lie proprietary datasets that are undisclosed and unavailable. Think of these datasets as having information that is very sensitive and highly valuable or strategic to their custodian. Examples of these may include a company's own proprietary data, or classified legal documents. The space between those two ends of the spectrum is home to a broad range of proprietary datasets that vary in their value and difficulty to access. These datasets may range from climate and weather data to something as specific as driving habits in a certain geography. Although it may be difficult to locate, identify, and acquire these datasets, there is a ton of value that data scientists could extract from them. The challenge? Getting their hands on it in the first place. The process of acquiring and using these 3rd party data sets (the “dark matter” of the data universe) is time-consuming, expensive, and doesn’t guarantee results.

How do Data Science Teams Evaluate Data?

Proprietary datasets are expensive, and often involve deals in the six or seven figures range between a buyer and a data supplier.

The current solutions available to data science teams to evaluate the impact of these datasets on their models are not ideal. For example, data science teams can purchase a sample dataset and test it out, but given that this data isn’t 100% representative of the whole dataset, it does not create the confidence required to make such a large purchase.

Another approach is to use a data clean room. Think of this as a secure, neutral environment where multiple parties can upload their dataset to. Here, the data science team would upload their dataset, and so would the data vendor, allowing the data science team to evaluate the dataset by training a model. While this is technically a good solution, neither company wants their data to be compromised, especially if the clean room is on a 3rd party cloud. They will therefore enter the clean room environment with a reduced or anonymized version of their data, hindering the ability to accurately evaluate it. All of the approaches for data evaluation have trade-offs, and one has to sacrifice some privacy, time, or accuracy in order to find a workable setup.

A Novel Approach for Data Evaluation

This is where Federated Learning (FL) comes into play. FL is a set of techniques that allow multiple parties to train a machine learning model together, without sharing any data. Using FL, each data custodian holds on to their data and trains a small piece of an overall AI model, exchanging only encrypted information about the parameters.

FL opens the door to another interesting and valuable concept: federated data evaluation. This is part of what we’re building at integrate.ai - a platform enabling data consumers to make high-confidence decisions about data purchases, without a protracted process of data exchange or time-consuming POCs.  We refer to the concepts of FL and Federated Data Evaluation as ‘Federated Data Science’.

Suppose that a data scientist at a major insurance company works on assessing weather-related risks in order to improve the company’s pricing model. They are looking to buy data from one of multiple providers that offer property-level intelligence that, they hope, will enhance their model and increase its accuracy.

Using a federated platform, the data scientist will be able to train a model that combines data from all of those providers with their own. Not only that, but they can also assess the influence that each dataset has on the accuracy of their model. Once they have done that, they can then decide which dataset to buy - either as a whole, or by subscribing to a continued federated access from their data provider, depending on their business agreement.

Open source AI has a very important role to play in the overall ecosystem by leveling the playing field, but finding ways of unlocking the value of proprietary data is key to getting real value out of AI. Federated Data Science can be a pivotal approach that enables data science practitioners to access the “dark matter” of data.

If you're interested in learning more about how we're leveraging federated learning technology to improve the data evaluation process, you can reach out to our team here.

Similar posts

News, insights, and opinions about federated learning and analytics.
Close Cookie Preference Manager
Cookie Preferences
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. Privacy Policy
Strictly Necessary (Always Active)
Cookies required to enable basic website functionality.
Made by Flinch 77
Oops! Something went wrong while submitting the form.
Cookie Preferences
X Close
Please provide a business or institutional email to continue.
We have a date... to federate!

Your request for a 14-day free trial has been received. You will receive an email within 1 business day with instructions to access your account.
Oops! Something went wrong while submitting the form.