Data mature organizations are starting to see diminishing returns on their data investments because of a new limiting factor - the data itself. This has particularly played out in the world of AI and machine learning, where there is a new focus on data-centric AI (improving the data to improve model performance) rather than the traditional model-centric AI (improving the algorithm to improve model performance). Organizations have access only to the data they can collect or the data they can buy. For many, this means that they do not have enough data, or do not have the right data to execute on strategic priorities. Where once data was a tool for growth, there is now a risk of data stagnation.
This has led to the growth of data collaboration
Many industries can benefit from external data sources. For example, healthcare applications could better predict diagnoses with access to a wider set of data to reduce bias. RPA (Robotic Process Automation) that utilize ML models in order to predict fraud, or perform AI based document analysis, can perform better with access to wider datasets to train those models.
In the quest for more data sources to augment their own data, leaders are looking at data collaboration. Organizations are turning to new or existing partners, and exploring what data already exists in their ecosystem that could support their data strategy. In 2018, a group of 23 research organizations, drug companies, and partners developed the European Health Data and Evidence Network to collaborate on real-world health data in a federated network. Recently, a group of insurers formed the CyberAcuView consortium to collaborate on data to “enhance cyber risk mitigation.” In some cases, trusted third parties like software providers who already work with a number of customers may see data synergy opportunities and bring them together as a consortium.
But there are major barriers to data collaboration
While a well-structured data collaboration can bring value to all parties, the challenges inherent in collaborating are often prohibitive. Some examples include:
- Regulation - Privacy and data residency rules in GDPR, CCPA, or PIPEDA limit what data can be used for and how data can (or cannot) be shared across jurisdictions.
- Legal barriers - Existing contracts can limit a company’s ability to use their partners’ data for specific purposes.
- Consent Limitations - SaaS companies may have access to their clients’ data, but only to execute a very narrow set of tasks. Enabling data usage outside of those use cases would require a new contract - and buy-in from the data owners.
- Patient Privacy - For healthcare organizations looking to collaborate across countries to support something like drug discovery, they may not be able to utilize valuable patient data - despite clear societal value.
- Resistance from data custodians - For data owners, there are significant competitive, security, and privacy implications for data collaboration. Data owners risk exposing their IP, or leaking their customers’ data.
Data sharing agreements are the first step
Organizations typically manage these challenges through data sharing agreements. These agreements can grow in complexity because of one main reason - they rest on the assumption that the release of data may be used for intentions far beyond the intended, allowable scope. Traditionally this risk is mitigated through deidentification - where identifying information is masked before the data is released. However, this only limits the ability for a data user to directly identify individuals in that data (it is still possible using external data sources to correlate individuals to the anonymized data). Furthermore, the insights from the released dataset are broad and permanent - a sophisticated data user can still extract insights beyond the scope of the contractual agreement, and the only thing preventing it is enforcement of the contract assuming the misuse was even caught.
Contracts are powerful tools but they have limitations: negotiating terms, demonstrating and monitoring compliance are cumbersome tasks. Perhaps more importantly, the accountability they provide is point in time and reactive: Facebook’s Cambridge Analytica woes started with data they shared for academic research back in 2013.
How to simplify data sharing agreements
This leads us to discuss several ways to simplify the contractual and data sharing agreement process to accelerate getting value from the data.
Tight Scoping in Use
Focusing contracts on the use of the data rather than its contents allows for simpler agreements to be drawn between parties. By broadly defining the types or ranges of data that will be covered by the agreement, this method works well with the following technique of minimizing data access and helps minimize the need to update the contract.
Minimal Data Access
At the point of data access, share the minimal amount of data required for other parties to achieve their goals, which allows for simpler expansion of data exposure if deemed fit.
This simplifies how fast data access can be extended, since release of newer fields of data is faster than having contracts redrawn. However, this requires working with partners that understand the need for minimal data use instead of just requesting the maximum amount of data that is allowed by the contract.
Employ existing standards
While drafting contracts, it’s worth leaning on existing frameworks that provide guidelines for the data within the industry. For example, GO FAIR Data Principles are commonly used as the industry standard for health data consortiums in Europe. In genomics, a possible framework is GA4GH’s Framework for Responsible Sharing of Genomic and Health-Related Data which provides clear principles to consider when sharing sensitive health data. It’s worth examining these, and other similar frameworks when drafting contracts since many existing issues with data sharing in those specific industries have already been thought through.
New developments in Privacy Enhancing Technologies defy the assumption of data release, by focusing on the ability to perform useful data related tasks but allow the data to remain within the custodian’s control. This provides guarantees that are stronger than traditional de-identification, and in some cases can yield better results.
Differential privacy is a technique within this realm, which adds a configurable amount of noise to the data before release. This method provides guarantees to the amount of privacy that is applied to the data - allowing data custodians to choose a level of noise as a trade-off for utility.
Federated Learning is a method of building machine learning models by aggregating model parameters instead of the underlying data. Even the most basic methods of aggregation can provide more protection than releasing the underlying data set that was used to train the model.
Contracts between parties can now be simplified as the requirements of moving identifiable or de-identified data are removed. It’s also possible to combine technologies in order to bolster the privacy protection of the data. For example, the model built using federated learning can be differentially private - which further protects individual privacy and the data custodian.
There are other technologies also within the realm of PETs including secure enclaves, homomorphic encryption, or secure multiparty computation. Many of these technologies are used in combination, but they all serve to provide a better way of data sharing which ultimately lowers the risk to data owners using technology as opposed to relying solely on a legal contract.
Technological approach to data protection
Leveraging these technologies ultimately leads to simpler contractual agreements, since the methods themselves protect the data. In the same way we know to trust HTTPS or other encryption methods with our day to day online banking tasks, we can expect that these technologies pave the way for a world where restrictive data sharing agreements become radically simplified.
In conclusion, we know that data sharing is needed in order to provide outcomes beyond what can be done with a single organization’s data. The current data-sharing-by-contract mechanism can be simplified in many ways, but it’s likely that the norm of data sharing moves towards technology based protections that fundamentally protect the data. Contractual agreements are at best only incentives for all parties to do their best, which carries much more risk than relying on fundamental computation methods to protect data.
integrate.ai makes simpler data sharing agreements possible by enabling machine learning and analytics on distributed data, without requiring access to individually identifiable data. Click here to learn more about the integrate.ai platform.