Open the floodgates!
I’m talking about data pipelines, of course.
In today’s data-hungry world, business intelligence, analytics, and machine learning are run on data coming from a (large) number of different sources. Volumes of data are being generated from sensors, business systems, CRMs, and mobile devices, to name just a few. Organizations from health to retail are increasingly building secure data sharing capabilities to leverage each others’ data. And the trend towards more distributed data systems is only accelerating. According to Gartner, by 2025, 75% of enterprise-generated data will be created and processed outside a traditional centralized data center or cloud.
As more data is being piped across distributed systems, the ETL pipelines moving data from point A to point B and powering data integrations are growing in volume, scale, complexity… and importance.
ETL is an acronym for “extract, transform, and load.” In the ETL process (sometimes also called the ELT process), data engineers extract a copy of data from distributed sources, transform the raw data into a format that can be used by downstream intelligence, analytics, and machine learning applications, and then load the converted data to a data warehouse or data lake so it can be accessible to those applications. ETL pipelines are the tools and activities built by said data engineers to automate the entire process.
The simplest of pipelines will automatically pipe data from one data source to one data warehouse. But in reality, most data pipelines are much more complex, extracting and transforming data from a number of different sources, which exist in different formats and levels of cleanliness. To add to the complexity, many organizations are managing tens, if not hundreds, of pipelines and data integrations. Building, running, and managing these pipelines becomes costly. According to McKinsey, a typical mid-size financial institution spends $60-$90M per year on data access, which includes pipelines.
ETL process costs can be broken out into five basic categories:
For all four categories, you can see how scale and complexity can lead to runaway costs. As the number of data sources and the volume of data grows, you’re paying to move and store more data, you’re provisioning more infrastructure for the compute, and your engineering team is spending more time building and maintaining complex pipelines.
Whether you’re building your first pipeline, or you’re managing hundreds of pipelines, right now is always the best time to start controlling your data pipeline costs. Pipelines will only continue to get more complex as your data grows, which will make them more costly. Getting into the habit of considering costs along with pipeline health, latency, and throughput will pay off big dividends in the long run.
Here are six tips for managing your ETL pipeline costs:
You might think this sounds heretical in an article about pipelines, but for machine learning and analytics tasks, there is no need to build complex and costly pipelines. Federated learning and federated analytics enable model training and data analytics across distributed data, so you can generate insights while avoiding all of the costs of moving large datasets.
It’s quick and easy to start experimenting with federated learning and analytics, with integrate.ai’s developer tools that manage all of the federated infrastructure, security, and data science tooling. You may find that you can replace some or all of your costly data pipelines with federation. Click here to learn more.