Data pipelines for the rest of us

Apache Airflow is a great data pipeline as code, but having most of its contributors work for Astronomer is another example of a problem with open source.

Data pipelines for the rest of us
freedomnaruk / Shutterstock

Depending on your politics, trickle-down economics never worked all that well in the United States under President Ronald Reagan. In open source software, however, it seems to be doing just fine.

I’m not really talking about economic policies, of course, but rather about elite software engineering teams releasing code that ends up powering the not-so-elite mainstream. Take Lyft, for example, which released the popular Envoy project. Or Google, which gave the world Kubernetes (though, as I’ve argued, the goal wasn’t charitable niceties, but rather corporate strategy to outflank the dominant AWS). Airbnb figured out a way to move beyond batch-oriented cron scheduling, gifting us Apache Airflow and data pipelines-as-code.

Today a wide array of mainstream enterprises depend on Airflow, from Walmart to Adobe to Marriott. Though its community includes developers from Snowflake, Cloudera, and more, a majority of the heavy lifting is done by engineers at Astronomer, which employs 16 of the top 25 committers. Astronomer puts this stewardship and expertise to good use, operating a fully managed Airflow service called Astro, but it’s not the only one. Unsurprisingly, the clouds have been quick to create their own services, without commensurate code back, which raises the concern about sustainability.

That code isn’t going to write itself if it can’t pay for itself.

What’s a data pipeline, anyway?

Today everyone is talking about large language models (LLMs), retrieval-augmented generation (RAG), and other generative AI (genAI) acronyms, just as 10 years ago we couldn’t get enough of Apache Hadoop, MySQL, etc. The names change, but data remains, with the ever-present concern for how best to move that data between systems.

This is where Airflow comes in.

In some ways, Airflow is like a seriously upgraded cron job scheduler. Companies start with isolated systems, which eventually need to be stitched together. Or, rather, the data needs to flow between them. As an industry, we’ve invented all sorts of ways to manage these data pipelines, but as data increases, the systems to manage that data proliferate, not to mention the ever-increasing sophistication of the interactions between these components. It’s a nightmare, as the Airbnb team wrote when open sourcing Airflow: “If you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands, this complexity can become a significant burden for the data teams to manage, or even comprehend.”

Written in Python, Airflow naturally speaks the language of data. Think of it as connective tissue that gives developers a consistent way to plan, orchestrate, and understand how data flows between every system. A significant and growing swath of the Fortune 500 depends on Airflow for data pipeline orchestration, and the more they use it, the more valuable it becomes. Airflow is increasingly critical to enterprise data supply chains.

So let’s go back to the question of money.

Code isn’t going to write itself

There’s a solid community around Airflow, but perhaps 55% or more of the code is contributed by people who work for Astronomer. This puts the company in a great position to support Airflow in production for its customers (through its managed Astro service), but it also puts the project at risk. No, not from Astronomer exercising undue influence on the project. Apache Software Foundation projects are, by definition, never single-company projects. Rather, the risk comes from Astronomer potentially deciding that it can’t financially justify its level of investment.

This is where the allegations of “open source rug pulling” lose their potency. As I’ve recently argued, we have a trillion-dollar free-rider problem in open source. We’ve always had some semblance of this issue. No company contributes out of charity; it’s always about self-interest. One problem is that it can take a long time for companies to understand that their self-interest should compel them to contribute (as happened when Elastic changed its license and AWS discovered that it had to protect billions of dollars in revenue by forking Elasticsearch). This delayed recognition is exacerbated when someone else foots the bill for development.

It’s just too easy to let someone else do the work while you are skimming the profit.

Consider Kubernetes. It’s rightly considered a poster child for community, but look at how concentrated the community contributions are. Since inception, Google has contributed 28% of the code. The next largest contributor is Red Hat, with 11%, followed by VMware with 8%, then Microsoft at 5%. Everyone else is a relative rounding error, including AWS (1%), which dwarfs everyone else for revenue earned from Kubernetes. This is completely fair, as the license allows it. But what happens if Google decides it’s not in the company’s self-interest to keep doing so much development for others’ gain?

One possibility (and the contributor data may support this conclusion) is that companies will recalibrate their investments. For example, over the past two years, Google’s share of contributions fell to 20%, and Red Hat’s dropped to 8%. Microsoft, for its part, increased its relative share of contributions to 8%, and AWS, while still relatively tiny, jumped to 2%. Maybe good communities are self-correcting?

Which brings us back to the question of data.

It’s Python’s world

Because Airflow is built in Python, and Python seems to be every developer’s second language (if not their first), it’s easy for developers to get started. More importantly, perhaps, it’s also easy for them to stop thinking about data pipelines at all. Data engineers don’t really want to maintain data pipelines. They want that plumbing to fade into the background, as it were.

How to make that happen isn’t immediately obvious, particularly given the absolute chaos of today’s data/AI landscape, as captured by FirstMark Capital. Airflow, particularly with a managed service like Astronomer’s Astro, makes it straightforward to preserve optionality (lots of choices in that FirstMark chart) while streamlining the maintenance of pipelines between systems.

This is a big deal that will keep getting bigger as data sources proliferate. That “big deal” should show up more in the contributor table. Today Astronomer developers are the driving force behind Airflow releases. It would be great to see other companies up their contributions, too, commensurate with the revenue they’ll no doubt derive from Airflow.

Copyright © 2024 IDG Communications, Inc.