Home Data-Driven Thinking ETL: The Most Important Acronym You’ve Never Heard Of

ETL: The Most Important Acronym You’ve Never Heard Of

SHARE:

mike-driscoll-2Data-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Mike Driscoll, CEO and founder of Metamarkets.

Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: “Sat Jan 22 10:37:13 PST,” “2014-01-22T1837:13.0+0000” and “1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It’s no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

1. Stay Close To The Source

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

2. Avoid Processed Data

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Just like food, data is best when it’s minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

4. Put Business Questions First (Don’t Let Data Wag The Dog)

Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that “once we have all the data, we can answer any question we’d like.” This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

5. Avoid ETL Where You Can

While for some organizational processes there’s no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else’s problem, freeing up our businesses to specialize in what we do best.

Follow Mike Driscoll (@medriscoll), Metamarkets (@metamarkets) and AdExchanger (@adexchanger) on Twitter.

Must Read

Comic: Header Bidding Rapper (Wrapper!)

Microsoft To Stop Caching Prebid Video Files, Leaving Publishers With A Major Ad Serving Problem

Most publishers have no idea that a major part of their video ad delivery will stop working on April 30, shortly after Microsoft shuts down the Xandr DSP.

AdExchanger's Big Story podcast with journalistic insights on advertising, marketing and ad tech

Guess Its AdsGPT Now?

Ads were going to be a “last resort” for ChatGPT, OpenAI CEO Sam Altman promised two years ago. Now, they’re finally here. Omnicom Digital CEO Jonathan Nelson joins the AdExchanger editorial team to talk through what comes next.

Comic: Marketer Resolutions

Hershey’s Undergoes A Brand Update As It Rethinks Paid, Earned And Owned Media

This Wednesday marks the beginning of Hershey’s first major brand marketing campaign since 2018

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters
Comic: Header Bidding Rapper (Wrapper!)

A Win For Open Standards: Amazon’s Prebid Adapter Goes Live

Amazon looks to support a more collaborative programmatic ecosystem now that the APS Prebid adapter is available for open beta testing.

Gamera Raises $1.6 Million To Protect The Open Web’s Media Quality

Gamera, a media quality measurement startup for publishers, announced on Tuesday it raised $1.6 million to promote its service that combines data about a site’s ad experience with data about how its ads perform.

Jamie Seltzer, global chief data and technology officer, Havas Media Network, speaks to AdExchanger at CES 2026.

CES 2026: What’s Real – And What’s BS – When It Comes To AI

Ad industry experts call out trends to watch in 2026 and separate the real AI use cases having an impact today from the AI hype they heard at CES.