Home Data-Driven Thinking ETL: The Most Important Acronym You’ve Never Heard Of

ETL: The Most Important Acronym You’ve Never Heard Of

SHARE:

mike-driscoll-2Data-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Mike Driscoll, CEO and founder of Metamarkets.

Data is the fuel and the exhaust of programmatic advertising. It informs every transaction, and every transaction generates more of it. As impression volumes rise into the trillions across all manner of devices, the focus of many ad tech engineering teams isn’t on ethereal machine learning algorithms, but something far less glamorous.

The process is called ETL — the critical, painstaking work of cleansing and consolidating disparate datasets. As the worlds of marketing and enterprise software collide, ETL could be the most important acronym you’ve never heard of.

ETL stands for extract, transform and load — and it’s a truism among data scientists that it takes up about 80% of our time, leaving just 20% for analysis. Having built big data platforms in pharma, banking and now in digital media, I believe this ratio is near universal.

Underinvestment in and misunderstanding of ETL is single-handedly responsible for a huge amount of organizational pain and inefficiency. It’s why data is so often delayed, why so many executives are unhappy with the quality of reporting and why more than 50% of corporate business intelligence initiatives fail.

ETL is hard because data is messy. There is no such thing as clean data, and even the most common attributes have a dizzying array of acceptable formats: “Sat Jan 22 10:37:13 PST,” “2014-01-22T1837:13.0+0000” and “1323599850” all denote the same time. Add to this a growing variety of data, such as geocoordinates, buyer names, seller URLs, device IDs, campaign strings, country codes, currencies. Each new source adds a layer of bricks to our collective tower of Babel.

It’s no wonder that an agency CIO recently confessed to me that he’d spent tens of millions of dollars a year on the reliable, repeatable transformation of data. As someone who has spent much of my career wrestling ETL’s demons, here are five ways for keeping them at bay:

1. Stay Close To The Source

Journalists know that when it comes to getting the facts, it’s best to go directly to the primary source and it’s best to break news first. The same is true for ETL. The closer you are to the data source, the fewer transformations and steps and the lower likelihood that something will break. The best ETL pipelines resemble tributaries feeding rivers, not bridges connecting islands. Also, the closer you are to the source, the faster you can optimize your approach, which in this space can pay huge dividends.

2. Avoid Processed Data

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Just like food, data is best when it’s minimally processed. In order to handle huge quantities of data, one common approach for ETL pipelines is to downsample it indiscriminately. Many programmatic buyers will examine, for example, a 1% feed of bid requests coming off of a particular marketplace.

In an era when bandwidth is cheap and computing resources are vast, sampling data is a throwback to the punchcard era — and worse, it waters down insights. Audience metrics like frequency and reach can become impossible to recover once a data stream has been put through the shredder. Sampling is why audience segments can resemble sausage — no one knows what’s inside.

3. Embrace (And Enforce) Standards

In the early days of the railroads, as many as a dozen distinct track gauges, ranging from a width between the inside rails of 2 to nearly 10 feet, had proliferated across North America, Europe, Africa and Asia. Owing to the difficulties of non-interoperable trains and carriages, as well as continuous transport across regions, a standard width was eventually adopted at the suggestion of a British civil engineer named George Stephenson. Today, approximately 60% of the world’s lines use this gauge.

Our programmatic vertical has its own George Stephensons, CTOs and chief scientists like Jim Butler and Neal Richter, whom you can find late at night debating specifications for OpenRTB protocols on developer lists. Just as with the railroads two centuries before, embracing and enforcing standards will catalyze faster growth in programmatic advertising through increased interoperability.

4. Put Business Questions First (Don’t Let Data Wag The Dog)

Too many organizations, upon recognizing that they’ve got data challenges, decide to undertake a grand data-unification project. Noble in its intentions, cheered by vendors and engineers alike, these efforts seek to funnel every source of data in the organization into a massive central platform. The implicit assumption is that “once we have all the data, we can answer any question we’d like.” This approach is doomed to fail because there is always more data than one realizes, and the choices around what data to collect and how to structure it can only be made by putting business questions first.

ETL is hard, and building pipelines laborious, so avoid building bridges to places that no business inquiry will ever visit.

5. Avoid ETL Where You Can

While for some organizational processes there’s no avoiding working with the nuts and bolts of data, for others it may be possible to get out of the data handling business entirely. Take, for example, the handling of email or digital documents: For years, IT departments suffered through the management and occasional migration of these assets. Today, however, cloud offerings, such as those from Google and Box, make this someone else’s problem, freeing up our businesses to specialize in what we do best.

Follow Mike Driscoll (@medriscoll), Metamarkets (@metamarkets) and AdExchanger (@adexchanger) on Twitter.

Must Read

APIs Have Had Their Moment, But MCPs Reign Supreme In The Agentic Era

On Tuesday, Infillion launched fully agentic media execution platform built on MCP, marking a shift from the programmatic to the agentic era.

Albertsons Launches New Off-Site Click-to-Cart Tech

The grocery chain Albertson’s is trying to reduce the time and number of clicks it takes to add an item to an online shopping cart. It’s new click-to-cart product should help.

Pinterest Acquires CTV Startup TvScientific (Didn’t CTV That Coming)

Looks like Pinterest has its eyes – or its pins, rather – fixed on connected TV.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters
Kelly Andresen, EVP of Demand Sales, OpenWeb

Turning The Comment Section Into A Gold Mine

Publisher comment sections remain an untapped source of intent-based data, according to Kelly Andresen, who recently left USA Today to head up comment monetization platform OpenWeb’s direct sales efforts.

Comic: Shopper Marketing Data

Shopify Launches A Product Network That Will Natively Integrate Items From Across Merchants

Shopify launched its latest advertising business line on Wednesday, called the Shopify Product Network.

Criteo Lays Out Its AI Ambitions And How It Might Make Money From LLMs

Criteo recently debuted new AI tech and pilot programs to a group of reporters – including a backend shopper data partnership with an unnamed LLM.