Home Data-Driven Thinking The Truth About Cross-Device Data Quality

The Truth About Cross-Device Data Quality

SHARE:

johndempseyData-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by John Dempsey, senior director of mobile, Oracle Data Cloud, at Oracle.

“Deterministic” isn’t the same as “true.”

The industry has reduced cross-device accuracy to a single question: “How many of your IDs are deterministic and how many are probabilistic?”

Unfortunately, this attempt to quantify accuracy is based on the flawed assumption that any link that is directly observed is 100% correct. So-called “deterministic” links come from users who have logged in to a device – a straightforward concept that’s become the industry shorthand for high-quality matching. But a login doesn’t always tell the truth.

What about when your kid logs in to Facebook using your phone? It’s a login, but it’s not you.

What about when users don’t want to provide their real information and fill in bogus emails, such as [email protected]? That particular email address matches to hundreds of thousands of US households in some databases. Same for [email protected], [email protected] and [email protected]. Before long, you can build a “deterministic” set of millions of devices tied to real devices.

Unfortunately, these “bad data issues” are common. Before applying quality controls, as many as 15% of emails in industry databases link with two or more households. This means that for many databases, this deterministic data is only 85% accurate.

Companies may claim they have a lot of deterministic matches. If they have no way of cleaning up the bad data, the solution starts to fall apart.

A better way to talk about data quality is to acknowledge that everything is probabilistic. Any single observed link can be thought of as a fact – sure, that email was seen logging in to that device at that time – but not all facts are true.

All links, including deterministic links, have a range of probability, from zero to 100%. Login data tends to be on the higher end of the scale, but rarely is it 100% right.

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

It’s All Probabilistic … Until You Have A Truth Set

There simply isn’t enough high-quality deterministic data available to build a cross-device graph at meaningful scale, so even large players like Facebook and Google rely on so-called “probabilistic” techniques.

A probabilistic match means an association is inferred without being directly observed. In other words, you’re making a prediction about the likelihood that two bits of information are connected in real life. With data, technology and math, you can tease out these connections very well – even if they weren’t directly observed.

How does it work?

A probabilistic model has three components:

First, there is observation data – what signals are you using to predict a match?

For a cross-device solution, it’s best to observe each device as many times as possible, with associated metadata such as IP address, time stamp, application and device type. More data tends to be better.

Second, there is the algorithm – how are you making your prediction?

This is the math used to identify the correct connections. The methodology can vary but what matters most is how well the algorithm can identify correct matches. Statistician George Box famously said, “All models are wrong but some models are useful.”

How are you going to judge the usefulness of your model?

Finally, there is the truth set – what will you use to train and validate your algorithm?

A truth set – a set of links assumed to be 100% true – is required for any evaluation of a probabilistic solution. Careful consideration must be given when obtaining a truth set: If this is the data an algorithm will be judged against, you first have to make sure your truth set is, in fact, TRUE.

How True Is Your Truth Set?

It’s not good enough to say, “All my data is scored against a truth set,” without first understanding what true means.

Ask the right questions and you’ll quickly find that most companies have no way to validate their so-called truth sets.

The typical approach looks like this: A company licenses anonymized login emails from a third party without knowledge of how the data was collected or whether it is accurate. These “truth sets” are often small – a million or fewer cross-device pairs – and not representative of the general population.

Deterministic data cannot be blindly trusted to be accurate. If you train your model against a “dirty” truth set, you’re bound to get bad results. It’s garbage in, garbage out.

The source of the truth set matters: You can have far more confidence in truth set data sourced directly from companies that charge customers’ credit cards on a monthly basis and regularly ship items to the same address associated with their devices.

Truth Be Told

For companies like Facebook, Google and Twitter, an email address alone can be good enough because it’s more likely to be “you” when using their services. But when it comes to the open web, which relies on demand-side platforms and exchanges, a much higher standard is required.

Accuracy matters. Cross-device mapping impacts everything from retargeting and frequency capping to attribution. It pays to get it right.

So how do you assess the quality of a cross-device solution?

Don’t bother asking how many deterministic IDs there are. Instead, ask about the trade-off between scale and accuracy. And make sure the truth set is actually “true.”

Follow John Dempsey (@johnfdempsey), Oracle Data Cloud (@OracleDataCloud) and AdExchanger (@adexchanger) on Twitter.

Must Read

Albert Thompson, Managing Director, Digital at Walton Isaacson

To Cure What Ails Digital Advertising, Marketers And Publishers Must Get Back To Basics

Albert Thompson, a buy-side veteran with 20+ years of experience, weighs in on attention metrics, the value of MFA sites, brand safety backlash and how publishers can improve their inventory.

A comic depiction of Google's ad machine sucking money out of a publisher.

DOJ vs. Google, Day Five Rewind: Prebid Reality Check, Unfair Rev Share And Jedi Blue (Sorta)

Someone will eventually need to make a Netflix-style documentary about the Google ad tech antitrust trial happening in Virginia. (And can we call it “You’ve Been Ad Served?”)

Comic: Alphabet Soup

Buried DOJ Evidence Reveals How Google Dealt With The Trade Desk

In the process of the investigation into Google, the Department of Justice unearthed a vast trove of separate evidence. Some of these findings paint a whole new picture of how Google interacts and competes with its main DSP rival, The Trade Desk.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters
Comic: The Unified Auction

DOJ vs. Google, Day Four: Behind The Scenes On The Fraught Rollout Of Unified Pricing Rules

On Thursday, the US district court in Alexandria, Virginia boarded a time machine back to April 18, 2019 – the day of a tense meeting between Google and publishers.

Google Ads Will Now Use A Trusted Execution Environment By Default

Confidential matching – which uses a TEE built on Google Cloud infrastructure – will now be the default setting for all uses of advertiser first-party data in Customer Match.

In 2019, Google moved to a first-price auction and also ceded its last look advantage in AdX, in part because it had to. Most exchanges had already moved to first price.

Unraveling The Mystery Of PubMatic’s $5 Million Loss From A “First-Price Auction Switch”

PubMatic’s $5 million loss from DV360’s bidding algorithm fix earlier this year suggests second-price auctions aren’t completely a thing of the past.