Which is better – deterministic data or probabilistic data?
It’s a trick question because the answer is “both.”
Even Google – whose first-party cross-device logged in user base likely trumps anything anyone other than Facebook could muster deterministically – uses a combination of logged-in user data and statistical matching.
With privacy in mind, Google’s recently announced cross-device measurement solution will only use deterministic logged-in data as the truth set to jump-start a probabilistic approach. As Neal Mohan, Google’s VP of video and display advertising, described it at launch in June, “We use people who have signed in to Google accounts on various devices as seed data, and we extrapolate from there.”
Oracle Data Cloud, a behemoth in its own right, is another prime example. The data offering, which Oracle refers to as “DaaS” or “data as a service,” announced an expansion of its partnership with Tapad on Thursday to focus on the nexus between mobile and offline.
The two had been working together for roughly the last year and a half, connecting online to mobile for Oracle’s DMP customers.
Tapad’s ID Graph uses a core of deterministic truth data to train its algorithm, thereby producing more and more accurate cross-device matches over time based on behavioral data. Oracle will integrate Tapad’s data into its own ID Graph as part of what Oracle Data Cloud SVP and GM Omar Tawakol termed a “vast effort to help marketers defragment their databases.”
“The techniques the data cloud focused on initially were deterministic – and we continue to focus on deterministic – but, if you look at the volume of interactions across mobile and across devices, deterministic doesn’t give enough volume yet,” said Tawakol, who was also the co-founder and former CEO of audience data company BlueKai before it was acquired by Oracle in February 2014. “It takes years to get to a high match rate.”
Which is why Oracle takes a three-pronged approach to the thorny challenge of establishing identity.
The first piece of the puzzle is BlueKai, which sources first- and third-party data from websites and directly from brands to build up online profiles of consumers. Datalogix, which Oracle acquired in December 2014, brings the offline sales component through direct relationships with stores. Tapad, which had preexisting partnerships with both BlueKai and Datalogix before the acquisitions, helps provide the linkages between devices and data points – of which there are many to make.
“We use both probabilistic data and deterministic data, and we see pluses and minuses with both approaches,” said Tapad CEO and founder Are Traasdahl. “Think about it: You might have three or four different active email address, different phones numbers and a trail of multiple physical addresses you’ve lived at tied to loyalty cards you forgot to update, as well as shared devices between you and other family members.”
In other words, deterministic data isn’t accurate by default.
“Garbage in, garbage out,” Tawakol said. “Our data scientists have been hitting me over the head for a while to stop using the terms ‘probabilistic’ and ‘deterministic’ because, from their perspective, it’s all about the confidence factor. Just because people provide deterministic data doesn’t mean that it’s the 100% truth.”
And then there are the environments where logged in user data isn’t available to make a connection.
“You don’t need to use your email address to watch TV or go into a store, so if you want a 360-degree view of your customer, there needs to be a combination of approaches,” Traasdahl said.
Nonetheless, deterministic is Tawakol’s preferred tactic, although he acknowledged the ever-present issue around scale.
Which is where probabilistic matching can come in to boost the reach. It’s a tradeoff, said Luca Paderni, VP and research director at Forrester – “scale at the detriment of accuracy.”
Because “cross-device data matching is [not] powerful enough to do just deterministic alone,” said David McIninch, CRO of performance marketing company Acquisio.
“Despite the fact that deterministic would be the ideal state, the market’s nowhere near achieving this goal,” he said. “Any data-mining being done by players who don’t own the data has an element of probabilistic data, for sure.”
Logic dictates that even Facebook, with its Atlas ad server and its soon-to-be-launched DSP have to engage in a bit of probabilistic gymnastics to make sure that the deterministic matches it’s making are as accurate as they can be.
“The [Facebook] data is usually extremely accurate but, much like email, there isn’t going to be an exact match for all data,” McIninch said. “[Say] someone’s name is not their actual name on Facebook, but they’ve signed in with their email and they’ve correlated data through photos, tags, statuses, etc. That rounds out a probabilistic data set that has a lot of legitimacy.”
So, in light of Facebook’s and Google’s scale and what Forrester’s Paderni called their ability to “deliver pretty high levels of reach against most audiences,” how can anyone else compete?
For one, probabilistic data doesn’t reside in a walled garden as does Facebook’s stash, said Kamakshi Sivaramakrishnan in a previous chat with AdExchanger, observing that with probabilistic techniques, “the consumer identity is not owned by the marketers,” which leads to a lack of openness and data portability.
User behavior is another limitation.
“If we compare Facebook’s mobile users against mobile-only users for Q1 and Q2, their cross-device reach is shrinking,” Sivaramakrishnan said. “This increasing single-device behavior will be a problem for even the ‘800-pound deterministic gorillas’ as they try to solve for cross-device applications, such as attribution.”