Don’t Drown In Your Data Lake

“The Sell Sider” is a column written by the sell side of the digital media community.

Today’s column is written by James Curran, founder and chief product officer at Staq.

Publishers are starting to ramp up their investment in data, especially as they gain more prowess in programmatic advertising. Many are throwing around the idea of creating their own data warehouse or have already started the process.

Data is currency. Amazon uses data to give users a more relevant retail experience. Google uses data to organize all information on the internet. Facebook and Twitter use data to map social behavior and interests to give users a more engaging social feed.

Publishers need to be careful, not just because everyone with a data lake today needs to ensure they have a license to fish from it and also invite others to, but because it can be overwhelmingly costly if they’re not ready for the process.

While publishers might be able to handle the rolled-up reports they get from their advertising and data partners, collecting log-level data and storing it is exponentially more complicated. Before they get in too deep, publishers must determine what they want to use data for, if they have the resources to collect, review and manage that data, and if they can analyze it well enough for an investment in a warehouse to be worthwhile.

Otherwise, their data lake ends up with quicksand at the bottom.

Focus on the goal

Data can be currency, but every data point isn’t valuable enough to keep. Thinking that you should simply collect and park every data point that comes into an organization is a bad idea. Storing it will be expensive. Publishers will have trouble squeezing insights from a huge data set. And they run a bigger security risk.

Instead, publishers should start with a clear goal. Even better, don’t move forward on collecting log-level data until they’re sure they can answer at least their top three most burning questions.

They might have a goal to normalize pricing across inventory based on advertiser bidding patterns. Or they may be interested in understanding the market value of various pieces of advertising inventory on their website cross-analyzed with different audience groups. So they’d need a solution that merges data from their data management platform and their ad servers. Every business problem is different, and so every data warehouse should look different.

Publisher will need a plan that allows them to identify and roll up the data they’ll be looking at every day. For example, do they want to see programmatic bids by page or content? Clearing price by advertiser or exchange? Fluctuations over a specific period?

Keep your head out of the clouds, even if the data is in the cloud

With a plan in place publishers need a location. Think of a data warehouse like a real brick-and-mortar building that will store stuff. Publishers need a clean, safe and secure storage facility. They need to be able to grant access to certain people and restrict other people. Trucks need to be able to pull up and drop data off at regular times, and publishers need to find a place to put incoming data that’s organized and works with what’s already there.

For publishers with limited resources, these responsibilities might stretch beyond the reasonable limits of their organization. Don’t let developer hubris get in the way of a prudent decision. They probably do not need their own servers or room in a custom data center.

Amazon, Microsoft Azure or Google Cloud will likely end up being the best partner because they are a relatively full service, and that’s OK. The most important points to cover are that the data is secure, organized, accessible and can accommodate the influx of new data without becoming unmanageable.

The opposite of ‘set it and forget it’

Speaking of unmanageable, log-level data has a habit of spinning out of control and requires a lot more than an organized warehouse to keep it in shape. Publishers also need personnel. Taking the programmatic example, a typical publisher pulls data in from 10, 20 or 30 different data sources per day. And every day, there are errors in that data.

As for internal data, a publisher often manages a collection of many websites. Some publishers are organized into a collection of many publishers. Depending on the organization, each site and publisher may have totally different content and ad placement architecture. Key value pairs, inventory organization and other important elements will need to be constantly checked and reconciled as people update and add to the different parts of the business.

Publisher will need to have the resources to address errors within millions or billions of data points from partners before they simply back up the truck and dump the info into the warehouse. On top of that, APIs stop working, field names change, partners change their policies and publishers need to be on top of every minute change or they will fall victim to the “garbage in, garbage out” problem.

At that point, the entire warehouse is compromised before they get a single insight from it. These labor costs are well understood in the world of physical storage, but are often dangerously neglected in the world of data, even as analysts are actively trying to look for answers.

I know of one publisher that put all warehouse management responsibility on a single person. When that person left the company, their data warehouse sat unattended as tons of data piled up. Their storage costs and risk piled up, too. It was several months before the finance department noticed the mounting costs and figured out where they were coming from.

The moral of the story is that before publishers get to the analysis phase, collecting and storing data is complicated. It requires a plan and goals, organization and oversight. Otherwise, all valuable insights will be sucked into the quicksand at the bottom of the lake.

Follow James Curran (@james_curran), Staq (@STAQ) and AdExchanger (@adexchanger) on Twitter.

Tagged in: