Dogfood

February 17, 2014

Building your own web analytics system using Big Data tools

Jenga1It’s been a busy couple of years here at Microsoft. For the dwindling few of you who are keeping track, at the beginning of 2012 I took a new job, running our “Big Data” platform for Microsoft’s Online Services Division (OSD) – the division that owns the Bing search engine and MSN, as well as our global advertising business.

As you might expect, Bing and MSN throw off quite a lot of data – around 70 terabytes a day.(that’s over 25 petabytes a year, to save you the trouble of calculating it yourself). To process, store and analyze this data, we rely on a distributed data infrastructure spread across tens of thousands of servers. It’s a pretty serious undertaking; but at its heart, the work we do is just a very large-scale version of what I’ve been doing for the past thirteen years: web analytics.

One of the things that makes my job so interesting, however, is that although many of the data problems we have to solve are familiar – defining events, providing a stable ID, sessionization, enabling analysis of non-additive measures, for example – the scale of our data (and the demands of our internal users) has meant that we have had to come up with some creative solutions, and essentially reinvent several parts of the web analytics stack.

What do you mean, the “web analytics stack”?

To users of a commercial web analytics solution, the individual technology components of those solutions are not very explicitly defined, and with good reason – most people simply don’t need to know this information. It’s a bit like demanding to know how the engine, transmission, brakes and suspension work if you’re buying a car – the information is available, but the majority of people are more interested in how fast the car can accelerate, and whether it can stop safely.

However, as data volumes are increasing, and web analytics are needing to be ever more tightly woven into the other data that organizations generate and manage, more people are looking to customize their solutions, and so it’s becoming more important to understand their components.

The diagram below provides a very crude illustration of the major components of a typical web analytics “stack”:

image

In most commercial solutions, these components are tightly woven together and often not visible (except indirectly through management tools), for a good reason: ease of implementation. At least for a “default” implementation, part of the value proposition of a commercial web analytics solution is “put our tag on your pages, and a few minutes/hours later, you’ll see numbers on the screen”.

A cunning schema

In order to achieve this promise, these tools have to make (and enforce) certain assumptions about the data, and these assumptions are embodied in the schema that they implement.Some examples of these default schema assumptions are:

  • The basic unit of interaction (transaction event) is the page view
  • Page views come with certain metadata such as User Agent, Referrer, and IP address
  • Page views are aggregated into sessions, and sessions into user profiles, based on some kind of identifier (usually a cookie)
  • Sessions contain certain attributes such as session length, page view count and so on.

Now, none of these schema assumptions is universal, and many tools have the capability to modify and extend the schema (and associated processing rules) quite dramatically. Google Universal Analytics is a big step in this direction, for example. But the reason I’m banging on about the schema is that going significantly “off schema” (that is to say, building your own data model, where some or all of the assumptions above may not apply) is one of the key reasons why people are looking to augment their web analytics solution.

Web Analytics Jenga

The other major reason to build a custom web analytics solution is to swap out one (or more) of the components of the “stack” that I described above to achieve improved performance, flexibility, or integration with another system. Some scenarios in which this might be done are as follows:

  • You want to use your own instrumentation/data collection technologies, and then load the data into a web analytics tool for processing & analysis
  • You want to expose data from your web analytics system in another analysis tool
  • You want to include significant amounts of other data in the processing tier (most web analytics tools allow you to join in external data, but only in relatively simple scenarios)

Like a game of Jenga, you can usually pull out one or two the blocks from the stack of a commercial web analytics tool without too much difficulty. But if you want to pull out more – and especially if you want to create a significantly customized schema – the tower starts to wobble. And that’s when you might find yourself asking the question, “should we think about building our own web analytics tool?”

“Build your own Web Analytics tool? Are you crazy?”

Back in the dim and distant past (over ten years ago), when I was pitching companies in the UK on the benefits of WebAbacus, occasionally a potential customer would say, “Well, we have been looking at building our own web analytics tool”. At the time, this usually meant that they had someone on staff who could write Perl scripts to process log data. I would politely point out that this was a stupid idea, for all the reasons that you would expect: If you build something yourself, you have to maintain and enhance it yourself, and you don’t get any of the benefits of a commercial product that is funded by licenses to lots of customers, and which therefore will continue to evolve and add features.

But nowadays the technology landscape for managing, processing and analyzing web behavioral data (and other transactional data) has changed out of all recognition. There is a huge ecosystem, mostly based around Hadoop and related technologies, that organizations can leverage to build their own  big data infrastructures, or extend commercial web analytics products.

At the lower end of the Web Analytics stack, tools like Apache Flume can be deployed to handle log data collection and management, with other tools such as Sqoop and Oozie managing data flows; Pig can be used for ETL and enrichment in the data processing layer; or Storm can be used for streaming (realtime) data processing. Further up the stack, Hive and HBase can be used to provide data warehousing and querying capabilities, while there is an increasing range of options (Cloudera’s Impala, Apache Drill, Facebook’s Presto, and Hortonworks’ Stinger) to provide the kind of “interactive analysis” capabilities (dynamic filtering across related datasets) which commercial Web Analytics tools are so good at. At finally, at the top of the stack, Tableau is an increasingly popular choice for reporting & data visualization, and of course there is the Microsoft Power BI toolset.

In fact, with the richness of the ecosystem, the biggest challenge for anyone looking to roll their own Web Analytics system is a surfeit of choice. In subsequent blog posts (assuming I am able to increase my rate of posting to more than once every 18 months) I will write more about some of the choices available at various points in the stack, and how we’ve made some of these choices at Microsoft. But after finally bestirring myself to write the above, I think I need a little lie down now.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

May 01, 2012

Google launches cloud-based BigQuery service

Some interesting news today: Google has fully launched the cloud-based BigQuery service that it first previewed last November. From the website:

Google BigQuery is a web service that lets you do interactive analysis of massive datasets—up to billions of rows. Scalable and easy to use, BigQuery lets developers and businesses tap into powerful data analytics on demand.

The BigQuery service is built on the back of Google’s enormous investments in data infrastructure and exposes some of the clever tools the company has built for internal use to an internal audience. It’s designed to help with ad hoc queries against unstructured data – kind of Hadoop in the cloud with a front-end querying service attached. In this regard it shares some similarities with the Hadoop on Azure service from my illustrious employers.

The interesting question with all these cloud-based Big Data services (a list of some of which you can find here, and here) is the acceptability to customers of loading significant amounts of data to the cloud, and dealing with the privacy and security questions that arise as a result. But it is interesting to contrast the significant complexity that attends any conversation about in-house or on-premise big data with the simplicity offered by a cloud-based approach.

The most intriguing aspect of Google’s foray into this area is the prospect of the company being able to leverage its “secret sauce” in terms of data analysis tools and technologies – few other companies may be able to match the kind of investment that Google can make here.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

March 08, 2012

Returning to the fold

imageFive years ago, my worldly possessions gathered together in a knotted handkerchief on the end of a stick, I set off from the shire of Web Analytics to seek my fortune among the bright lights of online advertising. I didn’t exactly become Lord Mayor of London, but the move has been a good one for me, especially in the last three years, when I’ve been learning all sorts of interesting things about how to measure and analyze the monetization of Microsoft’s online properties like MSN and Bing through advertising.

Now, however, the great wheel of fate turns again, and I find myself returning to the web analytics fold, with a new role within Microsoft’s Online Services Division focusing on consumer behavior analytics for Bing and MSN (we tend to call this work “Business and Customer Intelligence”, or BICI for short). Coincidentally I was able to mark this move this week with my first visit to an eMetrics conference in almost three years.

I was at eMetrics to present a kind of potted summary of some of what I’ve learned in the last three years about the challenges of providing data and analysis around display ad monetization. To my regular blog readers, that should come as no surprise, because that’s also the subject of my “Building the Perfect Display Ad Performance Dashboard” series on this blog, and indeed, the presentation lifted some of the concepts and material from the posts I’ve written so far. It also forced me to continue with the material, so I shall be posting more installments on the topic in the near future (I promise). In the meantime, however, you can view the presentation here via the magic of SlideShare:

The most interesting thing I discovered at eMetrics was that the industry has changed hugely while I’ve been away (well, duh). Not so much in terms of the technology, but more in terms of the dialog and how people within the field think of themselves. This was exemplified by the Web Analytics Association’s decision to change its name to the Digital Analytics Association (we shall draw a veil over my pooh-poohing of the idea of a name change in 2010, though it turns out I was on the money with my suggestion that the association look at the word “Digital”). But it was also highlighted by  the fact that there was very little representation at the conference by the major technology vendors (with the exception of WebTrends), and that the topic of vendor selection, for so long a staple of eMetrics summits, was largely absent from the discussion. It seems the industry has moved from its technology phase to its practitioner phase – a sign of maturity.

Overall I was left with the impression that the Web Analytics industry, such as it is, increasingly sees itself as a part of a broader church of analysis and “big data” which spans the web, mobile, apps, marketing, operations, e-commerce and advertising. Which is fine by me, since that’s how I see myself. So it feels like a good time to be reacquainting myself with Jim and his merry band of data-heads.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

February 07, 2012

Big (Hairy) Data

Zagazoo

My eye was caught the other day by a question posed to the “Big Data, Low Latency” group on LinkedIn. The question was as follows:

“I've customer looking for low latency data injection to hadoop . Customer wants to inject 1million records per/sec. Can someone guide me which tools or technology can be used for this kind of data injection to hadoop.”

The question itself is interesting, given its assumption that Hadoop is part of the answer – Hadoop really is the new black in data storage & management these days – but the answers were even more interesting. Among the eleven or so people who responded to the question, there was almost no consensus. No single product (or even shortlist of products) emerged, but more importantly, the actual interpretation of the question (or what the question was getting at) differed widely, spinning off a moderately impassioned debate about the true meaning of “latency”, the merits of solid-state storage vs HD storage, and whether to clean/dedupe the data at load-time,or once the data is in Hadoop.

I wouldn’t class myself as a Hadoop expert (I’m more of a Cosmos guy), much less a data storage architect, so I may be unfairly mischaracterizing the discussion, but the message that jumped out of the thread at me was this: This Big Data stuff really is not mature yet.

I was very much put in mind of the early days of the Web Analytics industry, where so many aspects of the industry and the way customers interacted with it had yet to mature. Not only was there still a plethora of widely differing solutions available, with heated debates about tags vs logs, hosted vs on-premise, and flexible-vs-affordable, but customers themselves didn’t even know how to articulate their needs. Much of the time I spent with customers at WebAbacus in those days was taken up by translating the customer’s requirements (which often had been ghost-written by another vendor who took a radically different approach to web analytics) into terms that we could respond to.

This question thread felt a lot like that – there didn’t seem to be a very mature common language or frame of reference which united the asker of the question and the various folk that answered it. As I read the answers, I found myself feeling mightily sorry for the question-poser, because she now has a list as long as her arm of vendors and technologies to investigate, each of which approaches the problem in a different way, so it’ll be hard going to choose a winner.

If this sounds like a grumble, it’s really not – the opposite, in fact. It’s very exciting to be involved in another industry that is forming before my very eyes. Buy most seasoned Web Analytics professionals enough drinks and they’ll admit to you that the industry was actually a bit more interesting before it was carved up between Omniture and Google (yes, I know there are other players still – as Craig Ferguson would say, I look forward to your letters). So I’m going to enjoy the childhood and adolescence of Big Data while I can.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

December 19, 2011

Building the Perfect Display Ad Performance Dashboard, Part II – metrics

Welcome to the second installment in my Building the Perfect Display Ad Performance Dashboard series (Note to self: pick a shorter title for the next series). In the first installment, we looked at an overarching framework for thinking about ad monetization performance, comprised of a set of key measures and dimensions. In this post, we’ll drill into the first of these – the measures that you need to be looking at to understand your business.

 

How much, for how much?

As we discussed in the previous post, analysis of an online ad business needs to focus on the following:

  • How much inventory was available to sell (the Supply)
  • How much inventory was actually sold (the Volume Sold)
  • How much the inventory was actually sold for (the Rate)

Of these, it’s the last two – the volume sold and the rate at which that volume was sold – where the buck (literally) really stops, since these two combine to deliver that magic substance, Revenue. So in this post we’ll focus on volume sold, rate and revenue as the core building-blocks of your dashboard’s metrics.

Volume, rate and revenue are inextricably linked via a fairly basic mathematical relationship:

Revenue = Rate x Volume

Another way of thinking about this is that these three measures form the vertices of a triangle:

image

Some business and economics textbooks call Rate and Volume “Price” and “Quantity” (or P and Q), but the terms we’re using here are more common in advertising.

Different parts of an ad business can be driven by different corners of the triangle, depending on the dynamics of how each part is transacted. Here are some examples:

  • Ads sold on a time-based/”sponsorship” basis are best thought of as driving revenue performance, because deals are done on a revenue basis regardless of volume/rate (though the advertiser will have a volume & rate expectation, which they’ll want to be met).
  • For premium ads sold on a CPM basis, deals revolve around Rate; the name of the game is to add value to inventory so that, impression-for-impression, it achieves more revenue.
  • For remnant ads and networks, volume is king (assuming you can maintain a reasonable rate) – you’re looking to maximize the amount of inventory sold, and minimize the amount that has to be given away or sent to “house” advertising.

Because of these different dynamics, measurement of ad monetization can easily fragment into various sub-types of measure; for example, as well as cost-per-thousand (CPM) rate, some ads are purchased on a CPC or CPA basis. So a more complete version of the diagram above looks like this:

image

However, it’s essential to remember the key relationship and dynamic between rate, volume and revenue, which is manifested in the CPM, Impressions and Delivery Revenue measures in the diagram above. So let’s look at these measures.

 

Volume

In the online ad business, Volume is measured in Ad Impressions. I have talked about ad impressions before on this blog, in this installment of Online Advertising 101 (you may want to take a moment to read the section entitled “What’s the product?” in that post). From a measurement point of view, whenever your ad server serves an ad (or more accurately, fields a request for an ad), its measurement system should log an ad impression. How much data is logged with this impression will vary depending on the ad server you’re using, but will likely include most of the following:

  • Date & time of the impression
  • Advertiser
  • Campaign and/or creative
  • Location/placement (i.e. where the ad was served)
  • Attributes of the individual who requested the ad (e.g. targeting attributes)

We’ll come back to those attributes (and how you can use them to segment your impressions for better analysis) in another post.

Capturing a true view of the ad impressions on your site can be a little more challenging if you are using multiple ad servers or networks to sell your inventory, particularly if you are using a combination of your own first-party ad server (for example, DFP) and redirecting some impressions to a third-party such as an ad network. When you have delivery systems chained together in this way, you may need to combine the impression counts (and other data) from those systems to get a true picture of impression volume, and you will need to be careful to avoid double-counting.

For reasons that will become clearer when we get on to talking about rate, it’s essential that you capture impression counts for your ad sales where you possibly can, even for parts of your site or network where the supply is not sold on an impression basis.

Other volume measures such as Clicks and Conversions become very useful when you’re looking to assess how valuable your inventory is from an advertiser perspective, since both are a proxy for true Advertiser ROI. They’re also useful for deriving effective rate, as we’ll see below.

 

Rate

At the highest level, rate is a simple function of volume and revenue – simply divide your total revenue by your total volume (and usually multiply by 1,000 to get a more usable number) and you have your overall rate – in fact, you have the most commonly used kind of rate that people talk about, known as “Effective Cost-per-Mille (Thousand)”, or eCPM (don’t as me why the e has to be small – ask e.e. cummings). Just to be clear, eCPM is calculated as:

eCPM = (Revenue) * 1000 / (Volume)

Sometimes eCPM is known as eRPM (Where the R stands for “Revenue”).

The reason we’re talking about eCPM before revenue in this post is because many advertising deals are struck on a CPM basis – i.e. the advertiser agrees to buy a certain amount of impressions at a certain pre-agreed rate. However, even for inventory is not being sold on a CPM basis, it’s essential to be able to convert the rate to eCPM. Here’s why.

The beauty about eCPM is it is the lowest common denominator – regardless of how a particular portion of your impression supply was sold (e.g. on a cost-per-click basis, or on a “share of voice” or time-based basis), if you can convert the rate back into effective CPM you can compare the performance of different subsets of your inventory on a like-for like basis. Consider the following example of delivery info for the parts of a fictional autos site:

Site area Sold as… Deal
Home page Share-of-voice $10,000 up-front
Car reviews Reserved CPM $2.50 CPM
Community AdSense $1.20 CPC

With just the information above, it’s impossible to understand whether the Home Page, Reviews or Community site areas are doing better, because they’re all sold on a different basis. But if you add impression counts (and, in the case of the Community area, click counts), it’s possible to derive an overall rate for the site, as well as to see which parts are doing best:

Site area Sold as… Deal Impressions Clicks CPC Revenue eCPM
Home page Sponsorship $10,000 up-front 5,347,592 n/a n/a $10,000 $1.87
Car reviews Reserved CPM $2.50 CPM 3,472,183 n/a n/a $8,680.45 $2.50
Community AdSense $1.20 CPC 1,306,368 5,832 $1.20 $6,998.40 $5.36
Total   10,126,144 $25,678.85 $2.53

See? Who knew that the Community area was throwing off so much money per impression compared to the other areas?

eCPM isn’t the only rate currency you can use, though its connection to both volume and revenue puts it at a distinct advantage, and it means most to publishers because it speaks to the one thing that a publisher can exert (some) control over – the volume of impressions that are available to sell.

 

Revenue

If you sell your inventory on a fairly straightforward CPM or CPC basis, then your site’s revenue will pop neatly out of the equation:

(Revenue) = (eCPM) * (Volume) / 1000

However, if you’re running a larger site and engaging in sponsorship-type deals with advertisers, your revenue picture may look a little more complex. This is because “sponsorships” (a term which covers a multitude of sins) can contain multiple revenue components, some of which can be linked to ad delivery (and which therefore lend themselves to rate calculations), and some of which cannot.

For example, the sponsorship deal on our fictitious autos site referenced above could in fact contain the following components on the invoice sent to the advertiser or agency:

Item Cost Impression Target
100% Share-of-voice rotation, 300x250, Home Page (1 day) $6,000 3,000,000
100% Share-of-voice rotation 120x600, Home Page (1 day) $4,000 3,000,000
Sponsor branding – Home Page background (1 day) $8,500 n/a
Sponsored article linked from Home Page (1 day) $3,500 n/a
Sponsor watermark on Home Page featured video (1 day) $1,500 n/a

In the above table, only the first two items are expected to be delivered through the ad server; the other three are likely to be “hard-coded” into the site’s CMS and actually deliver with the page impressions (or video stream, in the case of the last one).

There are a couple of different options for dealing with this second kind of revenue (which we’ll call “non-delivery” revenue) which can’t be directly linked to ad impressions. One is to attribute the revenue to the ad delivery anyway, kind of on the assumption that the ads “drag along” the other revenue. So in the above example, with 5,347,592 impressions delivered across the two units, the “overloaded” eCPM for the ad units would be $4.39.

The challenge with this approach is that the extra revenue is not associated with delivery of any particular ad. So in the above example, if you wanted to calculate the eCPM for just the 120x600 unit on the home page (perhaps across an entire month), would you include the non-delivery revenue? If yes, then how much of it? 50%? 40%? The lack of ability to truly associate the revenue with ad delivery makes these kinds of calls incredibly hard, and open to dispute (which is the last thing you want if you are presenting your numbers to the CEO).

The other approach is to treat the “non-delivery” revenue as a separate bucket of revenue that can’t be used in rate calculations. This keeps the data picture simpler and more consistent on the “delivery” side of the house, but you do end up with an awkward block of revenue that people are constantly poking and saying things like “I sure wish we could break that non-delivered revenue out a bit more”.

 

A complicated relationship

Once you have your arms around these three core measures, you can start to see how they interact, and there lies the magic and intricacy of the dynamics of selling display advertising. The implacable logic of the simple mathematical relationship between the three measures means that if one changes, then at least on of the others must also change. Only by looking at all three can you truly understand what is going on. We’ll dig into these relationships more in subsequent posts, but here’s a simple example of the rate achieved for ads sold on a fictional news site home page:

image

Someone looking at this chart may well ask “OMG! What happened to our rate in June 2009?” Well, a quick search on Wikipedia will reveal that a certain “King of Pop” died in that month, sending the traffic (and hence the ad impression volume) of most news sites sky-rocketing. In our fictional home-page example, almost all revenue is driven by “share of voice” (time-based) deals, so all that extra volume does is depress the effective rate, because the site earns the same amount per day regardless of traffic levels. So here’s volume and revenue from the same data set, to round out the picture:

image

We can now see that in fact, June wasn’t a bad month for Revenue; it was the huge spike in traffic that did the rate in.

The above example takes something very important for granted – namely, that we have enough segmentation (or “dimensional”) data associated with our measures to be able to break down site performance into more useful chunks (in this case, just the performance of the home page). In the next blog post, we’ll look at some of the most important of these dimensions. Stay tuned!

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

November 21, 2011

Should Wikipedia accept advertising?

imageIt’s that time of year again. The nights are drawing in, snow is starting to fall in the mountains, our minds turn to thoughts of turkey and Christmas pudding, and familiar faces appear: Santa, Len and Bruno, and of course, Jimmy Wales.

If you are a user of Wikipedia (which, if you’re a user of the Internet, you almost certainly are), you’ll likely be familiar with Jimmy Wales, the founder of Wikipedia and head of the Wikimedia Foundation, the non-profit which runs the site. Each year Jimmy personally fronts a campaign to raise funds to cover the cost of running Wikipedia, which this year will amount to around $29m.

The most visible part of this campaign is the giant banner featuring Jimmy Wales’s face which appears at the top of every Wikipedia article at this time of year. This year the banner has caused some hilarity as the position of the picture of Jimmy just above the article title has provided endless comic potential (as above), but every year it becomes increasingly wearisome to have Jimmy’s mug staring out at you for around three months. Would it not be easier for all concerned if Wikipedia just carried some advertising?

Jimmy has gone on record as saying that he doesn’t believe that Wikipedia should be funded by advertising, and I understand his position. To parse/interpret his concerns, I believe he’s worried about the following:

  • Accepting advertising would compromise Wikipedia’s editorial independence from commercial interests
  • Ads would interfere with the user experience of Wikipedia and be intrusive
  • Wikipedia contributors would not want to contribute for free to Wikipedia if they knew it was accepting advertising

I’m biased, of course, since I work for Microsoft Advertising, but I believe that each of these concerns is manageable. Let’s take them one by one:

Concern 1: Ads would compromise Wikipedia’s independence

There are plenty of historical examples where a publication has been put in a difficult position when deciding what to publish because of relationships with large advertisers. Wikipedia certainly doesn’t want, for example, Nike complaining about the content of its Wikipedia entry. And the idea of Wikipedia starting to employ sales reps to hawk its inventory is a decidedly unedifying one.

But Wikipedia does not have to engage in direct sales, or even non-blind selling, to reach its financial goals with advertising. The site could make its inventory available on a blind ad network (or ideally multiple networks) so that it would be impossible for an advertiser to specifically buy ad space on Wikipedia. If an advertiser didn’t like their ads appearing on Wikipedia, most networks offer a site-specific opt out, but the overall impact of this to Wikipedia would be minimal – Wikipedia carries such a vast range of content that it has the most highly diversified content portfolio in the world – no single advertiser could exert any real leverage over it.

Concern 2: Ads would make Wikipedia suck

As has been noted elsewhere, there are plenty of horrible ads at large in the Internet – intrusive pop-ups, or horrible creative. It would certainly be a valid concern that Wikipedia would suddenly become loaded with distracting commercial messages. But according to the back-of-an-envelope calculations I’ve done, there is no need for Wikipedia to saturate itself with ads in order to pay the bills.

According to the excellent stats.wikimedia.org site, Wikipedia served almost exactly 15bn page views world-wide in October 2011 (around half of which were in English). Assuming no growth in that figure over 12 months, that’s around 180bn PVs per year. So to meet its funding requirements, Wikipedia would need to generate a $0.16 eCPM on those page views (assuming just one ad unit per page). That’s a pretty modest rate, especially on a site with as much rich content as Wikipedia. It would give the site a number of options in terms of ad placement strategy, such as:

  • Place a very low-impact, small text ad on every page
  • Place a somewhat larger/more impactful ad on a percentage of pages on a rotation, and leave other pages ad free
  • Place ads on certain types of pages, leaving others always ad free (such as pages about people or companies, or pages in a particular language/geo)
  • Deploy a mix of units across different types of page, or in rotation

This also assumes that Wikimedia needs to raise all its funds every year from advertising, which it may not need to – though once the site accepted advertising, it would definitely become more difficult (though perhaps not impossible) to raise donations.

To preserve the user experience, I would definitely recommend just running text ads, which could be placed relatively unobtrusively. Sites running text-based contextual ads (such as those from Google AdSense or Microsoft adCenter) can usually expect to get at least around $0.30 eCPM, so there would be some headroom.

I would also recommend that Wikipedia not run targeted ads – or at least, only work with networks that do not sell user data to third parties. It could cause significant backlash if it became felt that Wikipedia was effectively selling data about its users’ browsing habits to advertisers for a fast buck.

Concern 3: Ads would make contributors flee

I can speak to this concern less authoritatively, since I am not that familiar with the world of Wikipedia contribution, but so long as Wikimedia made it clear that it was remaining a non-profit organization, and continued to operate in a thrifty fashion to cover its costs, the initial outrage of Wikipedia contributors could be managed. After all, plenty of other open-source projects that rely on unpaid contributors do provide the foundations for commercial activities, Linux being the best example.

In any case, in its deliberations about balancing the needs of its contributors with its need to pay the bills, Wikimedia will need to face some hard questions: Will it always be able to cover its costs through donations? Does the current level of investment in infrastructure represent an acceptable level of risk for a site that serves so many users? Is it acceptable to rely on unpaid contributors indefinitely? If Wikipedia ran out of cash or went down altogether, the righteous indignation of its contributors may not count for very much.

Apart from advertising and donations, the only other way that Wikipedia could pay the bills would be by creating paid-for services – for example, a research service. But would the unpaid Wikipedia contributors really be happier with this outcome than with advertising? It would effectively amount to selling the content that they’d authored for free. At least with advertising, it’s the user that is the product, not the content. So long as Wikipedia can maintain editorial independence and retain a good user experience, advertising feels like the better option to me.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

November 09, 2011

Building the Perfect Display Ad Performance Dashboard, Part I – creating a measurement framework

dashboard-warning-lightsThere is no shortage of pontification available about how to measure your online marketing campaigns: how to integrate social media measurement, landing page optimization, ensuring your site has the right feng shui to deliver optimal conversions, etc. But there is very little writing about the other side of the coin: if you’re the one selling the advertising, on your site, or blog, or whatever, how do you understand and then maximize the revenue that your site earns?

As I’ve covered previously in my Online Advertising 101 series, publishers have a number of tools and techniques available to manage the price that their online ad inventory is sold for. But the use of those tools is guided by data and metrics. And it’s the generation and analysis of this data that is the focus of this series of posts.

In this series, I’ll unpack the key data components that you will need to pull together to create a dashboard that will give you meaningful, actionable information about how your site is generating money – or monetizing, to use the jargon.

We’ll start by taking a high-level look at a framework for analyzing a site’s (or network’s) monetization performance. In subsequent posts, we’ll drill into the topics that we touch on briefly here.

 

Getting the measure of the business

Ultimately, for any business, revenue (or strictly speaking, income or profit) is king. If you’re not generating revenue, you can’t pay the bills (despite what trendy start-ups will tell you). But anyone running a business needs a bit more detail to make decisions that will drive increased revenue.

In the ad-supported publishing business, these decisions fall into a couple of broad buckets:

  • How to create more (or more appealing) supply of sellable advertising inventory
  • How to monetize the supply more effectively – either by selling more of it, or selling it for a better price, or both

Another way of thinking about these decisions is in a supply/demand framework that is common to almost all businesses: If your product is selling like hot cakes and you can’t mint enough to meet demand, you have a supply problem, and you need to focus on creating more supply. If, on the other hand, you have a lot of unsold stock sitting around in warehouses (real or virtual), you have a demand problem, and you need to think about how to make your products more compelling, or your sales force more effective, or both.

Online publishers usually suffer from both problems at the same time: Part of their inventory supply will be in high demand, and the business will be supply-constrained (it is not easy to mint new ad impressions the way a widget manufacturer can stamp out new widgets). Other parts of the inventory, on the other hand, will be hard to shift, and the business will be demand-constrained – and unlike widgets, unsold ad inventory goes poof! when the clock strikes midnight.

So analysis of an online ad business needs to be based on the following key measures:

  • How much inventory was available to sell (the Supply)
  • How much inventory was actually sold (the Volume Sold)
  • How much the inventory was actually sold for (the Rate)

It’s ultimately these measures (and a few others that can be derived from them) that will tell you whether you’re succeeding or failing in your efforts to monetize your site. But like any reasonably complex business (and online advertising is, at the very least, unreasonably complex), it’s really how you segment the analysis that counts in terms of making decisions.

 

What did we sell, and how did we sell it?

Most businesses would be doing a pretty poor job of analysis if they couldn’t look at business performance broken out by the products they sell. A grocery chain that didn’t know if it was selling more grapes or grape-nuts would not last very long. Online advertising is no exception – in fact, quite the opposite. Because online ad inventory can be packaged so flexibly, it’s essential to answer the question “What did we sell?” in a variety of ways, such as:

  • What site areas (or sub-areas) were sold
  • What audience/targeting segments were sold
  • What day-parts were sold
  • What ad unit sizes were sold
  • What rich media types were sold

The online ad sales business also has the unusual property that the same supply can (and is) sold through multiple channels at different price points. So it is very important to segment the business based on how the supply was sold, such as:

  • Direct vs indirect (e.g. via a network or exchange)
  • Reserved vs remnant/discretionary

Depending on the kind of site or network you’re analyzing, different aspects of these what and how dimensions will be more important. For example, if you’re running a site with lots of high-quality editorial content, analyzing sales by content area/topic will be very important; on the other hand, if the site is a community site with lots of undifferentiated content but a loyal user base, audience segments will be more relevant.

 

Bringing it together – the framework

I don’t know about you, but since I am a visual person to start with, and have spent most of the last ten years looking at spreadsheets or data tables of one sort or another, when I think of combining the components that I’ve described above, I think of a table that looks a bit like the following:

image

This table is really just a visual way of remembering the differences between the measures that we’re interested in (volume, rate etc) and the dimensions that we want to break things out by (the “what” and “how” detail). If you don’t spend as much of your time talking to people about data cubes as I do, these terms may be a little unfamiliar to you, which is why I’m formally introducing them here. (As an aside, I have found that if you authoritatively bandy about terms like “dimensionality” when talking about data, you come across as very wise-sounding.)

In the next posts in this series, I shall dig into these measures and dimensions (and others) in more detail, to allow us to populate the framework above with real numbers. We’ll also be looking at how you can tune the scope of your analysis to ensure that

For now, here’s an example of the kinds of questions that you would be able to answer if you looked at premium vs non-premium ad units as the “what” dimension, and direct vs indirect as the “how” dimension:

image

 

As this series progresses, I’d love to know what you think of it, as well as topics that you would like me to focus on. So please make use of the comments box below.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 24, 2011

Wading into the Google Secure Search fray

broken-chain-1024x768There’s been quite the hullabaloo since Google announced last week that it was going to send signed-in users to Google Secure Search by default. Back when Google first announced Secure Search in May, there was some commentary about how it would reduce the amount of data available to web analytics tools. This is because browsers do not make page referrer information available in the HTTP header or in the page Document Object Model (accessible via JavaScript) when a user clicks a link from an SSL-secured page through to a non-secure page. This in turn means that a web analytics tool pointed at the destination site is unable to see the referring URLs for any SSL-secured pages that visitors arrived from.

This is all desired behavior, of course, because if you’ve been doing something super-secret on a secure website, you don’t want to suddenly pass info about what you’ve been doing to any old non-secure site when you click an off-site link (though shame on the web developer who places sensitive information in the URL of a site, even if the URL is encrypted).

At the time, the web analytics industry’s concerns were mitigated by the expectation that relatively few users would proactively choose to search on Google’s secure site, and that consequently the data impact would be minimal. But the impact will jump significantly once the choice becomes a default.

One curious quirk of Google’s announcement is this sentence (my highlighting):

When you search from https://www.google.com, websites you visit from our organic search listings will still know that you came from Google, but won't receive information about each individual query.

This sentence caused me to waste my morning running tests of exactly what referrer information is made available by different browsers in a secure-to-insecure referral situation. The answer (as I expected) is absolutely nothing – no domain data, and certainly no URL parameter (keyword) data is available. So I am left wondering whether the sentence above is just an inaccuracy on Google’s part – when you click through from Google Secure Search, sites will not know that you came from Google. Am I missing something here? [Update: Seems I am. See bottom of the post for more details]

I should say that I generally applaud Google’s commitment to protecting privacy online in this way – despite the fact that it has been demonstrated many times that an individual’s keyword history is a valuable asset for online identity thieves, most users would not bother to secure their searches when left to their own devices. On the other hand, this move does come with a fair amount of collateral damage for anyone engaged in SEO work. Google’s hope seems to be that over time more and more sites will adopt SSL as the default, which would enable sites to capture the referring information again – but that seems like a long way off.

It seems like Google Analytics is as affected by this change as any other web analytics tool. Interestingly, though, if Google chose to, it could make the click-through information available to GA, since it captures this information via the redirect it uses on the outbound links from the Search Results page. But if it were to do this, I think there would be something of an outcry, unless Google provided a way of making that same data to other tools, perhaps via an API.

So for the time being the industry is going to have to adjust to incomplete referrer information from Google, and indeed from other search engines (such as Bing) that follow suit. Always seems to be two steps forward, one step back for the web analytics industry. Ah well, plus ca change…

Update, 10/25: Thanks to commenter Anthony below for pointing me to this post on Google+ (of course). In the comments, Eric Wu nails what is actually happening that enables Google to say that it will still be passing its domain over when users click to non-secure sites. It seems that Google will be using a non-secure redirect that has the query parameter value removed from the redirect URL. Because the redirect is non-secure, its URL will appear in the referrer logs of the destination site, but without the actual keyword. As Eric points out, this has the further unfortunate side-effect of ensuring that destination sites will not receive query information, even if they themselves set SSL as their default (though it’s not clear to me how one can force Google to link to the SSL version of a site by default). The plot thickens…

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

October 17, 2011

Nicely executed retargeting opt-out (for a change)

Retargeting (sometimes called remessaging or remarketing) has taken off in a big way, recently – Google introduced the feature into AdWords earlier this year, and a host of other players are in the game. Consequently, the interwebs now abound with commentary on the rather spooky nature of the technology, with people being “followed around” the Internet by ads for things they were either searching for, or were looking at on e-commerce websites.

It is true that most retargeting implementations are a bit clunky, and I have been on the receiving end of plenty of them myself. Their most irritating aspect seems to be that the time window for perceived relevance of the retargeted ads seem to be ridiculously long. It’s somehow almost more irritating to be deluged by ads for that miscellaneous widget site that you once visited a few weeks ago (even though you have since satisfied your need for widgets elsewhere) than it is to be served non-targeted (or more broadly targeted) ads.

Such ads are made more bearable by a robust opt-out capability; many ad networks have adopted the IAB’s self-regulatory program, which calls for the advertiser to make it possible to opt out of these kinds of ads, which is to say, stop receiving them; stopping the data collection is a more difficult matter.

So today I want to give a little love to TellApart, not because their retargeting implementation is especially subtle or innovative, but simply because they provide a nice opt-out implementation. Last week I spent a little time looking for a desk for my daughter (who currently occupies our dining table with her homework). So since then I have been served retargeted ads on behalf of the site I visited (www.childrensdesks.com) on various sites. Here’s one from Business Insider:

image

The nice thing about the ad is it has a little “x” icon in the top right (which actually makes a little more sense than the IAB’s suggested “Advertising Option Icon”, which is a bit cryptic). Clicking it gives me this:

image

The ability to opt out right in the ad unit is nice, and makes me feel more well-disposed to the advertiser and the site that the ad is running on. Clicking through the “Learn More About These Ads” link at the bottom takes me to TellApart’s website with a little more information and the same option to opt out – though no option to opt out of certain categories of ads, or groups of advertisers.If more retargeting networks provided simpler opt-out capabilities like these, it might help to make these ads seem like less of a scary proposition.

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

June 07, 2010

Online Advertising Business 101, Part VII: Demand-side Platforms (DSPs)

This is not the DSP you're looking for It turns out, alarmingly, that it’s been over a year since my last Online Advertising Business 101 post. And a year is an awfully long time in the world of online advertising. Long enough, in fact, for an entirely new kind of company to emerge and become the next big thing. I’m talking, of course, about Demand-side Platforms. You’ve heard of them, I trust? No? Then read on.

 

DSPs, RTBs, oh my

As it should happen, my last post in this series was on the subject of Ad Exchanges – a new kind of participant in the advertising value chain that acts as an intermediary between ad networks, allowing them to exchange inventory to make up for shortfalls in supply and demand among their own publishers and advertisers, and also allowing other scale players (e.g. big advertisers like eBay, or a big agency) to buy inventory dynamically across a number of different networks. All those folks had to do was implement some technology to interact with the exchange, and place bids in real-time (on of the key characteristics of an ad exchange being that it can auction inventory in real-time).

It turns out that that last glib sentence conceals an awful lot of complexity. Real-time Bidding (RTB), as it’s known, is a pretty complex technical challenge to pull off. It involves building a system that can listen for inbound impression opportunities, and parse them more or less in real-time (milliseconds) to transmit back a bid for that impression. If you just wanted to send the same bid back for every impression, or vary the bid on the basis of some very simple variables (e.g. the size of the ad unit), then you could probably hack something together reasonably easily, but of course it’s not that simple.

Once you get into the business of real-time bidding, you want to take into consideration a wealth of data (the more the better, in fact) about the impression, including (but not limited to):

  • Ad unit size/format (e.g. Rich Media)
  • Site
  • Page content category
  • Geo
  • Time of day
  • Frequency
  • User profile

Of course, if you’re an agency or network doing this, you want to be able to manage different bids and budgets for your different advertisers, and also to incorporate things like advertiser/publisher block lists (e.g. Wall Street Journal doesn’t want to advertise on the NYT site). And you probably want to enable advertisers to manage (or at least track) their campaigns through the third-party ad servers (Atlas Media Console, DFA, OAS) that they’re used to.

 

What’s a DSP?

The complexity of building something like this has spawned the new technology category of the Demand-side Platform (or DSP, which always makes me think of something else, showing my age). The name comes from the fact that these systems “aggregate demand” – i.e. they provide a single interface to the supply-side of the value chain for a number of advertisers.

Since many of the established participants in the advertising value chain already represent (or serve) advertisers, almost everyone – agencies, networks, exchanges - is these days calling themselves a DSP, much to the annoyance of Mike Nolet at AppNexus. Mike argues that the glib name, together with its appropriation by every man and his dog, make it all but useless to claim (or describe) something as a “DSP”.

It is true that the “Platform” in “Demand-side Platform” is a bit of a stretch, but I’m a little bit more forgiving than Mike, and don’t think that it’s an entirely useless term. As to what defines a DSP, I would offer the following definition, based upon this list of capabilities in the table below. If a company offers a well-integrated set of, say, three-quarters of these capabilities, and offers a real-time bidding capability, then you’re probably looking at a “proper” DSP:

DSP typical capabilities
Advertiser bid/campaign management UI or API
Ability to execute real-time buy bids with exchanges and other “RTB-enabled” counterparties
Multidimensional bid optimization
Ability for advertisers to provide their own data (e.g. targeting cookie data) to refine bidding
Universal user management (e.g. frequency capping) across multiple inventory sources
Integration with third-party ad servers
Consolidated reporting/billing/payments across multiple third-parties

Most of these “DSP capabilities” are nothing new. The items above can be found in other systems, such as those provided by some ad networks and third-party ad servers – which is why companies in these businesses are busily launching DSPs. What is new is the consolidation of these features into discrete offerings – and the new boundaries that are being drawn as a result.

 

Where to find a DSP

As well as stand-alone DSP providers (see the list below), DSP functionality is appearing in two other main places: ad networks and media agencies.

Ad networks are incorporating DSPs into their advertiser-facing offerings so that they can broaden the range of media they can offer to their advertisers beyond that which they directly represent. The latest network to do this is your friend and mine, Google, which this week announced the acquisition of Invite Media, which brokers bidding on Google’s own DoubleClick Exchange, but also a range of other exchanges including our own AdECN.

Networks will also increasingly use DSPs themselves to source their inventory, gradually replacing their “in-bulk” supply-side relationships with DSP-mediated deals where the network gets to pick and choose which impressions to add to its pool.

The other place that is getting all DSP’d up is the media agency. Several large agency holding companies are establishing ‘trading desks’ which are consolidating the buying (aka demand) from across the company’s advertiser base and leveraging that demand to drive better-value deals for the agency. These trading desks are becoming increasingly sophisticated and have recently started engaging in real-time bidding. Some of the better-established trading desks (with the providing agency in parentheses alongside them) are in the list below:

  • Vivaki Nerve Center (Publicis)
  • Omnicom Trading Desk
  • B3 (WPP)
  • AdNetic (Havas)
  • Cadreon (IPG)
  • ATOM (Razorfish)

 

Who are the DSPs, then?

Well, that depends who you ask, of course. But now you’re this far down this post, I feel emboldened to provide you with a list of the best-known independent DSPs. If I’ve missed someone, please let me know, and I’ll add them in.

Company Description
AdChemy Provider of audience-based display and search advertising platform (the “AdChemy Experience Platform”). Focuses on generating a “relevant” experience for consumers, bundling in dynamic creative generation with other DSP-style services such as segment targeting & optimization. Typically partners with Agencies – largest partnership is with Accenture Interactive.
x+1 Offers campaign/media optimization (“Media+1”) as part of a suite that includes landing page & site optimization (straying into MVT territory). Media+1 enables real-time bidding for exchange-based inventory - x+1 has created a “Virtual Audience Network” across multiple Ad Exchange & Network inventory sources, to compete with traditional ad network offerings.
Media Math “The new marketing OS”. Provides a DSP offering (“TerminalOne”) to agencies, through a combination of technology and services. One of the first companies to market with such an offering.
Turn Offers both agency trading desk and full-fledged network management software – both include the full range of DSP “table stakes” features such as exchange integration, RTB, targeting & optimization. Network offering adds cross-network inventory/campaign optimization. Because of both offerings, Turn works with both agencies and ad networks.
DataXu Fairly straightforward independent DSP – offers real-time bidding across a handful of exchanges, with optimization and tracking built in.
AppNexus Styles itself as a cloud-based platform on which RTB/DSP-style systems can be built, rather than a DSP in itself. Partners with agencies and ad networks to enable them to build out their own DSP or trading desk capabilities & offerings.
Triggit Another fairly vanilla DSP, with a fairly standard range of capabilities & partnerships. Fairly strong portfolio of data partners to aid with campaign targeting.
LucidMedia Offers self-serve DSP (LucidMediaDSP). Claims to be able to reach 95% of the US online audience via partnerships with the usual suspects (Google, Yahoo/RightMedia, Rubicon, PubMatic, etc).
Invite Media Now no longer independent after its acquisition by Google. “Bid Manager” tool provides cross-exchange buying; supports the major exchanges (Google, Yahoo! RM, AdECN) plus other supply aggregators such as PubMatic.

 

Where next for DSPs?

That’s an excellent question, and I’m glad you asked it. The short answer is, ah, not sure. But one thing I can aver: if you come back to this post in a year’s time, you’ll chuckle at the (by then) wildly out-of-date list above. Some of the companies in the list will no longer exist, either because they’ve gone out of business or been swallowed up by a larger fish, and there’ll be some new wunderkind on the block who I’ve completely missed out.

Which is all to say, there’s a lot of change going on in this area. The balance of power in the industry is shifting, and right now it seems like the DSPs are calling the shots, but don’t expect the big ad networks to sit around idle and let DSPs drain away the demand-side of their business. We may well find (as is happening in the web analytics industry) that very soon, there are no independent DSPs left, and every agency is a DSP also.

On the other hand, life is becoming much harder for small networks to survive in this new world; first their advertiser base will be wooed away by the DSPs (and by DSP-enabled network competitors); then their supply may fragment as DSP-based agency buying takes hold. If you thought the online ad industry was starting to settle down and make a little more sense, then, well, too bad. But at least it’s not boring.

Online Advertising Business 101 – Index of all posts

del.icio.usdel.icio.us diggDigg RedditReddit StumbleUponStumbleUpon

Subscribe

Enter your email address:

Delivered by FeedBurner