draft/use-cases/serving-ads.txt

==========================================
Serving and Tracking Online Advertisements
==========================================

.. default-domain:: mongodb

Overview
--------

This document outlines basic patterns and principles for using
MongoDB as a persistent storage engine for an online advertising
network. In particular, this document focuses on the task of deciding
*which* ad to serve when a user visits a particular site.

Problem
~~~~~~~

You want to create an advertising network to serve ads to online media
sites. As part of this service, you also want to track which ads are
available, and decide on a particular to serve to a particular zone.

Solution
~~~~~~~~

This case study describes a basic advertising service and then refines
this application to support more advanced ad targeting. The key
performance requirements for this solution is the latency between
receiving a request and returning the (targeted) ad for display.

The examples that follow use the Python programming language and the
:api:`PyMongo <python/current>` :term:`driver` for MongoDB, but you
can implement this system using any language you choose.

Serving Basic Ads
-----------------

A basic ad serving algorithm consists of the following steps:

#. The network receives a request for an ad, specifying at a minimum
   the ``site_id`` and ``zone_id``.

#. The network consults its inventory of ads available to display and
   chooses an ad based on various business rules.

#. The network returns the actual ad for display, possibly recording
   the decision.

This design uses the ``site_id`` and ``zone_id`` with the ad request,
as well as information stored in the ad inventory collection, to make
the ad targeting decisions. This design also provides flexibility for
additional functionality later.

Schema
~~~~~~

The schema for storing available ads consists of a single collection,
``ad.zone``:

.. code-block:: javascript

   {
     _id: ObjectId(...),
     site_id: 'cnn',
     zone_id: 'banner',
     ads: [
       { campaign_id: 'mercedes:c201204_sclass_4',
         ad_unit_id: 'banner23a',
         ecpm: 250 },
       { campaign_id: 'mercedes:c201204_sclass_4',
         ad_unit_id: 'banner23b',
         ecpm: 250 },
       { campaign_id: 'bmw:c201204_eclass_1',
         ad_unit_id: 'banner12',
         ecpm: 200 },
       ... ]
   }

.. kind of confusing to call this collection "ad.zone", esp since there is only 1 and zone is just property

For each (``site``, ``zone``) combination you'll store a list of ads,
sorted by their ``ecpm`` values.

Choosing an Ad to Serve
~~~~~~~~~~~~~~~~~~~~~~~

Querying
````````

The query that the application uses to choose an add to serve selects
a compatible ad and sorts by the advertiser's ``ecpm`` bid in order to
maximize the ad network's profits. This query resembles the following:

.. code-block:: python

   from itertools import groupby
   from random import choice

   def choose_ad(site_id, zone_id):
       site = db.ad.zone.find_one({
           'site_id': site_id, 'zone_id': zone_id})
       if site is None: return None
       if len(site['ads']) == 0: return None
       ecpm_groups = groupby(site['ads'], key=lambda ad:ad['ecpm'])
       ecpm, ad_group = ecpm_groups.next()
       return choice(list(ad_group))

.. should be noted that here the list of ads get sorted client side for every call.
   also full list gets pulled every time.
   In case that the list is long, would be better to normalize, then use index on {site, zone, ecpm}

Indexing
````````

To execute the ad choice with the lowest latency possible, create a
compound index on (``site_id``, ``zone_id``):

.. code-block:: pycon

   >>> db.ad.zone.ensure_index([
   ...     ('site_id', 1),
   ...     ('zone_id', 1) ])

Making an Ad Campaign Inactive
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Updating
````````

Ad systems must adjust automatically to the requirements of an ad
campaign: if the campaign has reached its end date or exhausted its
budget the system must automatically cease serving those ads. The
following operation implements this feature:

.. code-block:: python

   def deactivate_campaign(campaign_id):
       db.ad.zone.update(
         { 'ads.campaign_id': campaign_id },
         {' $pull': { 'ads', { 'campaign_id': campaign_id } } },
         multi=True)

The update statement selects only those ad zones that have available
ads from the given ``campaign_id`` and then uses the :operator:`$pull`
operator to remove them from rotation.

Indexing
````````

To support the multi-update, you should maintain an index on the
``ads.campaign_id`` field:

.. code-block:: pycon

   >>> db.ad.zone.ensure_index('ads.campaign_id')

Sharding
~~~~~~~~

To scale beyond the capacity of a single replica set, you will need to
shard the ``ad.zone`` collection. To maintain the lowest possible
latency in the ad selection operation, choose a the :term:`shard key`
that allows MongoDB to route the ``ad.zone`` query to a single shard
or limited number of shards. Using the (``site_id``, ``zone_id``)
field combination as a shard key fulfills this requirement:

.. code-block:: pycon

   >>> db.command('shardcollection', 'ad.zone', {
   ...     'key': {'site_id': 1, 'zone_id': 1} })
   { "collectionsharded": "ad.zone", "ok": 1 }

Adding Frequency Capping
------------------------

One problem with the logic described in the basic application design
is that that it will tend to display the same ad repeatedly until it
exhausts the campaign's budget. To mitigate this, the system may want
to limit the frequency that it presents a single user with a specific
ad. This "frequency capping" is an example of user profile targeting
in advertising.

To provide frequency capping, or any type of user targeting, the ad
system must maintain a profile for each visitor, typically implemented
as a cookie in the user's browser. The system uses this cookie,
effectively a ``user_id``, when logging impressions, clicks,
conversions, etc., as well as the ad serving decision. This section
focuses on how that profile data impacts the ad serving decision.

Schema
~~~~~~

In order to use the user profile data, you need to store it. In this case, it's
stored in a collection ``ad.user``:

.. code-block:: javascript

   {
     _id: 'cookie_value',
     advertisers: {
       mercedes: {
         impressions: [
           { date: ISODateTime(...),
             campaign: 'c201204_sclass_4',
             ad_unit_id: 'banner23a',
             site_id: 'cnn',
             zone_id: 'banner' } },
           ... ],
         clicks: [
           { date: ISODateTime(...),
             campaign: 'c201204_sclass_4',
             ad_unit_id: 'banner23a',
             site_id: 'cnn',
             zone_id: 'banner' } },
         ... ],
       bmw: [ ... ],
       ...
     }
   }

This schema:

- Segments profile information by advertiser. Typically advertising
  data is sensitive information that the service can't share between
  advertisers.

- Embeds all data in a single profile document. When you need to query
  this data, the application won't know which advertiser's ads it's
  showing, so it makes sense to store all data in a single document.

- Groups event information by type within an advertiser, and sorts by
  timestamp. This facilitates rapid lookups of a stream of a
  particular type of event.

.. probably should warn that storing impressions and clicks within user document can backfire.
   Could grow very large, e.g. if it's a bot, and past 1000 will start slowing down db.
   A more robust solution would keep each impression with its possible click in a separate document.
   Could be in a capped collection if impressions do not need to be kept for long.

Choosing an Ad to Serve
~~~~~~~~~~~~~~~~~~~~~~~

Querying
````````

Use the following query to choose an ad to serve. This operation must
iterate through ads in order of "desirability" and then select the
"best" ad that also satisfies the advertiser's targeting rules. In
this example the frequency cap is the targeting rule.

.. code-block:: python

   from itertools import groupby
   from random import shuffle
   from datetime import datetime, timedelta

   def choose_ad(site_id, zone_id, user_id):
      site = db.ad.zone.find_one({
           'site_id': site_id, 'zone_id': zone_id})
      if site is None or len(site['ads']) == 0: return None
      ads = ad_iterator(site['ads'])
      user = db.ad.user.find_one({'user_id': user_id})
      if user is None:
          # any ad is acceptable for an unknown user
          return ads.next()
      for ad in ads:
          advertiser_id = ad['campaign_id'].split(':', 1)[0]
          if ad_is_acceptable(ad, user[advertiser_id]):
              return ad
      return None

   def ad_iterator(ads):
       '''Find available ads, sorted by ecpm, with random sort for ties'''
       ecpm_groups = groupby(ads, key=lambda ad:ad['ecpm'])
       for ecpm, ad_group in ecpm_groups:
           ad_group = list(ad_group)
           shuffle(ad_group)
           for ad in ad_group: yield ad

   def ad_is_acceptable(ad, profile):
       '''Returns False if the user has seen the ad today'''
       threshold = datetime.utcnow() - timedelta(days=1)
       for event in reversed(profile['impressions']):
           if event['timestamp'] < threshold: break
           if event['detail']['ad_unit_id'] == ad['ad_unit_id']:
               return False
       return True

The ``chose_ad()`` function provides the framework for your ad
selection process. The query fetches ``site`` first, and then passes
it to the ``ad_iterator()`` function. This yields ads in order of
desirability. Then, ``ad_is_acceptable()`` checks each add  to
determine if it meets the advertiser's rules.

The ``ad_is_acceptable()`` function then iterates over all
``impressions`` in the user's' profile, from most recent to oldest,
within a certain ``threshold`` time period, which is 1 day in the
sample above. If the same ``ad_unit_id`` appears in the impression
stream, the function rejects the ad. Otherwise the service shows the
first acceptable ad to the user.

Indexing
````````

To retrieve the user profile with the lowest latency possible, this
operation requires an index on the ``_id`` field, which MongoDB
supplies by default.

Sharding
~~~~~~~~

When sharding the ``ad.user`` collection, choosing the ``_id`` field
as a :term:`shard key` allows MongoDB to route queries and updates to
the profile:

.. code-block:: pycon

   >>> db.command('shardcollection', 'ad.user', {
   ...     'key': {'_id': 1 } })
   { "collectionsharded": "ad.user", "ok": 1 }

Keyword Targeting
-----------------

Where frequency is an example of user profile targeting, you may want
the system to target ads so that the user receives ads relevant to the
page they're viewing. For example, you might want to target ads based
on a search query. In this case, the system sends a list ``keywords``
to the ``choose_ad()`` function with the ``site_id``, ``zone_id``, and
``user_id``.

Schema
~~~~~~

To choose relevant ads, you will expand the ``ad.zone`` collection to
store keywords for each ad:

.. code-block:: javascript

   {
     _id: ObjectId(...),
     site_id: 'cnn',
     zone_id: 'search',
     ads: [
       { campaign_id: 'mercedes:c201204_sclass_4',
         ad_unit_id: 'search1',
         keywords: [ 'car', 'luxury', 'style' ],
         ecpm: 250 },
       { campaign_id: 'mercedes:c201204_sclass_4',
         ad_unit_id: 'search2',
         keywords: [ 'car', 'luxury', 'style' ],
         ecpm: 250 },
       { campaign_id: 'bmw:c201204_eclass_1',
         ad_unit_id: 'search1',
         keywords: [ 'car', 'performance' ],
         ecpm: 200 },
       ... ]
   }

Choosing a Group of Ads to Serve
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The system will then choose a number of ads that match the keywords
used in the search. The following ``choose_ads()`` and
``ad_iterator()`` implementations will return and iterate over ads in
descending order of preference:

.. code-block:: python

   def choose_ads(site_id, zone_id, user_id, keywords):
      site = db.ad.zone.find_one({
           'site_id': site_id, 'zone_id': zone_id})
      if site is None: return []
      ads = ad_iterator(site['ads'], keywords)
      user = db.ad.user.find_one({'user_id': user_id})
      if user is None: return ads
      advertiser_ids = (
          ad['campaign_id'].split(':', 1)[0]
          for ad in ads )
      return (
          ad for ad, advertiser_id in izip(
              ads, advertiser_ids)
          if ad_is_acceptable(ad, user[advertiser_id]) )

  def ad_iterator(ads, keywords):
       '''Find available ads, sorted by score, with random sort for
       ties'''

       keywords = set(keywords)
       scored_ads = [
           (ad_score(ad, keywords), ad)
           for ad in ads ]
       score_groups = groupby(
           sorted(scored_ads), key=lambda score, ad: score)
       for score, ad_group in score_groups:
           ad_group = list(ad_group)
           shuffle(ad_group)
           for ad in ad_group: yield ad

  def ad_score(ad, keywords):
      '''Compute a desirability score based on the ad ecpm and
      keywords'''

      matching = set(ad['keywords']).intersection(keywords)
      return ad['ecpm'] * math.log(
          1.1 + len(matching))

  # ad_is_acceptable
  def ad_is_acceptable(ad, profile):
       '''Returns False if the user has seen the ad today'''
       threshold = datetime.utcnow() - timedelta(days=1)
       for event in reversed(profile['impressions']):
           if event['timestamp'] < threshold: break
           if event['detail']['ad_unit_id'] == ad['ad_unit_id']:
               return False
       return True

With these implementations, ads you must sort ads according to some
score. Here, ``ad_score()`` computes this value based on a combination
of the ``ecpm`` value for the ad and the number of keywords
matched. In more sophisticated implementations you may choose to
customize the value of some keywords, but this is beyond the scope of
this document. Because the ad service must sort all ads at display
time, you may find performance issues if you if there are a large
number of ads competing for the same display slot.

.. here the processing will be done client side for every request.
   This can become very expensive if a page gets hit a lot and has many possible ads.
   Again here if normalize each ad, can use index on { site, zone, keywords }