forked from mongodb/docs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathserving-ads.txt
428 lines (338 loc) · 14.1 KB
/
serving-ads.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
==========================================
Serving and Tracking Online Advertisements
==========================================
.. default-domain:: mongodb
Overview
--------
This document outlines basic patterns and principles for using
MongoDB as a persistent storage engine for an online advertising
network. In particular, this document focuses on the task of deciding
*which* ad to serve when a user visits a particular site.
Problem
~~~~~~~
You want to create an advertising network to serve ads to online media
sites. As part of this service, you also want to track which ads are
available, and decide on a particular to serve to a particular zone.
Solution
~~~~~~~~
This case study describes a basic advertising service and then refines
this application to support more advanced ad targeting. The key
performance requirements for this solution is the latency between
receiving a request and returning the (targeted) ad for display.
The examples that follow use the Python programming language and the
:api:`PyMongo <python/current>` :term:`driver` for MongoDB, but you
can implement this system using any language you choose.
Serving Basic Ads
-----------------
A basic ad serving algorithm consists of the following steps:
#. The network receives a request for an ad, specifying at a minimum
the ``site_id`` and ``zone_id``.
#. The network consults its inventory of ads available to display and
chooses an ad based on various business rules.
#. The network returns the actual ad for display, possibly recording
the decision.
This design uses the ``site_id`` and ``zone_id`` with the ad request,
as well as information stored in the ad inventory collection, to make
the ad targeting decisions. This design also provides flexibility for
additional functionality later.
Schema
~~~~~~
The schema for storing available ads consists of a single collection,
``ad.zone``:
.. code-block:: javascript
{
_id: ObjectId(...),
site_id: 'cnn',
zone_id: 'banner',
ads: [
{ campaign_id: 'mercedes:c201204_sclass_4',
ad_unit_id: 'banner23a',
ecpm: 250 },
{ campaign_id: 'mercedes:c201204_sclass_4',
ad_unit_id: 'banner23b',
ecpm: 250 },
{ campaign_id: 'bmw:c201204_eclass_1',
ad_unit_id: 'banner12',
ecpm: 200 },
... ]
}
.. kind of confusing to call this collection "ad.zone", esp since there is only 1 and zone is just property
For each (``site``, ``zone``) combination you'll store a list of ads,
sorted by their ``ecpm`` values.
Choosing an Ad to Serve
~~~~~~~~~~~~~~~~~~~~~~~
Querying
````````
The query that the application uses to choose an add to serve selects
a compatible ad and sorts by the advertiser's ``ecpm`` bid in order to
maximize the ad network's profits. This query resembles the following:
.. code-block:: python
from itertools import groupby
from random import choice
def choose_ad(site_id, zone_id):
site = db.ad.zone.find_one({
'site_id': site_id, 'zone_id': zone_id})
if site is None: return None
if len(site['ads']) == 0: return None
ecpm_groups = groupby(site['ads'], key=lambda ad:ad['ecpm'])
ecpm, ad_group = ecpm_groups.next()
return choice(list(ad_group))
.. should be noted that here the list of ads get sorted client side for every call.
also full list gets pulled every time.
In case that the list is long, would be better to normalize, then use index on {site, zone, ecpm}
Indexing
````````
To execute the ad choice with the lowest latency possible, create a
compound index on (``site_id``, ``zone_id``):
.. code-block:: pycon
>>> db.ad.zone.ensure_index([
... ('site_id', 1),
... ('zone_id', 1) ])
Making an Ad Campaign Inactive
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Updating
````````
Ad systems must adjust automatically to the requirements of an ad
campaign: if the campaign has reached its end date or exhausted its
budget the system must automatically cease serving those ads. The
following operation implements this feature:
.. code-block:: python
def deactivate_campaign(campaign_id):
db.ad.zone.update(
{ 'ads.campaign_id': campaign_id },
{' $pull': { 'ads', { 'campaign_id': campaign_id } } },
multi=True)
The update statement selects only those ad zones that have available
ads from the given ``campaign_id`` and then uses the :operator:`$pull`
operator to remove them from rotation.
Indexing
````````
To support the multi-update, you should maintain an index on the
``ads.campaign_id`` field:
.. code-block:: pycon
>>> db.ad.zone.ensure_index('ads.campaign_id')
Sharding
~~~~~~~~
To scale beyond the capacity of a single replica set, you will need to
shard the ``ad.zone`` collection. To maintain the lowest possible
latency in the ad selection operation, choose a the :term:`shard key`
that allows MongoDB to route the ``ad.zone`` query to a single shard
or limited number of shards. Using the (``site_id``, ``zone_id``)
field combination as a shard key fulfills this requirement:
.. code-block:: pycon
>>> db.command('shardcollection', 'ad.zone', {
... 'key': {'site_id': 1, 'zone_id': 1} })
{ "collectionsharded": "ad.zone", "ok": 1 }
Adding Frequency Capping
------------------------
One problem with the logic described in the basic application design
is that that it will tend to display the same ad repeatedly until it
exhausts the campaign's budget. To mitigate this, the system may want
to limit the frequency that it presents a single user with a specific
ad. This "frequency capping" is an example of user profile targeting
in advertising.
To provide frequency capping, or any type of user targeting, the ad
system must maintain a profile for each visitor, typically implemented
as a cookie in the user's browser. The system uses this cookie,
effectively a ``user_id``, when logging impressions, clicks,
conversions, etc., as well as the ad serving decision. This section
focuses on how that profile data impacts the ad serving decision.
Schema
~~~~~~
In order to use the user profile data, you need to store it. In this case, it's
stored in a collection ``ad.user``:
.. code-block:: javascript
{
_id: 'cookie_value',
advertisers: {
mercedes: {
impressions: [
{ date: ISODateTime(...),
campaign: 'c201204_sclass_4',
ad_unit_id: 'banner23a',
site_id: 'cnn',
zone_id: 'banner' } },
... ],
clicks: [
{ date: ISODateTime(...),
campaign: 'c201204_sclass_4',
ad_unit_id: 'banner23a',
site_id: 'cnn',
zone_id: 'banner' } },
... ],
bmw: [ ... ],
...
}
}
This schema:
- Segments profile information by advertiser. Typically advertising
data is sensitive information that the service can't share between
advertisers.
- Embeds all data in a single profile document. When you need to query
this data, the application won't know which advertiser's ads it's
showing, so it makes sense to store all data in a single document.
- Groups event information by type within an advertiser, and sorts by
timestamp. This facilitates rapid lookups of a stream of a
particular type of event.
.. probably should warn that storing impressions and clicks within user document can backfire.
Could grow very large, e.g. if it's a bot, and past 1000 will start slowing down db.
A more robust solution would keep each impression with its possible click in a separate document.
Could be in a capped collection if impressions do not need to be kept for long.
Choosing an Ad to Serve
~~~~~~~~~~~~~~~~~~~~~~~
Querying
````````
Use the following query to choose an ad to serve. This operation must
iterate through ads in order of "desirability" and then select the
"best" ad that also satisfies the advertiser's targeting rules. In
this example the frequency cap is the targeting rule.
.. code-block:: python
from itertools import groupby
from random import shuffle
from datetime import datetime, timedelta
def choose_ad(site_id, zone_id, user_id):
site = db.ad.zone.find_one({
'site_id': site_id, 'zone_id': zone_id})
if site is None or len(site['ads']) == 0: return None
ads = ad_iterator(site['ads'])
user = db.ad.user.find_one({'user_id': user_id})
if user is None:
# any ad is acceptable for an unknown user
return ads.next()
for ad in ads:
advertiser_id = ad['campaign_id'].split(':', 1)[0]
if ad_is_acceptable(ad, user[advertiser_id]):
return ad
return None
def ad_iterator(ads):
'''Find available ads, sorted by ecpm, with random sort for ties'''
ecpm_groups = groupby(ads, key=lambda ad:ad['ecpm'])
for ecpm, ad_group in ecpm_groups:
ad_group = list(ad_group)
shuffle(ad_group)
for ad in ad_group: yield ad
def ad_is_acceptable(ad, profile):
'''Returns False if the user has seen the ad today'''
threshold = datetime.utcnow() - timedelta(days=1)
for event in reversed(profile['impressions']):
if event['timestamp'] < threshold: break
if event['detail']['ad_unit_id'] == ad['ad_unit_id']:
return False
return True
The ``chose_ad()`` function provides the framework for your ad
selection process. The query fetches ``site`` first, and then passes
it to the ``ad_iterator()`` function. This yields ads in order of
desirability. Then, ``ad_is_acceptable()`` checks each add to
determine if it meets the advertiser's rules.
The ``ad_is_acceptable()`` function then iterates over all
``impressions`` in the user's' profile, from most recent to oldest,
within a certain ``threshold`` time period, which is 1 day in the
sample above. If the same ``ad_unit_id`` appears in the impression
stream, the function rejects the ad. Otherwise the service shows the
first acceptable ad to the user.
Indexing
````````
To retrieve the user profile with the lowest latency possible, this
operation requires an index on the ``_id`` field, which MongoDB
supplies by default.
Sharding
~~~~~~~~
When sharding the ``ad.user`` collection, choosing the ``_id`` field
as a :term:`shard key` allows MongoDB to route queries and updates to
the profile:
.. code-block:: pycon
>>> db.command('shardcollection', 'ad.user', {
... 'key': {'_id': 1 } })
{ "collectionsharded": "ad.user", "ok": 1 }
Keyword Targeting
-----------------
Where frequency is an example of user profile targeting, you may want
the system to target ads so that the user receives ads relevant to the
page they're viewing. For example, you might want to target ads based
on a search query. In this case, the system sends a list ``keywords``
to the ``choose_ad()`` function with the ``site_id``, ``zone_id``, and
``user_id``.
Schema
~~~~~~
To choose relevant ads, you will expand the ``ad.zone`` collection to
store keywords for each ad:
.. code-block:: javascript
{
_id: ObjectId(...),
site_id: 'cnn',
zone_id: 'search',
ads: [
{ campaign_id: 'mercedes:c201204_sclass_4',
ad_unit_id: 'search1',
keywords: [ 'car', 'luxury', 'style' ],
ecpm: 250 },
{ campaign_id: 'mercedes:c201204_sclass_4',
ad_unit_id: 'search2',
keywords: [ 'car', 'luxury', 'style' ],
ecpm: 250 },
{ campaign_id: 'bmw:c201204_eclass_1',
ad_unit_id: 'search1',
keywords: [ 'car', 'performance' ],
ecpm: 200 },
... ]
}
Choosing a Group of Ads to Serve
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The system will then choose a number of ads that match the keywords
used in the search. The following ``choose_ads()`` and
``ad_iterator()`` implementations will return and iterate over ads in
descending order of preference:
.. code-block:: python
def choose_ads(site_id, zone_id, user_id, keywords):
site = db.ad.zone.find_one({
'site_id': site_id, 'zone_id': zone_id})
if site is None: return []
ads = ad_iterator(site['ads'], keywords)
user = db.ad.user.find_one({'user_id': user_id})
if user is None: return ads
advertiser_ids = (
ad['campaign_id'].split(':', 1)[0]
for ad in ads )
return (
ad for ad, advertiser_id in izip(
ads, advertiser_ids)
if ad_is_acceptable(ad, user[advertiser_id]) )
def ad_iterator(ads, keywords):
'''Find available ads, sorted by score, with random sort for
ties'''
keywords = set(keywords)
scored_ads = [
(ad_score(ad, keywords), ad)
for ad in ads ]
score_groups = groupby(
sorted(scored_ads), key=lambda score, ad: score)
for score, ad_group in score_groups:
ad_group = list(ad_group)
shuffle(ad_group)
for ad in ad_group: yield ad
def ad_score(ad, keywords):
'''Compute a desirability score based on the ad ecpm and
keywords'''
matching = set(ad['keywords']).intersection(keywords)
return ad['ecpm'] * math.log(
1.1 + len(matching))
# ad_is_acceptable
def ad_is_acceptable(ad, profile):
'''Returns False if the user has seen the ad today'''
threshold = datetime.utcnow() - timedelta(days=1)
for event in reversed(profile['impressions']):
if event['timestamp'] < threshold: break
if event['detail']['ad_unit_id'] == ad['ad_unit_id']:
return False
return True
With these implementations, ads you must sort ads according to some
score. Here, ``ad_score()`` computes this value based on a combination
of the ``ecpm`` value for the ad and the number of keywords
matched. In more sophisticated implementations you may choose to
customize the value of some keywords, but this is beyond the scope of
this document. Because the ad service must sort all ads at display
time, you may find performance issues if you if there are a large
number of ads competing for the same display slot.
.. here the processing will be done client side for every request.
This can become very expensive if a page gets hit a lot and has many possible ads.
Again here if normalize each ad, can use index on { site, zone, keywords }