Marc Burgess: What the profiler does is it first cleans the data. It's looking at two sets of information: the information in the request that's sent to the website and then information in the page that comes back.
From the request it pulls out the URL, and if that URL is a well known search engine such as Google or Yahoo! it'll also look for the search terms that are in the request.
And then from the information returned by the website, the profiler looks at the content. The first thing it does is it ignores several classes of information that could potentially be sensitive. So there's no form fields, no numbers, no email addresses (that is something containing an "@") and anything containing a title like Mr or Mrs.
Aren't you collecting the first three characters?
MB: Because of a peculiarity of the tokenisation, numbers three digits or shorter aren't collected anyway, they're too short so there's no numbers at all. If you have a mixture of letters and numbers - a compound - that would be potentially collected.
Say, for example, the start of postcode?
MB: Yes...
KE: But as you'll see it's irrelevant anyway.
MB: So we do this basic cleaning process and then we take a look at the key words that have come from the page and we eliminate "noise words" that have a low intrinsic meaning. So what we're left with is a clean version of the key words in the page which we then basically do a chart of the ten most commonly occurring words.
This process has the effect of largely eliminating personally identifiable information [PII] from the web page because it would have to contain PII that didn't match any of our criteria and also appeared repeatedly in the page.
The profiler takes this "data digest" and it passes it through the box we call the anonymiser and into the box we call the channel server. The channel server has got a database of advertising categories that we call channels - things like sport, health and beauty, travel, luxury cars, etc. The channels are global to the whole system [across ISP networks]. Via the Open Internet Exchange advertisers are able to specify the channels they want to target.
The channels are controlled in the content they can have. We don't have adult advertising, no medical channel, no tobacco, no gambling. The channels are also designed so they always match a minimum number of unique users - 5,000. A channel has to be sufficiently broad so that it doesn't just reduce to one or two users.
As soon as that match has been made the data digest, which has only ever been in memory, is immediately deleted. It never goes to disk.