Friday, January 18, 2013

Digital attribution

I have changed jobs and work at Omnicom Media Group now as a group consultant in the data science team. My project is to work on digital media attribution and the path to conversion. Basically we are trying to establish which events on the path make a user/cookie more likely to convert.

This is an interesting area where lots of new companies are trying to compete. Some big players include Google and Adobe.

Our Facebook presence: here

Friday, October 05, 2012

Tags on top 100 UK sites

I took the top 100 UK web sites from Alexa and searched the homepage for web analytics tags. For nine websites I could not find any JS files or named tags, or the site returned an error.

I found that 23% use Doubleclick, 44% use Google Analytics, 9% use Nielsen, 5% use Omniture, 9% use Comscore. Tagman is only used by 1 site. Five sites have 3 tags altogether:

Thursday, October 04, 2012

scrape random movies from IMDB

I have created a scraper for IMDB. It creates a graph as it downloads all referred actors/directors, keywords, languages etc - basically features which could be put into a recommender or similar system.


Output looks like genre/Family name/nm0784124/ name/nm1293791/ name/nm0265620/ name/nm0754781/ keyword/bear country/jp language/ja genre/Romance genre/Mystery name/nm0130215/ name/nm0280541/ name/nm0302384/ name/nm0309129/ name/nm0130191/ name/nm0560478/ name/nm0001607/ name/nm0908001/ name/nm0912604/ name/nm0133597/ name/nm0489010/ name/nm0005166/ name/nm0593411/ name/nm0929869/ keyword/soap keyword/tragedy keyword/betrayal keyword/shipper country/us language/en genre/Drama name/nm0430267/ name/nm3136900/ name/nm0018495/ name/nm0068168/ name/nm0231191/ name/nm0263099/ name/nm0341647/ name/nm0367731/ name/nm0792129/ name/nm0909848/ country/gb language/en company/co0248652/

Tuesday, September 25, 2012

Python: combine lists or dictionaries

I have written some tests to see which approach is quickest to combine/reduce data where keys overlap. I tested whether it is better to store the data in lists or dictionaries. It turns out stacking lists and then using defaultdict (method 5) is the fastest. Using itertools.groupby() was the slowest.

The code is here

387299 1.15799999237 2.4681637353
387299 0.80999994278 2.4681637353
387299 0.805000066757 2.4681637353
387299 0.775000095367 2.4681637353
387299 0.677999973297 2.4681637353

Thursday, September 13, 2012

Is MapReduce on AWS fast?

I have been playing around with Elastic MapReduce on AWS. I ran the wordsplitter example from the AWS tutorial. The job took 3 minutes to complete the word count on 12 files.

I then wrote the whole thing in native Python using a dictionary (without MapReduce), this took 4 seconds to run on the EC2 server. So actually I am not that impressed with MapReduce, it might be due to file access or job creation but still hard to see what the fuss is about.

Tuesday, September 11, 2012

Predictive problems in customer FMCG analysis

Here are six areas in customer FMCG analysis which can be solved by predictive modelling:
o How can we predict the S shape of new product sales at earliest stage possibly
Customer targeting
o Which offers are most relevant for which customers (at what time) - to increase participation
Propensity to buy (cross sell)
o Who are the best targets for a product/category – to increase sales
Similarity (customers or products)
o Both can be used to recommend additional products to customers
Promotions (optimisation)
o Which mechanic and discount on which product delivers most ROI
Life stage and events
o Family types
o New family member, pregnancy
o New job/wealth
Some of these are getting addressed by retailers today; others will grow in the future. If you want to share ideas on any of those, of course let me know.

Post crisis agency relationship

The relationship between clients and analytical agencies has shifted. This is due to three reasons:
Abundance of data storage and computing power
Need to tighten budgets (especially post 2008)
Time to extract knowledge from agencies
Clients have learned that analysis is not rocket science anymore. They have brought a lot of capacity in-house. New technical solutions have made it possible to get around drawn out, bureaucratic IT projects. The focus has shifted from service to product.
Of the three reasons, the second one is most often neglected, but it means agencies need to prove their viability again and again.
The question is how agencies will respond. One answer can be syndication. By aggregating cross-client data they can create synergetic solutions which all parties can gain from.
Another answer is protection of intellectual property (IP). Previously it was assumed that algorithms cannot be copied because it takes too much time to reverse engineer. With more and more open source solutions, approaches can be borrowed quite easily. The output might not be the same but similar.
A third answer can be analytical talent. If agencies can secure the best talent (ie become more expensive – which goes against the tight budget issue), they could persuade clients to stick with them.

Wednesday, August 22, 2012

Telecoms will push analytics

Telecom firms are starting to ponder what to do with the massive data they collect on an everyday basis. How can they gather anonymised insights from their data which they can package up and sell?

They know the following things about (especially paid smartphone) users:
Who they are and where they live
Where they went at what time
How much they use their phone and for what (that includes websites browsed)
Who they are connected to

The challenge for many telecoms is to bring this information into a Single Customer View (SCV) as it exists in silos (and different frequencies).

Using these four tiers of information, the telco can produce unique insights of the sort:
Who shops where and what time?
Where do those shoppers go before and after the shop?
What is the cross shopping (switching) of users?
What websites correlate with what shopping behaviour?
Who are the influencers in a network?
o So if I wanted to target only influencers, who would it be?
How do online (mobile) and offline shopping behaviour interact?

Change index for time series in Excel

I have created an Excel tool where you can compare several time series and you can change the index/base date by selecting it from the drop down or changing the slider. It also tells you which date delivers the highest growth.


Wednesday, August 15, 2012

simple cross purchase code

If you have an item level data set and you want to explore cross purchase in SAS you usually need to sum and transpose your data first in several steps. Here is a simple SQL within SAS example using the max and case functions writing this succinctly, it won't necessarily run faster though.

proc sql;
create table x as
select custid, max(case when prodid=y then 1 else 0 end) as prod1, max(case when prodid=z then 1 else 0 end) as prod2
from items
group by 1;

proc freq;
table prod1*prod2;

Thursday, July 12, 2012

Lloyds TSB's targeted outdoor ads

Lloyds TSB is running an outdoor campaign saying things like 'Platform 1, a good place to check your banking' (when on a train station). However their media buyers put them in the wrong place. I was on platform 18 when I read the one about platform 1. I was on Upper Richmond Road, when the ad said Tooting High Street. Silly, misplaced targeting.

Friday, February 17, 2012

Some notable string quartets (links)

Big Data?

There is a lot of talk about big data and data science. With increasing abundance of structured and unstructured data, many praise the possibilities of analysing terabytes of data. But as nice as some of the visualisations might look like, the question is how businesses can use the new tools and data to make money. Is big data always better than old/small data?

I work with many businesses that have a lot of data but don’t have the time to look at all the information they have and could have. Before you extend your database, have a think about what more you can do with existing data. Integration is the key, there is no benefit to have half a dozen unlinked databases coming up with different answers.

Data collection needs to be improved. Before you commission a new database, think how you can track more behaviour about customers in your existing database.

Many new start ups focus on collecting huge databases hoping that clients will be amazed by the richness of the data. But this is only for clients who know how to use data. Not many businesses know how.
Facebook has probably the biggest ‘customer’ database in the world but almost all their money is made from display banners, which have been around since 1993.

Census is better than sample but true data makes a change; sometimes the tracking of a simple statistic (everyone understands) can make all the difference.

Monday, September 26, 2011

Wine region analysis

I have done some analysis on wine regions I am interested in, using data. I used all vintages of the top 500 wines (sometimes less) wine on their site. Apart from price, I also recorded Decanter awards. My analysis shows that Argentina has the best returns per rating point (ratings are an average of wine critics and the public, I have used points above 70). Piemont and Mosel have the worst stats: expensive or few awards. South Australia dominated among awards.

Region Wines Avg price Median price Awards Commended Net Avg point per pound Awards per wine
Alsace 2964 30 18 60 3 57 0.935 0.019
Argentina 2458 15 10 273 78 195 1.597 0.079
Greece 733 14 11 104 17 87 1.460 0.119
Mosel 3099 58 18 36 8 28 1.066 0.009
Piemont 5684 66 39 123 28 95 0.595 0.017
S Australia 4475 43 23 724 172 552 0.948 0.123
Stellenbosch 2425 17 11 353 100 253 1.460 0.104
Tuscany 5547 52 30 324 92 232 0.772 0.042


1997 498 1499

Thursday, August 25, 2011

Classical music documentaries

Here are some documentaries or some famous performances

Oistrakh plays Shostakovich

Oistrakh - Artist of People



Glenn Gould A Portrait

Glenn Gould - Life and Times



Mahler 3

Bernstein plays Mahler 2

Karajan plays Bruckner 8

Tchaikovsky - Who killed

Tchaikovsky - Discovering







Jansons plays Mahler 2

Rite of Spring

Shostakovich 5

Rostropovich plays Dvorak


Gould plays Bach

Shostakovich - Close Up

Shostakovich (private)

Shostakovich - Against Stalin

Friday, June 17, 2011

Bar/line chart improvement

Bar/line charts are quite useful when we want to show the development of 2 variables on different scales or units of measurement. By default Excel gives you the left chart, which is kind of ok, but I have 'developped' a better version on the right, which limits the charts at the min and max of the each series and draws a line in the mid point of the range (Median or average won't work here because they might not be on one line).

Also note how the colour font on the axis labels gets rid of any legend box.

Sunday, April 17, 2011

How to find R help online

In R you can find the help page of a function by typing help(func). If you want to something quickly online, place library and function in the following URL and off you go.[lib]/html/[func].html

Friday, April 15, 2011

Does Amazon filter Kindle items well?

As you might suspect, my answer is No. When searching for new Kindle books, I hardly find good results or recommendations. The problem is that there are a lot of virtually zero priced items which are top of the list but they are hardly worth the megabytes they carry. I am tired of looking at lists of cheap self help books.

Amazon seems to use the same recommendation idea for Kindle books but actually needs to adjust it to make it relevant. How does it help me if every second recommendation is Dracula just because it's free?

Amazon needs to put quality at the top of the list.

Better Choices Better Deals

BIS has published a paper which outlines how customers can benefit from using their data to optimise their shopping. They quote the many loyalty cards and tools which are already out there. They are creating the mydata initiative where customers can access their own data and find the best deals based on their usage.

I am quote skeptic about this scheme.

  1. Data is collected for a reason by specialised companies which exploit the data (not the customer), it is their asset.
  2. Data formats from different providers/retailers are vastly different and will never be brought under one roof, if it will the data will lose its richness. (A good example is the number of households quoted from Boots, Tesco and Nectar - they all use different definitions of what an active household is)
  3. There is a cost involved and it is not clear who will carry that.

Nevertheless I like the idea that customers have more rights to accessing their own data.

Wednesday, April 06, 2011

R and Python

Here is a R and Python syntax table, I have also included Numpy commands to make it more comparable. Where a cell is empty I could not find an equivalent.

Task Python Python Numpy R
sequence x=[I for I in range(1,11)]   x <- 1:10
scalar x=1 x=array(1) x <- 1
vector/list x=[1,2] x=array((1,2)) x <- c(1,2)
constant vector x=100*[1] x=ones(100) x <- rep(1,100)
append x.append(1)   x <- c(x,1)
matrix x=[[1,2],[3,4]] mat([[1,2],[3,4]]) x <- matrix(c(1,2,3,4),ncol=2,byrow=TRUE))
column stack   hstack((x,y)) cbind(x,y)
row stack   vstack((x,y)) rbind(x,y)
for for I in range(1,11):
 print I
  for I in c(1:10)) {
while I=1
while (I<10):
  I <- 1
while (I<10) {
 I <- I+1
if if I==10:
 print 'Yes'
 print 'No'
  if (I==10) {
} else {
 print ('No')
length len(x) len(x) length(x), nrow(x)
columns len(x[0]) x.shape[1] ncol(x)
dimension   x.shape dim(x)
summary     summary(x)
read csv import csv'file','r'))
for line in reader:
  mydata <- read.csv("file", header=TRUE)
write csv import csv
for d in data:
  write.csv(data, file="file", row.names = FALSE)
sum sum(x) sum(x) sum(x)
select element x[1][1] x[1,1] x[2,2]
last element x[-1] x[-1] x[-1]
select column   x[:,1] x[,2], x$Name
correlation   corrcoef(x,y) cor(x,y)
mean   mean(x) mean(x)
function def func(x):
 print x
 return x
  func <- function(x) {
dot product   dot(x,b) x*b
transpose   transpose(x) t(x)
matrix product   b*x t(b) %*% x
random random.random() random.rand(1) runif(1,0,1)
sort x.sort()   sort(x)
help help(command) help(command) help(command), ??command

Wednesday, March 23, 2011

Who we reward

Why do we reward people like Silvio Berlusconi or Charlie Sheen with our attention. They are bad at what they are supposed to do. They are addicts. You don't have to be moral about it, they are simply ugly.

Why don't we celebrate (more) people who create something beautiful, practical or useful to society?

Tuesday, March 22, 2011

Maths of self-publishing

Joe Konrath has decided to self-publish rather than accept a 500k advance. He explains that he would get 70% rather than 14.9%. He also explains that pricing books cheaper, will get him higher e-sales. In the table below I have calculated that with these figures he only needs to sell 43% of what the publisher expects to sell to make the same money.

Price Royalty % Sales Royalty
w/ publisher $9.99 14.9% 335,906 $500,000.00
self pub $4.99 70.0% 143,143 $500,000.00
% 50% 470% 43%

Thursday, January 13, 2011


Everyone is talking about Groupon, I wanted to check how fast they are growing. The below shows the global visitors - Groupon has overtaken Yelp but is closely followed by Living Social. It looks as though Living Social could actually overtake Groupon. I am not sure if the dip is due to the incomplete January.

How to hack Google charts

I ran a comparison on Google Trends and this is the location of the chart png. As you can see it has lots of colours in there and labels.,B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6,nBnB,RRRR,________________________________AYBDBUBMBNBbByCJCzCcDlDrETFOF1GyGjHSHoJXKmLrLnMAMmMqOZQiUkTcSzUKVZbKaqe3nek4hK,AYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAY______________________________________________________________________________,AVATARAKAcATASBeDzBIBSBBAkAdAiAyA2A2A.BTA6A4A1BGAwA8A2BHB7BkCiCnCcDDDiE5FZITIpHqGHHkIQHnJwJ1I5PDMWL8O3PNSyYecW,,MUMgN7OSO2QFPjQYQwQeQHRXRrSYTRTtUxVvVkV8T8UmU0UsVGSxRWSITyXFWwXgXMXmV8XKYvY7XsYCYnYQbSbSa6axa1Z3Z4dXYiUKWsXSYS,&chds=0.0,2100000.0&chs=580x188&chco=ffffff00,ffffff00,ffffff00,ffffff00,4684eeff,4684eeff,dc3912ff,4684eeff,ff9900ff,4684eeff&chls=1.0,1.0,0.0%7C1.0,1.0,0.0%7C1.0,1.0,0.0%7C1.0,1.0,0.0%7C1.75,1.0,0.0%7C1.5,3.0,3.0%7C1.75,1.0,0.0%7C1.5,3.0,3.0%7C1.75,1.0,0.0%7C1.5,3.0,3.0&chxt=x&chxr=0,0.0,100.0&chxl=0:%7C%7CJan+2009%7C%7C%7CApr+2009%7C%7C%7CJul+2009%7C%7C%7COct+2009%7C%7C%7CJan+2010%7C%7C%7CApr+2010%7C%7C%7CJul+2010%7C%7C%7COct+2010%7C%7C%7C&chxs=0,443322ff,9.0,0.0&chm=v,443322ff,1,-1,1%7Ct+Daily+Unique+Visitors,676767ff,0,0,10,1%7Ct+Google+Trends,676767ff,0,6,10,1%7Ct+1.4+M,676767ff,2,0,10,1%7Ct+700+K,676767ff,3,0,10,1&chg=12.0,33.33,1.0,1.0,4.0

For instance I could change the 'Google Trends' in the top right to my name. I tried changing the tick at 700 to 600 but that changes just the label not the data. You could try to increase the dimensions from 580x188 to something bigger.,B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6,nBnB,RRRR,________________________________AYBDBUBMBNBbByCJCzCcDlDrETFOF1GyGjHSHoJXKmLrLnMAMmMqOZQiUkTcSzUKVZbKaqe3nek4hK,AYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAY______________________________________________________________________________,AVATARAKAcATASBeDzBIBSBBAkAdAiAyA2A2A.BTA6A4A1BGAwA8A2BHB7BkCiCnCcDDDiE5FZITIpHqGHHkIQHnJwJ1I5PDMWL8O3PNSyYecW,,MUMgN7OSO2QFPjQYQwQeQHRXRrSYTRTtUxVvVkV8T8UmU0UsVGSxRWSITyXFWwXgXMXmV8XKYvY7XsYCYnYQbSbSa6axa1Z3Z4dXYiUKWsXSYS,&chds=0.0,2100000.0&chs=580x188&chco=ffffff00,ffffff00,ffffff00,ffffff00,4684eeff,4684eeff,dc3912ff,4684eeff,ff9900ff,4684eeff&chls=1.0,1.0,0.0|1.0,1.0,0.0|1.0,1.0,0.0|1.0,1.0,0.0|1.75,1.0,0.0|1.5,3.0,3.0|1.75,1.0,0.0|1.5,3.0,3.0|1.75,1.0,0.0|1.5,3.0,3.0&chxt=x&chxr=0,0.0,100.0&chxl=0:||Jan+2009|||Apr+2009|||Jul+2009|||Oct+2009|||Jan+2010|||Apr+2010|||Jul+2010|||Oct+2010|||&chxs=0,443322ff,9.0,0.0&chm=v,443322ff,1,-1,1|t+Daily+Unique+Visitors,676767ff,0,0,10,1|t+Dirk+nachbar,676767ff,0,6,10,1|t+1.4+M,676767ff,2,0,10,1|t+700+K,676767ff,3,0,10,1&chg=12.0,33.33,1.0,1.0,4.0

Wednesday, January 12, 2011

Kaggle social network challenge - test/train code

For those having participated in the Kaggle social network challenge here is the Python code to split the full downloaded graph into test and training.

#create random sorted train set and test set with equal amounts of true and false edges

import random


#import complete file

for line in f1:
    prim.append([a,b,random.random()]) #need rand for later
    if a in prim_set: #if seen before
    if b in sec_set: #if seen before
print len(prim),len(prim_connections),len(sec_connections)

#universe of those with 2+ connections
for p in prim_connections.keys():
    if prim_connections[p]>1:

#universe of those with 2+ connections
for p in sec_connections.keys():
    if sec_connections[p]>1:
print prim_2plus,sec_2plus

#chose 2 sets 5000
sample2=set([i for i in sample if i not in sample1])

print len(sample),len(sample1),len(sample2)

#sort by random
prim2=sorted(prim,key=lambda rand:rand[2])

del prim

for i in prim2:
    if i[0] in sample1:
        if i[0] not in sample1_done and (sec_connections[i[1]]>1 or i[1] in prim_connections): #not done and inbound has other edge
            f2.write(i[0]+','+i[1]+'\n') #test
            f3.write(i[0]+','+i[1]+',1\n') #validate
            sample1_done.add(i[0]) #is done
            print len(sample1_done)
            f4.write(i[0]+','+i[1]+'\n') #train       
        f4.write(i[0]+','+i[1]+'\n') #train
        if i[0] in sample2: #create a subset of prim to speed up non pairs check

del prim2

print len(prim3)

#for sample2 chose non connections
for i in sample2:
    if count
        prim4=[j[1] for j in prim3 if i==j[0]] #a subset
        while done==0:
            rand=random.sample(sec_universe,1)[0] #because 1 returns set
            if rand not in prim4 and rand<>i:
        print count
        f2.write(i+','+rand+'\n') #test
        f3.write(i+','+rand+',0\n') #validate