Friday, October 05, 2012

Tags on top 100 UK sites

I took the top 100 UK web sites from Alexa and searched the homepage for web analytics tags. For nine websites I could not find any JS files or named tags, or the site returned an error.

I found that 23% use Doubleclick, 44% use Google Analytics, 9% use Nielsen, 5% use Omniture, 9% use Comscore. Tagman is only used by 1 site. Five sites have 3 tags altogether:

Thursday, October 04, 2012

scrape random movies from IMDB

I have created a scraper for IMDB. It creates a graph as it downloads all referred actors/directors, keywords, languages etc - basically features which could be put into a recommender or similar system.


Output looks like genre/Family name/nm0784124/ name/nm1293791/ name/nm0265620/ name/nm0754781/ keyword/bear country/jp language/ja genre/Romance genre/Mystery name/nm0130215/ name/nm0280541/ name/nm0302384/ name/nm0309129/ name/nm0130191/ name/nm0560478/ name/nm0001607/ name/nm0908001/ name/nm0912604/ name/nm0133597/ name/nm0489010/ name/nm0005166/ name/nm0593411/ name/nm0929869/ keyword/soap keyword/tragedy keyword/betrayal keyword/shipper country/us language/en genre/Drama name/nm0430267/ name/nm3136900/ name/nm0018495/ name/nm0068168/ name/nm0231191/ name/nm0263099/ name/nm0341647/ name/nm0367731/ name/nm0792129/ name/nm0909848/ country/gb language/en company/co0248652/

Tuesday, September 25, 2012

Python: combine lists or dictionaries

I have written some tests to see which approach is quickest to combine/reduce data where keys overlap. I tested whether it is better to store the data in lists or dictionaries. It turns out stacking lists and then using defaultdict (method 5) is the fastest. Using itertools.groupby() was the slowest.

The code is here

387299 1.15799999237 2.4681637353
387299 0.80999994278 2.4681637353
387299 0.805000066757 2.4681637353
387299 0.775000095367 2.4681637353
387299 0.677999973297 2.4681637353

Thursday, September 13, 2012

Is MapReduce on AWS fast?

I have been playing around with Elastic MapReduce on AWS. I ran the wordsplitter example from the AWS tutorial. The job took 3 minutes to complete the word count on 12 files.

I then wrote the whole thing in native Python using a dictionary (without MapReduce), this took 4 seconds to run on the EC2 server. So actually I am not that impressed with MapReduce, it might be due to file access or job creation but still hard to see what the fuss is about.

Tuesday, September 11, 2012

Predictive problems in customer FMCG analysis

Here are six areas in customer FMCG analysis which can be solved by predictive modelling:
o How can we predict the S shape of new product sales at earliest stage possibly
Customer targeting
o Which offers are most relevant for which customers (at what time) - to increase participation
Propensity to buy (cross sell)
o Who are the best targets for a product/category – to increase sales
Similarity (customers or products)
o Both can be used to recommend additional products to customers
Promotions (optimisation)
o Which mechanic and discount on which product delivers most ROI
Life stage and events
o Family types
o New family member, pregnancy
o New job/wealth
Some of these are getting addressed by retailers today; others will grow in the future. If you want to share ideas on any of those, of course let me know.

Post crisis agency relationship

The relationship between clients and analytical agencies has shifted. This is due to three reasons:
Abundance of data storage and computing power
Need to tighten budgets (especially post 2008)
Time to extract knowledge from agencies
Clients have learned that analysis is not rocket science anymore. They have brought a lot of capacity in-house. New technical solutions have made it possible to get around drawn out, bureaucratic IT projects. The focus has shifted from service to product.
Of the three reasons, the second one is most often neglected, but it means agencies need to prove their viability again and again.
The question is how agencies will respond. One answer can be syndication. By aggregating cross-client data they can create synergetic solutions which all parties can gain from.
Another answer is protection of intellectual property (IP). Previously it was assumed that algorithms cannot be copied because it takes too much time to reverse engineer. With more and more open source solutions, approaches can be borrowed quite easily. The output might not be the same but similar.
A third answer can be analytical talent. If agencies can secure the best talent (ie become more expensive – which goes against the tight budget issue), they could persuade clients to stick with them.

Wednesday, August 22, 2012

Telecoms will push analytics

Telecom firms are starting to ponder what to do with the massive data they collect on an everyday basis. How can they gather anonymised insights from their data which they can package up and sell?

They know the following things about (especially paid smartphone) users:
Who they are and where they live
Where they went at what time
How much they use their phone and for what (that includes websites browsed)
Who they are connected to

The challenge for many telecoms is to bring this information into a Single Customer View (SCV) as it exists in silos (and different frequencies).

Using these four tiers of information, the telco can produce unique insights of the sort:
Who shops where and what time?
Where do those shoppers go before and after the shop?
What is the cross shopping (switching) of users?
What websites correlate with what shopping behaviour?
Who are the influencers in a network?
o So if I wanted to target only influencers, who would it be?
How do online (mobile) and offline shopping behaviour interact?

Change index for time series in Excel

I have created an Excel tool where you can compare several time series and you can change the index/base date by selecting it from the drop down or changing the slider. It also tells you which date delivers the highest growth.


Wednesday, August 15, 2012

simple cross purchase code

If you have an item level data set and you want to explore cross purchase in SAS you usually need to sum and transpose your data first in several steps. Here is a simple SQL within SAS example using the max and case functions writing this succinctly, it won't necessarily run faster though.

proc sql;
create table x as
select custid, max(case when prodid=y then 1 else 0 end) as prod1, max(case when prodid=z then 1 else 0 end) as prod2
from items
group by 1;

proc freq;
table prod1*prod2;

Thursday, July 12, 2012

Lloyds TSB's targeted outdoor ads

Lloyds TSB is running an outdoor campaign saying things like 'Platform 1, a good place to check your banking' (when on a train station). However their media buyers put them in the wrong place. I was on platform 18 when I read the one about platform 1. I was on Upper Richmond Road, when the ad said Tooting High Street. Silly, misplaced targeting.

Friday, February 17, 2012

Some notable string quartets (links)

Big Data?

There is a lot of talk about big data and data science. With increasing abundance of structured and unstructured data, many praise the possibilities of analysing terabytes of data. But as nice as some of the visualisations might look like, the question is how businesses can use the new tools and data to make money. Is big data always better than old/small data?

I work with many businesses that have a lot of data but don’t have the time to look at all the information they have and could have. Before you extend your database, have a think about what more you can do with existing data. Integration is the key, there is no benefit to have half a dozen unlinked databases coming up with different answers.

Data collection needs to be improved. Before you commission a new database, think how you can track more behaviour about customers in your existing database.

Many new start ups focus on collecting huge databases hoping that clients will be amazed by the richness of the data. But this is only for clients who know how to use data. Not many businesses know how.
Facebook has probably the biggest ‘customer’ database in the world but almost all their money is made from display banners, which have been around since 1993.

Census is better than sample but true data makes a change; sometimes the tracking of a simple statistic (everyone understands) can make all the difference.