Tuesday, September 25, 2012

Python: combine lists or dictionaries

I have written some tests to see which approach is quickest to combine/reduce data where keys overlap. I tested whether it is better to store the data in lists or dictionaries. It turns out stacking lists and then using defaultdict (method 5) is the fastest. Using itertools.groupby() was the slowest.

The code is here

387299 1.15799999237 2.4681637353
387299 0.80999994278 2.4681637353
387299 0.805000066757 2.4681637353
387299 0.775000095367 2.4681637353
387299 0.677999973297 2.4681637353

Thursday, September 13, 2012

Is MapReduce on AWS fast?

I have been playing around with Elastic MapReduce on AWS. I ran the wordsplitter example from the AWS tutorial. The job took 3 minutes to complete the word count on 12 files.

I then wrote the whole thing in native Python using a dictionary (without MapReduce), this took 4 seconds to run on the EC2 server. So actually I am not that impressed with MapReduce, it might be due to file access or job creation but still hard to see what the fuss is about.

Tuesday, September 11, 2012

Predictive problems in customer FMCG analysis

Here are six areas in customer FMCG analysis which can be solved by predictive modelling:
o How can we predict the S shape of new product sales at earliest stage possibly
Customer targeting
o Which offers are most relevant for which customers (at what time) - to increase participation
Propensity to buy (cross sell)
o Who are the best targets for a product/category – to increase sales
Similarity (customers or products)
o Both can be used to recommend additional products to customers
Promotions (optimisation)
o Which mechanic and discount on which product delivers most ROI
Life stage and events
o Family types
o New family member, pregnancy
o New job/wealth
Some of these are getting addressed by retailers today; others will grow in the future. If you want to share ideas on any of those, of course let me know.

Post crisis agency relationship

The relationship between clients and analytical agencies has shifted. This is due to three reasons:
Abundance of data storage and computing power
Need to tighten budgets (especially post 2008)
Time to extract knowledge from agencies
Clients have learned that analysis is not rocket science anymore. They have brought a lot of capacity in-house. New technical solutions have made it possible to get around drawn out, bureaucratic IT projects. The focus has shifted from service to product.
Of the three reasons, the second one is most often neglected, but it means agencies need to prove their viability again and again.
The question is how agencies will respond. One answer can be syndication. By aggregating cross-client data they can create synergetic solutions which all parties can gain from.
Another answer is protection of intellectual property (IP). Previously it was assumed that algorithms cannot be copied because it takes too much time to reverse engineer. With more and more open source solutions, approaches can be borrowed quite easily. The output might not be the same but similar.
A third answer can be analytical talent. If agencies can secure the best talent (ie become more expensive – which goes against the tight budget issue), they could persuade clients to stick with them.