Tuesday, April 28, 2009

Government internet database

Maybe the u-turn on the internet database has also to do with the realisation what the cost of such a data base would be. Below I am making an estimate of how big an annual data set would be: 503 terabyte. This would need an army of IT specialists to maintain. A query on this data base would probably take a lot of time. And this is a conservative estimate assuming only the URL (and datetime) is kept and no search keywords or so. It also assumes emails are transferred into text and attachments are lost.

Population 60,000,000

Internet users % 73.00%

Internet users 43,800,000
Websites Sites per day 15
Days per week 6
Datetime field 8 byte
URN field 8 byte
Site field 100 byte
Emails Emails per day 15
Text size of email 1000 byte
Email accounts per person 2

Size of annual database 503,388,144,000,000 byte

503,388,144,000 KB

503,388,144 MB

503,388 GB

Size of monthly slice 41,949.01 GB

Wednesday, April 15, 2009

Small towns stereotypes

People from small towns and villages are usually either very shy or
over confident because they are overwhelmed or think of themselves
masters of their little universe. Big city people are more likely to
be neurotic.

Sent from my iPhone

Strangling furious customers

Nowadays every one has to be afraid when causing an argument about the
niggly rules at airports. They will probably arrest and question you.
All in the name of terror prevention. But have they ever seen a
terrorist complaining about bad customer service? It's just a means to
continue bad service and strangle the customer's rights.

Sent from my iPhone

Thursday, April 09, 2009

Significance Importance Relevance

You can think of these three concepts in a particular order. Significance means that something is possibly true. However, this only counts if it is actually important - has magnitude. Even if it is important, it's not guaranteed to be relevant. For it to be relevant it needs appeal and has to be practical/ implementable. 

Wednesday, April 08, 2009

I have uploaded a brief paper on testing for multi-modality using clustering - here.

In general modality is looked at from a distributional perspective. Silverman’s test uses a kernel density estimate (KDE) to test for modality. But the test statistic is biased and it can only test for one mode hypothesis against another. We propose a different way by looking at a distribution from the top.