Sunday, May 05, 2013

Three microblogs: The Ascetic Programmer, Science in Crisis and Data Science Matters.

I've started three thematic microblogs you may be interested in.

Thursday, October 18, 2012

R anti-tips

Not all R tips are equally good. Let's set the record straight.

Thursday, May 10, 2012

The essential R packages

Much has been said about the richness of the system of packages for R, but where is one supposed to start?

Saturday, December 03, 2011

Mapreduce everywhere

Mapreduce could extend its reach beyond — or inside — the data center. Coming soon to a computer near you?

Thursday, September 15, 2011

The connected components example, rewritten using RHadoop/rmr

My new implementation of random mate for mapreduce, using the package rmr from Revolution Analytics open source project RHadoop.

Wednesday, April 27, 2011

A map reduce algorithm for connected components: implementation

At long last, a complete implementation of the algorithm I described some time ago.

Friday, April 15, 2011

Bringing relational joins to Rhipe

Relational operations are a very common way to express map-reduce computations at a higher level, but Rhipe, an R package for mapreduce, doesn't have any. Let's start to fix this with a basic join function.

Monday, April 11, 2011

Let a million Twitters bloom

Why are some people uncomfortable with cloud computing? What are the limitations and is there a way forward?

Thursday, April 07, 2011

Looking for a map reduce language

On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …

Monday, November 29, 2010

Find the odd bag

From a job interview challenge, an interesting probability exercise in two parts. One of the themes here is pretty standard fare. You are given a clearly defined random procedure whose outcome is a mixture of two distributions. The problem is, given a certain set of outcomes, find which of the two distributions it is coming from. For instance, imagine you have to assign one of two classes to an item based on repeated noisy measurements and you know the relative size of the two classes (a priori probability of belonging to one of the two). The second part of the challenge is a bit more interesting but also eccentric. It is asking for a best case outcome that would make it easiest (smallest sample) to detect the class of the item with a certain error probability. I am not aware of any practical statistical question where such a best case problem arises, even if we consider the converse, the worst case outcome. But being used to worst case analysis from my CS training, I came up with an optimality proof based on induction and manipulation of binomial coefficients, which confirms the intuition that a very unlikely, extreme outcome is the best one. The main idea is that when lower bounding an expression including binomial coefficients, it is somehow easier to prove a tight lower bound because the binomial coefficients on the two sides of the inequality are very similar in that case and one can simplify a lot and then use simple algebra. It won't set the world of Mathematics abuzz, but it seemed interesting enough to share.