Sunday, October 05, 2008

Six Products, Six Carbon Footprints - WSJ.com

Interesting article in the Wall Street Journal about attempts to study the carbon footprint of different products

The surprising lesson, at least for me, is that transportation does not account for much of the carbon footprint of several products. Sometimes is one of the raw materials (leather, polyester), sometimes is the presentation (refrigerated beer), sometimes the natural production processes (milk -- cows produce lots of methane). It's also a very practical read that you can turn into action.

Read More......

Sunday, August 17, 2008

Google returns more AdSense rich results than Live: conflict of interest?

Several search pundits have commented on conflicts of interest Google is potentially exposed to. Google itself admitted to not wanting to hold onto Performics, the search engine marketing division of their recent purchase Doubleclick, and in fact they quickly sold it. But that's only one of many conflicts of interest Google has to deal with. John Battelle observes in his searchblog that, in a comparison between Google Ad Planner and Comscore data, sites that use AdSense, Google's advertising platform, get better traffic numbers. If you use the free Ad Planner to plan an ad campaign your money is more likely to flow to Google's coffers than if you were using Comscore. Google provided a statement denying any deliberate bias, but it's hard to find alternative explanations. A similar conflict of interest arises for Quantcast and their Media Planner and rating services w.r.t. being fair to publishers that don't use their tag (they are not quantified in their jargon). They provide ratings and demographic analyses for both, but directly get data only from quantified publishers. Since I am a former employee I will abstain from a more detailed analysis in this case.

Another conflict of interest is between Google the search company and Google the media company: will Google search give prominence to Google properties Picasa, Knol, Blogger and YouTube over the rest of the net? Timo Paloheimo is afraid that could be the case and so he created a custom search engine that excludes such properties, a draconian solution in my opinion but a clear expression of the malaise generated by Google's dominant position in search and the multiple conflicts of interest with other Google products.

But this is only the tip of the iceberg IMHO: the main conflict of interest is between search and advertisment. When companies like Answers.com report that their traffic is down 28% because of a change in their Google ranking, Om Malik rightfully comments that this very fact "raises questions about Google’s power".
Google has an interest in returning the most relevant and useful results in each search; and also has an interest in directing traffic to sites that use adSense because they provide a big part of Google's revenue. How do they balance the two? Google's answer is that the two businesses are run as independently as possible (I forgot the source here, but I am almost sure it was a Google spokesperson). But is that a sufficient answer? Or should we trust Microsoft not to tweak their live search? I think I have strong evidence that Google search results are more likely to carry adSense ads than Microsoft's live.com.

The methodology is as follows:

  1. Sample random search keywords
  2. Use each of them as a query on google.com and live.com
  3. Check if any of the results in the first page uses adSense
I considered the ratio between the number of search matches using adSense and the total number of accessible search results (normally 10, but sometimes some could not be accessed within a reasonable time) for each keyword. Out of 632 searches, Google-provided matches were more adSense rich than live.com ones 316 times, the opposite was true 126 times and the ties were 190. Pretty remarkable, isn't it? If you are not convinced by your intuition or are statistically inclined, applying a Wilcoxon signed rank test rejects the hypothesis of the two search engines results being equally adSense rich with a p-value smaller than 2.2e-16.

Correlation is not causation. It could be that either company is tweaking their engine to favor or penalize adSense; or it could be some other mechanism I haven't figured out yet. But consider this hypothetical scenario: Google doesn't feel it needs to compete so hard in search anymore, wants to boost revenue, decides to give a small but competitively significant rank boost to adSense-running sites, those in turn outcompete their non-adSense-running competitors. We really don't want this to happen. We just recovered from another stranglehold on the computing world.

Read More......

Monday, August 11, 2008

Our energy future up close and personal

Saul Griffith, CEO of a wind power startup, tries to map out a possible energy future in terms of global and personal choices. It's the first time I see somebody mapping sustainable energy choices down to how many trips one can take or how many energy drinks one can have. Extremely instructive, a "must watch".

Read More......

Wednesday, July 30, 2008

reCAPTCHA, a bogus motivation

The authors claim they use spare cycles for a good cause, but it isn't true.

CAPTCHAs are little reading tests that allow to distinguish humans from machines when accessing a computer interface. A word is displayed in a slightly noisy form and the user is required to type it in, easy for humans but not so for machines. The authors of reCAPTCHA wanted to associate a useful task to this test. They say

... in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort?
So instead of picking random words for the test, they pick word-sized fragment from large public digitization efforts (currently the one led by the Internet Archive) and submit them to the users. That is, reCAPTCHA words are words that are not in digital format, appear in a paper document of interest and machines could not understand. So by solving a reCAPTCHA you are helping a useful project. So where is the catch? Well it took me a moment of thought but it's kind of obvious: if these are fragments of real digitization projects, by definition we don't know the solution. If we don't know it, we can't use them as CAPTCHAs! So I went back to the reCAPTHCA web site, and what they do is to propose two CAPTCHAs, one with a known but useless solution and the other with an unknown but useful one, in a random order. If you want to use a website, you have to complete both. Therefore there is no use of spare cycles whatsoever: the effort for the user has been doubled. If reCAPTCHA were the standard, there would be another 150,000 hours a day spent solving the second word. Fine by me in exchange for a free service, but not as efficient as advertised, not even close.

Read More......

Wednesday, July 23, 2008

Live search cashback not so cheap

It's been said that the cashback program with which Microsoft is trying to increase it search market share is a bold if not desperate move, because MS is giving away to the user its search revenue. Not so fast.
I did an unscientific test simulating the purchase of a North Face Dyad 22 (a backpacking tent) and a pair of Wilson Trance all court (tennis shoes). I searched with live and google products. In both cases, I found a better deal with google products ($188 and $79 vs $201 and $82), live cashback included. One could speculate that google indexes more merchants or that there is a fee to be included in the cashback program, but I have no way to tell which is closer to the truth. Did I hit two outliers or is this something other people noticed?

Read More......

Sunday, June 08, 2008

Map of Science

Fascinating map of the relation between scientific disciplines. Not all of the methodology is clear, but it is fun to look at. Notice the lack of ties between CSE and Biology.

Map of Science



Read More......

Sunday, November 04, 2007

Facebook Illegal Wiretaps


The formulation of this problem is quite creative, but overall it is just describing a matrix where the rows are workers and the columns are tasks. Workers have numbers and tasks have names and the job completion time depend on whether the worker is odd or even, the number of vowels and consonants in the task name and even common factors between the task name length and the worker number. Contrived but trivial. On top of that one is called to solve an assignment problem or weighted bipartite matching in graph speak. One algorithm to do just that is the Hungarian Algorithm (I wish I could speak of the Italian Algorithm, but there are none I know of). To create the input matrix one can use the program below, but it doesn't really achieve the full generality called for by the Facebook people. I wrote it down just to convince myself there wasn't some obvious structure in the matrix. It is written in my latest loop-free style (what do you do if your company doesn't switch to a functional language? Simple, write functionally in any language to the extent that is possible).



use strict;

sub nconsonants {
my $a = shift;
$a=~s/[aeiouAEIOU]+//g;
return length($a);
}

sub factors {
my $a = shift;
my @factors = ([],[],[2],[3],[2,2], [5], [2,3], [7], [2,2,2], [3,3], [2,5]);
return $factors[length($a)];
}


sub name2info {
my $a = shift;
return {ncon=>nconsonants($a),
nvow=>length($a) - nconsonants ($a),
fact=>factors($a)
};
}

sub progtime {
my ($prog, $info) = @_;
my $sum = 0;
map({$sum+=(($prog % $_)?0:4)} @{$info->{fact}});
return ($prog % 2 ? 1.5 * $info->{nvow} : $info->{ncon}) + $sum + 1;
}

my @names = qw/ANDROMEDA BARBARA CAMERON DAGMAR EKATERINA FLANNERY GREGORY HAMILTON ISABELLA
JEBEDIAH KIMBERLEY LARISSA MEREDITH NORMAN OSWALD PENELOPE QUENTIN RANDALL SAVANNAH TABITHA
URSULA VIVIENNE WINONA XAVIER YVONNE ZENOBIA/;

my @res=
map ({my $info = $_;
[map({my $prog = $_;
progtime($prog, $info);
}
1..26)]
}
map({name2info $_
}
@names));

print join "\n", map({join " \t", @$_} @res);


Read More......

Friday, November 02, 2007

Facebook Prime Bits

This is one of Facebook job candidate puzzles. Given a range [a,b] of positive integer numbers, test for the primality of the number of 1 bits in the binary representation of each number, and do so in O(n) where n is b - a. Unfortunately the puzzle goes on to assume that n is a 64 bit integer, and that makes asymptotic analysis quite meaningless, so I will discard this assumption for now and comment on it later. The naive solution loops through the given range converting each number to its binary representation, counting 1 bits and testing primality. The complexity of this algorithm is O((b-a) log(b)), using the polynomial time deterministic primality test of Agrawal, Saxena and Kayal. One might think that the primality test is the hard part of the algorithm, but the numbers to be tested are pretty small. The representation of b takes log(b) bits. The number of 1 bits in it is at most, of course, log(b) and that can be represented with log(log(b)) bits, and the ASK algorithm is polynomial w.r.t to the size of its input, or poly(log(log(b))). So is it possible to get rid of the log(b) term depending on counting the bits in the binary representation of a number? The answer is in the following functions and is positive. The main idea is to avoid recomputing the same partial sum of bits over and over again. Imagine a trie or prefix tree of the binary representations of all the numbers in the range. Imagine a traversal in preorder that is performed while keeping a count of 1 bits seen so far between the root (most significant bit) and the current position. Every time a leaf (complete number) is reached, the results are extended with one item.


use strict;

sub countbits {
return f(@_, 0, []);
}

sub f {
my ($a, $b, $count, $retv) = @_;
if ($a != $b) {
my $k = int(log2($b));
if (topbits($a, $k) == topbits($b, $k)) {
return f(bottombits($a, $k),
bottombits($b, $k),
$count + topbits($a, $k),
$retv );
}
else {
f(bottombits($a, $k),
2**$k - 1,
$count,
$retv);
return f(0,
bottombits($b, $k),
$count + 1,
$retv);
}
}
else {
push @$retv, $count + g($a);
return $retv;
}
}

sub g {
my $a = shift;
return ($a == 0 ? 0 :$a % 2 + g ($a >> 1));
}

sub topbits{
my ($a, $k) = @_;
return $a >> $k;
}

sub bottombits{
my ($a, $k) = @_;
return $a % 2**$k;
}

sub log2 {
my $n = shift;
return log($n)/log(2);
}


The shift and mod operations are not strictly constant time, but they are used just to select the first bit and the remaining bits, so that can be done in constant time but I thought the function was clearer this way. The worst case analysis assumes that the double recursive call is always performed and relies on the fact that the number of significant digits is decreasing by one with each level of recursion. G is of course O(log(a)). This establishes a O((b-a)+log(a)) bound. So there is a log(a) term in there, and we still need the primality test, either precomputed (O((b-a)+log(a)+log(b)poly(log(log(b)))) total time) or performed on each number in O((b-a) poly(log(log(b)))). Neither is superior depending on a. So one can not strictly achieve O(b-a) as claimed by the Facebook gurus. This is not a limitation of my approach. If it were possible, it would imply for a=b that primality can be tested in constant time. Two friends of mine and programming superstars both used the additional assumption of 64 bit numbers to consider log(b) <=64 and therefore constant as we were discussing this problem. Personally, I think mixing finite and asymptotic analysis doesn't really help making things more clear and ridding oneself of a log(b) factor when it can be as large as 64 can have practical consequences. Actually, I think this is a great example that asymptotic analysis is insightful. Still it would be nice to make sure that my algorithm doesn't have horrible constants in its time complexity function. This is left as an exercise for the reader (suggestion: use generating functions).

Read More......

Thursday, May 24, 2007

ProjectDescription - Lucene-hadoop Wiki

Implementation of simple parallel computing, based on Google's map-reduce, runs over Amazon's EC2. Supercomputing for the rest of us


ProjectDescription - Lucene-hadoop Wiki

Read More......

Thursday, March 22, 2007

PLoS Computational Biology - “Antedisciplinary” Science

PLoS Computational Biology - “Antedisciplinary” Science

Sean Eddy, a computational biologist, critiques the big science team mentality promoted by the NIH. New disciplines are always founded amid resistance from the existing powers. When the rise of a new discipline is unavoidable, existing powers try to split the pie by forming teams -- and requiring everyone else to do so if they want funding. Well, the original spin is not so political, read the paper yourself. It voices the opinions of many computational biologists I know, who are struggling to fit into some department and get funding.

Read More......