07 January, 2008

Correlation IS NOT Causation

A classic example of mixing up correlation and causation. PolySigh regresses Huckabee's vote percentages on the percentage of Catholics by county and finds that Catholic counties tended to go for Romney. PolySigh concludes: "Huckabee did best among evangelicals in rural areas with lots of religious adherents. On the other hand, he did poorly among Catholics."

Unfortunately, there is a confounding variable. Catholic populations are higher in urban areas than in rural areas. People in urban areas tend to support Romney, people in rural areas tend to support Huckabee. So we have a negative correlation between Catholicism and votes for Huckabee that can be explained by looking at a third factor. Another reason to find this alternative explanation more convincing is because Catholics mostly vote Democratic anyway, so only a few would be voting in the Republican primary.

This also illustrates the difference between types of data. If PolySigh had access to vote-level data instead of just county-level data, he could run the same regression and get a better estimate because you could see exactly how many Catholics are voting for Huckabee and how many are voting for Romney. The data aggregated to county-level loses several dimensions of information.

I'm not sure why Henry Farrell picked this up since it seems rather obvious.

No comments: