Google Correlate for fun and profit

...well, if you find the "profit" application, let me know.

Google labs has a new product called Google Correlate.  You can read the product's introductory description here (in comic book form).  The service takes any time-based data series you give it and matches the Google search queries that have the highest correlation with your series, then does you the additional favor of plotting it on a map---very cool.

The process was used effectively to predict flu spikes.  People start Googling for flu symptoms just slightly before flu outbreaks.  So, reporting spikes in these correlated queries gives public health officials a signal that something is coming.  

With a little imagination, you can see how this tool might have econometric or financial applications.  Take your favorite stock, for example, and see what search queries correlate with its behavior.  Perhaps people Googling "new car" provide an indication that the auto market will improve.  Then you can use that query series as input into a predictive model for the sector.  

Another example might be finding what correlates with the CPI or PCE, and obtaining another factor in real-time estimates of major economic news before the data releases.  (Economic data usually has a very slow period---a month or sometimes a quarter between releases.  Getting a better estimate before the release can make a big difference.) 

Overall, I think it's a spectacular idea, but hasn't worked for any of the few practical applications I tried.

Here's what happened when I input a few financial series, hoping to find the "new car" variable that would help predict auto stocks:

First, Ford (NYSE: F):

1) Get Data from R
>getSymbols("F", from="2000-01-01", to=Sys.Date())
2) Export to CSV
3) Copy and paste into

Top search-query correlations for the shareprice of Ford Motor Company over the last decade:

Correlation Query
0.8514 desktop windows
0.8029 remote desktop windows
0.7988 administrator windows
0.7984 blackcrow
0.7819 shortcuts windows
0.7565 occurred
0.751 remote desktop connection windows
0.7505 converse star player
0.7465 tmg
0.7403 unlink

I'll take that as a "nothin'."  All look spurious, but interesting.  "Windows" inexplicably comes up several times.

Let's try again, this time for something more internetty.  Perhaps people are less inclined to Google for cars.  But surely they will Google for things related to Google itself, (NYSE: GOOG), when thinking about buying or selling Google stock.  Here are our correlates: 

Correlation Query
0.8296 solitaire network
0.8267 st louis backpage
0.8233 10020
0.8199 94105
0.8118 94111
0.8109 podomatic
0.8107 orchard bank credit
0.8087 medical spa
0.8076 crazyshit
0.8076 10017

I really don't know how to interpret that.  Of all the search queries in the world, Google's share price seems to have a correlation with medical spas.  I'm quite surprised that the spurious results win over anything real.  Maybe the trader-bots have won afterall, or every tech trader is using his bloomberg terminal exclusively. 

Here's the time-series graph for the top correlate:

Weird.  (And for the record, I also repeated the exercise on google using first- and second-differenced series.  Both differenced series returned zero results.)

To add to the confusion, here are a few more correlated queries I found, this time using just other search queries, rather than time series:

1. "Frankenmuth Michigan" is highly correlated with "kitty litter" and "ear mites."
2. "The Onion" is correlated with "colon use" especially in Vermont.
3. "Vernors", a regional pop from Detroit, is correlated with "BB guns" at a two-week lag, and also almost every item of quirky Michigan-kitsch-icana.
4. "Pepsi" is correlated with "Bedroom expressions" and "Wolf Robotics."

These are entertaining bloopers, but I am disappointed my quick explorations didn't point the way for this as a more useful tool.  Looking at the results above, I wouldn't be surprised if even the "flu" and "mittens" probably are not directly related to the search queries so much as they are picking up latent variables, such as weather.  The seasonal effect is dominating any relevance of the increase in queries.  

That is, it's not that searching for flu symptoms is a genuine indicator of flu outbreaks---you're not searching because you feel like you're catching the flu---so much as that during winter, people get the flu, and people search for flu symptoms.   And mittens.  And sometimes kitty-litter, apparently.  And that there's enough noise in all other queries---everything else is so unrelated in the great booming confusion that is internet search---that the effect of seasonality alone is enough to look like a related indicator.  Don't get me wrong, it'll still work because this correlation is founded in a shared relationship, but it won't necessarily be more informative than "flu outbreaks tend to happen when it's cold out," something you might proxy more easily (thermometer) than by scanning through Google search queries.

Still, even if that's true, I'm surprised.  There are enough independent queries that the spurious correlations, like that between Pepsi and Wolf Robotics, seem like they should overwhelm almost any valid correlation out there.  Identification, as always, is the problem.

Anyway, it's an interesting product, and I'm looking forward to seeing how it develops.  Hopefully they'll post more tools in the future for sorting through the junk results.  This might be a job for the distance correlation measure...

Have you had any more luck than I have?  Post a reply in the comments!

Standard disclaimer: Though mentioning stocks, the above is not to be taken as investment advice.  I do not currently (June 12, 2011) have any positions in Ford or Google.  (Or Wolf Robotics.)