Google Correlate: Take Two

published by adam on Fri, 09/16/2011 - 10:46

I've posted before about Google correlate in Google Correlate for fun and profit. It's a fantastic platform, but I have not yet discovered any practical use cases for it. This is still the case, but after experimenting with it more, I now have a better idea of what they would need to do (and why they can't do it), to make it more useful. First off, you might want to take a look at their white paper, available here, or attached at the end of this article below. The relevant point is that according to their methodology:

"In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass."

Now, it's a wonder that google correlate is able to search across such a large space of data quickly enough to generate fast (indeed, nearly instantaneous) results. However, their double hashing that makes this possible also reveals the key weakness of the algorithm---it can't be used to correlate against a function of google's data. This would have to happen before the first hash. That is, given that you're working with time series, it's important to be able to do transformations of the series you're exploring. Is it seasonal? Seasonally adjust it. Do you want to explore differences rather than just levels? Do your models need a stationary time series (YES)? If you're not able to apply a first-difference transformation to google's series, it won't be useful to compare it against differenced series you're working on. The case where this is obviously important is stock returns---you wouldn't use levels, you'd need to use a log return transformation, or a first or second differenced series. I believe that so far, providing this transformation is out of reach, especially if you need interestingly customizeable transforms. Google would have a hard time, and receive little benefit from hashing every conceivable function applied to their time series database in order to make the transformation accessible (and I believe this is impossible to do in a generalized way---the double hashing is necessary precisely because the search space is already too big). However, it might not be too hard, and indeed could be of great value, to provide a few of the more common transformations---log transform, log_10, first, and second differences would probably go a long way to making this quite useful in a variety of fields.