There's no such thing as Unstructured Data and MongoNYC 2013 wrap-up

I made this point with Venkatesh Rao in our Future of Data Project for the CSC, but found myself at a table at the MongoDB NYC conference talking through it again---I think it's a useful enough idea, and maybe even an original enough one, to devote a post to.  (Thank you to my Data Science employer Noodle Education for sending me there.)

The claim is that viewing SQL vs NoSQL databases as Structured versus Unstructured is misleading.  There's no such thing as unstructured data---the word for that is noise.  Information requires context, and even the popular NoSQL forms like document databases, time series dbs, or graph databases---can absolutely maintain context and can indeed be very structured.  A (BSON or) JSON document of the type you'd find in document database can be composed of multiple objects, nested, linked, arrayed, and hierarchical.  It's not that they're deficient in this---for example, hierarchical data can be difficult to represent in the set language of SQL, but is much easier in JSON-like objects. 

But considering even in cases where you don't use these features and keep your document perfectly flat---in the very moment where you seek to do any analysis, perform any link, or derive any aggregation on your data---you impose structure quite explicitly.  An analysis is the imposition of a structure or context onto a set of elements.  An analysis is a simplification and is always an analysis with respect to something

This means it is not so much that SQL is structured and NoSQL is unstructured---though you're perfectly capable of using poor ontologies in any database or organization of any kind---but rather that the more relevant division is:

SQL is statically structured where NoSQL is dynamically structured

The schema layed upon SQL is strong and permanent---reworking it is a bear.  The "schema" you get out of a NoSQL collection is often ad hoc, created and imposed often at the time of query, to mix match and fulfil a particular and immediate need.  It may disappear after that analysis, it may become an input for another, or it may be a common enough query that it remains useful in the future.  But importantly the cost of restructuring and schema imposition in NoSQL languages is quite low---allowing analysts the freedom to re-see their data in new ways as conditions change or problems change, without having to consult a DB admin.

The distinction of dynamic structure v static structure is quite parallel to dynamic vs static typing in programming languages themselves--and it's no wonder that languages like javascript and python fit very naturally with databases like MongoDB and Neo4j.

Now, onto reviewing the MongoDB conference itself.  Some things that are exciting (features in 2.4 improved in 2.6):

  • Full text search
  • GeoSearch.

I sat on both these presentations, one by Eliot Horowitz the other by Greg Studer. These are quite excellent.  I think the GeoSearch alone, particularly that they chose geoJSON as their standard and made code open for analysis is a very big deal.  This combined with recent improvements in date-math and date-query are going to make easy mashups of geo+time+people+stuff data a lot easier.  Full text search also looked promising, with the ability to set field-weight for matches.  It doesn't quite look ready to replace SOLR yet, pending the need for synonym lists, but is a very powerful tool as well.

I'm looking forward to the next iteration.