Big Data: Lakes, Swamps or Reservoirs?
August 8, 2019
This blog post was written by Xenomorph COO Robert Ottley
One topic that has drawn a lot of interest in recent years is that of data lakes. A data lake, for those unfamiliar with the term, is an analogy for a very large store of data captured in its natural form. In other words, it represents data that hasn’t necessarily been cleansed or structured, but has simply flowed in from various sources and been captured into one very large repository.
One of the problems with such a repository is that it can be difficult to sift through, given its size and lack of structure. Traditionally, financial data stores have been significantly more structured in terms of their data model. This structured approach not only helps to describe and normalise the content, but also maps relationships between interdependent data items, which in turn can help significantly when it comes to manipulating the data in a business context.
Another drawback of storing data in its natural form is that it may be of dubious quality. When we think of lakes, one tends to conjure up an image of pristine waters fed by mountain streams. However, that image rarely applies to data.
When data flows into a big lake, without any traditional EDM processes to validate and cleanse it, the more appropriate analogy may be that of a swamp – where the underlying data is of such poor quality that you struggle to see through it or derive any value from it at all. Perhaps the more appropriate approach to storing financial data would be analogous to reservoirs – man-made constructs that provide more structure than lakes and have been built to store clean, potable contents (water or data).
Now, when it comes to finding the right architecture to build your data reservoir, there will undoubtedly be lots of debate over what technologies to use. While we have traditionally been proponents of SQL, we have seen increasing adoption and interest in using NoSQL data stores over the years.
We also see increasing convergence between the two. Being able to combine the structure of SQL with the scalability and performance of NoSQL seems like the ideal outcome. In that respect, there are a few initiatives already underway to bridge the two approaches such as that being developed within Microsoft’s Azure Data Lake stack where their unified query language U-SQL offers the potential to successfully bring the structured and unstructured worlds together.
Within the capital markets industry, the potential benefits of such an approach are certainly worthy of investigation. Although most reference and market data tends to be highly structured in nature (and probably best suited to SQL), there are some content sets that could be better suited to NoSQL. That might be because of their sheer scale—such as very large tick data stores covering millions of instruments spanning decades of history. Or it could be because of their unstructured nature, such as natural language content—news articles, prospectuses, blogs and tweets–which can be analysed for sentiment and used as an input into trading strategies.
Being able to work with data stored in both SQL and NoSQL databases using a unified query language is certainly something that we think holds significant further promise and is an area that we at Xenomorph have been actively developing over the past 18 months.