NoSQL Document Database – Manhattan MarkLogic
July 14, 2014
Bit late in posting this up, but given I did something about RainStor I thought I should write up my attendance at a MarkLogic event day in downtown Manhattan from several weeks back – their NoSQL database is used to serve up content on the bbc web site if you wanted some context. They are unusual for the NoSQL “movement” in that they are a proprietary vendor in a space that is dominated by open source databases and the companies that offer support for them. The database they most seem to compete with in the NoSQL space seems to be MongoDB, where both have origins as “document databases” (- managing millions of documents is one of the most popular uses for big data technology at the moment, though not so much publicized as more fashionable things like swallowing a twitter feed for sentiment analysis for example).
In order to cope with the workloads needing to be applied to data, MarkLogic argue that data has escaped from the data centre in terms of need separate data warehouses and ETL processes aligned with each silo of the business. They put forward the marketing message that MarkLogic allows the data to come back into the data center given it can be a single platform for where all data lives and all workloads applied to it. As such it is easy to apply proper data governance if the data is in one place rather than distributed across different databases, systems and tools.
Apparently MarkLogic started out with the aims of offering enterprise search of corporate data content but has evolved much beyond just document management. Gary Bloom, their CEO, described the MarkLogic platform as the combination of:
• Search Engine
• Application Services
He said that the platform is not just the database but particularly search and database together, aligned with the aim of not just storing data and documents but with the aim of getting insights out of the data. Gary also mentioned the increasing importance of elastic compute and MarkLogic has been designed to offer this capability to spin up and down with usage, integrating with and using the latest in cloud, Hadoop and Intel processors.
Apparently one of the large European investment banks is trying to integrate all of their systems for post-trade analysis and regulatory reporting. The bank apparently tried doing this by adopting a standard relational data model but faced two problems in that 1) the relational databases were not standard and 2) that it was difficult to get to and manage an overarching relational schema. On the schema side of things, the main problem they were alluding to seemed to be one schema changing and having to propagate that through the whole architecture. The bank seems now to be having more success now that they have switched to MarkLogic for doing this post-trade analysis – from a later presentation seems like things like trades are taken directly from the Enterprise Service Bus so saving the data in the message as is (schema-less).
One thing that came up time and time again was their pitch that MarkLogic is “the only Enterprise NoSQL database” with high availability, transactional support (ACID) and security built in. He criticized other NoSQL databases for offering “eventual consistency” and said that they aspire to something better than that (to put it mildly). I thought it was interesting over a lunch chat that one of MarkLogic guys said that "MongoDB does a lot of great pre-sales for MarkLogic" meaning I guess that MongoDB is the marketing "poster child" of NoSQL document databases so they get the early leads, but as the client widens the search they find that only MarkLogic is "enterprise" capable. You can bet that the MongoDB team disagree (and indeed they do…).
On the consistency side, Gary talked about “ObamaCare” aka HealthCare.gov that MarkLogic were involved in. First came some performance figures of how they were handling 50,000 transactions/sec with 4-5ms response time for 150,000 concurrent users. This project suffered from a lot of technical problems which really came down to problems of running the system based on a fragile infrastructure with weaknesses in network, servers and storage. Gary said that the government technologists were expecting data consistency problems when things like the network went down, but the MarkLogic database is ACID and all that was needed was to restart the servers once the infrastructure was ready. Gary also mentioned that he spent 14 years working at Oracle (as a lot of the MarkLogic folks seem to have) but it was only really until Oracle 7 that they could really say they offered data consistency.
On security, again there was more criticism of other NoSQL database for offering access to either all of the data or none of it. The analogy used was one of going to an ATM and being offered access to everyone’s money and having to trust each client to only take their own. Continuing the NoSQL criticism, Gary said that he did not like the premise put around that “NoSQL is defined by Open Source” – his argument was that MarkLogic generates more revenue than all the other NoSQL databases on the market. Gary said that one client said that they hosted a “lake of data” in Hadoop but said that Hadoop was a great distributed file system but still needs a database to go with it.
Gary then talked about some of the features of MarkLogic 7, their current release. In particular that MarkLogic 7 offered scale out elasticity but with full ACID support (apparently achieving one should make it not possible to achieve the other), high performance and a flexible schema-less architecture. Gary implied that the marketing emphasis had changed recently from “big data” pitch of a few years back to include both unstructured and structured data but within one platform, so dealing with heterogeneous data which is a core capability of MarkLogic. Other features mentioned were support for XML, JSON and access through a Rest API. Usage of MarkLogic as a semantic database (a triple store) and support for the semantic query language Sparql. Gary mentioned that semantic technology was a big area of growth for them. He also mentioned support for tiered stored on HDFS.
The conversation them moved on to what’s next with version 8 of Mark Logic. The main thing is “Ease of Use” for the next release with the following features:
• MarkLogic Developer – freely downloadable version
• MarkLogic Essential Enterprise – try it for 99c/hour on AWS
• MarkLogic Global Enterprise – 33% less (decided to spend less time on the sales cycle)
• Training for free – all classes sold out – instructor led online
Joe Pasqua, SVP of Product Strategy, then took over from Gary for a more technical introduction to the MarkLogic platform. He started by saying that MarkLogic is a schema-less database with a hierarchical data model that is very document-centric, and can be used for both structured and unstructured data. Data is stored in compressed trees with the system. Joe then explained how the system is indexed explaining the “Universal Index” which lists where to find the following kinds of data as in most good search engines:
• Stemmed words and phrasing
• Structure (this is indexed too as new documents come in)
• Words and phrases in the context of structure
• Security Permissions
Joe also mentioned that a “range index” is used to speed up comparisons, apparently in a similar way to column store. Geospacial indices are like 2D range indices for how near things are to a point. The system also supports semantic indices, indexing on triples of subject-predicate-object.
He showed how the system has failover replication within a database cluster for high availability but also full replication for disaster recover purposes. There were continual side references to Oracle as a “legacy database”.
On database consistency and the ACID capability Joe talked about MVCC (Multi Version Concurrency Control). Each “document” record in MarkLogic seems to have a start and end time for how current it is, and these values are used when updating data to avoid any reduction in read availability. When a document is updated a copy of it is taken but made hidden until ready – the existing document remains available until the update is ready, and then the document “end time” in the old record is marked and the “start time” marked on the new record. So effectively always doing append in serial form not seeking on disk, and the start and end time for the record enables bitemporal functionality to be implemented. Whilst the new record is being created it is already being indexed so there is zero latency searching once the new document is live.
One of the index types mentioned by Joe was a “Reverse Index” where queries are indexed and as a new document comes in it is passed over these queries (sounds like the same story from the complex event processing folks) and can trigger alerts based on what documents fit each query.
In summary, the event was a good one and MarkLogic seems interesting technology and there seems to be a variety of folks using it in financial markets with the post trade analysis example (bit like RainStor I think though, as an archive) and others using it more in the reference data space. Not sure how much MarkLogic is real-time capable – seems to be a lot of emphasis on post trade. Also brought home to me the importance of search and database together which seems to be a big strength of their technology.