Financial Markets Industry
Posts categorized "Database Technology"
Quick plug for an interview I did recently with Paul Rowady on the Tabb Forum, you can get access to the video here and a brief summary of what we talked about is below. As ever, Paul has his humourous angle on things and this time my green socks got the "Umpa Lumpa" treatment (unfortunately you have to watch to the end to catch that one!). Last time it was my likeness to the lead singer of an Australian band. And for the record, we did also have a good conversion on data management and BI/visualization.
"As firms increasingly apply analytics to massive volumes of raw data, the amount of derived data is growing exponentially, and the need to apply strict governance to this derived data is more important than ever. To satisfy regulatory demands, the full data trail – including models and calculations – needs to be auditable, remarks Brian Sentance, CEO, Xenomorph. Unfortunately, there often is a disconnect between the validation of the raw data and the governance of the middle tier of derived data or analytics, he notes. Sentance and TABB Group’s Paul Rowady, principal and director of data and analytics research, examine the breakdown of data governance best practices, the risks involved, and the role of visualization tools in identifying data quality and data management shortfalls."
Posted by Brian Sentance | 10 March 2015 | 3:16 pm
The A-Team put on another good event at DMS New York yesterday. Lots of good stuff talked and here are a few takeaways that I remember, after a photo of Ludwig D'Angelo of JPMorgan:
- Data Utilities - One of presenters said that "Data Utility" was a really overused term second only to "Big Data". My comment would be that a lot of the managed services folks seem to want to talk about "Data Utilities" - seeming to prefer that term rather than what they are? Maybe because they perceive as better marketing and/or maybe because they hope to be annointed/appointed (how I don't know) as an industry "Data Utility". Anyway for me they fail to address the issue of client-specific data and its management very well, much to the detriment of their argument imho - although SmartStream did say that client data can be mixed up into the data services they offer.
- Andrew Gets Literaturally Physical - Andrew Delaney of the A-Team expressed a preference for "physical" books when talking about why the A-Team also prints the Regulatory Data Handbook2 as well as making it available online. I have to agree that holding a book still beats my Kindle experience but maybe I am just getting old. Andrew should check out this YouTube video on how the book was first introduced...
- FIBO - The Financial Instrument Business Ontology (FIBO) was discussed in the context of trying to establish industry standards for data. As ever the usage of words like "Ontology" I suspect leaves a lot of business folks looking for the nearest double shot of expresso but that aside, seems like the EDM Council are making some progress on developing this standard. Main point from the event was industry adoption is key. I found some of the comments during the day a bit schizophrenic, in that some said that the regulators should not mandate standards (i.e. leave it to industry adoption and principles) but then in the next breath discussing the benefits (or otherwise) of the LEI (ok, not mandated but specific and coming from the regulators). Certainly the industry needs "help" (is that a strong enough word?) to get standards in place.
- Data Quality - Lots on data quality with assessing the business value of data quality initiatives being a key point. On the same subject, Predrag of element-22 announced that the EDM Council will soon be announcing adoption of the Data Quality Index, which could be used to correlate data quality with operational KPIs for the business.
- Regulation (doh!) - It wouldn't be a data management event without lots of discussion on regulation - a key point being that even those regulations that are not directly/explicitly about data still imply that data management is key (take CVA calcs for example) - and on a related note it was suggested that BCBS239 should be considered as a more general data managment template for any business objective.
- Entity Hierarchies/LEI - Ludwig D'Angelo of JPMorgan gave a great talk and said that vendors were missing a massive opportunity in delivering good hierarchy datasets to clients, and that the effort expended on this at firms was enormous. Ludwig said that the lack of hierarchies in the Legal Entity Identifier (LEI) is a gap that the private sector could and should fill. Ludwig also seemed initially to be thrown when one of the audience suggested that they were multiple "golden copies" of hierarchies needed, since definitions of ownership can differ depending on which department you are in (old battle of risk and finance departments again). Good discussion later of how regulation was driving all systems to be much more entity-centric rather than portfolio-centric, emphasising the importance of getting entity hierarchies right.
- DCAM - John Bottega did a great presentation on the Data Management Capability Model (DCAM). John asked Predrag of element-22 to speak about DCAM and he said that unlike previous models (DMM) then this framework would not only assess where you are in data management but will also show you where you need to go. DCAM covers data management strategy / operations / quality / business case / data architecture / tech architecture / governance / program. From what I could see it looked like a great framework - it appeared like common sense and obvious but that is in itself difficult to achieve so good effort I think. Element-22 will offer an online service around DCAM that will also allow anonymous benchmarking of data management capabilities as more institutions get involved (update: the service is called pellustro).
- BCBS239 - Big thanks to John M. Fleming of BNY Mellon and Srikant Ganesan of Risk Focus for taking part in the panel with me. Less focus on spreadsheet use and abuse on this panel unlike the London Panel from last month. John had some very practical ideas such as the use of Wikis to publish/gather data dictionary information and with a large legacy infrastructure you are better documenting differences in definitions across systems rather than trying to change the world from day one. Echoing some of the points from DMS London, it was thought that making the use of internal data standards as part of a project sign off was very pragmatic data governance, but that also some systems should be marked/assessed as obsolete/declining and hence blocked from any additional usage in new project work. Bit of a plug for some of our recent work on data validation and exception management, but the panel said that BCBS239 needs to encompass audit/lineage on calculations/derived data/rules in addition to just the raw data
Posted by Brian Sentance | 5 November 2014 | 11:42 pm
A great afternoon event put on by TabbFORUM in New York yesterday with a number of panels and one on one interviews (see agenda). You can see some of went on at the event via the hashtag #TabbTech or via the @XenomorphNews feed.
"Death of Legacy" Panel Discussion
Posted by Brian Sentance | 16 October 2014 | 9:52 pm
A-Team’s DMS Data Management Awards close on the 26th of September so if you haven't already, please vote for Xenomorph!
Xenomorph on the Cloud - First of a few lookbacks at what we have been doing over the past year - firstly with a short animation about one of our major initiatives this year, cloud provision of data management and a new venture into cloud-based data publishing with the TimeScape MarketPlace.
So it would be fantastic if you could support Xenomorph by voting here.
Posted by Brian Sentance | 11 September 2014 | 7:21 pm
Bit late in posting this up, but given I did something about RainStor I thought I should write up my attendance at a MarkLogic event day in downtown Manhattan from several weeks back - their NoSQL database is used to serve up content on the bbc web site if you wanted some context. They are unusual for the NoSQL “movement” in that they are a proprietary vendor in a space that is dominated by open source databases and the companies that offer support for them. The database they most seem to compete with in the NoSQL space seems to be MongoDB, where both have origins as “document databases” (- managing millions of documents is one of the most popular uses for big data technology at the moment, though not so much publicized as more fashionable things like swallowing a twitter feed for sentiment analysis for example).
In order to cope with the workloads needing to be applied to data, MarkLogic argue that data has escaped from the data centre in terms of need separate data warehouses and ETL processes aligned with each silo of the business. They put forward the marketing message that MarkLogic allows the data to come back into the data center given it can be a single platform for where all data lives and all workloads applied to it. As such it is easy to apply proper data governance if the data is in one place rather than distributed across different databases, systems and tools.
Apparently MarkLogic started out with the aims of offering enterprise search of corporate data content but has evolved much beyond just document management. Gary Bloom, their CEO, described the MarkLogic platform as the combination of:
• Search Engine
• Application Services
He said that the platform is not just the database but particularly search and database together, aligned with the aim of not just storing data and documents but with the aim of getting insights out of the data. Gary also mentioned the increasing importance of elastic compute and MarkLogic has been designed to offer this capability to spin up and down with usage, integrating with and using the latest in cloud, Hadoop and Intel processors.
Apparently one of the large European investment banks is trying to integrate all of their systems for post-trade analysis and regulatory reporting. The bank apparently tried doing this by adopting a standard relational data model but faced two problems in that 1) the relational databases were not standard and 2) that it was difficult to get to and manage an overarching relational schema. On the schema side of things, the main problem they were alluding to seemed to be one schema changing and having to propagate that through the whole architecture. The bank seems now to be having more success now that they have switched to MarkLogic for doing this post-trade analysis – from a later presentation seems like things like trades are taken directly from the Enterprise Service Bus so saving the data in the message as is (schema-less).
One thing that came up time and time again was their pitch that MarkLogic is “the only Enterprise NoSQL database” with high availability, transactional support (ACID) and security built in. He criticized other NoSQL databases for offering “eventual consistency” and said that they aspire to something better than that (to put it mildly). I thought it was interesting over a lunch chat that one of MarkLogic guys said that "MongoDB does a lot of great pre-sales for MarkLogic" meaning I guess that MongoDB is the marketing "poster child" of NoSQL document databases so they get the early leads, but as the client widens the search they find that only MarkLogic is "enterprise" capable. You can bet that the MongoDB team disagree (and indeed they do...).
On the consistency side, Gary talked about “ObamaCare” aka HealthCare.gov that MarkLogic were involved in. First came some performance figures of how they were handling 50,000 transactions/sec with 4-5ms response time for 150,000 concurrent users. This project suffered from a lot of technical problems which really came down to problems of running the system based on a fragile infrastructure with weaknesses in network, servers and storage. Gary said that the government technologists were expecting data consistency problems when things like the network went down, but the MarkLogic database is ACID and all that was needed was to restart the servers once the infrastructure was ready. Gary also mentioned that he spent 14 years working at Oracle (as a lot of the MarkLogic folks seem to have) but it was only really until Oracle 7 that they could really say they offered data consistency.
On security, again there was more criticism of other NoSQL database for offering access to either all of the data or none of it. The analogy used was one of going to an ATM and being offered access to everyone’s money and having to trust each client to only take their own. Continuing the NoSQL criticism, Gary said that he did not like the premise put around that “NoSQL is defined by Open Source” – his argument was that MarkLogic generates more revenue than all the other NoSQL databases on the market. Gary said that one client said that they hosted a “lake of data” in Hadoop but said that Hadoop was a great distributed file system but still needs a database to go with it.
Gary then talked about some of the features of MarkLogic 7, their current release. In particular that MarkLogic 7 offered scale out elasticity but with full ACID support (apparently achieving one should make it not possible to achieve the other), high performance and a flexible schema-less architecture. Gary implied that the marketing emphasis had changed recently from “big data” pitch of a few years back to include both unstructured and structured data but within one platform, so dealing with heterogeneous data which is a core capability of MarkLogic. Other features mentioned were support for XML, JSON and access through a Rest API. Usage of MarkLogic as a semantic database (a triple store) and support for the semantic query language Sparql. Gary mentioned that semantic technology was a big area of growth for them. He also mentioned support for tiered stored on HDFS.
The conversation them moved on to what’s next with version 8 of Mark Logic. The main thing is “Ease of Use” for the next release with the following features:
• MarkLogic Developer – freely downloadable version
• MarkLogic Essential Enterprise – try it for 99c/hour on AWS
• MarkLogic Global Enterprise – 33% less (decided to spend less time on the sales cycle)
• Training for free – all classes sold out – instructor led online
Joe Pasqua, SVP of Product Strategy, then took over from Gary for a more technical introduction to the MarkLogic platform. He started by saying that MarkLogic is a schema-less database with a hierarchical data model that is very document-centric, and can be used for both structured and unstructured data. Data is stored in compressed trees with the system. Joe then explained how the system is indexed explaining the “Universal Index” which lists where to find the following kinds of data as in most good search engines:
• Stemmed words and phrasing
• Structure (this is indexed too as new documents come in)
• Words and phrases in the context of structure
• Security Permissions
Joe also mentioned that a “range index” is used to speed up comparisons, apparently in a similar way to column store. Geospacial indices are like 2D range indices for how near things are to a point. The system also supports semantic indices, indexing on triples of subject-predicate-object.
He showed how the system has failover replication within a database cluster for high availability but also full replication for disaster recover purposes. There were continual side references to Oracle as a “legacy database”.
On database consistency and the ACID capability Joe talked about MVCC (Multi Version Concurrency Control). Each “document” record in MarkLogic seems to have a start and end time for how current it is, and these values are used when updating data to avoid any reduction in read availability. When a document is updated a copy of it is taken but made hidden until ready – the existing document remains available until the update is ready, and then the document “end time” in the old record is marked and the “start time” marked on the new record. So effectively always doing append in serial form not seeking on disk, and the start and end time for the record enables bitemporal functionality to be implemented. Whilst the new record is being created it is already being indexed so there is zero latency searching once the new document is live.
One of the index types mentioned by Joe was a “Reverse Index” where queries are indexed and as a new document comes in it is passed over these queries (sounds like the same story from the complex event processing folks) and can trigger alerts based on what documents fit each query.
In summary, the event was a good one and MarkLogic seems interesting technology and there seems to be a variety of folks using it in financial markets with the post trade analysis example (bit like RainStor I think though, as an archive) and others using it more in the reference data space. Not sure how much MarkLogic is real-time capable – seems to be a lot of emphasis on post trade. Also brought home to me the importance of search and database together which seems to be a big strength of their technology.
Posted by Brian Sentance | 14 July 2014 | 11:40 am
We had over 60 folks along to our event our the Merchant Taylors' Hall last week in London. Thanks to all who attended, all who helped with the organization of the event and sorry to miss those of you that couldn't come along this time.
Some photos from the event are below starting with Brad Sevenko of Microsoft (Director, Capital Markets Technology Strategy) in the foreground with a few of the speakers doing some last minute adjustments at the front of the room before the guests arrived:
Rupesh Khendry of Microsoft (Head of World-Wide Capital Markets Solutions) started off the presentations at the event, introducing Microsoft's capital markets technology strategy to a packed audience:
After a presentation by Virginie O'Shea of Aite Group on Cloud adoption in capital markets, Antonio Zurlo (below) of Microsoft (Senior Program Manager) gave a quick introduction to the services available through the Microsoft Azure cloud and then moved on to more detail around Microsoft Power BI:
After Antonio, then yours truly (Brian Sentance, CEO, Xenomorph) gave a presentation on what we have been building with Microsoft over the past 18 months, the TimeScape MarketPlace. At this point in the presentation I was giving some introductory background on the challenges of regulatory compliance and the pros and cons between point solutions and having a more general data framework in place:
The event ended with some networking and further discussions. Big thanks to those who came forward to speak with me afterwards, great to get some early feedback.
Posted by Brian Sentance | 1 July 2014 | 11:54 am
One day to go until our TimeScape MarketPlace breakfast briefing "Financial Markets Data and Analytics. Everywhere You Need Them" at Merchant Taylor's Hall tomorrow, Wednesday June 25th. With over ninety people registered so far it should be a great event, but if you can make it please register and come along, it would be great to see you there.
Posted by Brian Sentance | 24 June 2014 | 11:25 am
Pleased to announce that our TimeScape MarketPlace event "Financial Markets Data and Analytics. Everywhere You Need Them" is coming to London, at Merchant Taylor's Hall on Wednesday June 25th.
Come and join Xenomorph, Aite Group and Microsoft for breakfast and hear Virginie O'Shea of the analyst firm Aite Group offering some great insights from financial institutions into their adoption of cloud technology, applying it to address risk management, data management and regulatory reporting challenges.
Microsoft will be showing how their new Power BI can radically change and accelerate the integration of data for business and IT staff alike, regardless of what kind of data it is, what format it is stored in or where it is located.
And Xenomorph will be demonstrating the TimeScape MarketPlace, our new cloud-based data mashup service for publishing and consuming financial markets data and analytics.
In the meantime, please take a look at the event and register if you can come along, it would be great to see you there.
Posted by Brian Sentance | 11 June 2014 | 8:50 pm
Quick thank you to the clients and partners who took some time out of their working day to attend our breakfast briefing, "Financial Markets Data and Analytics. Everywhere You Need Them." at Microsoft's Times Square offices last Friday morning. Not particularly great weather on here in Manhattan so it was great to see around 60 folks turn up...
Posted by Brian Sentance | 14 May 2014 | 9:49 pm
The New York Chapter of PRMIA hosted "Regulatory, Compliance, and Risk Data Technology Challenges" at Credit Suisse's offices in New York, last Thursday 10th April. Abraham Thomas introduce the panelists, and Don Wesnofske started off by setting the scene for the evening's event.
Don outlined how in reaction to the 2008 Crisis the regulators now require data retention for up to 10 years or more. Don cited one particular example where data must be reconstructed within 24 to 48 hours for any date up to 7 years back, and said that this kind of "forensic" investigation capability was an important consideration for many financial institutions. He took us through a good presentation slide of his view on data management/risk architecture, and outlined how operational risk is comprised of people, process, technology and events. Don ended his presentation by taking us through Wikipedia's definition of "Big Data", and in particular talked about how data has a life cycle going through:
Don handed then handed over to Luigi Mercone of Credit Suisse who is a Director of Engineering Strategy & Architecture at Credit Suisse. Luigi started by saying that to the business at CS, he is technical support which involves asking "What is on fire today? And whats going to be on fire tomorrow?" Luigi described how some time back CS had regulatory enquiry around their equities business which required them to reconstruct data from 2 years back.
The project to do this took around 4-5 months of database adminstrators time to reconstruct the world as at that point in time (I guess because tape storage was being used, and this needed restoring to disk/database). This was for an equity order management system that had doubled in size every year for the past 17 years, and at that point CS was only retaining data going back 2 years. Luigi said that it was then thought that with new regulations requiring the ability to produce forensice evidence at any point in time would potentially swamp CS's resources unless it was addressed head on and strategically.
Luigi described the original architecture that they were using being based on an in-memory database for intraday workloads, then standard Sybase (probably ASE I guess) and then Sybase IQ for longer term archiving, taking advantage of the column-store capabilities of Sybase IQ and the resulting data compression possible. He added that the data storage requirements of the system had grown from 150TB to 1.2PB in 4 years.
Luigi then offered a comparison of this original architecture with what he found by implementing RainStor, in the original architecture the Sybase IQ database compressed data down into 160TB, whereas this was improved by a further factor of 10 down to 14TB using RainStor. He said that the RainStor was self-service providing a standard SQL interface, eliminated the need for tape storage, reduced the system "footprint" by 90% at CS, was 1/5 of the cost and the performance was good. (I guess here I would like to caveat that I know nothing of the original architecture other than the summary Luigi provided, and as such it is hard to judge whether the original architecture was optimal for the data growth experienced, and hence whether this was overall an objective comparison of Sybase IQ's capabilities with RainStor.) Luigi closed by saying that whilst RainStor was a great archive database, its original origins were in in-memory databases and he would encourage RainStor to re-enter that market too, given his experience so far.
John Bantleman CEO of RainStor took over and described how RainStor had been designed specifically for the needs of data archiving (I guess talking more about what it does now rather than its origins outlined by Luigi above). He said that RainStor offers a 20-40x storage footprint reduction over traditional database technology and operates efficiently even at the PetaByte (PB) scale, based around RainStor proprietary database technology making use of columnar storage and being capable of storing data in both relational-style tabular format and also in more "document" style using XML and JSON formats using Key-Value access. John mention that in terms of being able to store data that not only could RainStor retrieve data at a point in time, but it could retrieve the schema being used at that point in time for a more complete view of the state of the world at that point. This echos a couple of past articles that I have penned, one for IRD and one for Wilmott Magazine on bitemporal regulatory requirements.
John said that regulation was driving the need for data archiving capabilities, with 1400 regulations added since 2008 (not sure of source, but believable) and the comment from a Chief Data Officer (CDO) at one financial markets client that if a project wasn't driven by regulatory compliance then the project isn't going to get done (certainly sounds like regulatory overload). John's opening remarks were really around how regulatory cost, complexity and compliance were driving forces behind the growth of RainStor in financial services technology, and whilst regulation is the driver, firms should look at archiving of data as an opportunity too, in order to create value from corporate memory, and to be proactive in addressing future reporting and analysis needs.
John illustrated the regulatory need for data archiving through the Consolidated Audit Trail (CAT) regulation with data retention over 7 years will generate 100PB of data. He also mentioned SEC Rule 17a-4 for broker dealers as another example of "data retention" regulation, with particular reference to storage of records in on-rewriteable, non-erasable format. John termed this WORM storage, meaning Write Once, Read Many. John seemed to imply that both the software (RainStor) and the hardware it runs on (e.g. EMC or Teradata etc) need to be WORM compliant. One of the audience members asked John about BCBS 239, to which John said that he didn't know that particular regulation (fair enough that John didn't know in my opinion, RainStor's tech is general about "data" and is applicable across many industries, whereas BCBS 239 is obviously about banks specifically and is more about data aggregation and reporting than data retention/archiving to my understanding, and this seems to be confirmed with a quick doc scan for "archive" or "retention".)
To finish off the main part of the event (before the drinks and food began) there was a panel discussion. Luigi said that it was best to "prepare for all time, not just specifics" with respect to data retention and that there were dangers in rolling up data (effectively aggregating and loosing granularity to reduce storage needs). John added that his definition of "Big Data" was "All information, for ever". Luigi added that implementing RainStor had allowed CS to spend more time on interesting questions rather than on database restoration. John proposed that version 1 of Big Data involved the retention of web data, and as such loosing a data point here and their didn't matter. Version 2 of Big Data is concerned more with enterprise data where all data has value and needs to be retained i.e. lots of high value data. He added that this was an opportunity for risk and compliance to become an asset.
Abraham (second from left), Don (center) and John (second from right)
Overall it was a good event which I found very interesting (but I have to admit to a certain geeky interest in this kind of tech). The event would have benefitted from say another competitive or complementary technology vendor involved maybe, plus maybe an academic to give a different slant on data retention and on what the regulators hope to gain from this kind of mandated data retention. Not that the regulators have been that good at managing data themselves recently.
Networking afterwards courtesy of Credit Suisse and RainStor
Posted by Brian Sentance | 17 April 2014 | 3:05 pm
You can find A-Team's view on "Building a Flexible Enterprise Architecture" here. Some additional notes/thoughts:
- I thought Neil van Lint of GoldenSource's comment about "putting lipstick on a pig" with reference to legacy architectures was pretty funny and apt.
- The old Irish joke about asking for directions and receiving the response "Well I wouldn't start from here" is also amusing but too true with our industry and most large organisations.
- "Schema on read, not on write" is getting my award for phrase of the month from NoSQL proponents (quote Amir from Mark Logic).
- Agree that ETL is problematic/a big resource drain but unless starting from a greenfield site it is currently unavoidable.
- I like the idea of FIBO (and decoupling data meaning from data structure) but still left unsure what it actually (practically) covers so far and how much it is used, despite the references to it by Peter of Nordea. I guess it is all a matter of semantics.
- I knew little of TOGAF mentioned by Rupert but maybe that is because I am a techie no more (if I ever was).
- Rupert came back to his "where are we?" and data map questions and asked the audience how many of them had a good handle on where data was used in what systems - unsurprisingly not many with a Morgan Stanley guy saying that there monitoring systems were linked to the operational systems for a full inventory of data.
- I agree that the regulators need to push standards directly on the industry - Amir ended the panel suggesting the regulators need to say things like "Thou shalt use FIBO".
Posted by Brian Sentance | 20 March 2014 | 7:12 pm
Rupert Brown of UBS did the keynote at this Spring's A-Team Data Management Summit (DMS). Rupert's talk was about understanding what data there is within a financial institution and understanding where it comes from and where it goes to. Rupert started by asking the question "Where are we?" illustrating it with a map of systems and data flows for an institution - to my recollection I think he said it stretched to 7 metres in length and did not look that accessible or easy to understand. He asked what dimensions it should have as a "map" of data, wondering what dimensions are analogous to latitude, longitude, altitude and orientation? Maybe things like function, product, process, accounting or legal entity as potential candidates.
Briefly Rupert took a bit of a detour into his love of trains with a little history on the London Underground Map. He started by mentioning the role of George Dow who illustrated maps for train routes in a single line, showing just dependency and lineage (what stations are next etc) and ignoring geography and distance. This was built upon by another gentleman, Harry Beck, who took these ideas a stage further with the early ancestors of the current Undergroud map, showing both routes but interweaving all the lines together into a map that additionally was topologically sufficient (indicating broad direction - NESW).
Continuing on with this analogy of Underground to maps of data and data management, Rupert then mentioned Frank Pick who created the Underground brand. Through creating such an identifiable brand, effectively Frank got people to believe and refer to the map, and that people in data governance need and could benefit from taking a similar approach to data governance with data management. I guess it is easy to take maps we see every day for granted and particularly some of the thought that went into them, maybe ideas that initially were not intuitive (or at least not directly representative of physical reality) but that greatly improved understand and comprehension. Put another way, representing reality one for one does not necessarily get you to something that is easy to understand (sounds like a "model" to me).
Rupert then described some of his efforts using Open Street Map to map data, making use of the concepts of nodes, ways and areas. Apparently he had implemented this using a NoSQL database (Mark Logic) for performance reasons (doesn't sound like a really "big data" sized problem with several hundred apps and several thousand data transports but nevertheless he said it was needed, maybe as a result of its graph like nature?). He said that the data was crowdsourced to refine the data, with a wiki for annotations. He said he was interested in the bitemporality of data, i.e. how the map changes over time. He advised that every application should also be thought of as its own "databus" in addition to any de facto databuses might be present in the architecture.
In summary the talk was interesting, but it was demonstrable from what Rupert showed that we have long way to go in representing clearly and easily where data came from, where it goes to and how it is used. I think Rupert acknowledges this and has some academic partnerships trying to develop better ways of representing and visualizing data. Certainly data lineage and audit trail on everything is a hot topic for many of our clients currently, and something that deserves more attention. You can download Rupert's presentation here and the A-Team's take on his talk can be found here.
Posted by Brian Sentance | 18 March 2014 | 11:12 am
Quick thank you to Don Syme of Microsoft Research for including a demonstration of F# connecting to TimeScape running on the Windows Azure cloud in the F# in Finance event this week in London. F# is functional language that is developing a large following in finance due to its applicability to mathematical problems, the ease of development with F# and its performance. You can find some testimonials on the language here.
Don has implemented a proof-of-concept F# type provider for TimeScape. If that doesn't mean much to you, then a practical example below will help, showing how the financial instrument data in TimeScape is exposed at runtime into the F# programming environment. I guess the key point is just how easy it looks to code with data, since effectively you get guided through what is (and is not!) available as you are coding (sorry if I sound impressed, I spent a reasonable amount of time writing mathematical C code using vi in the mid 90's - so any young uber-geeks reading this, please make allowances as I am getting old(er)...). Example steps are shown below:
Referencing the Xenomorph TimeScape type provider and creating a data context:
Connecting to a TimeScape database:
Looking at categories (classes) of financial instrument available:
Choosing an item (instrument) in a category by name:
Looking at the properties associated with an item:
The intellisense-like behaviour above is similar to what TimeScape's Query Explorer offers and it is great to see this implemented in an external run-time programming language such as F#. Don additionally made the point that each instrument only displays the data it individually has available, making it easy to understand what data you have to work with. This functionality is based on F#'s ability to make each item uniquely nameable, and to optionally to assign each item (instrument) a unique type, where all the category properties (defined at the category schema level) that are not available for the item are hidden.
The next event for F# in Finance will take place in New York on Wednesday 11th of December 2013 in New York, so hope to see you there. We are currently working on a beta program for this functionality to be available early in the New Year so please get in touch if this is of interest via email@example.com.
Posted by Brian Sentance | 27 November 2013 | 6:00 am
Another good event from PRMIA at the Harmonie Club here in NYC last week, entitled Risk Data Agregation and Risk Reporting - Progress and Challenges for Risk Management. Abraham Thomas of Citi and PRMIA introduced the evening, setting the scene by refering to the BCBS document Principles for effective risk data aggregation and risk reporting, with its 14 principles to be implemented by January 2016 for G-SIBs (Globally Systemically Important Banks) and December 2016 for D-SIBS (Domestically Systemically Important Banks).
The event was sponsored by SAP and they were represented by Dr Michael Adam on the panel, who gave a presentation around risk data management and the problems have having data siloed across many different systems. Maybe unsurprisingly Michael's presentation had a distinct "in-memory" focus to it, with Michael emphasizing the data analysis speed that is now possible using technologies such as SAP's in-memory database offering "Hana".
Following the presentation, the panel discussion started with a debate involving Dilip Krishna of Deloitte and Stephanie Losi of the Federal Reserve Bank of New York. They discussed whether the BCBS document and compliance with it should become a project in itself or part of existing initiatives to comply with data intensive regulations such as CCAR and CVA etc. Stephanie is on the board of the BCBS committee for risk data aggregation and she said that the document should be a guide and not a check list. There seemed to be general agreement on the panel that data architectures should be put together not with a view to compliance with one specific regulation but more as a framework to deal with all regulation to come, a more generalized approach.
Dilip said that whilst technology and data integration are issues, people are the biggest issue in getting a solid data architecture in place. There was an audience question about how different departments need different views of risk and how were these to be reconciled/facilitated. Stephanie said that data security and control of who can see what is an issue, and Dilip agreed and added that enterprise risk views need to be seen by many which was a security issue to be resolved.
Don Wesnofske of PRMIA and Dell said that data quality was another key issue in risk. Dilip agreed and added that the front office need to be involved in this (data management projects are not just for the back office in insolation) and that data quality was one of a number of needs that compete for resources/budget at many banks at the moment. Coming back to his people theme, Dilip also said that data quality also needed intuition to be carried out successfully.
An audience question from Dan Rodriguez (of PRMIA and Credit Suisse) asked whether regulation was granting an advantage to "Too Big To Fail" organisations in that only they have the resources to be able to cope with the ever-increasing demands of the regulators, to the detriment of the smaller financial insitutions. The panel did not completely agree with Dan's premise, arguing that smaller organizations were more agile and did not have the legacy and complexity of the larger institutions, so there was probably a sweet spot between large and small from a regulatory compliance perspective (I guess it was interesting that the panel did not deny that regulation was at least affecting the size of financial institutions in some way...)
Again focussing on where resources should be deployed, the panel debated trade-offs such as those between accuracy and consistency. The Legal Entity Identifier (LEI) initiative was thought of as a great start in establishing standards for data aggregation, and the panel encouraged regulators to look at doing more. One audience question was around the different and inconsistent treatment of gross notional and trade accounts. Dilip said that yes this was an issue, but came back to Stephanie's point that what is needed is a single risk data platform that is flexible enough to be used across multiple business and compliance projects. Don said that he suggests four "views" on risk:
- Risk Taking
- Risk Management
- Risk Measurement
- Risk Regulation
Stephanie added that organisations should focus on the measures that are most appropriate to your business activity.
The next audience question asked whether the panel thought that the projects driven by regulation had a negative return. Dilip said that his experience was yes, they do have negative returns but this was simply a cost of being in business. Unsurprisingly maybe, Stephanie took a different view advocating the benefits side coming out of some of the regulatory projects that drove improvements in data management.
The final audience question was whether the panel through the it was possible to reconcile all of the regulatory initiatives like Dodd-Frank, Basel III, EMIR etc with operational risk. Don took a data angle to this question, taking about the benefits of big data technologies applied across all relevant data sets, and that any data was now potentially valuable and could be retained. Dilip thought that the costs of data retention were continually going down as data volumes go up, but that there were costs in capturing the data need for operational risk and other applications. Dilip said that when compared globally across many industries, financial markets were way behind the data capabilities of many sectors, and that finance was more "Tiny Data" than "Big Data" and again he came back to the fact that people were getting in the way of better data management. Michael said that many banks and market data vendors are dealing with data in the 10's of TeraBytes range, whereas the amount of data in the world was around 8-900 PetaBytes (I thought we were already just over into ZetaBytes but what are a few hundred PetaBytes between friends...).
Abraham closed off the evening, firstly by asking the audience if they thought the 2016 deadline would be achieved by their organisation. Only 3 people out of around 50+ said yes. Not sure if this was simply people's reticence to put their hand up, but when Abraham asked one key concern for many was that the target would change by then - my guess is that we are probably back into the territory of the banks not implementing a regulation because it is too vague, and the regulators not being too prescriptive because they want feedback too. So a big game of chicken results, with the banks weighing up the costs/fines of non-compliance against the costs of implementing something big that they can't be sure will be acceptable to the regulators. Abraham then asked the panel for closing remarks: Don said that data architecture was key; Stephanie suggested getting the strategic aims in place but implementing iteratively towards these aims; Dilip said that deciding your goal first was vital; and Michael advised building a roadmap for data in risk.
Posted by Brian Sentance | 4 November 2013 | 11:47 am
...Xenomorph!!! Thanks to all who voted for us in the recent A-Team Data Management Awards, it was great to win the award for Best Risk Data Management and Analytics Platform. Great that our strength in the Data Management for Risk field is being recognised, and big thanks again to clients, partners and staff who make it all possible!
Please also find below some posts for the various panel debates at the event:
- Data Architecture: Sticks or Carrots?
- What Will Drive Data Management?
- Big Data, Cloud, In-Memory
- The Chief Data Officer Challenge
- Managed Services and the Utility Model
Some photos, slides and videos from the event are now available on the A-Team site.
Posted by Brian Sentance | 9 October 2013 | 12:07 pm
Andrew Delaney introduced the final panel of the day, involving Steve Cheng of Rimes, Jonathan Clark of Tech Mahindra, Tom Dalglish of UBS and Martijn Groot of Euroclear. Main points:
- Andrew started by asking the panel for their definitions of managed data services and data utilities
- Martijn said that a managed data service was usually the lifting out of a data process from in a company to be run by somebody else whereas a data utility had many users.
- Tom put it another way saying that a managed service was run for you whereas a utility was run for them. Tom suggested that there were some concerns around data utilities for the industry in terms of knowing/being transparent about data vendor affinity and any data monopoly aspects.
- When asked why past attempts at data utilities had failed, Tom said that it must be frustrating to be right but at wrong time, but in addition to the timing being right just now (costs/regulations being drivers) then the tech stack available is better and the appreciation of data usage importance is clearer.
- Steve added a great point on the tech stack, in that it now made mass customisation much easier.
- Jonathan made the point that past attempts at data utilities were built on product platforms used at clients, whereas the latest utilities were built on platforms specifically designed for use by a data utility.
- Looking at the cost savings of using a data utility, Martijn said that the industry spends around $16-20B on data, and that with his Euroclear data utility they can serve 2000 clients with a staff level that is less than any one client employs directly.
- Tom said that the savings from collapsing the data silos were primarily from more efficient/reduced usage of people and hardware to perform a specific function, and not data.
- Steve suggested that some utilities take an incremental data services and not take all data as in the old utility model, again coming back to his earlier point of mass customisation.
- Tom mentioned it was a bit like cable TV, where you can subscribe to a set of services of your choice but where certain services cost more than others.
- Martijn said that there were too many vested interests to turn data costs around quickly. He said that data utilities could go a long way however.
- Tom concluded by saying that it was about content not feeds, licensing was important as was how to segregate data.
Good panel - additionally one final audience question/discussion was around data utilities providing LEI data, and it was argued that LEI without the hierarchy is just another set of data to map and manage.
Posted by Brian Sentance | 7 October 2013 | 12:28 pm
Andrew Delaney introduced the second panel of the day, with the long title of "The Industry Response: High Performance Technologies for Data Management - Big Data, Cloud, In-Memory, Meta Data & Big Meta Data". The panel included Rupert Brown of UBS, John Glendenning of Datastax, Stuart Grant of SAP and Pavlo Paska of Falconsoft. Andrew started the panel by asking what technology challenges the industry faced:
- Stuart said that risk data on-demand was a key challenge, that there was the related need to collapse the legacy silos of data.
- Pavlo backed up Stuart by suggesting that accuracy and consistency were needed for all live data.
- Rupert suggested that there has been a big focus on low latency and fast data, but raised a smile from the audience when he said that he was a bit frustrated by the "format fetishes" in the industry. He then brought the conversation back to some fundamentals from his viewpoint, talking about wholeness of data and namespaces/data dictionaries - Rupert said that naming data had been too stuck in the functional area and not considered more in isolation from the technology.
- John said that he thought there were too many technologies around at the moment, particularly in the area of Not Only SQL (NoSQL) databases. John seemed keen to push NoSQL, and in particular Apache Cassandra, as post relational databases. He put forward that these technologies, developed originally by the likes of Google and Yahoo, were the way forward and that in-memory databases from traditional database vendors were "papering over the cracks" of relational database weaknesses.
- Stuart countered John by saying that properly designed in-memory databases had their place but that some in-memory databases had indeed been designed to paper over the cracks and this was the wrong approach, exascerbating the problem sometimes.
- Responding to Andrew's questions around whether cloud usage was more accepted by the industry than it had been, Rupert said he thought it was although concerns remain over privacy and regulatory blockers to cloud usage, plus there was a real need for effective cloud data management. Rupert also asked the audience if we knew of any good release management tools for databases (controlling/managing schema versioning etc) because he and his group were yet to find one.
- Rupert expressed that Hadoop 2 was of more interest to him at UBS that Hadoop, and as a side note mentioned that map reduce was becoming more prevalent across NoSQL not just within the Hadoop domain. Maybe controversially, he said that UBS was using less data than it used to and as such it was not the "big data" organisation people might think it to be.
- As one example of the difficulties of dealing with silos, Stuart said that at one client it required the integration of data from 18 different system to a get an overall view of the risk exposure to one counterparty. Stuart advocated bring the analytics closer to the data, enabling more than one job to be done on one system.
- Rupert thought that Goldman Sachs and Morgan Stanley seem to do what is the right thing for their firm, laying out a long-term vision for data management. He said that a rethink was needed at many organisations since fundamentally a bank is a data flow.
- Stuart picked up on this and said that there will be those organisations that view data as an asset and those that view data as an annoyance.
- Rupert mentioned that in his view accountants and lawyers are getting in the way of better data usage in the industry.
- Rupert added that data in Excel needed to passed by reference and not passed by value. This "copy confluence" was wasting disk space and a source of operational problems for many organisations (a few past posts here and here on this topic).
- Moving on to describe some of the benefits of semantic data and triple stores, Rupert proposed that the statistical world needed to be added to the semantic world to produce "Analytical Semantics" (see past post relating to the idea of "analytics management").
Great panel, lots of great insight with particularly good contributions from Rupert Brown.
Posted by Brian Sentance | 7 October 2013 | 12:23 pm
Great day on Thursday at the A-Team Data Management Summit in London (personally not least because Xenomorph won the Best Risk Data Management/Analytics Platform Award but more of that later!). The event kicked off with a brief intro from Andrew Delaney of the A-Team talking through some of the drivers behind the current activity in data management, with Andrew saying that risk and regulation were to the fore. Andrew then introduced Colin Gibson, Head of Data Architecture, Markets Division at Royal Bank of Scotland.
Data Architecture - Sticks or Carrots? Colin began by looking at the definition of "data architecture" showing how the definition on Wikipedia (now obviously the definitive source of all knowledge...) was not particularly clear in his view. He suggested himself that data architecture is composed of two related frameworks:
- Orderly Arrangement of Parts
He said that the orderly arrangement of parts is focussed on business needs and aims, covering how data is sourced, stored, referenced, accessed, moved and managed. On the discipline side, he said that this covered topics such as rules, governance, guides, best practice, modelling and tools.
Colin then put some numbers around the benefits of data management, saying that for every dollar spend on centralising data saves 20 dollars, and mentioning a resulting 80% reduction in operational costs. Related to this he said that for every dollar spent on not replicating data saved a dollar on reconcilliation tools and a further dollar saved on the use of reconcilliation tools (not sure how the two overlap but these are obviously some of the "carrots" from the title of the talk).
Despite these incentives, Colin added that getting people to actually use centralised reference data remains a big problem in most organisations. He said he thought that people find it too difficult to understand and consume what is there, and faced with a choice they do their own thing as an easier alternative. Colin then talked about a program within RBS called "GoldRush" whereby there is a standard data management library available to all new projects in RBS which contains:
- messaging standards
- standard schema
- update mechanisms
The benefit being that if the project conforms with the above standards then they have little work to do for managing reference data since all the work is done once and centrally. Colin mentioned that also there needs to be feedback from the projects back to central data management team around what is missing/needing to be improved in the library (personally I would take it one step further so that end-users and not just IT projects have easy discovery and access to centralised reference data). The lessons he took from this were that we all need to "learn to love" enterprise messaging if we are to get to the top down publish once/consume often nirvana, where consuming systems can pick up new data and functionality without significant (if any) changes (might be worth a view of this post on this topic). He also mentioned the role of metadata in automating reconcilliation where that needed to occur.
Colin then mentioned that allocation of costs of reference data to consumers is still a hot topic, one where reference data lags behind the market data permissioning/metering insisted upon by exchanges. Related to this Colin thought that the role of the Chief Data Officer to enforce policies was important, and the need for the role was being driven by regulation. He said that the true costs of a tactical, non-standard approach need to be identifiable (quantifying the size of the stick I guess) but that he had found it difficult to eliminate the tactical use of pricing data sourced for the front office. He ended by mentioning that there needs to be a coming together of market data and reference data since operations staff are not doing quantitative valuations (e.g. does the theoretical price of this new bond look ok?) and this needs to be done to ensure better data quality and increased efficiency (couldn't agree more, have a look at this article and this post for a few of my thoughts on the matter). Overall very good speaker with interesting, practical examples to back up the key points he was trying to get across.
Posted by Brian Sentance | 7 October 2013 | 12:12 pm
Numerix ran a great event on Thursday morning over at Microsoft's offices here in New York. "The Road to Achieving a Unified View of Risk" was introduced by Paul Rowady of the TABB Group. As at our holiday event last December, Paul is a great speaker and trying to get him to stop talking is the main (positive) problem of working with him (his typical ebullience was also heightened by his appearance in the Wall Street Journal on Thursday, apparently involving nothing illegal he assured me and even about which his mother phoned him during his presentation...). Paul started by saying that in their end of year review with his colleagues Larry Tabb and Adam Sussman, he suggested that Tabb Group needed to put more into developing the risk management thought leadership, which had led to today's introduction and the work Tabb Group have been doing with Numerix.
Having been involved in financial markets in Chicago, Paul is very bullish about the risk management capabilities of the funds and prop trading shops of the exchange traded options markets from days of old, and said that these risk management capabilities are now needed and indeed coming to the mainstream financial markets. Put another way, post crisis the need for a holistic view on risk has never been stronger. Considering bilateral OTC derivatives and the move towards central clearing, Paul said that he had been thinking that calculations such as CVA would eventually become as extinct as a dodo. However on using some data from the DTCC trade repository, he found that there are still some $65trillion notional of uncleared bilateral trades in the market, and that these will take a further 30 years to expire. Looking at swaptions alone the notional uncleared was $6trillion, and so his point was that bilateral OTC and their associated risks will be around for some time yet.
Paul put forward some slides showing back, middle and front-offices along different siloed business lines, and explained that back in the day when margins were fat and times were good, each unit could be run independently, with no overall view of risk possible given the range of siloed systems and data. In passing Paul also mentioned that one bank he had spoken two had 6,000 separate systems to support on just the banking side, let alone capital markets. Obviously post crisis this has changed, with pressures to reduce operational costs being a key driver at many institutions, and currently only valuation/reference data (+2.4%) and risk management (+1.2%) having increased budget spend across the market in 2013. Given operational costs and regulation such as CVA, risk management is having to move from being an end of day, post-trade process to being pre- and post-trade at intraday frequency. Paul said that not only must consistent approaches to data and analytics be taken across back, middle and front office in each business unit but now an integrated view of risk across business units must be taken (echos of an earlier event with Numerix and PRMIA). Considering consistent analytics, Paul mentioned his paper "The Risk Analytics Library" but suggested that "libraries" of everything were needed, so not just analytics, but libraries of data (data management anyone?), metadata, risk models etc.
Paul asked Ricardo Martinez of Deloite for an update on the regulatory landscape at the moment, and Ricardo responded by focusing down on the derivatives aspects Dodd-Frank. He first pointed out that even after a number of years the regulation was not yet finalized around collateral and clearing. A good point he made was that whilst the focus in the market at the moment is on compliance, he feels that the consequences of the regulation will ripple on over the next 5 years in terms of margining and analytics.
Some panel members disagreed with Paul over the premise that bilateral exotic trades will eventually disappear. Their point was that the needs of pension funds and other clients are very specific and there will always be a need for structured products, despite the capital cost incentives to move everything onto exchanges/clearing. Paul countered by saying that he didn't disagree with this, but the reason for suggesting that the exotics industry may die is trying to find institutions that can warehouse the risk of the trade.
Satyam Kancharla of Numerix spoke next. Satyam said that two main changes struck him in the market at the moment. One was the adjustment to a mandated market structure with clearing, liquidity and capital changes coming through from the regulators. The other was increased operating efficiency for investment banks. Whilst it is probable that no in investment bank would ever get to the operational efficiency of a retail business like Walmart, this was however the direction of travel with banks looking at how to optimize collateral, optimize trading venues etc.
Satyam put forward that computing power is still adhering to Moore's law, and that as a result some things are possible now that were not before, and that a centralized architecture built on this compute power is needed, but just because it is centralized does not mean that it is too inflexible to deal with each business units needs. Coming back to earlier comments made by the panel, he put forward that a lot of quants are involved in simply re-inventing the wheel, to which Paul added that quants were very experienced in using words like "orthogonal" to confuse mere mortals like him and justify the repetition of business functionality available already (from Numerix obviously, but more of that later). Satyam said that some areas of model development were more mature than others, and that quants should not engage in innovation for innovation's sake. Satyam also made a passing reference to the continuing use of Excel and VBA is the main tool of choice in the front office, suggesting that we still have some way to go in terms of IT maturity (hobby-horse topic of mine, for example see post).
Prompt by an audience question around data and analytics, Ricardo said that the major challenge towards sharing data was not technical but cultural. Against a background were maybe 50% of investment in technology was regulation-related, he said that there were no shortage of business ideas for P&L in the emerging "mandated" markets of the future, but many of these ideas required wholesale shifts in attitudes at the banks in terms of co-operation across departments and from front to back office.
Satyam said that he thought of data and analytics as two sides of the same coin (could not agree more, but then again I would say that) in that analytics generate derived data which needs just as much management as the raw data. He said that it should be possible to have systems and architectures that manage the duality of data and analytics well, and these architectures did not have to imply rigidity and inflexibility in meeting individual business needs.
There was then some debate of trade repositories for derivatives, where the panel discussed the potential conflict between the US regulators wanting competition in this area, but as Paul suggested having competition between DTCC, ICE, Bloomberg, LCH Clearnet etc also led to fragmentation. As such Paul put it that the regulators would need to "boil the ocean" to understand the exposures in the market. Ricardo also mentioned some of the current controversy over who owns the data in the trade repository. One of the panelists suggested that we should also keep an eye open to China and not necessarily get totally tied up in what is happening in "our" markets. The main point was that a huge economy such as China's could not survive without a sophisticated capital market to support it, and that China was not asleep in this regard.
A good audience question came from Don Wesnofske who asked how best to cope with the situation where an institution is selling derivatives based on one set of models, and the client is using another set of models to value the same trade. So the selling institution decides to buy/build a similar model to the client too, and Don wondered how the single analytic library practically helped this situation where I could price on one model and report my P&L using another. One panelist responded that it was mostly the assumptions behind each model that determined differences in price, and that heterogenious models and hence prices where needed for a market to function correctly. Another concurred on this and suggested there needed to be an "officially blessed" model with an institution against which valuations are compared. Amusingly for the audience, Steve O'Hanlon (CEO of Numerix) piped up that the problem was easy to resolve in that everyone should use Numerix's models.
Mike Opal of Microsoft closed the event with his presentation on data, analytics and cloud computing. Mike started by illustrating that the number of internet-enabled devices passed the human population of the world in 2008 and by 2020 the number of devices would be 50 billion. He showed that the amount of data in the world was 0.8ZB (zetabytes) in 2009, and is projected to reach 8ZB by 2015 and 35ZB by 2020, driven primarily by the growth in internet-enabled devices. Mike also said that the Prism project so in the news of late was involving the construction of a server fame near Salt Lake City of 5ZB in size, so what the industry (in this case the NSA) is trying to do is unimaginable if we were to go back only a few years. He said that Microsoft itself was utterly committed to cloud computing, with 8 datacenters globally but 20 more in construction, at a cost of $500million per center (I recently saw a datacentre in Redmond, totally unlike what I expected with racks pre-housed in lorry containers, and the containers just unloaded within a gigantic hanger and plugged in - the person showing me around asked me who the busiest person was a Microsoft data center and the answer was the truck drivers...)
Talking of "Big Data", he first gave the now-standard disclaimer (as I have I acknowledge) that he disliked the phrase. I thought he made a good point in the Big Data is really about "Small Data", in that a lot of it is about having the capacity to analyze at tiny granular level within huge datasets (maybe journalists will rename it? No, don't think so). He gave a couple of good client case studies, one for Westpac and one for Phoenix on uses of HPC and cloud computing in financial services. He also mentioned the Target retailing story about Big Data, which if you haven't caught it is worth a read. One audience question asked him again how committed Microsoft was to cloud computing given competition from Amazon, Apple and Google. Mike responded that he had only joined Microsoft a year or two back, and in part this was because he believed Microsoft had to succeed and "win" the cloud computing market given that cloud was not the only way to go for these competitors, whereas Microsoft (being a software company) had to succeed at cloud (so far Microsoft have been very helpful to us in relation to Azure, but I guess Amazon and others have other plans.)
In summary a great event from Numerix with good discussions and audience interaction - helped for me by the fact that much of what was said (centralization with flexibility, duality of data and analytics, libraries of everything etc) fits with what Xenomorph and partners like Numerix are delivering for clients.
Posted by Brian Sentance | 17 June 2013 | 8:23 pm
There are (occasionally!) some good questions and conversations going on within some of the LinkedIn groups. One recently was around what use cases there are for unstructured data within banking and finance, and I found this comment from Tom Deutsch of IBM to be quite insightful and elegant (at least better than I could I have written it...) on what the main types of unstructured data analysis there are:
- Listening for the first time
- Listening better
- Adding context
Listening for the first time is really just making use of what you already probably capture to hear what is being said (or navigated)
Listening better is making sure you are actually both hearing and understanding what is being said. This is sometimes non-trivial as it involves accuracy issues and true (not marketing hype) NLP technologies and integrating multiple sources of information
Adding context is when you either add structured data to the above or add the above to structured data, usually to round out or more fully inform models (or sometimes just build new ones).
Posted by Brian Sentance | 10 May 2013 | 2:17 pm
I went over to NYU Poly in Brooklyn on Friday of last week for their Big Data Finance Conference. To get a slightly negative point out of the way early, I guess I would have to pose the question "When is a big data conference, not a big data Conference?". Answer: "When it is a time series analysis conference" (sorry if you were expecting a funny answer...but as you can see, then what I occupy my time with professionally doesn't naturally lend itself to too much comedy). As I like time series analysis, then this was ok, but certainly wasn't fully "as advertised" in my view, but I guess other people are experiencing this problem too.
Maybe this slightly skewed agenda was due to the relative newness of the topic, the newness of the event and the temptation for time series database vendors to jump on the "Big Data" marketing bandwagon (what? I hear you say, we vendors jumping on a buzzword marketing bandwagon, never!...). Many of the talks were about statistical time series analysis of market behaviour and less about what I was hoping for, which was new ways in which empirical or data-based approaches to financial problems might be addressed through big data technologies (as an aside, here is a post on a previous PRMIA event on big data in risk management as some additional background). There were some good attempts at getting a cross-discipline fertilization of ideas going at the conference, but given the topic then representatives from the mobile and social media industries were very obviously missing in my view.
So as a complete counterexample to the two paragraphs above, the first speaker (Kevin Atteson of Morgan Stanley) at the event was on very much on theme with the application of big data technologies to the mortgage market. Apparently Morgan Stanley had started their "big data" analysis of the mortgage market in 2008 as part of a project to assess and understand more about the potential losses than Fannie Mae and Freddie Mac faced due to the financial crisis.
Echoing some earlier background I had heard on mortgages, one of the biggest problems in trying to understand the market according to Kevin was data, or rather the lack of it. He compared mortgage data analysis to "peeling an onion" and that going back to the time of the crisis, mortgage data at an individual loan level was either not available or of such poor quality as to be virtually useless (e.g. hard to get accurate ZIP code data for each loan). Kevin described the mortgage data set as "wide" (lots of loans with lots of fields for each loan) rather than "deep" (lots of history), with one of the main data problems was trying to match nearest-neighbour loans. He mentioned that only post crisis have Fannie and Freddie been ordered to make individual loan data available, and that there is still no readily available linkage data between individual loans and mortgage pools (some presentations from a recent PRMIA event on mortgage analytics are at the bottom of the page here for interested readers).
Kevin said that Morgan Stanley had rejected the use of Hadoop, primarily due write through-put capabilities, which Kevin indicated was a limiting factor in many big data technologies. He indicated that for his problem type that he still believed their infrastructure to be superior to even the latest incarnations of Hadoop. He also mentioned the technique of having 2x redundancy or more on the data/jobs being processed, aimed not just at failover but also at using the whichever instance of a job that finished first. Interestingly, he also added that Morgan Stanley's infrastructure engineers have a policy of rebooting servers in the grid even during the day/use, so fault tolerance was needed for both unexpected and entirely deliberate hardware node unavailability.
Other highlights from the day:
- Dennis Shasha had some interesting ideas on using matrix algebra for reducing down the data analysis workload needed in some problems - basically he was all for "cleverness" over simply throwing compute power at some data problems. On a humourous note (if you are not a trader?), he also suggested that some traders had "the memory of a fruit-fly".
- Robert Almgren of QuantitativeBrokers was an interesting speaker, talking about how his firm had done a lot of analytical work in trying to characterise possible market responses to information announcements (such as Friday's non-farm payroll announcement). I think Robert was not so much trying to predict the information itself, but rather trying to predict likely market behaviour once the information is announced.
- Scott O'Malia of the CFTC was an interesting speaker during the morning panel. He again acknowledged some of the recent problems the CFTC had experienced in terms of aggregating/analysing the data they are now receiving from the market. I thought his comment on the twitter crash was both funny and brutally pragmatic with him saying "if you want to rely solely upon a single twitter feed to trade then go ahead, knock yourself out."
- Eric Vanden Eijnden gave an interesting talk on "detecting Black Swans in Big Data". Most of the examples were from current detection/movement in oceanography, but seemed quite analogous to "regime shifts" in the statistical behaviour of markets. Main point seemed to be that these seemingly unpredictable and infrequent events were predictable to some degree if you looked deep enough in the data, and in particular that you could detect when the system was on a possible likely "path" to a Black Swan event.
One of the most interesting talks was by Johan Walden of the Haas Business School, on the subject of "Investor Networks in the Stock Market". Johan explained how they had used big data to construct a network model of all of the participants in the Turkish stock exchange (both institutional and retail) and in particular how "interconnected" each participant was with other members. His findings seemed to support the hypothesis that the more "interconnected" the investor (at the centre of many information flows rather than add the edges) the more likely that investor would demonstrate superior return levels to the average. I guess this is a kind of classic transferral of some of the research done in social networking, but very interesting to see it applied pragmatically to financial markets, and I would guess an area where a much greater understanding of investor behaviour could be gleaned. Maybe Johan could do with a little geographic location data to add to his analysis of how information flows.
So overall a good day with some interesting talks - the statistical presentations were challenging to listen to at 4pm on a Friday afternoon but the wine afterwards compensated. I would also recommend taking a read through a paper by Charles S. Tapiero on "The Future of Financial Engineering" for one of the best discussions I have so far read about how big data has the potential to change and improve upon some of the assumptions and models that underpin modern financial theory. Coming back to my starting point in this post on the content of the talks, I liked the description that Charles gives of traditional "statistical" versus "data analytics" approaches, and some of the points he makes about data immediately inferring relationships without the traditional "hypothesize, measure, test and confirm-or-not" were interesting, both in favour of data analytics and in cautioning against unquestioning belief in the findings from data (feels like this post from October 2008 is a timely reminder here). With all of the hype and the hope around the benefits of big data, maybe we would all be wise to remember this quote by a certain well-known physicist: "No amount of experimentation can ever prove me right; a single experiment can prove me wrong."
Posted by Brian Sentance | 7 May 2013 | 1:46 pm
Good post from Jim Jockle over at Numerix - main theme is around having an "analytics" strategy in place in addition to (and probably as part of) a "Big Data" strategy. Fits strongly around Xenomorph's ideas on having both data management and analytics management in place (a few posts on this in the past, try this one from a few years back) - analytics generate the most valuable data of all, yet the data generated by analytics and the input data that supports analytics is largely ignored as being too business focussed for many data management vendors to deal with, and too low level for many of the risk management system vendors to deal with. Into this gap in functionality falls the risk manager (supported by many spreadsheets!), who has to spend too much time organizing and validating data, and too little time on risk management itself.
Within risk management, I think it comes down to having the appropriate technical layers in place of data management, analytics/pricing management and risk model management. Ok it is a greatly simplified representation of the architecture needed (apologies to any techies reading this), but the majority of financial institutions do not have these distinct layers in place, with each of these layers providing easy "business user" access to allow risk managers to get to the "detail" of the data when regulators, auditors and clients demand it. Regulators are finally waking up to the data issue (see Basel on data aggregation for instance) but more work is needed to pull analytics into the technical architecture/strategy conversation, and not just confine regulatory discussions of pricing analytics to model risk.
Posted by Brian Sentance | 14 February 2013 | 2:50 pm
A little late on these notes from this PRMIA Event on Big Data in Risk Management that I helped to organize last month at the Harmonie Club in New York. Big thank you to my PRMIA colleagues for taking the notes and for helping me pull this write-up together, plus thanks to Microsoft and all who helped out on the night.
Introduction: Navin Sharma (of Western Asset Management and Co-Regional Director of PRMIA NYC) introduced the event and began by thanking Microsoft for its support in sponsoring the evening. Navin outlined how he thought the advent of “Big Data” technologies was very exciting for risk management, opening up opportunities to address risk and regulatory problems that previously might have been considered out of reach.
Navin defined Big Data as the structured or unstructured in receive at high volumes and requiring very large data storage. Its characteristics include a high velocity of record creation, extreme volumes, a wide variety of data formats, variable latencies, and complexity of data types. Additionally, he noted that relative to other industries, in the past financial services has created perhaps the largest historical sets of data and continually creates enormous amount of data on a daily or moment-by-moment basis. Examples include options data, high frequency trading, and unstructured data such as via social media. Its usage provides potential competitive advantages in a trading and investment management. Also, by using Big Data it is possible to have faster and more accurate recognition of potential risks via seemingly disparate data - leading to timelier and more complete risk management of investments and firms’ assets. Finally, the use of Big Data technologies is in part being driven by regulatory pressures from Dodd-Frank, Basel III, Solvency II, Markets for Financial Instruments Directives (1 & 2) as well as Markets for Financial Instruments Regulation.
Navin also noted that we will seek to answer questions such as:
- What is the impact of big data on asset management?
- How can Big Data’s impact enhance risk management?
- How is big data used to enhance operational risk?
Presentation 1: Big Data: What Is It and Where Did It Come From?: The first presentation was given by Michael Di Stefano (of Blinksis Technologies), and was titled “Big Data. What is it and where did it come from?”. You can find a copy of Michael’s presentation here. In summary Michael started with saying that there are many definitions of Big Data, mainly defined as technology that deals with data problems that are either too large, too fast or too complex for conventional database technology. Michael briefly touched upon the many different technologies within Big Data such as Hadoop, MapReduce and databases such as Cassandra and MongoDB etc. He described some of the origins of Big Data technology in internet search, social networks and other fields. Michael described the “4 V’s” of Big Data: Volume, Velocity, Variety and a key point from Michael was “time to Value” in terms of what you are using Big Data for. Michael concluded his talk with some business examples around use of sentiment analysis in financial markets and the application of Big Data to real-time trading surveillance.
Presentation 2: Big Data Strategies for Risk Management: The second presentation “Big Data Strategies for Risk Management” was introduced by Colleen Healy of Microsoft (presentation here). Colleen started by saying expectations of risk management are rising, and that prior to 2008 not many institutions had a good handle on the risks they were taking. Risk analysis needs to be done across multiple asset types, more frequently and at ever greater granularity. Pressure is coming from everywhere including company boards, regulators, shareholders, customers, counterparties and society in general. Colleen used to head investor relations at Microsoft and put forward a number of points:
- A long line of sight of one risk factor does not mean that we have a line of sight on other risks around.
- Good risk management should be based on simple questions.
- Reliance on 3rd parties for understanding risk should be minimized.
- Understand not just the asset, but also at the correlated asset level.
- The world is full of fast markets driving even more need for risk control
- Intraday and real-time risk now becoming necessary for line of sight and dealing with the regulators
- Now need to look at risk management at a most granular level.
Colleen explained some of the reasons why good risk management remains a work in progress, and that data is a key foundation for better risk management. However data has been hard to access, analyze, visualize and understand, and used this to link to the next part of the presentation by Denny Yu of Numerix.
Denny explained that new regulations involving measures such as Potential Future Exposure (PFE) and Credit Value Adjustment (CVA) were moving the number of calculations needed in risk management to a level well above that required by methodologies such as Value at Risk (VaR). Denny illustrated how the a typical VaR calculation on a reasonable sized portfolio might need 2,500,000 instrument valuations and how PFE might require as many as 2,000,000,000. He then explain more of the architecture he would see as optimal for such a process and illustrated some of the analysis he had done using Excel spreadsheets linked to Microsoft’s high performance computing technology.
Presentation 3: Big Data in Practice: Unintentional Portfolio Risk: Kevin Chen of Opera Solutions gave the third presentation, titled “Unintentional Risk via Large-Scale Risk Clustering”. You can find a copy of the presentation here. In summary, the presentation was quite visual and illustrating how large-scale empirical analysis of portfolio data could produce some interesting insights into portfolio risk and how risks become “clustered”. In many ways the analysis was reminiscent of an empirical form of principal component analysis i.e. where you can see and understand more about your portfolio’s risk without actually being able to relate the main factors directly to any traditional factor analysis.
Panel Discussion: Brian Sentance of Xenomorph and the PRMIA NYC Steering Committee then moderated a panel discussion. The first question was directed at Michael “Is the relational database dead?” – Michael replied that in his view relational databases were not dead and indeed for dealing with problems well-suited to relational representation were still and would continue to be very good. Michael said that NoSQL/Big Data technologies were complimentary to relational databases, dealing with new types of data and new sizes of problem that relational databases are not well designed for. Brian asked Michael whether the advent of these new database technologies would drive the relational database vendors to extend the capabilities and performance of their offerings? Michael replied that he thought this was highly likely but only time would tell whether this approach will be successful given the innovation in the market at the moment. Colleen Healy added that the advent of Big Data did not mean the throwing out of established technology, but rather an integration of established technology with the new such as with Microsoft SQL Server working with the Hadoop framework.
Brian asked the panel whether they thought visualization would make a big impact within Big Data? Ken Akoundi said that the front end applications used to make the data/analysis more useful will evolve very quickly. Brian asked whether this would be reminiscent of the days when VaR first appeared, when a single number arguably became a false proxy for risk measurement and management? Ken replied that the size of the data problem had increased massively from when VaR was first used in 1994, and that visualization and other automated techniques were very much needed if the headache of capturing, cleansing and understanding data was to be addressed.
Brian asked whether Big Data would address the data integration issue of siloed trading systems? Colleen replied that Big Data needs to work across all the silos found in many financial organizations, or it isn’t “Big Data”. There was general consensus from the panel that legacy systems and people politics were also behind some of the issues found in addressing the data silo issue.
Brian asked if the panel thought the skills needed in risk management would change due to Big Data? Colleen replied that effective Big Data solutions require all kinds of people, with skills across a broad range of specific disciplines such as visualization. Generally the panel thought that data and data analysis would play an increasingly important part for risk management. Ken put forward his view all Big Data problems should start with a business problem, with not just a technology focus. For example are there any better ways to predict stock market movements based on the consumption of larger and more diverse sources of information. In terms of risk management skills, Denny said that risk management of 15 years ago was based on relatively simply econometrics. Fast forward to today, and risk calculations such as CVA are statistically and computationally very heavy, and trading is increasingly automated across all asset classes. As a result, Denny suggested that even the PRMIA PRM syllabus should change to focus more on data and data technology given the importance of data to risk management.
Asked how best to should Big Data be applied?, then Denny replied that echoed Ken in saying that understanding the business problem first was vital, but that obviously Big Data opened up the capability to aggregate and work with larger datasets than ever before. Brian then asked what advice would the panel give to risk managers faced with an IT department about to embark upon using Big Data technologies? Assuming that the business problem is well understood, then Michael said that the business needed some familiarity with the broad concepts of Big Data, what it can and cannot do and how it fits with more mainstream technologies. Colleen said that there are some problems that only Big Data can solve, so understanding the technical need is a first checkpoint. Obviously IT people like working with new technologies and this needs to be monitored, but so long as the business problem is defined and valid for Big Data, people should be encouraged to learn new technologies and new skills. Kevin also took a very positive view that IT departments should be encouraged to experiment with these new technologies and understand what is possible, but that projects should have well-defined assessment/cut-off points as with any good project management to decide if the project is progressing well. Ken put forward that many IT staff were new to the scale of the problems being addressed with Big Data, and that his own company Opera Solutions had an advantage in its deep expertise of large-scale data integration to deliver quicker on project timelines.
Audience Questions: There then followed a number of audience questions. The first few related to other ideas/kinds of problems that could be analyzed using the kind of modeling that Opera had demonstrated. Ken said that there were obvious extensions that Opera had not got around to doing just yet. One audience member asked how well could all the Big Data analysis be aggregated/presented to make it understandable and usable to humans? Denny suggested that it was vital that such analysis was made accessible to the user, and there general consensus across the panel that man vs. machine was an interesting issue to develop in considering what is possible with Big Data. The next audience question was around whether all of this data analysis was affordable from a practical point of view. Brian pointed out that there was a lot of waste in current practices in the industry, with wasteful duplication of ticker plants and other data types across many financial institutions, large and small. This duplication is driven primarily by the perceived need to implement each institution’s proprietary analysis techniques, and that this kind of customization was not yet available from the major data vendors, but will become more possible as cloud technology such as Microsoft’s Azure develops further. There was a lot of audience interest in whether Big Data could lead to better understanding of causal relationships in markets rather than simply correlations. The panel responded that causal relationships were harder to understand, particularly in a dynamic market with dynamic relationships, but that insight into correlation was at the very least useful and could lead to better understanding of the drivers as more datasets are analyzed.
Posted by Brian Sentance | 8 February 2013 | 3:14 pm
Posted by Brian Sentance | 22 January 2013 | 3:14 pm
In relation to the Microsoft/PRMIA event that Brian moderated at last night in New York, I spotted this article recently that tries to map out all the different databases that are now commercially available in some form, from SQL to No SQL and all the various incarnations and flavours in between:
As Brian suggested in his recent post, It's amazing to see how much the landscape has evolved from the domination (mantra?) that there was the relational way, or no way. Obviously times have moved on (er, I guess the Internet happened for one thing...) and people are now far more accepting of the need for different approaches to different types and sizes of business problems. That said, I agree with the article and comments that suggest there do seem to be far too many options available now - there has to be some consolidation coming otherwise it will become increasingly difficult to know where to start. Choice is a wonderful thing, but only in moderation!
Posted by Chris Budgen | 16 January 2013 | 9:30 pm
Good breakfast event from SAP and A-Team last Thursday morning. SAP have been getting (and I guess paying for) a lot of good air-time for their SAP Hana in-memory database technology of late. Domenic Iannaccone of SAP started the briefing with an introduction to big data in finance and how their SAP/Sybase offerings knitted together. He started his presentation with a few quotes, one being "Intellectual property is the oil of the 21st century" by Mark Getty (he of Getty images, but also of the Getty oil family) and "Data is the new oil" by both Clive Humby and Gerd Leonhard (not sure why two people quoted saying the same thing but anyway).
For those of you with some familiarity with the Sybase IQ architecture of a year or two back, then in this architecture SAP Hana seems to have replaced the in-memory ASE database that worked in tandem with Sybase IQ for historical storage (I am yet to confirm this, but hope to find out more in the new year). When challenged on how Hana differs from other in-memory database products, Domenic seemed keen to emphasise its analytical capabilities and not just the database aspects. I guess it was the big data angle of bring the "data closer to the calculations" was his main differentiator on this, but with more time I think a little bit more explanation would have been good.
Pete Harris of the A-Team walked us through some of the key findings of what I think is the best survey I have read so far on the usage of big data in financial markets (free sign-up needed I think, but you can get a copy of the report here). Some key findings from a survey of staff at ten major financial institutions included:
- Searching for meaning in instructured data was a leading use-case thought of when thinking of big data (Twitter trading etc)
- Risk management was seen as a key beneficiary of what the technologies can offer
- Aggregation of data for risk was seen as a key application area concerning structured data.
- Both news feed but also (surprisingly?) text documents were key unstructured data sources being processed using big data.
- In trading news sentiment and time series analysis were key areas for big data.
- Creation of a system wide trade database for surveillance and compliance was seen as a key area for enhancement by big data.
- Data security remains a big concern with technologists over the use of big data.
There were a few audience questions - Pete clarified that there was a more varied application of big data amongst sell-side firms, and that on the buy-side it was being applied more KYC and related areas. One of the audience made that point that he thought a real challenge beyond the insight gained from big data analysis was how to translate it into value from an operational point of view. There seemed to be a fair amount of recognition that regulators and auditors are wanting a full audit trail of what has gone on across the whole firm, so audit was seen as a key area for big data. Another audience member suggested that the lack of a rigid data model in some big data technologies enabled greater flexibility in the scope of questions/analysis that could be undertaken.
Coming back to the key findings of the survey, then one question I asked Pete was whether or not big data is a silver bullet for data integration. My motivation was that the survey and much of the press you read talks about how big data can pull all the systems, data and calculations together for better risk management, but while I can understand how massively scaleable data and calculation capabilities was extremely useful, I wondered how exactly all the data was pulled together from the current range of siloed systems and databases where it currently resides. Pete suggested that this was stil a problematic area where Enterprise Application Integration (EAI) tools were needed. Another audience member added that politics within different departments was not making data integration any easier, regardless of the technologies used.
Overall a good event, with audience interaction unsurprisingly being the most interesting and useful part.
Posted by Brian Sentance | 3 December 2012 | 2:12 pm
Bankenes Sikringsfond Selects Xenomorph's TimeScape for Faster Data Analysis and High-Quality Decision Support
Just a quick note to say that we have signed a new client, Bankenes Sikringsfond, the Norwegian Banks’ Guarantee Fund. They will be using TimeScape to fulfill requirements for a centralised analytics and data management platform. The press release is available here for those of you who are interested.
Posted by Sara Verri | 11 October 2012 | 10:50 am
New article with some of my thoughts on data models, interfaces and software upgrades has just gone up on the Waters Inside Reference Data site.
Posted by Brian Sentance | 11 September 2012 | 4:50 pm
We have a great new software release out today for TimeScape, Xenomorph's analytics and data management solution, more details of which you can find here. For some additional background to this release then please take a read below.
For many users of Xenomorph's TimeScape, our Excel interface to TimeScape has been a great way of extending and expanding the data analysis capabilities of Excel through moving the burden of both the data and the calculation out of each spreadsheet and into TimeScape. As I have mentioned before, spreadsheets are fantastic end-user tools for ad-hoc reporting and analysis, but problems arise when their very usefulness and ease of use cause people to use them as standalone desktop-based databases. The four-hundred or so functions available in TimeScape for Excel, plus Excel access to our TimeScape QL+ Query Language have enabled much simpler and more powerful spreadsheets to be built, simply because Excel is used as a presentation layer with the hard work being done centrally in TimeScape.
Many people like using spreadsheets, however many users equally do not and prefer more application based functionality. Taking this feedback on board has previously driven us to look at innovative ways of extending data management, such as embedding spreadsheet-like calculations inside TimeScape and taking them out of spreadsheets with our SpreadSheet Inside technology. With this latest release of TimeScape, we are providing much of the ease of use, analysis and reporting power of spreadsheets but doing so in a more consistent and centralised manner. Charts can now be set up as default views on data so that you can quickly eyeball different properties and data sources for issues. New Heatmaps allow users to view large colour-coded datasets and zoom in quickly on areas of interest for more analysis. Plus our enhanced Reporting functionality allows greater ease of use and customisation when wanting to share data analysis with other users and departments.
Additionally, the new Query Explorer front really shows off what is possible with TimeScape QL+, in allowing users to build and test queries in the context of easily configurable data rules for things such as data source preferences, missing data and proxy instruments. The new auto-complete feature is also very useful when building queries, and automatically displays all properties and methods available at each point in the query, even including user-defined analytics and calculations. It also displays complex and folded data in an easy manner, enabling faster understanding and analysis of more complex data sets such as historical volatility surfaces.
Posted by Brian Sentance | 17 July 2012 | 3:11 pm