Data Validation and Cleansing: How Do You Prioritise What To Fix First?
Data validation and cleansing is a methodical discipline. Start with a set of rules (or tests) to identify anomalies. For example, if a data item hasn’t changed (and is possibly stale); if it has changed beyond typical norms (and is possibly an error); or if two sources differ in their value for the same item (suggesting one may be wrong). Those rules throw up a regular list of exceptions that require further investigation. And that investigative process will either unearth an inaccurate data item that needs to be fixed, or a valid reason for the rule (or combination of rules) having been broken. For example, the false-positive exception where a price may have changed intra-day but coincidentally closes at the same price as the prior day.
Those core steps – i) run data through a set of validation rules, ii) identify exceptions and iii) investigate and fix – are bread and butter to data managers. Yet from the perspective of the business, there is an interim step that is often missed – how do you prioritise which exceptions to investigate and fix first?
Surely not all exceptions are created equally. Some might have much bigger implications on the firm’s underlying business, whatever that may be. Recently, we’ve had several clients ask us to incorporate a severity rank into their exception handling report – to help them prioritise their efforts. Yet not all firms prioritise in the same way.
One of our customers linked their trade position databases into our architecture to rank the potential dollar impact associated with each exception. In other words, if the firm was sitting on a big position in a particular security, and the price of that security spiked downwards, that spike would be the first to be investigated. Another customer asked for another approach. They wanted to make sure their exception handling was prioritised according to the clients that held a particular security. In other words, exceptions that potentially impacted their large, important clients would be the first to be investigated. Although it was by no means a trivial challenge, our open architecture and rules engine meant we were able to accommodate both of these requests relatively easily.
Yet there is still a lot more of interest that we’re investigating. So here are some open-ended questions to anyone charged with assuring data quality: how do you prioritise your exception handling processes? Would you incorporate different prioritisation factors depending on the test in question (in other words, would you tend to prioritise a data spike over a stale data item)? Would you set out SLAs clearly outlining expectations for resolving high, medium and low priority exceptions, and is your system capable of tracking the right metrics to ensure compliance with those SLAs?
This blog was first published in Information Management