Managing Data Quality. Tim King
decreasing data quality.
What you should be aiming for is a ‘virtuous circle’ (see Figure 2.1) whereby improvements in data quality deliver improvements in information quality. These in
Figure 2.1 The virtuous circle of data quality
Challenges when exploiting and managing data
17
turn should improve the quality of the decision making of the organisation and, in turn, should lead to better results or outcomes. With better decisions and results, it is likely that the data arising from these activities will also be better quality, particularly in more ‘data aware’ organisations.
Unclear data ownership
Some people talk about ‘data owners’ and organisations often worry about assigning ownership to particular data sets. Since many business processes can create, use and amend similar data (e.g. customer data being updated as part of many processes), an assigned owner of data will struggle to retain influence over the activities that are changing the data and contributing to poor quality.
By assigning empowered process owners and maintaining explicit specifications for the data being created by a process, organisations can establish a more solid foundation for the control of data quality. If such data specifications have been defined, agreed and published to reflect the decisions being supported by the data, it becomes possible to assess compliance objectively. Many individual sources (e.g. different suppliers or departments) can then contribute to the data, but always in accordance with the single, definitive data specification.
Managing data storage is not ownership; effective owners are those who can determine the extent to which data are appropriate for the needs of the business.
Backups and data quality
All well run information technology (IT) systems will have an agreed backup regime in place, ensuring the organisation is able to restore a full or partial system in the event of major hardware or software issues. This will typically include a range of daily, weekly and monthly backups, plus off-site storage and perhaps standby ‘failover’ systems, to ensure that if, for example, there were to be a fire on site, software systems and services could still be restored from scratch without undue delay or data loss. Backups can vary from a ‘full’ backup (copying all data and information to backup media) through to an ‘incremental’ backup that only copies data and information that have changed since the last backup was run.
Depending on the backup service level agreement, the data backups can usually give a view of the data from one or two months ago. However, if you were to try and restore your data to a backup from some time ago, you will almost certainly ‘lose’ all the data updates that have been made since the backup was created. Such a scenario requires both the application of incremental restores and the analysis of which updates to ignore and overwrite.
If staff have been entering poor quality data over a significant period of time (perhaps over many years), then it will be extremely difficult, if not impossible, to go back to a point where the data were ‘good’ and find the correct data. This is particularly challenging
Managing Data Quality
18
when there are some teams who are very diligent (whose data you do not want to correct or change) and other teams who are more careless (whose data you will need to correct).
The reality is that system and data backups provide little or no value when trying to resolve data quality problems.
Data quality and lack of transparency in business cases
In an organisational context, it is rare for a business case to be expressed along the lines of ‘based on the quality of input data, we believe project costs are likely to be between £240k and £320k, with benefits in the range of £80k to £160k per annum’. In this example, the worst-case forecast would give a payback period of four years, whereas the best case would suggest that the project covered its costs in only 18 months. Clearly, this represents a large range of outputs, and, depending on how the costs, benefits and payback are presented, there will be very different perceptions of the level of risk presented by the project. Based on this example, consider the impact if the results were presented as follows:
1. ‘The project will cost £280k, deliver benefits of £120k per annum and therefore achieve a payback of 2.3 years’ (these are based on the mid-points of the above scenario).
2. ‘Based on the assessed quality of input data, we forecast the project costs to be between £260k and £300k with benefits in the range of £100k to £140k per annum’ (these are based on the mid-points of the above scenario, but with lower variance in costs).
3. ‘Based on the assessed quality of input data, we forecast the project costs to be between £240k and £320k with benefits in the range of £80k to £160k per annum.’
In the first case, the business case sounds attractive, so you will probably be able to gain project approval for £280k. If, however, the actual costs and benefits end up at the pessimistic end of the scale, then you might have a hard time from the project sponsors.
In the second case, the variance due to data quality issues is smaller (and has been declared), so you could achieve both approval for the project and avoid any unpleasant surprises later in the project.
In the final case, the variance due to data quality issues indicates a large range for the potential payback period. The outcome could be that you get approval for a limited initial investment, say £25k, to correct or gather data and perhaps to run a proof-of-concept study. This is likely to help reduce the level of risk and uncertainty associated with the project and potentially avoid the organisation committing funds to a non-viable project. The outcome of this limited project phase would then be suitable to allow the whole project to proceed, having helped to reduce the risk of cost over-runs and implementation issues.
Challenges when exploiting and managing data
19
The data triangle
To support effective data exploitation, it is useful to consider the ‘data triangle’, shown in Figure 2.2. This illustrates three factors that need to be present to ensure effective data exploitation; similar to a tripod, if any one of the elements is not present, then the overall approach will not stand up. The three factors are as follows:
Analytic and processing tools: You need to ensure that you have both an effective tool set and the individuals with the skills to use it efficiently and effectively.
Subject matter expertise: The world’s best experts in your chosen tool set will only be able to provide minimal benefits unless they are supported by suitable subject matter expertise. This will help to ensure that they are able to understand the context of the particular exploitation challenge, to interpret the meaning of the data, to spot obvious data issues and to provide validation that outputs are sensible.
Data of known quality: It is important not only that you have sufficient data to be able to undertake data exploitation tasks, but also that the quality of these data is known. Clearly, it would be great to have ‘perfect’ data. In real business situations, however, there will be data quality issues that cannot be corrected in a timely or cost-effective manner. The previous discussion on complex decisions showed the importance of knowing and stating the quality of your data and the impact they have on extracting appropriate insight and foresight.
Figure 2.2 The data triangle