6 minute read

Data Consistency and Data Quality: A Search For Reality

Data Consistency and Data Quality: A Search For Reality

Article written by David Steele, Head of Data Science

Advertisement

At The Floow, we see ourselves as the data equivalent of an oil refinery.

Where an oil refinery takes crude oil and refines it to make fuel with the impurities removed, The Floow takes raw sensor information from driver journeys, such as GPS and accelerometer values, and refines it into a clean data set that can be ingested by our scoring and analytical processes.

At a fundamental level, this process works very well. Our scoring implementations and analytical processes are designed in a highly robust way. This means that any impurities which may pass through the cleansing routines and into our system, do not break it or make it crash.

However, as with most software and models, our systems are designed using an iterative process. The idea behind this is that each new iteration brings about enhancements and improvements, which will produce a better performance than the baseline performance of the current iteration.

‘Data Consistency’ and ‘Data Quality’ are two such enhancements. They are specifically designed to measure and better handle the impurities of the data that our systems may be exposed to during the scoring and modelling processes.

Why do we need these enhancements, and why now?

Over the last 12 months, The Floow’s Data Science team, including myself, have been developing and working on a number of enhancements around Data Consistency and Data Quality to introduce to our current systems.

There are many reasons why we decided to make these enhancements to our systems, but here are a few of the key motivators which prompted our decision to undertake this work:

• As the technology that collects journey data becomes more sophisticated, it can deliver a richer data set which must be cleansed properly before being ingested. A rich data set may have increased sensitivity to subtle outliers and oddities, so we need processes that can spot such things with a high degree of accuracy.

• We are constantly improving our visual representations of data, therefore we need to make sure that the data can be easily understood for any output which may be shown.

• Being aware of data issues is not enough, we need to cleanse the data according to the correct contextual purpose. With Data Consistency and Data Quality, we can apply our cleansing processes to obtain the best possible representation of an individual journey that took place in a real-life scenario, so that users are not unfairly scored due to bad data.

• Poor data can cause problems with our algorithms. For example, our tagging algorithm predicts the journey type (car, train, plane etc) and if the data is poor, then these predictions are more likely to be wrong.

• More recently, we have seen increased demand for Pay As You Drive (PAYD) style propositions and interventions, such as speeding interventions. In order to judge driver risk at a journey level, we need to ensure that the data we use to make these judgements is as robust as possible, and we also need to be able to confidently say if the quality of the data is too poor to use.

• Continual improvement of our data processes and procedures to ensure that our algorithm and our scores are the best and truest representation of a driver’s behaviour.

What is Data Consistency?

Simply put Data Consistency is a best estimation replacement algorithm. It aims to replace GPS speed dropouts (or spikes) with sensible speeds using distance and time calculations. The end result helps to smooth out issues to boost our Anticipation score.

Figure 1: Showing the difference between actual speeds and Data Consistency speeds

Figure 1: Showing the difference between actual speeds and Data Consistency speeds

The graph in Figure 1 (above) shows two speed traces taken from a real-life journey. The blue speed trace is the collection of best speeds that could be derived from the given journey data.

As you can see, there are three dropouts during the journey (the blue drops) as well as an initial issue at the start of the journey.

Data Consistency looks at the speed trace, locates the dropouts (or spikes), and then estimates a more likely speed, based on the surrounding information.

This can be seen by the red dotted line. The result is that the red speed trace more closely reflects what would have happened in real life.

The importance of Data Quality

The key part of Data Consistency is that it uses the surrounding good data to make best estimates where erroneous speeds occur. However, this process (as well as other processes) can be adversely affected when the surrounding data is also not good, and this is where Data Quality comes in.

Data Quality uses a range of metrics to ascertain the overall quality of the data for a given journey. The idea being that if the quality of the journey data falls below a given threshold, then judgements should not be made on the insights derived from the data.

Also, we should not expose our scoring and analytical processes to this low quality data as it could produce results that are not correct, and this could have a detrimental impact to the end-user, and the reputation of an insurer.

For an example of this, we can look back at Figure 1. Without the cleaning steps being applied to this data, the blue speed trace would likely suggest that the driver is behaving very erratically, by speeding up and slowing down in a dramatic, and unrealistic, fashion.

Therefore the work we have done around Data Quality is very important as it enables us to:

• Assess whether a journey should be scored / processed, or not

• Identify areas which may be prone to poor GPS

• Identify the possibility of faulty GPS devices

• Discard or handle journeys with GPS issues

• Identify ghost / phantom / ping journeys

• Ultimately be fairer to the end-user by only using accurate data

It will also pave the way for future enhancements to how we process and score data, as well as helping to advance our Digital Education portfolio such as driver coaching and in-app feedback provided to drivers after each journey.

To further showcase the importance of our Data Quality work, on the next page are a couple of examples of driver journeys.

Figure 2: A poor scoring journey

Figure 2: A poor scoring journey

The image in figure 2 (above) shows a journey that has scored very poorly for Data Quality. The individual data points shown in the image are not aligned to the road, with half of them being over water. Also, the bearings (the green arrows) and the speeds shown are erratic with extreme changes being observed.

To judge a user’s driving behaviour on this data is both inadvisable and unfair, and therefore we would say that this journey is ‘too poor to score’. On the other hand, the journey shown in Figure 3 (below) is in stark contrast to that in Figure 2.

Figure 3: A snippet of a good scoring journey

Figure 3: A snippet of a good scoring journey

The journey in Figure 3 shows a section from a journey that has scored much better on Data Quality. The individual points map well to the road, the bearing follows the road nicely, and the speeds follow a nice smooth pattern. Therefore we would say that this data is a good representation of the real life journey, and it can be used for processing.

Ensuring the best possible outputs for everyone

GPS sensors are now ubiquitous across many devices including smartphones, cars and OBDs, but the output from these devices is not always good quality which could be due to a device being faulty, the driver going through a weak signal area, etc.

Although the vast majority of data which we receive is very good and works well with our processes, we need tools that can automatically handle erroneous data. Data Consistency and Data Quality are tools which allow us to do just that, and that is why the work which we have undertaken over the last 12 months is so important, as it ensures that the insights into driver behaviour which we deliver to our clients, and end-users, is fair and reliable.

This article is from: