What is data linkage?
Last time I described administrative/routine data, and how different organisations hold data about you as part of their everyday activates. These organisations might be your GP practice, or local hospital, or your local council, or your child’s school or the Office of National Statistics or some other organisation.
Sometimes, to answer difficult questions we need to join the data from different organisations, we call this data linkage. Last time I described how your GP data is kept separate to data held by hospitals. What if there was an increase in the number of admissions to hospital due to asthma attacks and we wanted to find out why? We could look at the GP practice data for the region and see if there are more people in the region with asthma than before, or we could look to see if fewer people are having their regular asthma reviews or some other plausible reasons for the rise in hospital admissions. But looking at GP practice data in isolation means that it is possible that the people who are not having their asthma reviewed regularly are not the same people who end up in hospital with an asthma attack. To make sure we understand if there is a correlation, we need to link the GP practice data to the hospital data, then we can see if the people having asthma attacks are the same people who have not had their asthma reviewed recently by their GP.
This kind of data linkage is relatively easy to do as both your GP and the hospital will have the NHS numbers of all of their patients. The NHS number is unique for every patient, so we can be very confident that the patient in the GP data is the same one as the patient in the hospital data with the same NHS number. Now we can do our analysis and get to the bottom of what is going on.
What about other data sets where they don’t have the NHS number? For instance, what if a school had noticed that more children were missing school due to respiratory issues, how could we find out what is happening? Perhaps we would want to take the school attendance records and match the pupils to their GP records to see if could get a clearer picture of what is going on. This is more difficult to do since the school data will not have the NHS number to link on. We have to use other identifiers to link on in this scenario. We could start with name and address – if they match in both data sets they are probably the same person, right? What if a child has the same name as one of their parents? They might get linked to their parent’s GP record. Let’s use name, address and date of birth, that should give us a unique identifiable value to link on. In this way we can say with a very high probability that we are linking the same people together.
We don’t have to stop there, perhaps we want to link in the local air quality measurements, or how good the pupils’ houses are at keeping the heat in. There are mechanisms to link together all sorts of data to build up a more complete picture of complex trends. In different data sets we will have different identifiers we can use to do the linkage, so it very much depends on the organisations and data that is available as to if and how linkage is done.
I should emphasise that just because it is technically possible to link data sets together it does not mean that it is always legally or ethically possible. Often data is not allowed to be linked to other data sets as that was not agreed at the time the data was collected. It doesn’t matter how compelling the reason to do the linkage is, without the relevant informed consent it usually cannot happen. This is the kind of information covered in those privacy policies that organisations point you towards when you first start to interact with them.
Another important point to emphasise is what happens with that identifiable information used to do the linkage. If I am the person trying to use linked data, I don’t have any real need to see the identifiable information, so what usually happens is that it is used by the organisations which already hold the data (or a trusted third party) to do that linkage, then it is deleted so that I could see the linked data but I would have no idea who the individuals in it are. This is one of the many steps we take to make sure data is used in a responsible way.
We’ve covered a lot about data in this blog series so far, next time we will start to cover what we actually do with it.
Dr Olly Butters, Care and Health Informatics theme