Frequently asked questions

Background

CO-CONNECT is a research project to streamline the process for scientists across the UK to find and request access to the data they need to understand COVID-19 and develop potential therapies and treatments, all whilst keeping patient data confidential. It is funded by the Medical Research Council (Part of UKRI) and the Department of Health and Social Care (part of NIHR) in direct response to the pandemic.

The project is supported by the CO-CONNECT team based at the Universities of Dundee, Edinburgh, and Nottingham (lead institution). The team provides supporting capabilities in data handling, data curation to OMOP, data governance and data infrastructure. The team never have access to identifiable or pseudonymous patient data. 

Across the UK, each time someone takes part in research, visits their GP or hospital, or takes a blood test, data from these events is securely held by the organisation that collected it.   However, these data (Data Sources) are held by different organisations (Data Partners), in different formats and in different secure environments. These datasets can also be found and accessed in different ways. This lack of standardised approaches and fragmented landscape has presented challenges for public health agencies and researchers to find, navigate permissions to access, and interrogate the data they need to inform the pandemic response at pace. In collaboration with Health Data Research UK, CO-CONNECT is streamlining processes across the 3 stages of a research project.

  • The first stage of a research project is to find the data available to answer a specific research question.
  • In the second stage, if researchers find datasets they wish to analyse, they must obtain formal approvals from every data partner to carry out their research on the data. 
  • The third stage is for researchers to analyse data. CO-CONNECT is supporting two methods for researchers to analyse data.

University of Nottingham is the contractual lead with the Universities of Dundee and Edinburgh and UK HSA as co-leads of the programme. 

CO-CONNECT are collaborating with a range of organisations who host and manage data. They are the legal data controllers of the data. We term them CO-CONNECT “Data Partners”. A Data Source is a specific data set hosted and managed by the Data Partner.  

Data Partners: an organisation who has been funded and contracted through CO-CONNECT that has data of relevance for the CO-CONNECT project:

  • Imperial College London
  • King College London
  • NHS Digital
  • NHS GOSH
  • Northern Ireland Department of Health & Social Care
  • Office for National Statistics
  • Public Health Scotland
  • Queen Mary University London
  • UK Health Security Agency (formerly Public Health England)
  • University College London
  • University of Bristol
  • University of Cambridge
  • University of Dundee
  • University of Edinburgh
  • University of Liverpool
  • University of Leicester
  • University of Nottingham
  • University of Oxford
  • University of Swansea
  • Health Data Research UK: a national institute whose mission is to unite the UK’s health and care data to enable discoveries that improve people’s lives. CO-CONNECT are working in partnership with HDR-UK to integrate the developed tools into the HDR Innovation Gateway.
  • BC Platforms: a global company with software and teams to support the discovery and use of biomedical data in clinical trials and research. CO-CONNECT partnership with BC Platforms is as a software supplier only. The software supplied by BC Platforms is run and controlled by HDR UK.

Engagement: Disseminate the project to the UK research community with the aim of promoting the use of the platform.

  • University College London – Communication Lead
  • University of Edinburgh – PPIE Lead

Data Discovery can have multiple definitions. For CO-CONNECT it is a process that facilitates users to understand if a dataset is suitable for their use by asking questions about whether a dataset contains certain data. 

This is a form of data discovery. It allows a user to specify a cohort they are interested to find, such as ‘Females with Asthma’. The system can then inform the user which datasets contain individuals that meet their cohort definition. The approach does not reveal individual data but tells the user the scale of the cohort available (i.e. there are 100 people across five data sets that meet your definition). Another term for it is study feasibility.

The Cohort Discovery Search Tool is a service run by Health Data Research UK that has implemented the tools created in CO-CONNECT and offers researchers a web portal they can log into and run cohort discovery questions. 

A trusted research environment (TRE) is a secure environment in which researchers who have permission can analyse the data. TREs work on the basis that data should not be sent to researchers but instead researchers securely login to an environment where they can analyse the data. The data cannot leave the environment. All that can be exported is aggregate level anonymous data, such as a table or graph summarising the results. There are no accepted definitions for a TRE and many are run differently. The Goldacre review has sought to set out a vision for TREs going forward. 

No. All the tools that CO-CONNECT are building are open source such as the tools that the data partners use to convert the data and the tool used by the data team to generate mappings rules from metadata CO-CONNECT have also created our own version of BC|Link that is open source, called HUTCH, and implements the query protocol

The software from BC Platforms is not open source. BC Platforms have made available their query protocol (what defines the queries that can be run). 

CO-CONNECT

CO-CONNECT is researching standard methods to respond to these challenges. Standards ensure consistency and interoperability across systems and processes. It also ensures data is handled consistently across multiple organisations. CO-CONNECT is researching standardisation across different processes and in doing so generating enhanced assurance that the privacy of personal data can be maintained. CO-CONNECT have sought to develop software tools that standardise the following: 

  1. Representation of Covid-19 serology results 
  2. The process to convert data to a single standard 
  3. The process to answer questions to cohort availability question
  4. The process of generating a pseudo-anonymised ID
  5. The process and development of analytics across Data Sources, without the data moving (federated analytics)
  6. The process of analysing data in a TRE when data is required to be in a single location
  7. The methods to include Patient and Public Involvement in a technical infrastructure projects

The standardisation processes researched in CO-CONNECT are underpinned by the core concept that the individual data should never leave the control of the Data Partner. All processes have been developed so the Data Partner retains full control and ownership of data with the Data Partner enabled to standardise their processes by software provided by CO-CONNECT.

There is no single supplier to measure an antibody response to COVID-19. Therefore, whilst there are reliable results for someone either having or not having an immune response, it is much harder the compare the quantitative result between studies. This is because the way in which different suppliers measure the response is different. CO-CONNECT has been working on standards that help describe how a result was obtained in order to make it easier to compare results between suppliers.

When converting data to a common standard, most techniques and tools assume that the team undertaking the work to understand how to convert the data will also have access to the data. We wanted to ensure there was a consistent way to convert the data by having a central team but we did not want this team to have access to the data to maintain confidentiality of the data hosted by each Data Partner. Therefore, we needed to research a new way to undertaking data conversion where the team would never have access to the data. 

Data Partners run an open source tool called WhiteRabbit on a pseudonymised version of their data. This generates a metadata report. The report contains metadata regarding the tables, fields and values. The Data Partner always retains control of what data WhiteRabbit can access and the configuration of the parameters. Once Data Partners have checked the report only contains metadata (no identifiable, row level data), the Data Partner then shares this report with the CO-CONNECT technical team. The CO-CONNECT technical team use the metadata to develop a set of rules to apply to the data and share back the rules with the Data Partner who then run our software that applies the rules and outputs data in a standard format and check that the data has been transformed appropriately.

CO-CONNECT is researching methods to enable researchers to determine how many people meet their research criteria within the various datasets across the UK using the Cohort Discovery Search Tool embedded within one website, the Health Data Research Innovation Gateway.

Within each Data Partner, a secure computer is set up which is separate from where identifiable data is stored, but still within the Data Partner’s secure environment. Staff within each data partner create a copy of relevant data – with anything they deem to be sensitive or identifiable removed. This is known as pseudonymous data. For example, information like names, addresses, and specific dates of birth, dates of testing or care are removed, and identifiers are converted into new pseudonymous codes. The Data Partner then transfers this pseudonymous data on to the secure computer. 

In our example all that is transferred by the Data Partner is about Jane Doe is the pseudonymous code, that this person had a positive PCR test on the 2nd week of August 2020 and the year of their birth.  

Software within the secure environment of the Data Partner will send a message out to the Gateway which will return any questions which need to be run on the data.  An example question could be “How many people in the dataset who have had a PCR test which was positive and were under the age of 40.”

The software will return a summary answer to each question which shows the number of people who meet the criteria, helping researchers discover the most useful data.  

CO-CONNECT is researching a new capability in collaboration with data partners to provide data in a way that enables researchers to understand which data is from the same individual across different data partners – without compromising patient identity. 

For example, if successful, COVID-19 results from a test centre could then be linked to hospital records and pharmacies’ prescriptions. By developing methods to link these data, researchers can assess if patients with different existing health conditions are more or less susceptible to COVID-19, impacting healthcare treatments.     

This work continues to be researched and is not yet implemented or agreed between Data Partners. 

In collaboration with Data Partners, CO-CONNECT is researching federated analysis. This is where the data remains within the Data Partners secure environment, but questions about the data are sent through the Health Data Research Innovation Gateway website and summary results returned. Building upon the functionality of the method to find data using the cohort discovery search tool as described in the CO-CONNECT: Finding Data video, this method would provide more complex trend analysis rather than simple counts. Data Partners must approve all federated analysis research projects before any analysis can take place. 

For example, the question could be “How does the number of people with a positive PCR test change with age”. The trend returned would return the number of people of people with a positive PCR test grouped by age range.

In collaboration with Data Partners, CO-CONNECT are researching a new capability to speed up the process where Data Partners generate a subset of data for research analysis by utilisation of a semi-automated method. This method helps Data Partners to create this subset of their data based on the cohort definition used with the Cohort Discovery Search Tool. For example, “A PCR test which was positive and were under the age of 40”.

The generated subsets of data from across different Data Partners that have given permission would be placed into what is called a Trusted Research Environment. Trusted Research Environments, also known as TREs, are secure IT systems that researchers can remotely connect to and ask research questions on pseudonymised data. The data cannot be copied or removed from the Trusted Research Environment and researchers can only export answers such as a graph.  

The copying of a pseudonymous subset of data relevant to a specific research project into a single TRE only takes place once researchers have been granted permission by each Data Partner and all Data Partners agree which single TRE to utilise.

CO-CONNECT Data Governance

Pseudonymisation is a process of removing directly identifiable data from a dataset (such as a name, NHS number) and replacing that with a code. A dataset that has been pseudonymised presents less risk as the ability to identify an individual has been reduced, although not removed. This is common practice in healthcare research as it reduces the risk but can still facilitate updating of data over time and linking data between datasets. Pseudonymised data is still considered identifiable if the link between the original data and the code created is held by the organisation.

No. The UK has existing TREs. The CO-CONNECT team believe that TREs should be used at any time the individual data is being brought together for analysis. CO-CONNECT support the principles behind the recent Goldacre review

No. All pseudo-anonymisation and data linkage is undertaken by the Data Partner. CO-CONNECT is working with the Data Partners to bring in standard mechanisms for how Data Partners create a pseudo-anonymised identifier to make it possible to link data together in a Trusted Research Environment. This work is an active part of research to understand if it would be possible. Data is only moved to a TRE and/or linked under the direct authority and action of the Data Partner. CO-CONNECT is not linking data.

No. Data always remains with the Data Partner and always within their existing infrastructure. This means if the data currently resides within a TRE, it will always stay within the TRE. 

Yes, however, it is the Data Partners, and only the Data Partners, who undertake any processing of pseudo-anonymised data under their existing legal basis for holding the data. All processing is undertaken on the infrastructure under the control of the Data Partner and is not sent to a third party for any processing.

The datasets across CO-CONNECT have data that can be linked to other healthcare records. Retaining this capability is important as researchers understand more about the disease. As has been shown, pre-existing health conditions, ethnicity, gender and vaccination status all can impact how researchers should interpret the immune response to COVID-19.  Pseudonymisation allows the Data Partners to reduce the risk of identification whilst still maintaining the possibility to link to other data. Data is also continuously being collected for some Data Partners, therefore, there is a clear desire to keep updating the datasets as new information becomes available. The decision to link or add data and the permission to link or add data to other datasets is maintained by each Data Partner. 

This will vary between each Data Partner. Some studies are consented research studies, which include the collection of samples. Some others are from data collected as part of healthcare delivery and are available via national Trusted Research Environments. What CO-CONNECT have sought to standardise is the mechanism in which anonymised data is produced from the pseudo-anonymised data to assist researchers in discovering if a dataset is of use. 

Even in datasets that have been pseudonymised, there may be a series of data points that result in only a few small number of individuals having those characteristics. To prevent this occurring low number suppression can be used. A threshold can be set such that if any result would return a result below the threshold, a result of zero is returned instead of the real result. 

Under GPDR, the data released from each Data Partner is considered to be anonymised data. This can include metadata, including technical descriptive information about the dataset such as the fields, columns, tables and structure of the database. It also includes the response to a cohort discovery question, which is a count. Each Data Partner can also apply rounding and low number suppression to each result. 

To protect against an individual being identified by a unique set of characteristics, each data partner must set a minimum limit where no results are returned, for most this means only results with more than 10 people are returned. Also, to protect against someone asking multiple questions and subtracting counts to identify someone, each data partner can enable rounding on results, which ensures they are never exact results but sufficiently precise to allow the researcher to know the scale of the dataset. 

No. CO-CONNECT have made no arrangements to sell or cost recover data. All decisions on commercial access to data is retained by each Data Partner outside of and separate to CO-CONNECT.

Every Data Partner has assessed the data governance risk of taking part in CO-CONNECT. The process used varies based on the data held. As a minimum every organisation has undertaken a data protection risk assessment. Many have undertaken a full data protection impact assessment given the scale of data they hold. Security risk assessments have been undertaken on the software, especially by the Trusted Research Environments to ensure the software does not undermine their security and consistent with their normal processes for assessing software.

For each dataset, the Data Partners have evaluated the appropriate low number suppression threshold to set and whether rounding on results should be applied (where a result of 18 would be reported as 20). Data partners have also been asked not to supply exact dates, but to convert them to the first of month for events (such as a diagnosis) and to a year (for dates of birth). 

HDR UK Cohort Discovery tool

Health Data Research (HDR) UK are collaborators and delivery partners on the CO-CONNECT research project. CO-CONNECT has collaborated with HDR UK to scale and provide the capabilities to the wide userbase of HDR. HDR UK does not have any access to identifiable or pseudonymised patient data. 

Servers hosted in the United Kingdom. 

Researchers who have been approved by HDR UK to login. All users must agree to terms and conditions of use. The first method is via OpenAthens which is a UK-wide identity federation that allows anyone with a recognised institution to login to any service under the federation. They use their institutional account to login and on sign up are required to agree to the HDR UK Gateway T&Cs. The second method is via logins using LinkedIn or Google. These users can request access but this redirects them to a ticketing system which requests extra information about them and an email verification check. This information is then manually verified to confirm they are a bona fide researcher before they are given access. HDR UK plan to review these types of accounts every six months to ensure they still have a valid reason to access the query portal.

No. Each Data Partner decides the data and fields to include as part of their own governance and risk assessments. 

No. Each Data Partner creates the data access process for researchers to use. This is to ensure that all access requests go through the normal data governance process at each Data Partner.  

No. Researchers can establish if data exists that is suitable for their research. If the researcher wants access to the data to undertake research, they must go through the access procedures for each Data Partner. 

Software from BC Platforms

No. All access to the data is controlled by the Data Partner. BC Platforms do not have access to the computers or infrastructure in which individual data is stored. Technical precautions have also been made so that only appropriate staff from each Data Partner can connect to the computers that host the data. 

BC|RQuest software is available to use by all authorised researchers and allows them to create their query and view the responses to their query. This software has been provided to CO-CONNECT  ‘white labelled’ meaning there is no BC Platforms branding and the software has been rebranded as the HDR Cohort Discovery tool. BC Platforms help maintain the software (such as upgrades) but the server that runs the software is under the control of Health Data Research UK and hosted in the United Kingdom. 

The most crucial part of the process is the software that responds to a query, under the authority of the Data Partner, as it runs against a pseudo-anonymised dataset and produces the anonymised aggregated count. Instead of designing that process, security and accreditation from scratch CO-CONNECT chose a reputable company with significant experience, track record and accredited processes. 

BC|RQuest software, hosted by HDR UK, holds data on the users, their rights to run queries across the different Data Partners and the queries they requested to be run. For each query the response is also recorded, which is a count, and a gender distribution and age range if the Data Partner has allowed that data to be returned.

No. The communication protocol is that BC|Link connects to BC|RQuest to download any queries and respond. The Data Partner configures which BC|RQuests they wish to connect and download queries. No inbound firewall rules are required, meaning no connection from outside the Data Partners network to BC|Link is possible.

No. The software is installed within a secure machine within the infrastructure of the Data Partner. No connections are allowed to the BC|Link software from any outside provider or third party. The only party to have access to the BC|Link software are the Data Partners who control and manage all access. 

It is a core data protection concept to minimise data. Each Data Partner is free to choose what data should be made discoverable via the HDR Cohort Discovery tool. It also ensures this software can be isolated from the wider network, so it only stores what the Data Partner decides is relevant. BC|Link can also only connect to outside servers or other internal systems that the Data Partner deems to be necessary. The software is always stored in the environment of the Data Partner and is always under their supervision. 

BC|Link software is installed behind the firewall of each Data Partner which holds the pseudo-anonymised version of the data the Data Partner wishes to make discoverable. BC|Link facilitates the Data Partner to answer the queries stored on the HDR Cohort Discovery platform by systematically applying a standardised mechanism to anonymise the data by running queries that return a simple count and/or distribution to a question. In the examples of TREs, this software is installed within the secure environment of the TRE.

Yes. All components go under rigorous testing within BC Platforms. In addition, their testing programme has been presented and evaluated by the CO-CONNECT team. Their software platforms have undergone penetration testing at multiple locations and security assessments. 

BC Platforms have robust ISO accredited Quality Assurance and Information Security practices.  External security testing is documented and reviewed. 

Yes. All queries requested by each user and the response are recorded both in BC|RQuest software and in every BC|Link instance.