Meta Data Standardisation

Overview

The CO-CONNECT project requires all data partners to convert their data to the Observational Medical Outcomes Partnership (OMOP) format.

For security reasons, the CO-CONNECT team will not have access to any of the Data Partner’s raw data; instead, each Data Partner will share a standardised set of metadata files with the CO-CONNECT team who will use these files to define bespoke transformation and mapping rules.

"The tools will never have access to the primary dataset or any identifiable data"

Data Preparation

The CO-CONNECT team will advise each Data Partner about which datasets are most relevant to include, but the final decision lies with the Data Partner. It is important that the data is prepared in the right way and the CO-CONNECT team are ready to provide support with this. The set of CSV files should have:

- One CSV file that represents the demographic data of individuals. This should contain the anonymised ID for the individual (following the CO-CONNECT specification), Sex and Date of Birth (obfuscated as required) as a minimum, and Ethnicity where possible. In cases of multiple demographic records per individual, please include only the most recent record in the demographics file.

- One or more CSV files that contain event data, such as questionnaire responses or clinical events/measurements. As a minimum each CSV file must have a column that contains the anonymised ID for the individual (matching an entry in the CSV file above), the date of a particular event (one per file) and one or more columns that capture the data relevant to that date. You can have as many CSV files as needed, but each CSV file must only contain information effective on the specified event date.

- All measurements in the metric system.

- All dates and datetimes in the ISO-8601 format: YYYY-MM-DD HH:MM:SS.ffffff

- All numbers without any digit grouping symbols (e.g. 1000, not 1,000).

- All decimal numbers rounded to two decimal places after the decimal symbol “.”.

- All CSV files encoded with UTF-8 and Unix/Linux line ending.

All CSV file names limited to 30 characters.

Data Partners will then upload these CSV files into the virtual machine that hosts the BC|Link software that will have the mapping rules applied and the OMOP data created. These files must not contain any identifiable information and should have all data privacy steps applied to them before proceeding.

Generating Metadata

Metadata allows the CO-CONNECT team to understand the structural format of the CSV files and the range of values present. Ideally, each Data Partner should use a data profiling tool called WhiteRabbit to generate the metadata on the extracted files. This is so that the metadata is extracted in a standard format. This tool can be connected directly to most databases but if this causes security concerns then it can also be pointed towards CSV files. It can be ran on a completely locked-down machine with no internet access – this should alleviate security concerns. For Data Partners who are not comfortable using this tool even with such precautions, the CO-CONNECT team can provide a template to manually complete.

The WhiteRabbit Scan Report contains information about the fields within each CSV file and frequency distributions of the values within these fields, but it will not be able to complete the field description. Therefore, for each field (on the Field Overview tab) please complete the description.

WhiteRabbit contains an option to automatically remove values with frequency values under a certain threshold, for instance if a dataset contains a field called “condition”, and there are only 2 patients with a very rare condition, this can cause security concerns. Data Partners should set the “Minimum Cell Count” threshold in WhiteRabbit to 5 to conform to CO-CONNECT standards.

Removing sensitive data

Please remove any data values which could be deemed as confidential or sensitive from the scan report, such as the anonymised ID of the individual, date of birth, and their frequencies. When removing these values and frequencies from the Scan Report, please refrain from removing the entire column from the spreadsheet. For instance, please maintain the headers and remove the values, as in the example below:

Data Dictionary

Whilst the WhiteRabbit Scan Report provides a good understanding of a Data Partner’s data, a description of the codes used in each dataset is still required. Examples of common codes are M and F, or 0 and 1 for a sex or gender question. Please provide this information in one CSV file, with the following columns: csv_file_name, field_name, code, and value.

Below is an example of the expected content:

csv_file_name	field_name	code	value
questionnaire	gender	0	Male
questionnaire	gender	1	Female
tab10	q1	0	Fever
tab10	q1	1	Cough
tab11	q2	0	Yes
tab11	q2	1	No
tab11	q3	SNOMED

If a recognised standard code system is used in the dataset, please enter the name of the coding system in the code column. In the example above, the responses to question q3 are coded using SNOMED. The CO-CONNECT team can assist Data Partners who are unsure which coding systems present in their datasets.

“For further documentation including how to run the ETL tool on your datasets once mapping is completed, see our docs on github