skip to content

Core Bioinformatics group


Data acquisition, traceability and curation

We operate a strict policy of experimental record-keeping with traceable identifiers for each experiment that are linked to the unique anonymised identifiers of each of the primary human samples we store and use. Experiments are recorded in lab books and electronic format.

We keep databases for i) all primary samples reception, storage, experimental use and destruction (consent removal); ii) all experiments performed; iii) primary data storage locations. This system is compliant with Human Tissue Act regulation.

To ensure reproducibility, all datasets will be clearly annotated with standardised metadata (isolation procedures, processing procedures, anonymised clinical data where relevant).

Patient data is anonymised (conversion tables are stored on an encrypted and password protected Hard Disks). Samples corresponding to the specific patients are labelled using unique patient identifiers and human-readable ids indicating the specific treatments.

Data storage

All data generated is securely curated by the CSCI Core bioinformatics group and the Clinical School Computing Services.

All acquired data is processed on the CSCI cluster and high-memory servers. The hardware is under warranty and end user support is provided by both the CSCI Core Bioinformatics and CSCI IT facility. Best practise policies are published on an internal wiki accessible by all cluster users. The IT facility provides enterprise level offsite backup for all CSCI storage servers. A backup and disaster policy approved by the Wellcome Trust /MRC is in place and subject to a yearly review by the head of IT and members of the CSCI IT Committee.

Raw unprocessed data, and the metadata used for the analysis is backed-up on a cold storage server at CSCI. Hot raw and processed data is stored on RAIDed and backed-up folders on the CSCI cluster. The metadata for each component of the project will comprise the experimental design and the overview of discriminative features for each sample that will facilitate other researchers in the field to explore the datasets. The format of the metadata will be compliant with the policies of the standard repositories used in our discipline [such as the Gene Expression Omnibus (GEO) and ArrayExpress].

Projects adhere to the CSCI IT facility guidelines, ensuring the backup of original raw files and results, to an offsite location at the Sainsbury Laboratory. The offsite location is an extension of the CSCI network with the data transferred over a secure private 10GBps fibre link. Data is stored on an enterprise level Spectra Logic archive storage. Offsite backups consist of a single copy of the onsite data, snapshots of changed data going back up 7 days plus to one month. As secondary back-up, raw sequencing data is deposited in FASTQ format into the European Nucleotide Archive (ENA) databases.

Data sharing

All datasets (raw data, processed data and associated metadata and documentation) are deposited in publicly available databases such the Human Cell Atlas (when relevant) or GEO/EBI at the time of preprint or publication or within 12 months of the end of the funding.

Any bioinformatics tools developed in this proposal are made available to the scientific community, including all scripts and documentations. We submit packages comprising the developed pipelines to the relevant repositories (gitLab) enabling users to access these through CLI (command line interface), and/or develop web interfaces freely available to the scientific community.

All manuscripts are uploaded as preprints in suitable repositories (BioRxiv) and all peer-reviewed publications are submitted as open access. All original data and analytical results is archived securely and made freely available in tandem with primary publications, and maximally within 12 months of project end according to CSCI policy.

The truth is rarely pure and never simple. Oscar Wilde, The Importance of Being Earnest