POLICIES FOR DATA SHARING & DATA ACCESS

©IRD Maxime Jacquet

Infravec2 is committed to developing a public data commons around vector species. To this effect, recipients of Infravec2 products are expected to place any data that is generated in the public domain within 18 months. This allows recipients the first chance to publish using these data, while ensuring their lasting availability to subsequent researchers. Please see our Policies for Data Sharing and Access below, including a guide to appropriate repositories for various data types. Please contact your access provider if you need further advice. These policies are derived from the Infravec2 Open Research Data Management Plan.

Policies for Data Sharing and Data Access

Infravec2 supports the principle of open research data, and intends that data generated by the project is widely available for use by the research community and working to establish community-wide standards for data representation and publication. Users of Infravec2 are expected to publish any data with which they are provided in accordance with these policies, within 18 months of the completion of the use. This policy provides a reasonable trade-off, allowing users the first opportunity to interpret the outputs of their work, while ensuring that the public funding of the facilities used results in their availability to contribute to the public good hereafter. Specifically, we expect we following:

  • Publication of all appropriate data to a persistent, relevant repository. Appropriate data includes any data used to support a publication; but in addition, certain types of large scale molecular data (“omics” data) whose re-use value is well proven (e.g. DNA sequence, RNA sequence, protein sequence, etc.).
  • Where well-established repositories exist for a given data type (e.g. European Nucleotide Archive, GenBank and the DNA Database of Japan for nucleic acid sequence data), data should be submitted to such a repository. For data types for which no well-established dedicated repository exists, other types of repositories may be appropriate (for example, institutional repositories, publishers’ repositories, or generic data storage infrastructures such as EUDAT or figshare). Repositories should assign unique identifiers by which data records can be identified (such as Digital Object Identifiers (DOIs)), and should be selected according to criteria including longevity (of repository), the absence of barriers to downstream data access (e.g. through charging or licensing restrictions), and familiarity (will the intended user community expect to find this type of data in this repository?).
  • Where there is a well-established standard for the experimental meta-data (i.e. information about the experiment, the data producer/owner/publisher, or the results set itself) – e.g. the MIAME (Minimal Information About a Microarray Experiment) standard for microarray data, or other standards conforming to the Minimal Information Standards for Biological and Biomedical Investigations  – published data is expected to conform.
  • Where there are no established domain-specific standards for minimal meta data, or repository-specific requirements, users should describe their data with a minimal set of metadata in accordance with the standards of Dublin Core Metadata Element Set, Version 1.1.

We will advise and assist its users in the formatting, annotation and publication of their data according these standards; nonetheless, responsibility for these tasks is accepted by the user as a condition of access of the infrastructure.

Data Types, Formats and Ontologies

Data types expected to be generated in the project include genome sequence and assembly, structural annotation (gene models, repeats, other functional regions) and functional annotation (protein function assignment), variation data, transcriptome data, proteomic and metabolomic and sampling, arbovirus and malaria experimental infection data (linked to archived samples); and microbiome data (Operational Taxonomic Units), including natural virome composition.

Data type Appropriate Format(s) Appropriate Repository Comments
Nucleotide sequence data (short reads) FASTQ, BAM, CRAM, and other machine specific formats accepted by the ENA (see website for more detail). Meta-data should be  MIxS-compliant. European Nucleotide Archive (ENA); ArrayExpress (specifically, for RNA-seq data generated for the purposes of quantification). Submission to a partner database of ENA (GenBankor DDBJ) is also acceptable.
Nucleotide sequence (long reads, annotated assembled sequences) EMBL format (for sequence/annotation), AGP (for assembly description). Metadata should be MIxS-compliant. European Nucleotide Archive. Annotation can also be submitted as Tracks to the Ensembl Track Hub Registry provided the underlying sequence is submitted to ENA.
Annotation
  • BED
  • Bed Graph
  • GFF2/GTF
  • GFF3
  • Pairwise interactions (WashU)
  • PSL
  • WIGBAM/CRAM
  • BigBed
  • BigWig
  • VCF

See Ensembl website for more details. Tracks should be packaged as Track hubs.

Track Hub Registry, for subsequent incorporation in Ensembl Metazoa, VectorBase and other genome browsers. Much annotation can be visualized a positions or spans on a genomic reference sequence (tracks). Track hubs are collections of tracks with common metadata.
Structural annotation Sequence features and their attributes should be described using the Sequence Ontology within GFF2 or GFF3 files. Ensembl Metazoa can accept GFF based structural annotation for submissions that have a public ENA entry for the assembly.  
Functional annotation Depends entirely on data under annotation; use of structured controlled vocabularies (such as the Gene Ontology, appropriate to the domain is recommended. GAF format may be appropriate for annotating other biological objects. None that directly take submissions. Contact management team for advice.
Variation data Variant Call Format    
Microarray data Meta-data should be MIAME-compliant. ArrayExpress. MIAME-compliance is enforced by Array Express submission interface.
Proteomic data Meta-data should be MIAPE-compliant. PRIDE. MIAPE compliance is enforced by PRIDE submission interface.
Metabolomic data MetaboLights ISA format. MetaboLights. MetaboLights provides software support for generating compliant data.
Infection data No standard exists. Representation in a spreadsheet is currently normal. Data should be deposited in a general-purpose repository and identifiable through DOIs or similar identifiers. VectorBase is currently working to develop standards for the representation of infection data and we will will collaborate on this work with the goal of developing a common standard.
Microbiome data (from colonization experiments) Meta-data should be MIxS-compliant. Nucleotide sequence should be deposited in the European Nucleotide Archive (ENA). Expected datatypes: 16S hypervariable region amplicon sequences, and taxonomic calls

PDF version of the policies

Data Management Plan
Version 4.0 (current)
Version 3.0
Version 2.0
Version 1.0