Datasets can be made publicly available in three ways:
- In the article itself: for small datasets that can be presented in full in a table.
- In the supporting information: for medium-sized datasets that can be presented in large tables or compressed files, which can be downloaded online from the journal website.
- In a data repository: for large datasets (e.g., DNA sequences) that need large database infrastructures to store them.
Although option 3 (deposit data in data repository) is most suitable for large datasets, it is strongly recommended that datasets of all sizes be uploaded to some form of repository [1].
Journals with mandatory data sharing policies require authors to provide a data availability statement on the first page of the article, which states the location of the dataset.
Examples of data availability statements:
Data Availability: All relevant data are within the paper.
Data Availability: All relevant data are within the paper and its Supporting Information files.
Data Availability: The TaqMan Human MicroRNA Array experiments are MIAME compliant and have been deposited at the NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo) under accession GSE6459 [2].
Data Availability: All .bam sequencing files are available at the European Nucleotide Archive (http://www.ebi.ac.uk/ena) (accession numbers ERS700862, ERS700863, ERS700864, ERS700858, ERS700859, ERS700860, ERS700861) [3].
In some cases, you may want to upload data before you are ready to release it publicly or publish it. In this case, you can upload data to a repository with tiered access – i.e., the data will only be made available when it has been published in a journal [1].
Are there exceptions to these mandatory policies?
In certain cases, datasets are too large, or the data are human patient data, which cannot be made publicly available for ethical reasons. In such cases, it is recommended you contact the target journal to discuss solutions to these issues [1].
Which repository?
Many journals provide lists of recommended subject-specific repositories. A good example can be found here. Alternatively, you can search for appropriate repositories using the registry re3data.org.
Are there costs?
The cost of depositing data in a repository varies. Dryad charges $120 per dataset (<20 GB); however, they have a waiver for countries with low-income economies [4], and both Nature and some Royal Society journals (Biology Letters, Proceedings B and Royal Society Open Science) cover the cost of depositing the data (<20 GB) in both Dryad and Figshare [5] (two large generalist repositories).
When it comes to data sharing, it is better to provide as much information as possible. The open transparent sharing of data not only benefits the scientific community but wins the favour of public taxpayers.
References
1. Plos One. Data Availability. Available from: http://journals.plos.org/plosone/s/data-availability [Accessed 15th November 2016].
2. Wozniak MB, Scelo G, Muller DC, Mukeria A, Zaridze D, Brennan P. Circulating microRNAs as non-invasive biomarkers for early detection of non-small-cell lung cancer. PloS One. 2015 May 12;10(5):e0125026.
3. Butler TM, Johnson-Camacho K, Peto M, Wang NJ, Macey TA, Korkola JE, Koppie TM, Corless CL, Gray JW, Spellman PT. Exome sequencing of cell-free DNA from metastatic cancer patients identifies clinically actionable mutations distinct from primary disease. PloS One. 2015 Aug 28;10(8):e0136407.
4. Dryad. Data publishing charges. Available from: http://datadryad.org/pages/payment [Accessed 15th November 2016].
5. The Royal Society. Data sharing and mining. Available from: https://royalsociety.org/journals/ethics-policies/data-sharing-mining/ [Accessed 15th November 2016].