A symposium titled "The Rhyme and the Reason of Data Sharing" was held on November 5, 2007, in San Diego, California, during the annual meeting of the Society for Neuroscience (SfN). The symposium was co-sponsored by the National Institute of Neurological Disorders and Stroke (NINDS), National Institute of Mental Health (NIMH), National Institute on Drug Abuse (NIDA), National Institute of Biomedical Imaging and Bioengineering, and National Center for Research Resources.
The symposium organizers were Yuan Liu, Ph.D., and Giorgio Ascoli, Ph.D. Dr. Liu is NINDS program director for Technology Development and Chief of the Office of International Activities. She is also co-organizer of the National Institutes of Health (NIH) Neuroinformatics Interest Group. Dr. Ascoli is a professor in the Molecular Neuroscience Department at George Mason University and Director of the Center for Neural Informatics, Structure, and Plasticity at the University's Krasnow Institute for Advanced Study.
Brain research yields vast quantities of complex data. This wealth of information could be used to answer scientific questions that are independent of the projects in which the data originated. The ever-increasing power and cost-effectiveness of informatics tools and open-source databases has facilitated data sharing and reanalysis for scientific integration and new discovery. However, the research community faces technical and sociological challenges to data sharing.
The goal of the symposium was to promote efficient data sharing and data reuse within the neuroscience community by demonstrating how good data-sharing practices can add value to original data.
Efforts by neuroscientists to share data emerged in the 1980s, but community awareness of the possibilities for sharing have lagged behind the development of databases and platforms for doing so. To help address this gap, SfN launched in 2004 its Neuroscience Database Gateway (NDG, http://ndg.sfn.org). NDG, a searchable and curated resource of online neuroscience databases and informatics tools, was created under the auspices of SfN's then newly formed Neuroinformatics Committee.
Shortly afterwards, in November 2005, NIH awarded a contract to scientists in the field of neuroinformatics to design the Neuroscience Information Framework (NIF, http://nif.nih.gov). The project built on the NDG and used cutting-edge informatics technologies. NIF will ultimately provide a resource registry, ontology, and a concept-based query system to enable discovery and access to neuroscience data and informatics tools. NIF was among the first major initiatives of the NIH Blueprint for Neuroscience Research (http://neuroscienceblueprint.nih.gov), a cooperative effort of 16 NIH Institutes, Centers, and Offices that support neuroscience research for developing tools and coordinating neuroscience resources and training.
Policies on data sharing have been adopted by several Federal funding agencies, including NIH (http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm) and the National Science Foundation (NSF) (http://www.nsf.gov/pubs /2001/gc101/gc101rev1.pdf). In addition, some scientific journals have established data deposition requirements.
In June 2007, SfN, the NIH Blueprint for Neuroscience Research, International Neuroinformatics Coordinating Facility (INCF, http://www.incf.org), and Science Commons (http://sciencecommons.org) supported a conference titled "PubMed Plus: New Directions in Publishing and Data Mining Leadership." The purpose of the conference was to discuss ways in which neuroscience journals and databases can collaborate to enhance the mining of published data. A pilot program, the Neuroscience Peer Review Consortium (http://nprc.incf.org), was established as a result of the conference.
The questions put to presenters at "The Rhyme and Reason of Data Sharing" symposium were: (1) why, what, when, how, and with whom to share data, and (2) how to overcome technical and sociological obstacles in data sharing. The symposium followed on the heels of the Neuroscience 2007 Presidential Special Lecture given by Mark H. Ellisman, Ph.D., of the University of California, San Diego. In his lecture, Dr. Ellisman surveyed the current needs and challenges of data management in the neuroscience community and explored the opportunities that powerful computational approaches and resources can provide. He discussed the complexity of neuroscience research today, which involves sophisticated instruments, large interdisciplinary teams in multiple locations, and a huge increase in the quantity and complexity of collected data. Dr. Ellisman suggested that if data is carefully collected, curated, and archived, the neuroscience community will have a valuable resource for use both today and for generations to come.
Dr. Liu opened "The Rhyme and the Reason of Data Sharing" symposium. Dr. David Van Essen, Ph.D., the 2006-2007 SfN President, and Dr. Thomas R. Insel, M.D., Director of NIMH, offered opening remarks.
Dr. Liu pointed out that while the accumulation of data is increasing, the loss of data is also increasing due to failure of sharing.
Dr. Van Essen indicated that advances in informatics technology have fundamentally changed the way in which data are acquired, navigated, mined, and shared. He stated his belief that, as informatics tools and resources increase in power and capacity over the next few years, the neuroscience community will recognize the win-win nature of data sharing. This recognition in turn will trigger an exponential increase in the amount of shared data.
Dr. Insel pointed out that biology is increasingly becoming an information science. As a result of this cultural change, there is an increasing need to make sure that investments made with public funds are used as efficiently and comprehensively as possible. Speaking on behalf of the directors of all of the NIH Institutes, Dr. Insel stated that one important goal of NIH is to facilitate scientists' ability to post, access, and share data. Achieving this goal requires the development of technologies and policies that are based on an understanding of the complicated sociology of data sharing. Dr. Insel concluded his remarks by introducing a new NIH policy for data sharing, the Policy for Sharing of Data Obtained in NIH-Supported or -Conducted Genome-Wide Association Studies (GWAS) (http://grants.nih.gov/grants/guide/notice-files/not-od-07-088.html). He offered this policy as a model for consideration by the neuroscience community.
The symposium was divided into two sessions. In the first session of scientific presentations, five speakers offered their own success stories of new hypotheses, results, publications, and collaborations arising from effective data sharing. The second session was a panel discussion of issues and questions related to data sharing.
Modern electrophysiological techniques allow one to record from hundreds of nerve cells simultaneously, providing a paradise of data for exploration. However, the terabytes of data that can now be generated in a few days may require a year or more to analyze in sufficient detail to generate a single publication. This state of affairs has fundamentally changed the practice of electrophysiology: while decades ago physiologists typically spent most of their time at the bench, now they spend most of it analyzing data. Moreover, many physiologists do not have the computational expertise necessary to fully and efficiently mine their own datasets. Hence, the fastest and most potentially fruitful analysis of a given dataset requires a team of individuals with different skills, including classically trained biologists, computer scientists, mathematicians, and physicists.
Moreover, in many cases a given dataset can answer many more questions than those that motivated its initial collection. For example, the laboratory of György Buzsáki (Rutgers University) undertook the technically very challenging endeavor of trying to detect dendritic action potentials from hippocampal pyramidal cells while simultaneously recording from them intracellularly. The data, which took years to collect, did not answer the question posed by the experimenters. However, Dr. Harris and collaborators subsequently reanalyzed the data and from it generated two research papers: one analyzing mechanisms of extracellular spike-sorting, and another defining the temporal organization of cell assemblies. Both of these papers were published in Nature (Harris, Henze et al. 2002; Harris, Csicsvari et al. 2003)Thus, data originally collected for one purpose was ultimately used to answer completely different questions.
Gene expression data also have huge potential for reuse. The number of neuroscience publications that include expression profiling has grown exponentially since 1998, and is now in the thousands. GEO, the gene expression database hosted by the National Center for Biotechnology Information, now contains almost 10,000 datasets, and as of September 2007 about 350 of these were brain-related. Most journals encourage sharing of gene expression data, thus ensuring a continued positive trend. There are multiple public gene expression databases that offer both datasets and online tools for their analysis (e.g., GEO, Array Express, Gemma).
Microarray datasets are much smaller than multi-channel electrophysiological datasets, but there are far more of them available. Thus, one approach that is particularly useful for reuse of microarray data is meta-analysis. For example, one could analyze all the data from experiments that examined the effects of a particular experimental variable (e.g., stress, learning, aging) on different tissues or brain regions, and ask which genes are similarly or differently regulated among them. Or, one can ask whether two or more genes are functionally related by determining if their expression is tightly co-regulated across a large number of different experimental conditions. Dr. Pavlidis and colleagues have developed a data-sharing platform, Gemma (http://www.bioinformatics.ubc.ca/Gemma) that includes online tools for analyzing gene co-expression across multiple datasets. Such meta-analyses provide far more statistically powerful conclusions about functional relationships between genes than do analyses of individual datasets (Lee, Hsu et al. 2004).
Functional magnetic resonance imaging (fMRI) not only generates very large (gigabyte) datasets, but also requires highly sophisticated computational platforms for their collection and analysis. As is true for multi-channel electrophysiological data and gene expression microarray data, fMRI data can be used to test many hypotheses in addition to those motivating their collection, and also to test novel data analysis methods. Reuse of fMRI data is currently supported by a number of online data archives and registries (reviewed in Kennedy and Haselgrove 2006). One such database is the fMRI Data Center at University of California Santa Barbara, which was first established in 2000 and now contains over 120 datasets. These data in turn have been disseminated to laboratories throughout the world.
The first dataset contributed to this collection came from Dr. Ishai in a study of cortical representation of different categories of objects (faces, houses, etc.). Her analysis showed that although each category of object elicited maximal activation of a specific cortical region, each was also associated with distributed patterns of activation across wide expanses of the cortex. But that was just the first in a series of seminal findings that came from this dataset. For example, Dan Lloyd (Trinity College) went on to reuse Dr. Ishai's data to test hypotheses about the characteristics of human consciousness (Lloyd 2002), and Thomas Carlson (University of Utrecht) reused it to test new analytic tools for decoding human mental states based on fMRI data (Carlson, Schrater et al. 2003). To date, seven articles have been published based on reuse of Dr. Ishai's dataset. These articles have appeared in prominent journals including Nature, Science, and Nature Neuroscience. The reuse of Dr. Ishai's data is just one example of how sharing of fMRI data can enable new findings, collaborations, and publications.
Use and reuse of complex neuroscience datasets increasingly requires collaboration between experimentalists and theorists. However, such collaborations are challenged by the fact that these two groups come from distinct disciplines and view biological problems from different perspectives. For example, many theorists are trained as physicists, and believe that complex and seemingly disparate processes are based on a small set of basic principles. Thus, the theorist may not appreciate the myriad of details that the experimentalist believes are essential to understand the results of a particular experiment. Over time, however, experimentalists and theorists may become increasingly appreciative of each other's perspectives, and new insights may emerge as a result.
Dr. Hirsch, who was trained as an experimentalist, has collaborated with theorists at the Redwood Center for Theoretical Neuroscience (University of California, Berkeley) to study visual cortical receptive field formation. Their common goal was to understand visual coding: how the early visual pathway collects information about the visual world in a form that the cortex can later make sense of. The experimentalists were initially interested in defining the neuronal circuitry involved in receptive field formation, and the theorists in analyzing the information content of spike trains. Together, they were able to understand how to recover the shape of a neuron's receptive field based on its responses to a particular stimulus, and also to develop models to predict that neuron's responses to other stimuli (Wang, Wei, et al, 2007). This example shows how experimental data can be used to generate theoretical models, and how these models can in turn be used to generate predictions for testing in the lab.
Three-dimensional digital reconstruction of neuronal morphology is seeing increasingly widespread use in neuroscience research. Digital reconstructions of neurons have a large number of potential applications, including analyses of structure-function relationships, and of the effects of experimental manipulations or other variables (e.g. age, genotype) on neuronal structure. As is true for other data types discussed here, morphological data can often be used to answer questions other than those for which they are gathered, or to test novel computational or statistical analytical methods. Reuse of morphological data is particularly desirable because of the time and effort required to gather this kind of data. Computer-assisted neuronal reconstruction systems are becoming increasingly automated, but the process is still fairly labor-intensive.
Neuromorpho.org (http://neuromorpho.org) is the largest of several publicly available collections of 3-D single neuron reconstructions and associated metadata. The Neuromorpho site now contains over 4000 neurons from dozens of different laboratories, and offers a variety of online tools for their analysis. The database is currently used at rates of several thousands of downloads per month. One example of reuse of data from this site is a recent study by Armen Stepanyants (Northeastern University) and colleagues (Stepanyants, Hirsch, et al. 2007). They analyzed connectivity patterns among neurons in slices of cat primary visual cortex using data that had originally collected and contributed to the database by Judith Hirsch's lab. They showed that the fractions of excitatory and inhibitory connections that survive in typical cortical slices are surprisingly low, suggesting that electrophysiological studies conducted in slices significantly underestimate the extent of neuronal connectivity. In addition, they found that the percentage of inhibitory connections that survive is five-fold higher than the percentage of excitatory connections. Hence, cortical slices will appear less excitable in vitro than in vivo. Many other examples exist of reuse of data from the Neuromorpho site (Ascoli 2007).
Moderator: David Shurtleff, Ph.D., Division Director of Basic and Behavioral Neuroscience, NIDA, NIH (on behalf of NIDA Director, Nora Volkow, who was not able to attend the symposium)
As exemplified in the "success stories" of data sharing and reuse presented above, there are multiple compelling reasons to share data. These include the following:
(1) Many datasets contain far more information than a single laboratory has the time and/or expertise to extract from them.
(2) A single data set can often answer many more questions than those that motivated its initial collection.
(3) Reuse is cost-effective, particularly for datasets that are expensive and labor-intensive to collect.
(4) Sharing promotes collaboration among scientists who might not otherwise interact, and thus has the potential to generate particularly novel hypotheses.
In theory, of course, any kind of data that can be put into digital format can be shared. In practice, some kinds of experimental data lend themselves more readily than others to sharing. Gene expression profiling data are relatively simple to share. Anatomical data are typically easier to share than are complex electrophysiological datasets because less computational and theoretical expertise is necessary for their analysis. Genetic and anatomical data from human subjects and post-mortem samples tend to be less readily shared than animal data. These data require an especially high level of investigator effort to collect, and there are special issues (e.g., of maintaining patient confidentiality) associated with their use.
It is critical that datasets be accompanied by enough "metadata" (i.e., information about the data) that they can be analyzed effectively and interpreted correctly. For example, the fMRI Data Center at UC Santa Barbara required submission of all the metadata that would be necessary to fully reproduce the submitter's published findings from the data, including subject demographic data, experimental design information, fMRI scanner protocol, and statistical results. Ideally, metadata would include sufficient information for other investigators to familiarize themselves with the ins and outs of the experiments that generated the data - that is, the metadata would provide a user's manual for the dataset. This information could include Web-based teaching materials, such as tutorials made by experimentalists to help explain how the data were collected and how to interpret potential artifacts. It could also include bibliographies of review articles and links to relevant web sites.
Data sharing requires the establishment of sharing platforms and common data formats for each kind of data to be shared. In developing these, it is important to seek recommendations and buy-in not only from prospective user communities, but also from the developers of software packages to be used for analyzing specific kinds of data. In addition, user interfaces should be made as intuitive as possible.
Many neuroscience studies involve multiple types of data (anatomical, electrophysiological, genetic, etc.). The most powerful hypotheses about nervous system function are likely to involve making links between these different data types. Thus, the ultimate goal for neuroinformatics is to build federated databases in which multiple data types can be archived together and analyzed in relation to one another. Development of such databases will require that collectors of different data types start talking to one another to develop common ontologies.
With modern communications systems like email, video conferencing, and Skype, it is possible to share data and collaborate on data analysis with scientists around the world, including ones with whom the data collector has no direct contact. However, studies on data sharing in the social sciences have shown that the best predictor of whether a collaboration will be successful is physical distance between people. For example, it is usually much more convenient and time-effective to share data with colleagues within one's own laboratory or institution than ones in other time zones. Also, the original collector of the data can be more assured of outcomes regarding co-authorship when interacting with local versus distant colleagues.
Another way to maximize the efficiency and effectiveness of data sharing is to share with colleagues whose skill sets are complementary to the skill sets of the principal researcher rather than redundant with them. Thus, experimentalists should seek the collaboration of theorists, experts in computational analysis, or experimentalists trained in other disciplines or techniques other than their own. Grant programs should be established to support collaborations between scientists of different backgrounds and expertise, and experimentalists and theorists. In addition, platforms should be developed for investigators to identify scientists from other disciplines who share their interest in a particular biological topic. Finally, increased efforts should be made to recruit Ph.D.s in math and physics into postdoctoral programs in neurobiology, and to promote interdisciplinary activities within pre- and postdoctoral training programs.
No universal standards currently exist for how soon after its collection data should be shared. Many journals now require that a given dataset be shared immediately after publication of a paper in which the data is used. The NIH Microarray Consortium requires that the data for a given project be shared within six months after its collection is completed. It may be important for individual labs to set standards as to how soon after a dataset is collected by one lab member it will become available for use by other lab members.
Authorship. The primary barrier to data sharing appears to be the fear of "getting scooped" by one's own data in someone else's hands. Thus, a scientist who spent a large amount of time gathering a dataset may fail to reach publication first with results he or she derived from it. Several of the symposium participants said that they personally did not know of a single case of a primary investigator being scooped through data sharing. Even if it is largely groundless, this fear could be alleviated by establishing specific time windows during which the primary investigator has sole use of a dataset. (See "When to Share," above.)
A related concern for a scientist who has spent years or more collecting a dataset is that other investigators will then analyze the dataset in different ways, publish their results (and derive career benefits from doing so), and not acknowledge the contribution of the primary investigator in collecting the data. Hence, policies should be established by which the role of the primary collector is acknowledged when their data is reused. However, this could not be done uniformly. For example, if collectors were listed as co-author on all papers resulting from use of their datasets, this could result in thousands of papers for the collector of a highly used dataset. Moreover, the collector of that dataset might not want to have his or her name on every paper that used it.
One possible model for handling issues of credit is offered by neurodatabase.org. They have a policy in which investigators who reuse data agree to acknowledge the primary data collector in the same manner they would acknowledge other kinds of scientific contributors. At the very least the data collector should be cited, just as one would cite a publication that formed part of the basis for the new work. If the use of a dataset is extensive, it is suggested that the secondary user contact the primary collector to discuss the prospective publication and determine if a co-authorship would be appropriate.
Misuse of shared data. Some investigators are concerned about misinterpretation of their data by scientists who reuse them independently. For example, a theorist might misinterpret data through incomplete understanding of experimental details or of known phenomena associated with certain datasets (for example, might "rediscover" gamma rhythms in an electrophysiological dataset). Therefore it is critical that data be cleaned and well-documented prior to sharing (see "What to Share," above). However, it can take several months to do this, with loss of time for the primary investigator. A potential solution to this problem is for different neurobiological disciplines to focus on collection of a small number of very high quality datasets. Also, theorists should be encouraged to gain expertise with collecting and analyzing datasets.
Need for ongoing support of shared datasets. Investigators may also worry about how much time they will have to devote to supporting shared data. For example, they may end up receiving and having to respond to questions from many users about how exactly the data was collected. One solution to this issue is for investigators to develop user manuals for their datasets so that they don't have to keep answering same questions over and over. Also, scientists need to be trained differently from the start to keep their experimental records in a clean and detailed format (i.e., not just as jottings in a lab notebook) so that the records can be used by others. Standardized record-keeping could be assisted by the availability of web-based forms provided by platforms on which the data is eventually to be shared.
Regardless of how it is handled, however, support and maintenance of shared datasets takes time and money, both of which may well outlast the period of grant support for the original project through which the data was collected. Thus, NIH and other agencies will need to provide funding to support and maintain archives, and not just for their initial development but for their sustainability and long-term survival. Moreover, this effort should take place on an international rather than national level.
Lack of community awareness. Several of the symposium participants felt that a significant percentage of the neuroscience community would be willing to share data, but were simply unaware that platforms are available to do so. Similarly, a large segment of the community lacks awareness of the public datasets they themselves could take advantage of, and expertise in how to make use of them (i.e., how to use the search engines and other analytical tools associated with the databases). Thus, increased effort should be devoted to training the broader neuroscience community in use of sharing platforms and analytical tools. This training should start at the graduate or even undergraduate level. Specific training methods could include courses offered by SfN or academic institutions (e.g., Informatics 101 for Neuroscientists), SfN symposia and workshops, and the development of competitions and awards encouraging innovative analyses of datasets. The new generation of investigators should be made aware that reuse of public datasets is a great way to develop hypotheses and collect pilot data without having to spend much money. At the same time, NIH reviewers should be trained to appreciate and reward that approach.
Possible approaches to overcoming specific barriers to data sharing are discussed above. However, for data sharing to become the norm rather than the exception, broader positive incentives must be established to promote it. For example,
(1) On the academic side, decisions about tenure and other forms of university advancement could include evaluation of the impact of data shared by an investigator.
(2) NIH grant review could include consideration of the number and quality of datasets previously shared by the applicant, and the number and impact of papers published by other scientists who used those datasets.
(3) Access to data resources could require that the user share their own data in return (as is currently the policy for the NIH Microarray Consortium).
Incentives like these should be put into place as soon as possible to minimize loss of data currently being collected, and to ensure that as much data as possible will be saved and shared with current and future neuroscientists.
Last Modified March 10, 2011