The scientific challenge for this project is to accelerate discovery and exploration of the synthetic biology design space. In particular, many parts used in synthetic biology come from or are initially tested in a simple bacteria, E. coli, but many potential applications in energy, agriculture, materials, and health require either different bacteria or higher level organisms (yeast for example). Currently, researchers use a trial-and-error approach because they cannot find reliable information about prior experiments with a given part of interest. This process simply cannot scale. Therefore, to achieve scale, a wide range of data must be harnessed to allow confidence to be determined about the likelihood of success. The quantity of data and the exponential increase in the publications generated by this field is creating a tipping point, but this data is not readily accessible to practitioners. To address this challenge, our multidisciplinary team of biological engineers, machine learning experts, data scientists, library scientists, and social scientists will build a knowledge system integrating disparate data and publication repositories in order to deliver effective and efficient access to collectively available information; doing so will enable expedited, knowledge-based synthetic biology design research.
This project will develop an open and integrated synthetic biology knowledge system (SBKS) that leverages existing data repositories and publications to create a single interface that transforms the way researchers access this information. Access to up-to-date information in multiple, heterogeneous sources will be provided via a federated approach. New methods based on machine learning will be developed to automatically generate ontology annotations in order to create connections between data in various repositories and information extracted from publications. Provenance for each entity in SBKS will be tracked, and it will be utilized by new methods that are developed to assess bias and assign confidence scores to knowledge returned for each entity. An intuitive, natural-language-based interface and visualization functionality will be implemented for users to easily access and explore SBKS contents. Additionally, as ethics is necessarily a part of synthetic biology research, data from text sources related to ethical concerns in synthetic biology will also be incorporated to inform researchers about ethical debates relevant to their search queries. Finally, to test the SBKS API, a new genetic design tool, Kimera, will be developed that leverages the knowledge in SBKS to produce better designs. The proposed SBKS will accelerate discovery and innovation by enabling researchers to learn from others’ past experiences and to maximize the productivity of valuable experimental time on testing designs that have a higher likelihood of working when transformed to a new organism. This research thus provides the potential for transformative research outcomes in the field of synthetic biology by leveraging data science to improve the field’s epistemic culture.
This material is based upon work supported by the National Science Foundation under Grant Nos. 1939892, 1939929, 1939885, 1939887, 1939951, and 1939860. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.