Open PHACTS: Semantic interoperability for drug discovery

Finalist

Open PHACTS (Open Pharmacological Concept Triple Store http://www.openphacts.org/) is a project partially funded under a European Union grant from the Innovative Medicines Initiative (IMI) and was born as a public-private partnership (PPP) between academia, publishers, small and medium-sized enterprises (SMEs), and pharmaceutical companies. The ultimate goal of the project is to deliver and sustain an ‘open pharmacological space’ (OPS). Open PHACTS used and enhanced state-of-the-art semantic web standards and technologies. It is focused on practical and robust applications to solve specific questions in drug discovery research. A detailed overview of the project was published [1], and a summary view of the project will be discussed in the following sections:

The Challenge: Research and discovery in the life sciences, including disease and its treatment, is amazingly complex. Present technologies are now generating enormous quantities of biomedical data, and data-driven life science research, including drug discovery, will increasingly rely on a community of collaborating partners to extract knowledge from these data sources to solve complex questions. Unfortunately, there is today a diverse landscape of data sources, with an inherent distribution of data quality, formats, standards, copyright and licensing.
Advances in technologies have led to the generation of so much data that humans cannot capture and synthesize the information. Most scientific investigations in life science now involve data from DNA and RNA sequencing, proteomics, metabolomics, screening, biomedical imaging, analysis of data in existing databases, data in the narrative literature and medical records. This data is growing at an ever accelerating speed, in size and in diversity. It provides an enormous opportunity for scientists to run queries and mine the data to develop insights that have not been possible before. A primary challenge is working with the complexity of the aggregated data. The construction of standardized, re-usable, stable, up-to-date and easy to use workflow elements is the primary manner by which to work with these complex data. A large interoperable open data space was needed to feed these workflows.
In addition, many companies have spent considerable time and energy creating their own integration frameworks to combine public data with their proprietary data. Ultimately, these systems mostly achieve the same results, highlighting the considerable duplication of effort in what arguably should be a precompetitive activity. Because of the large number and diversity of data sources this integration has been very limited. As a result there was a great need to develop a resource that addresses this problem. The Open PHACTS project took on both challenges by building an integrated linked open data system that can easily be extended with proprietary data.

The Approach: The vast number of publicly available databases in the biomedical area [2] requires that priorities have to be set to select the ones most useful for semantic integration. In the Open PHACTS project this selection was based on defining the most pertinent research questions that exist in pharmaceutical research. This list was initiated by scientists from the European pharmaceutical companies in a brainstorm session, and the list was subsequently extended and refined within the wider Open PHACTS project consortium, including the academic researchers in the pharmacological domain. In this first round questions that involve the integration of compound–target–pathway–disease/phenotype data were considered. A total of 83 questions were defined, which were clustered by domain and prioritized [3]. All questions required the integration of at least two different data sources. Some example questions are:
- Give me all oxidoreductase inhibitors active <100 nm in human and mouse
- The current Factor Xa lead series is characterized by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X
- For my given compound, which targets have been patented in the context of Alzheimer's disease?
- Who is working on the most relevant genes concerning Alzheimer's disease?

The Data: Based on the research questions and public availability, the following databases were included in Open PHACTS: ChEBI, ChEMBL, ConceptWiki, DisGeNET, DrugBAnk, ENZYME, FAERS, Gene Ontology, neXTProt, Uniprot, Wikipathways, resulting in an RDF data set of more than 3 billion triples. These databases represent several scientific domains, including chemistry, pharmacology, disease, proteins, and pathways. Several crucial aspects needed to be worked out in detail: data ontologies and vocabularies, and data licensing and copyrighting details. It is well established that the name space of biomedicine is messy and ambiguous. Many synonyms circulate on the web for crucial concept categories (semantic types) such as proteins, genes, drugs and diseases, but also for institutes, authors and, for instance, units of measurement. Chemistry is, in many ways, even more challenged by ambiguity and degeneracy in its identifier systems. A single compound can be represented using systematic names generated according to several conventions. The complexity of the name space is approached through a system by which the individual concepts constituting the biological concept space (genes, proteins, drugs/chemicals, diseases, among others) as well as the social concept space (authors, articles, datasets, among others) are kept at the individual concept level. Rather than trying to enforce all data providers globally to refer to these individual concepts in a standard way, the OPS uses on-the-fly identity mapping to combine different terms and internationalized resource identifiers (IRIs) dynamically for the same physical entity. The advantage of this approach is that different rules can be applied at query time to fit the current query best (for instance deciding whether to treat genes and proteins, genes and gene probes or different tautomers or stereoisomers of the same chemical compound as ‘the same physical entity’ for the purposes of a query). Open PHACTS does not mandate the use of any one specific vocabulary for a certain semantic type (gene, protein, drug, among others), but provides recommendations as to how best to represent data within the system. Specifically, Open PHACTS recommends the use of open public vocabularies and identifier schemes, advocating use of resources such as the NCBO's BioPortal (http://bioportal.bioontology.org/ontologies) and EBI's Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup/), approved vocabularies and public identifier mapping services such as BridgeDB (http://www.bridgedb.org/) and identifiers.org. The complexities of handling chemical compound data is managed by using the proven ChemSpider platform [4].
Another practical issue in the integration of multiple data sources is copyright. International law is not uniform regarding the copyrighting of data and this can lead to many practical problems. Copyright and licensing terms cannot be assumed if they are not explicitly provided by the data providers, and this is often missing. An explicit copyright and license statement is therefore crucial to enable the sharing and repurposing of data, which in itself is required for anyone to maintain, correct, mix and redistribute a dataset. Therefore, solving legal and practical issues around data access, sharing and licensing has been a focus of Open PHACTS and the project currently has crystal clear copyright and licensing for all included data sources.
Even though the efforts on name spaces and licensing are not very visible in the operational system, they are essential for a sustainable and extensible Linked Open Data system, and in the project the phrase "We do the boring stuff really well" was coined often.

The Platform: Open PHACTS provides a semantic platform and consists of six components:
1. Data Sources. We rely on the existing original RDF data sources, hosted in an OpenLink Virtuoso triplestore. This encourages both originating data providers and third parties to continue to provide RDF data.
2. Linked Data Cache (LDC). We chose to centrally warehouse the data, as opposed to federate it, in a Linked Data Cache for reasons of reliability and performance.
3. Identity Resolution Service (IRS). The role of the IRS is to translate user-entered entity names (in free text form) into known entities within the system (i.e. that have a defined URI). These known entities can then be used in structured queries.
4. Identity Mapping Service (IMS). Equivalence is context-dependent: For example, when trying to find the targets that a particular chemical compound interacts with, some data sources may have created mappings to gene rather than protein identifiers: in such instances it may be acceptable to users to treat gene and protein IDs as being in some sense equivalent. However, in other situations this may not be acceptable and the platform needs to allow for this dynamic equivalence within a scientific context. Rather than hard coding the identity links into the datasets, the platform defers the links to be resolved during query execution by the Identity Mapping Service (IMS). Thus, by changing the set of dataset links used to execute the query, different interpretations over the data can be provided. The IMS also allows for queries with a novel concept: scientific lenses. Scientific lenses are a dynamic way of identity mapping, allowing more context-dependent searching. For instance, for a chemist it is important to distinguish between two salt forms of the same chemical, whereas for a biologist this is irrelevant. Scientific lenses allow the user to tune the level at which certain concepts are seen as identical.
5. Domain specific services. There are a variety of important pharmacological operations that are specific to a domain and have reliable and performant implementations. A good example is the mapping of compounds based on chemical structures and not on names. Instead of reimplementing this feature, we rely on an existing chemistry registration and normalisation service: ChemSpider [4]
6. Core API. The initial prototype of the Open PHACTS platform only provided a SPARQL endpoint through which the integrated data could be queried. This required each of the drug discovery applications to have an intimate knowledge of the data exposed and the ability to write the required SPARQL queries to retrieve the data desired. To address this problem the Core API was introduced into the architecture. The Core API provides a set of common methods that applications can call. This benefits application developers as they no longer need to formulate their own SPARQL queries.

Detailed information on the technical implementation can be found in the following two documents [5,6]

The Applications: The real value of the Open PHACTS system will be realized through the applications that are built using its API. A user friendly query interface called the Open PHACTS Explorer is available to the world (https://www.openphacts.org/2/sci/explorer.html) and allows users to query the data for chemical compounds, protein targets, and assay data. This provides easy access to the integrated data with several relatively basic queries. The real value of Open PHACTS lies in the diversity and integration of the multitude of data sources, and any user is free to build their own applications and queries using the existing API calls. The Open PHACTS project has built a number of modules in Knime (Open Source)and Pipeline Pilot, which are software platforms that allow construction of complex query and analysis workflows using these modules. These modules and workflows are freely available through the project web site (http://www.openphacts.org/).
Within the Open PHACTS project team, several custom applications were built that are also freely accessible. They all have a different focus and support different scientific communities. Examples include ChemBioNavigator for visualising groups of related molecules, COMBINE for building build and visualising interactive chemical and biological networks, and SciBites, providing a real-time drug discovery/pharma information portal, connecting the latest news on competitive intelligence for pharma and biotech companies directly to Open PHACTS pharmacology data (http://www.openphacts.org/2/sci/apps.html). Several applications are demonstrated on youtube (https://www.youtube.com/user/OpenPHACTS).

The Impact: The Open PHACTS project is having a clear impact in several ways. The most straightforward impact is the use of the system in scientific research. Peer reviewed scientific publications have been coming out that make extensive use of the system to do analysis that was very difficult to do previously [7,8]. Many pharmaceutical companies are busy integrating their internal data with the Open PHACTS data, so they can easily query across all data that is available to them, both public and private.
A second impact is the demonstration that large amounts of diverse semantic data in RDF triple format can be queried in a performant manner, and this was clearly not the consensus opinion of the big data providers such as the European Bioinformatics Institute (EBI) and commercial data providers like Thomson-Reuters. The hard work and success of the Open PHACTS project has undoubtedly changed people's mind on the practicality of using linked data in biomedical research. This is supported by the fact that many data providers have now decided to offer their data in RDF format, and this greatly helps to sustain the Open PHACTS system. The work that was done in Open PHACTS to make the data from different sources more interoperable, by using standardized ontologies and vocabularies, has also stimulated new biomedical data organizations and projects, such as the European Elixir initiative (https://www.elixir-europe.org/), to make data interoperability one of the key objectives in the many data sources that they will provide. The accomplishments of the Open PHACTS project have also been an important driver for the implementation of the FAIR Data initiative, strongly supported by the Dutch Techcentre for Life Sciences (http://www.dtls.nl/fair-data/). The results of Open PHACTS have also impacted discussions around the data that is generated in the multiple EU sponsored IMI public-private projects (http://www.imi.europa.eu/). It is acknowledged that an approach like the one taken in Open PHACTS needs to be put in place to ensure the interoperability and sustainability of the data that is generated in the many ongoing and finished IMI projects.

The Future: Open PHACTS was started as a project to show that linked open data could be integrated and made publicly available for efficient querying. As described, this involved the efforts of many experts in the areas of biomedical science, data ontologies and vocabularies, licensing, computer science, software engineering, etc. The project has been very successful in its objectives, and this is proved by the fact that several large scale biomedical initiatives and organizations (Elixir, EBI, Thomson Reuters, FAIR Data, etc) have included interoperable data in RDF form as objectives in their efforts. Organizations like Elixir etc. will be the data providers on which a system like Open PHACTS can continue to provide real value to the biomedical community. Several biomedical data domains are clearly of high value to be added to the integrated data, such as genomics, proteomics, and metabolomics data. Every data source that gets added will enrich the type of queries and data mining that can be done, and we expect a large increase of scientific activity and publications on this multidomain data analysis in the upcoming years. The linked nature of the data in Open PHACTS is very well suited for network/graph based analytics, such as shortest path analysis. This will play an important role in guiding the direction of disease and pharmaceutical research, as only these methods will allow qualitative and quantitative analysis of the ever growing volumes of biomedical data. It is likely that in the future companies like Ontoforce (www.ontoforce.com) and Euretos (www.euretos.com) will develop more efficient software to mine the Open PHACTS data, and this will only increase the impact of the project. Biomedical data mining will have clearly moved on from diverse isolated relational databases to (still diverse!) integrated semantic databases.

The Team: Open PHACTS was conceived and built by a well-coordinated multidisciplinary team of academic researchers, small and large companies in the fields of bioinformatics, cheminformatics, computer science, software engineering, and pharmaceutical research. The consortium members come from all across Europe, including universities of Amsterdam (VU), Maastricht, Vienna, Denmark, Hamburg, Manchester, Santiago de Compostela, Bonn, the academic hospital of Leiden (LUMC), organizations PSMAR in Barcelona, Royal Society of Chemistry UK, the Spanish National Cancer Research Centre, Stichting Netherlands Bioinformatics Centre (NBIC), the Swiss Institute of Bioinformatics, the European Institute of Bioinformatics (EBI), small companies BioSolveIT, Connected Discovery, Open Link, and SciBite, and large pharma companies GSK, AstraZeneca, Novartis, Merck, Lundbeck, Eli Lilly, Janssen, and Almirall. The coordination of such a complex team was facilitated by frequent subteam meetings. Several hackathons were organized to brainstorm and develop tools and applications on the spot.
In addition there are 50+ so-called associate partners, who have declared a strong interest in the project and are kept up to date regularly on new developments. These include software developers, data providers, academic institutions and hospitals, pharma and biotech companies (http://www.openphacts.org/partners/associated-partners).

[1] Williams AJ et al, Drug Discovery Today 2012, 17, 1188-1198
[2] Galperin MY et al, Nucleic Acids Research 2015, 43, D1-D5
[3] Azzaoui K et al, Drug Discovery Today 2013, 18, 843-852
[4] Pence H and Williams AJ, J. Chem. Educ. 2010, 87, 1123-1124
[5] Gray AJG et al, Semantic Web 2014, 5, 101-113 http://dx.doi.org/10.3233/SW-2012-0088
[6] Groth PT et al, J. Web Semantics 2014, 29, 1-7 http://dx.doi.org/10.1016/j.websem.2014.03.003
[7] Ratnam J et al, PLoS One 2014, 9, e115460
[8] Chichester C et al, Drug Discovery Today 2015, 20, 399-405

The Open PHACTS Consortium

Faculty of Health, Life Sciences and Medicine Maastricht University, Dr Chris Evelo
Minderbroedersberg 4-6
6200 MD Maastricht
Netherlands