Tools and Methods

Session 4.4

Thursday, September 17, 2015 - 10:30 to 12:00
D2 Works. 1
Ruben Verborgh


A Semantic Method for Multiple Resources Exploitation

Being able to extract and exploit information that is included in multiple resources (repositories, corpora, etc.) is essential to benefiting from the increasing availability and complementary nature of such data scattered across the World Wide Web. However, such an endeavour raises a number of challenges including dealing with the diverse structures of such resources, different relationships among such data, and the overlapping and complementary nature of the information. Thus, developing a semantic method that can extract semantic information and hidden associations would help overcome such difficulties that occur when dealing with multiple resources. This paper presents a new semantic method that exploits the overlap between various resources with different structures (i.e. ontologies as forms of structured data and corpora as examples of unstructured data) and employs semantic relations, specifically sibling relations, to infer new information that may not exist in the original resources. Then, this method employs the new information in a content-based recommender system to enhance the quality of the provided recommendations (i.e. articles) in complex fields that are inherently characterised by varying relations and structures, such as bioinformatics. In addition, this method is accompanied by an automatic tool that is responsible for tailoring individual recommendations to each user based on his/her profile.

Streaming Transformation of XML to RDF using XPath-based Mappings

The Extensible Markup Language (XML) has become a widely adopted data interchange format. With the rise of Linked Data published using the Resource Description Framework (RDF), a number of tools for transforming XML to RDF have been developed. Specifying XML to RDF mappings for these tools often requires skills in programming languages such as XSLT or XQuery. Moreover, these tools are rarely able to deal with large XML inputs. We introduce an XML to RDF transformation approach, which is based on mappings comprising RDF triple templates that employ simple XPath expressions. Thanks to the restricted XPath expressions, which can be evaluated against a stream of XML data, our implementation can handle extremely large input XML files. To process the XML input efficiently, we employ XML filtering techniques and a strategy for selecting relevant XML nodes to generate RDF triples from. We show that the time complexity of our mapping algorithm is linear in the size of the XML input and also prove its practical efficiency with an evaluation on large real-world data.

SemaGrow: Optimizing Federated SPARQL queries

Processing SPARQL queries involves the construction of an efficient query plan to guide query execution. Alternative plans can vary in the resources and the amount of time that they need by orders of magnitude, making planning crucial for efficiency. On the other hand, the construction of optimal plans can become computationally intensive and it also operates upon detailed, difficult to obtain, metadata. In this paper we present Semagrow, a federated SPARQL querying system that uses metadata about the federated data sources in order to optimize query execution. We balance between a query optimizer that introduces little overhead, has appropriate fall backs in the absence of metadata, but at the same time produces optimal plans in as many situations as possible. Semagrow also exploits non-blocking and asynchronous stream processing technologies to achieve query execution efficiency and robustness. We also present and analyse empirical results using the FedBench benchmark to compare Semagrow against FedX and SPLENDID. Semagrow clearly outperforms SPLENDID and it is either on a par or much faster than FedX.