The development of Big Data technologies offers new perspectives in building powerful disambiguation systems. New approaches can be imagined to discover and normalize non-controlled vocabularies such as named entities.
In this presentation, I will explain how Reportlinker.com, an award-winning market research solution, developed an inference engine based on supervised analysis to disambiguate the names of companies found in a corpus of unstructured documents.
Through several examples, I will explain the main steps of our approach:
- The discovery of non-verified fact (hypotheses) using a large volume of data
- The transformation of hypotheses into verified facts, using an iterative graph processing system
- The construction of a relational graph to attribute new context around each normalized concept.
The RDF data model allows the description of domain-level knowledge that is understandable by both humans and machines. RDF data can be derived from different source formats and diverse access points, ranging from databases or files in CSV format to data retrieved from Web apis in JSON, Web Services in XML or any other speciality formats. To this end, machine-interpretable mapping languages, such as rml, were introduced to uniformly define how data in multiple heterogeneous sources is mapped to the rdf data model, independently of their original format. However, the way in which this data is accessed and retrieved still remains hard-coded, as corresponding descriptions are often not available or not taken into account. In this paper, we introduce an approach that takes advantage of widely-accepted vocabularies, originally used to advertise services or datasets, such as Hydra or DCAT, to define how to access Web-based or other data sources. Consequently, the generation of RDF representations is facilitated and further automated, while the machine-interpretable descriptions of the connectivity to the original data remain independent and interoperable, offering a granular solution for accessing and mapping data.
Tourpedia (http://tour-pedia.org) is an open initiative which contains a linked dataset of tourism places, i.e. accommodations, attractions, points of interest (POIs) and restaurants. Tourpedia extracts and integrates information about places from four different social social media: Facebook, Foursquare, Google Places and Booking.com. The resulting knowledge base currently consists of more than 6M RDF triples and describes almost 500.000 places, each of which is identified by a globally unique identifier, which can be dereferenced over the Web into a RDF description. This paper gives an overview of the Tourpedia knowledge base and illustrates how new relations are discovered among places through Named Entity Recognition (NER) tools.
This paper investigates necessities and pitfalls in existing data licensing practices on the World Wide Web. The authors analyzed four open data portals with respect to the available licenses and drew conclusions about the quantity and quality of available licensing information. Additionally the authors address reasoning issues with respect to the automatic detection and potential clearance of licensing conflicts when creating derivative works from multiple data sources. The issues raised in this paper should be taken into account when designing and implementing a Linked Data licensing policy.