David Baehrens


Large-Scale Patent Classification at the European Patent Office

Monitoring of competitors, contractors, or certain products is a high demand in various industries. Analyzing patents, research literature and expert blogs is one strategy to complement this complex puzzle. Such an analysis comprises the detailed assessment of large corpora to specific technology fields. The process of search, filtering and categorization of big data sets go typically far beyond simple keyword search. Currently, the process of gaining knowledge from such data is still a tedious and time consuming expert work. Current software often does not cover the full analytics lifecycle or does not achieve the precision required for automating analyses. Language technologies are a promising approach to support more fine-granular analysis of data.
Averbis' Information Discovery combines text-mining and machine-learning technologies to gain knowledge from unstructured data with both high precision and completeness. The software identifies business-critical facts and relationships hidden to effectively support information driven decision making. By integrating heterogeneous data sources, we support IP professionals in competitor and patent landscaping analyses. In different real-world evaluations our approach reaches accuracy rates being on par with the results of human experts expert judgements with a reduction of manual effort of up to 80%.

The European Patent Office (EPO) offers inventors a centralised patent granting process through which they can obtain patent protection in up to 40 European countries. The activities of the EPO are dependent on their capacity to process large volumes of information. One of the ways that they achieve this is through the classification of documents. In order to allow efficient searching within existing patent documents, patent offices classify all patents according to hierarchical classification schemes. Since the beginning of 2013, the scheme used by the EPO is the Cooperative Patent Classification (CPC) which is jointly managed by the EPO and the US Patent and Trademark Office (USPTO). CPC contains a total of around 250.000 subdivisions.

All patent applications received by the EPO are fully classified in CPC to a detailed technical level by EPO examiners. Prior to this, all applications are given a preliminary high-level classification within a few days of arrival at the EPO. The purpose of this 'pre-classification' is to route the application to the correct technical department within the EPO. This pre-classification of files also requires a lot of examiner and non-examiner time. The CPC scheme will be revised on a regular basis, and each revision will mean that a certain number of patent documents need to be re-classified. This re-classification can be a tedious task and takes a great deal of patent examiner time and resources.

The European Patent Office and Averbis recently went into collaboration for the pre-classification of incoming patent applications (use case 1) and re-classification of existing classification schemes (use case 2). In this cooperation, various services are provided with the aim of automatically assigning patent applications to the right departments and automatically allocating existing patents with new CPC codes. The solution is based on complex linguistic and semantic analyses, as well as statistically-based machine learning processes. Up to 250.000 incoming patents shall be classified per year and categorized in up to 1.500 hierarchical categories. Due to the high numbers of categories, we applied a hierarchical classification approach with several concatenated classifiers on different levels of the hierarchy. The classification is a multi-label scenario, meaning that one patent maybe classified into various categories. The classifiers themselves are based on Support Vector Machines. We use a fast and lightweight version resulting in good classification performance as well as high training and classification speed. For training of the classifiers, about 650.000 patents over the last ten years have been provided. On the available infrastructure (128GB RAM, 16 Cores), the feature extraction and creation of the models takes about 2 hours. The time needed for prediction is below a second per patent. EPO's minimal requirement is that up to 50% of the patents can be automatically routed to the right department with high precision. To this end, each classification is accompanied with a confidence score expressing the probability of the system of being correct. Patents with high confidence are autoamtically routed to the respective department, while patents with low confidence are manually revised.

The system is currently in its first year of productive use. In this talk, we want to present the EPO project in more details and provide some technical background about the applied language technologies.


David Baehrens is project manager for knowledge management and information discovery solutions at Averbis GmbH, Freiburg, Germany. He is responsible for the technical direction of customer projects, is supervising the customer key account for automatic classification of publications, and is the product owner of the Averbis Terminology Platform. David Baehrens holds a degree in computer science from the Technical University Berlin, Germany, where he has worked in the field of machine learing and published methods in the domain of drug discovery and design. He is experienced as a consultant, software architect and engineer of an information system for medical research labs and as an analyst of data sets from experiments in neuro science. Before, he has developed software to collaborate on the curation and application of ontologies in the semantic web. He is familiar with a wide range of technologies with in-depth expertise in web applications, knowledge organization systems, and machine learning applications.

Project manager for knowledge management and information discovery solutions