XML Stack Overflow: Filter patent data based on keywords using machine learning algorithm in R

Monday, 16 February 2015

Filter patent data based on keywords using machine learning algorithm in R

Here are the ingredients for the question:

Data: patent data from the EPO-server

Quantity: ca. 10000 files per year between 1980-2014

Format: xml

Example: http://ift.tt/1CzZPvl

Project: Based on keywords such as "labor", "efficiency", "automation", etc. I would like to filter out those patents that are related to the automation of a process and will therefore replace labor force (e.g. supermarkets' self-checkout machines).

Aim: The goal is to obtain a share of patents per year (and per country) that are related to automation.

Question: Excuse me, I am new to machine learning but from I understand, the process requires semi-supervised learning techniques. How do I incorporate the keywords mentioned above into the machine learning algorithm (e.g. K-nearest neighbour) in R? Also: do I need to merge all of the xml-files into a data.frame beforehand? I am only interested in the ID-number, application number, and description.

Any help is highly appreciated. A hands-on example would be amazing.

No comments:

Post a Comment