Filter patent data based on keywords using machine learning algorithm in R



Here are the ingredients for the question:



  • Data: patent data from the EPO-server

  • Quantity: ca. 10000 files per year between 1980-2014

  • Format: xml


  • Example: http://ift.tt/1CzZPvl




  • Project: Based on keywords such as "labor", "efficiency", "automation", etc. I would like to filter out those patents that are related to the automation of a process and will therefore replace labor force (e.g. supermarkets' self-checkout machines).




  • Aim: The goal is to obtain a share of patents per year (and per country) that are related to automation.




  • Question: Excuse me, I am new to machine learning but from I understand, the process requires semi-supervised learning techniques. How do I incorporate the keywords mentioned above into the machine learning algorithm (e.g. K-nearest neighbour) in R? Also: do I need to merge all of the xml-files into a data.frame beforehand? I am only interested in the ID-number, application number, and description.




Any help is highly appreciated. A hands-on example would be amazing.


No comments:

Post a Comment