I am trying to scrap some data from Trip advisor. I have been following the post created by Rdabbler and was able to get the review, date of rating and ratings (code reproduced below). However, I would also like to know more about the person who is making the comment. Specifically, where they are from, how many cities they have visited and how many helpful votes they have ? What is the path I should set for the getNodeSet function ?
I am an R beginner and a complete novice in XML/Xpath so I will appreciate any help. Thanks in advance.
Code:
options(stringsAsFactors=FALSE)
urllink <- "http://ift.tt/1cIWVPr"
doc=htmlTreeParse(urllink,useInternalNodes=TRUE)
get node sets
# review id
ns_id=getNodeSet(doc,"//div[@class='quote']/a[@href]")
# top quote for a review
ns_topquote=getNodeSet(doc,"//div[@class='quote']/a[@href]/span")
# get partial entry for review that shows in the page
ns_partialentry=getNodeSet(doc,"//div[@class='col2of2']//p[@class='partial_entry'][1]")
# date of rating
ns_ratingdt=getNodeSet(doc,"//div[@class='col2of2']//span[@class='ratingDate relativeDate' or @class='ratingDate']")
# rating (number of stars)
ns_rating=getNodeSet(doc,"//div[@class='col2of2']//span[@class='rate sprite- rating_s rating_s']/img[@alt]")
get actual values extracted from node sets
# review id
id=sapply(ns_id,function(x) xmlAttrs(x)["id"])
# top quote for the review
topquote=sapply(ns_topquote,function(x) xmlValue(x))
# rating date (couple of formats seem to be used and hence a and b below)
ratingdta=sapply(ns_ratingdt,function(x) xmlAttrs(x)["title"])
ratingdtb=sapply(ns_ratingdt,function(x) xmlValue(x))
# rating (number of stars)
rating=sapply(ns_rating,function(x) xmlAttrs(x)["alt"])
# partial entry for review
partialentry=sapply(ns_partialentry,function(x) xmlValue(x))
No comments:
Post a Comment