Parsing data from multiple webpages using XML more efficiently in r

This question refers to a situation whereby I am able to achieve what I want, but I would like to know if there is a better, more efficient way. i.e. a solution that requires less memory to complete and is quicker.

I am trying to put together a dataset of all the individual performances of cricketers in every game. Here, I am just focusing on batting results. These data are available at the cricinfo website. Using their search engine it is possible to display up to 200 individual results on a page. For this particular set of performances (T20 - it's a type of cricket match), there are 8890 performances listed on 45 pages. All have 200 except the last page which has 90.

I stored all the urls for all 45 pages like this:


library(XML)

nums<-1:45
urls<-paste("http://ift.tt/1r05Gbh",
            nums,";size=200;template=results;type=batting;view=innings", sep="")

urls   #45 urls 
names(urls) <- paste("x",nums,sep="")  #out of habit, I named them

Next I used the following for loop to extract the relevant index of the table containing each list of 200 performances in a table. For every webpage/url, the data of interest are stored in the 49th element. I stored all the results in a list.


results.tables <- vector("list",length(nums))

for(i in 1:length(urls)){
  x <- readHTMLTable(urls[i])
  results.tables[[i]] <- x[[49]]
}

results.tables  #contains all the data

The following just makes it all into one dataframe and gets rid of 2 meaningless columns:


T20.bat<-do.call("rbind", results.tables)
T20.bat<-T20.bat[c(1:8,10:12)]
head(T20.bat)

#            Player Runs Mins BF 4s 6s     SR Inns     Opposition       Ground  Start Date
#1   AJ Finch (Aus)  156   70 63 11 14 247.61    1      v England  Southampton 29 Aug 2013
#2 BB McCullum (NZ)  123   72 58 11  7 212.06    1   v Bangladesh    Pallekele 21 Sep 2012
#3     RE Levi (SA) 117*   67 51  5 13 229.41    2  v New Zealand     Hamilton 19 Feb 2012
#4    CH Gayle (WI)  117   75 57  7 10 205.26    1 v South Africa Johannesburg 11 Sep 2007
#5 BB McCullum (NZ) 116*   87 56 12  8 207.14    1    v Australia Christchurch 28 Feb 2010
#6   AD Hales (Eng) 116*   97 64 11  6 181.25    2    v Sri Lanka   Chittagong 27 Mar 2014

The part that takes by far the longest is the for loop. This is where I'm wondering if I am missing a more efficient way to run the readHTMLTable function over all 45 urls ?

I realize that running this in some other programming language would be quicker/less memory intensive, but am looking for a better R strategy if possible - especially as I'd like to repeat this type of data collection for a much longer list of urls.

Parsing data from multiple webpages using XML more efficiently in r

No comments:

Post a Comment