This question refers to a situation whereby I am able to achieve what I want, but I would like to know if there is a better, more efficient way. i.e. a solution that requires less memory to complete and is quicker.
I am trying to put together a dataset of all the individual performances of cricketers in every game. Here, I am just focusing on batting results. These data are available at the cricinfo website. Using their search engine it is possible to display up to 200 individual results on a page. For this particular set of performances (T20 - it's a type of cricket match), there are 8890 performances listed on 45 pages. All have 200 except the last page which has 90.
I stored all the urls for all 45 pages like this:
library(XML)
nums<-1:45
urls<-paste("http://ift.tt/1r05Gbh",
nums,";size=200;template=results;type=batting;view=innings", sep="")
urls #45 urls
names(urls) <- paste("x",nums,sep="") #out of habit, I named them
Next I used the following for
loop to extract the relevant index of the table containing each list of 200 performances in a table. For every webpage/url, the data of interest are stored in the 49th element. I stored all the results in a list.
results.tables <- vector("list",length(nums))
for(i in 1:length(urls)){
x <- readHTMLTable(urls[i])
results.tables[[i]] <- x[[49]]
}
results.tables #contains all the data
The following just makes it all into one dataframe and gets rid of 2 meaningless columns:
T20.bat<-do.call("rbind", results.tables)
T20.bat<-T20.bat[c(1:8,10:12)]
head(T20.bat)
# Player Runs Mins BF 4s 6s SR Inns Opposition Ground Start Date
#1 AJ Finch (Aus) 156 70 63 11 14 247.61 1 v England Southampton 29 Aug 2013
#2 BB McCullum (NZ) 123 72 58 11 7 212.06 1 v Bangladesh Pallekele 21 Sep 2012
#3 RE Levi (SA) 117* 67 51 5 13 229.41 2 v New Zealand Hamilton 19 Feb 2012
#4 CH Gayle (WI) 117 75 57 7 10 205.26 1 v South Africa Johannesburg 11 Sep 2007
#5 BB McCullum (NZ) 116* 87 56 12 8 207.14 1 v Australia Christchurch 28 Feb 2010
#6 AD Hales (Eng) 116* 97 64 11 6 181.25 2 v Sri Lanka Chittagong 27 Mar 2014
The part that takes by far the longest is the for
loop. This is where I'm wondering if I am missing a more efficient way to run the readHTMLTable
function over all 45 urls ?
I realize that running this in some other programming language would be quicker/less memory intensive, but am looking for a better R strategy if possible - especially as I'd like to repeat this type of data collection for a much longer list of urls.
No comments:
Post a Comment