XML : Memory management Rstudio XML package

I'm currently working on some simple webscraping piece of code which reads pieces of text from around 2000 webpages and stores this in a matrix. This is working fine however I have some RAM issues. When running the script the amount of RAM used by Rstudio quickly rises up to 1,5/2 Gb. Removing any large objects from the workspace does not seem to decrease the memory usage.

Here is what my code looks like

  #Find the number of pages to search through                      URL <- some URL              rawpage <- htmlTreeParse(URL, useInternal=TRUE)              noOfPages <- as.numeric(gsub("\\.","",gsub("\n","",xpathSApply(rawpage, "//li[@class='first-last-page']/a",xmlValue))))          #Create an empty matrix          Data <- character(0)        #Iterate through number of pages          for (i in 1:noOfPages) {            provPage <- paste(URL,"/p",i,sep = "")            rawpage <- htmlTreeParse(provPage, useInternal=TRUE)        #Here I retrieve some URL's found on each page            Data <- rbind(Data,as.matrix(xpathSApply(rawpage, "//div[@class='search-result-media']/a",xmlGetAttr, "href")))          }    

It seems that the last part in the For loop where the Data matrix is updated is the main cause of the huge memory usage.

If someone could help me understand this issue and decrease the RAM usage somehow that would be great.

Kind regards,

Andrew

No comments:

Post a Comment