I'm currently working on some simple webscraping piece of code which reads pieces of text from around 2000 webpages and stores this in a matrix. This is working fine however I have some RAM issues. When running the script the amount of RAM used by Rstudio quickly rises up to 1,5/2 Gb. Removing any large objects from the workspace does not seem to decrease the memory usage.
Here is what my code looks like
#Find the number of pages to search through URL <- some URL rawpage <- htmlTreeParse(URL, useInternal=TRUE) noOfPages <- as.numeric(gsub("\\.","",gsub("\n","",xpathSApply(rawpage, "//li[@class='first-last-page']/a",xmlValue)))) #Create an empty matrix Data <- character(0) #Iterate through number of pages for (i in 1:noOfPages) { provPage <- paste(URL,"/p",i,sep = "") rawpage <- htmlTreeParse(provPage, useInternal=TRUE) #Here I retrieve some URL's found on each page Data <- rbind(Data,as.matrix(xpathSApply(rawpage, "//div[@class='search-result-media']/a",xmlGetAttr, "href"))) } It seems that the last part in the For loop where the Data matrix is updated is the main cause of the huge memory usage.
If someone could help me understand this issue and decrease the RAM usage somehow that would be great.
Kind regards,
Andrew
No comments:
Post a Comment