I am writing a function that scrapes a huge XML file. Since the nodes are about one million I would like to use the foreach package. Code and relevat comment below
xmlfile=xmlParse(file="./DATI/xmldata.xml") #read the file
#explore
if(exists("dbfromxml")) rm(dbfromxml)
root<-xmlRoot(xmlfile)
persons<-xmlChildren(root)
rm(root)
nrecords<-xmlSize(persons)
#set out the parallel framework
library(foreach)
library(doParallel)
cores <- getOption("mc.cores", detectCores())
cl<-makeCluster(cores,outfile="ciao.txt")
registerDoParallel(cl)
dbfromxml<-foreach(i=1:10,.combine=rbind,.packages = "XML") %dopar% {
personsxml<-persons[[i]]
processaXMLPersona(personsxml) #this function works properly ouside a do parallel environment
}
stopCluster(cl)
The problem arise when I set out the doParallel / makeCluster infrastructure loading .packages = "XML" that makes the R cluster crashing. The following errors come out: Error in unserialize(socklist[[n]]) : error reading from connection; Error in serialize(data, node$con) : error writing to connection
No comments:
Post a Comment