I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming
library, such this:
val jobConf = new JobConf() jobConf.set("stream.recordreader.class", "org.apache.hadoop.streaming.StreamXmlRecordReader") jobConf.set("stream.recordreader.begin", "<page") jobConf.set("stream.recordreader.end", "</page>") org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml") // Load documents, splitting wrt <page> tag. val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j
or JAXB
(more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader
used to split the file does not perform any validation.
No comments:
Post a Comment