XML : Validate an XML using Apache Spark

I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.

Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:

  val jobConf = new JobConf()  jobConf.set("stream.recordreader.class",              "org.apache.hadoop.streaming.StreamXmlRecordReader")  jobConf.set("stream.recordreader.begin", "<page")  jobConf.set("stream.recordreader.end", "</page>")  org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")     // Load documents, splitting wrt <page> tag.  val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])    

Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).

Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.

No comments:

Post a Comment