XML : SparkSQL with databricks xml lib: 'Malformed row' on a valid xml

Suppose I'm running Spark 1.6.0 on Oracle JDK 1.8 (build 1.8.0_65-b17) in an ipython notebook session started with the following line:

  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark --packages com.databricks:spark-xml_2.10:0.3.1z    

So I have included databricks spark-xml package (https://github.com/databricks/spark-xml). Next I'm going to run the following code against pyspark:

  dmoz = '/Users/user/dummy.xml'  v=sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Topic', failFast=True).load(dmoz)  print v.schema    

where dummy.xml contains this tiny fragment of a DMOZ dump (http://rdf.dmoz.org/):

  <?xml version="1.0" encoding="UTF-8"?>  <RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf/">    <!-- Generated at 2016-01-24 00:05:51 EST from DMOZ 2.0 -->    <Topic r:id="">      <catid>1</catid>    </Topic>  </RDF>    

Which validates against any validator i've been able to find. And the result is:

  ...    Py4JJavaError: An error occurred while calling o82.load.  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.RuntimeException: Malformed row (failing fast): <Topic r:id="">    <catid>1</catid>  </Topic>      at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:101)      at com.databricks.spark.xml.util.InferSchema$$anonfun$3$$anonfun$apply$2.apply(InferSchema.scala:83)    ...    

It refers to this line of code: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/InferSchema.scala#L101. Which is clearly the case of XMLStreamException thrown by some of the javax.xml.stream classes above.

Unfortunately, details of the exception get omitted by the handler, so I can't tell what exactly is wrong with the row. However, removing namespace from attributes (i.e. r:id becomes just id) makes it go away. I'm feeling I've hit some common pitfall, just need to know which one.

No comments:

Post a Comment