XML : Parsing a nested XML string from a Hive table using PIG

I'm trying to use PIG to extract some XML from a field in a Hive table, rather than from an XML file (which is the assumption of most of the examples I have read). The XML comes from a table arranged as follows:

  ID, {XML_string}    

The XML string contains n. number of rows, always containing at least one from up to 10 attributes. We can assume that attribute #1 will always be present and will be unique.

  <row>   <att1></att1>   <att2></att2>   ...  </row>  <row>   <att1></att1>   <att2></att2>   ...  </row>  ...    

I want to transform this into a new table with each row in the XML string exploded out into a separate row in the new table, but I still want to include the ID from the existing table.

  ID, att1, att2, att3  ==  ====  ====  ====  1   1     xxx   xxx  1   2     xxx   xxx  1   3     xxx   xxx  2   1     xxx   xxx    

I've approached this so far in PIG by using XPathAll. I've read a lot of advice that suggests avoiding Regex for XML parsing.

  REGISTER /home/piggybank-0.12.0.jar  DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();  A = LOAD 'HiveTable' USING org.apache.hive.hcatalog.pig.HCatLoader();  B= FOREACH A GENERATE id,       XPathAll(xml_string,'ROW/_ATT1') as att1;      XPathAll(xml_string,'ROW/_ATT2') as att2;      XPathAll(xml_string,'ROW/_ATT3') as att3;  dump B;    

This results in the following output, assuming there are three row instances for item 1:

  (1 (Att1-i1,Att1-i2,Att1-i3),(Att2-i1,Att2-i2,Att2-i3),(Att3-i1,Att3-i2,Att3-i3))    

All of the information appears to be there, I just can't seem to unlock the way to pull out the first element from each of the embedded tuples into a new row, then the second elements, and so on. In other words:

  (1, Att1-i1, Att2-i1, Att3-i1)  (1, Att1-i2, Att2-i2, Att3-i2)  (1, Att1-i3, Att2-i3, Att3-i3)    

I'm clinging to the hope this can be done using Hive + Pig without having to resort to Java, etc. I'd appreciate any insights. I'm not precious about the approach taken so far, so if I have gone the long way round, please tell me!

No comments:

Post a Comment