I'm trying to use PIG to extract some XML from a field in a Hive table, rather than from an XML file (which is the assumption of most of the examples I have read). The XML comes from a table arranged as follows:
ID, {XML_string} The XML string contains n. number of rows, always containing at least one from up to 10 attributes. We can assume that attribute #1 will always be present and will be unique.
<row> <att1></att1> <att2></att2> ... </row> <row> <att1></att1> <att2></att2> ... </row> ... I want to transform this into a new table with each row in the XML string exploded out into a separate row in the new table, but I still want to include the ID from the existing table.
ID, att1, att2, att3 == ==== ==== ==== 1 1 xxx xxx 1 2 xxx xxx 1 3 xxx xxx 2 1 xxx xxx I've approached this so far in PIG by using XPathAll. I've read a lot of advice that suggests avoiding Regex for XML parsing.
REGISTER /home/piggybank-0.12.0.jar DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll(); A = LOAD 'HiveTable' USING org.apache.hive.hcatalog.pig.HCatLoader(); B= FOREACH A GENERATE id, XPathAll(xml_string,'ROW/_ATT1') as att1; XPathAll(xml_string,'ROW/_ATT2') as att2; XPathAll(xml_string,'ROW/_ATT3') as att3; dump B; This results in the following output, assuming there are three row instances for item 1:
(1 (Att1-i1,Att1-i2,Att1-i3),(Att2-i1,Att2-i2,Att2-i3),(Att3-i1,Att3-i2,Att3-i3)) All of the information appears to be there, I just can't seem to unlock the way to pull out the first element from each of the embedded tuples into a new row, then the second elements, and so on. In other words:
(1, Att1-i1, Att2-i1, Att3-i1) (1, Att1-i2, Att2-i2, Att3-i2) (1, Att1-i3, Att2-i3, Att3-i3) I'm clinging to the hope this can be done using Hive + Pig without having to resort to Java, etc. I'd appreciate any insights. I'm not precious about the approach taken so far, so if I have gone the long way round, please tell me!
No comments:
Post a Comment