Wednesday, 13 April 2016

XML : Hive: do we really need XML or JSON Serde to parse XML aor JSON data?

There is a typical task of reading data, serialized in JSON or XML on in some other non-plain text format. To elaborate on this task several Hive SerDe's are suggested:

XML Serde: https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources

JSON Serde: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe

etc

A question is:

Why to use these SerDe's if we can just modify RecordReader in our InputFormat to form row (which is value after RecordReader) in a way we want?

We can read XML tags which we consider to be a single row, split them and form a String encoded in a way default SerDe would understand, add columns for key if we want (which is ignored by Hive by default), and thus have a simple setup for JSON and XML processing. Why to create special SerDe's ?

No comments:

Post a Comment