There is a typical task of reading data, serialized in JSON or XML on in some other non-plain text format. To elaborate on this task several Hive SerDe's are suggested:
XML Serde: https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
JSON Serde: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe
etc
A question is:
Why to use these SerDe's if we can just modify RecordReader in our InputFormat to form row (which is value after RecordReader) in a way we want?
We can read XML tags which we consider to be a single row, split them and form a String encoded in a way default SerDe would understand, add columns for key if we want (which is ignored by Hive by default), and thus have a simple setup for JSON and XML processing. Why to create special SerDe's ?
No comments:
Post a Comment