If I have a large XML file, and I want to process it in parallel. 'Hadoop in practice' use mahout XMLInputFormat, and I find the getSplits() method is not overrided. In other words, it's using TextInputFormat's getSplits() method. And how does this method avoid splitting the file in the position of begin-tag and end-tag.
No comments:
Post a Comment