I'm trying to process the wikipedia dump found here. Specifically with the file - enwiki-latest-pages-articles-multistream.xml.bz2. This is about 46GB uncompressed. I am currently using the STAX parser in Java (xerces) and am able to extract 15K page elements per second. However the bottleneck seems to be the parser and I have toyed around with aalto-xml but it hasn't helped.
Since I'm parsing each page element in the Storm spout, it is a bottleneck. However, I thought I could simply emit the text between ... tags and have several bolts process each of those page elements in parallel. This would reduce the amount of work the Storm spout has to perform. However I am not sure of the specific approach to take here. If I use a parser to extract the content between tags then that would mean it would parse every single element from the beginning of the tag until the end. Is there a way to eliminate this overhead in a standard SAX/STAX parser?
No comments:
Post a Comment