I'm using pugiXML in Xcode and am using C++. While parsing a huge (~900MB) xml file (wikibook dump), my XML parser code stops reading at an arbitrary point midway through without any errors.
Here is what the sample XML file looks like:
<page> <title>Organic Chemistry/Cover</title> <ns>0</ns> <id>5</id> <revision> <id>2835870</id> <parentid>2247133</parentid> <format>text/x-wiki</format> <text> *Insert many paragraphs here* </text> </revision> </page>
I believe the problem is with this block of code:
pugi::xml_document doc; //inFile is the xml file that contains the entire corpus doc.load(inFile); for (auto point : doc.select_nodes("page")){ cout << point.node().child_value("title") << endl; cout << point.node().child_value("id") << endl; cout << point.node().child("revision").child_value("text"); }
I believe what I'm doing here is going through each page in the corpus, extracting the title, id and text.
I thought the problem was only with getting the "text" part but if I extract title and id by itself it doesn't go all the way till the end either. I'm not sure if it is just Xcode that can't handle a big file (but I've also trimmed down to a smaller 5MB file).
Any help will be appreciated.
No comments:
Post a Comment