I'm using pugiXML in Xcode and am using C++. While parsing a huge (~900MB) xml file (wikibook dump), my XML parser code stops reading at an arbitrary point midway through without any errors.
Here is what the sample XML file looks like:
  <page>  <title>Organic Chemistry/Cover</title>  <ns>0</ns>  <id>5</id>  <revision>    <id>2835870</id>    <parentid>2247133</parentid>    <format>text/x-wiki</format>    <text> *Insert many paragraphs here* </text>    </revision>    </page>      I believe the problem is with this block of code:
  pugi::xml_document doc;  //inFile is the xml file that contains the entire corpus  doc.load(inFile);    for (auto point : doc.select_nodes("page")){           cout << point.node().child_value("title") << endl;         cout << point.node().child_value("id") << endl;         cout << point.node().child("revision").child_value("text");    }      I believe what I'm doing here is going through each page in the corpus, extracting the title, id and text.
I thought the problem was only with getting the "text" part but if I extract title and id by itself it doesn't go all the way till the end either. I'm not sure if it is just Xcode that can't handle a big file (but I've also trimmed down to a smaller 5MB file).
Any help will be appreciated.
 
No comments:
Post a Comment