XML : pugiXML stops mid-document while parsing C++

I'm using pugiXML in Xcode and am using C++. While parsing a huge (~900MB) xml file (wikibook dump), my XML parser code stops reading at an arbitrary point midway through without any errors.

Here is what the sample XML file looks like:

  <page>  <title>Organic Chemistry/Cover</title>  <ns>0</ns>  <id>5</id>  <revision>    <id>2835870</id>    <parentid>2247133</parentid>    <format>text/x-wiki</format>    <text> *Insert many paragraphs here* </text>    </revision>    </page>    

I believe the problem is with this block of code:

  pugi::xml_document doc;  //inFile is the xml file that contains the entire corpus  doc.load(inFile);    for (auto point : doc.select_nodes("page")){           cout << point.node().child_value("title") << endl;         cout << point.node().child_value("id") << endl;         cout << point.node().child("revision").child_value("text");    }    

I believe what I'm doing here is going through each page in the corpus, extracting the title, id and text.

I thought the problem was only with getting the "text" part but if I extract title and id by itself it doesn't go all the way till the end either. I'm not sure if it is just Xcode that can't handle a big file (but I've also trimmed down to a smaller 5MB file).

Any help will be appreciated.

No comments:

Post a Comment