Sunday, 10 August 2014

what the schema of reading big xml data using "Memory Mapped Files"?



i have a big xml file( osm map data file to parse). the initial code to process is like this:



FILE* file = fopen(fileName.c_str(), "r");
size_t BUF_SIZE = 10 * 1024 * 1024;
char* buf = new char[BUF_SIZE];
string contents;
while (!feof(file))
{
int ret = fread(buf, BUF_SIZE, 1, file);
assert(ret != -1);
contents.append(buf);
}

size_t pos = 0;
while (true)
{
pos = contents.find('<', pos);
if (pos == string::npos) break;

// Case: found new node.
if (contents.substr(pos, 5) == "<node")
{
do something;
}

// Case: found new way.
else if (contents.substr(pos, 4) == "<way")
{
do something;
}


}


then here people tell me i should use memory mapping file to process those "big data file", detail is here: how to read to huge file into buffer,


but i have a question: I do understand when create memory mapped file for read, i can count line by line read the content of every line.



int file = open(path, O_RDONLY); //Open the file.
off_t fileLength = lseek(file, 0, SEEK_END); //Get its size.

//Map its contents into memory.
const char* contents = mmap(NULL, fileLength, PROT_READ, MAP_SHARED, file, 0);

close(file); //The file can be closed right away, the mapping is not affected.
Inspect the file in any way you want. Like counting lines:

off_t lineCount = 0;
for(off_t i = 0; i < fileLength; i++) if(contents[i] == '\n') lineCount++;


But here when the big file need to read is a XML file, is is a previous-continue related,it is not a situation can parse line by line. for example, an end tag "" for tag "" may appear several lines later , So how should i cope with this situation?


Any one familiar with those big-xml-data process could give some explanation?


the libxml2 can cope those things without those consideration leaving for users.



xmlTextReaderPtr inputReader;
inputReader = xmlNewTextReaderFilename


Just create xml reader and then could begin parse.


but i am interesting how this problem is settled? big file could use "memory mapped file", but when the file's content is not logical separated, but the content of "before-after" is related, the what technique should adopt?


hope i have clearly express my question.


No comments:

Post a Comment