I am building a search engine in c++ (using a 900 mb rapidXML file that contains pages from wikiBooks) and my objective is to parse the ~900 MB XML document using rapidXML so that the user can just enter one word in the search bar and receive the ACTUAL XML DOCUMENTS that contain that word (link).
I need to figure out how to store index of each token (aka each word within of each document) so that when the user wants to see the page numbers a certain word occurs, I can jump to that specific page.
I have been told to do the "file io offset" (where you store where in the file a word is so that you can jump to it) and I am having a hard time understanding what to do.
Questions:
-
Do I use the "seekg" and "tellg" in the istream library (to find the byte location that each document PAGE is stored at)? And if so, how?
-
How do I return the actual document back to the user (that contains many occurances of the searched word)?
No comments:
Post a Comment