I have 2 large files which I gather from Stackoverflow named posts.xml
and questions.txt
with the following structure:
posts.xml:
<posts> <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/> <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." /> ... </posts>
A post can be question or answer (both)
questions.txt:
Id,CreationDate,CreationDatesk,Score 123,2008-08-01 16:08:52,20080801,48 126,2008-08-01 16:10:30,20080801,33 ...
I wanna query on posts just one time and index the selected rows (which their ID is in questions.txt
file) with lucene. Since the xml file is very large (about 50GB), the time of querying and indexing is important for me.
Now the question is: How can I find all the selected rows in posts.xml
that are repeated in questions.txt
This is my approach until now:
SAXParserDemo.java:
public class SAXParserDemo { public static void main(String[] args){ try { File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml"); SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); UserHandler userhandler = new UserHandler(); saxParser.parse(inputFile, userhandler); } catch (Exception e) { e.printStackTrace(); } } }
Handler.java:
public class Handler extends DefaultHandler { public void getQuestiondId() { ArrayList<String> qIDs = new ArrayList<String>(); BufferedReader br = null; try { String qId; br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt")); while ((qId = br.readLine()) != null) { qId = qId.split(",")[0]; //this is question id findAndIndexOnPost(qId); //find this id on posts.xml then index it! } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } private void findAndIndexOnPost(String qID) { } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { if (qName.equalsIgnoreCase("row")) { System.out.println(attributes.getValue("Id")); switch (attributes.getValue("PostTypeId")) { case "1": String id = attributes.getValue("Id"); break; case "2": break; default: break; } } } }
No comments:
Post a Comment