Parse large XML file into database with Ruby and Nokogiri - use multiple threads



I have a large XML file (5GB+) that I want to parse into a mysql database. I currently have a ruby script that uses a Nokogiri SAX parser to insert every new book into the database, but this method is very slow since it inserts one by one. I need to figure out a way to parse the large file with multiple concurrent threads.


I was thinking I could split up the file into multiple files and multiple scripts would work on each subfile. Or have the script send each item to a background job for inserting into the database. Maybe using delayed_job, resque or sidekiq.


Anyone have experience with this? With the current script, it'll take a year to load the database.



<?xml version="1.0"?>
<ibrary>
<NAME>cool name</NAME>
<book ISBN="11342343">
<title>To Kill A Mockingbird</title>
<description>book desc</description>
<author>Harper Lee</author>
</book>
<book ISBN="989894781234">
<title>Catcher in the Rye</title>
<description>another description</description>
<author>J. D. Salinger</author>
</book>
</library>

No comments:

Post a Comment