How to update large number of records via XML files



So this question is more of a discussion.


I'm the developer for a real estate syndication website and am currently having trouble figuring out a way to update massive numbers of listings/records efficiently (2,000,000+ listings).


We currently accept XML feeds, containing real estate listings, from about ~20 different websites. Most of the incoming feeds are small (~100 or so listings), but we have a couple of XML feeds that contain ~1,000,000 listings. The small feeds are parsed fast and easy, however, the large feeds are taking upwards of 2-3 hours each.


The current "live" database table that contains the listings for viewing on the site is MyISAM. I chose MyISAM because ~95% of the queries to the table are SELECT's. Really the only time there are writes (UPDATE/INSERT queries) are during the time the XML feeds are being processed.


The current process is as follows:




  1. There is a CRON in place that starts the main parsing script.




  2. It loops through a feeds table and grabs the external XML feed source files. It then runs through said file and for each record in the XML file it checks against the listings table to see if a listing needs to be updated or inserted (if it's a new listing).




This is all happening against the live table. What I'd like to find out is if anybody has any better logic to make these updates/inserts happen in the background so as to not slow down the production tables, and ultimately, the user experience.


Would a delta table be the best choice? Maybe do all the heavy work on a separate database and just copy the new table over to the production database? On a separate workhorse domain altogether? Should I have a separate listings table that does all the parsing which would be InnoDB instead of MyISAM?


What we're trying to accomplish is to have our system be able to update listings frequently throughout the day without slowing the site down. Our competitors boast that they are updating their listings every 5 minutes in some cases. I just don't see how that's even possible.


Sorry about the blabbing on. I'm working right now so this is more of a brain dump just to get the ball rolling. If anybody would like me to provide DB table schematics, I'd be more than happy.


In summary: I'm looking for a way to frequently update millions of records in our database (daily) via a couple dozen external XML feeds/files. I just need some logic on how to effectively, and efficiently, make this happen so as to not drag the production server down with it.


Cheers.


No comments:

Post a Comment