Saturday, 4 April 2015

Parse XML sitemap and visit all urls



I'm in the middle of an import process that involves executing a snippet of code on every single page. The problem is that there are literally thousands of pages, so it would be pretty inhumane to manually click them all open.


After facing a dead end trying to google a web crawler to do this, I tried my own, with PHP. It was awful.



<?php
set_time_limit(0);
ini_set('max_execution_time', 3000);

$file = file_get_contents("map.xml");
$xml = simplexml_load_string($file);


$count = 0;
foreach($v as $val){

$curl = curl_init($val->loc);
curl_setopt ($curl, CURLOPT_POST, 1);
curl_exec($curl);
curl_close($curl);

}


This printed those pages hundres of times, and crashed in the middle. Also, the development server that this is on took a dive, and it has 4GB of RAM paired with two cores.




I also tried Node.js, and failed with that too. So here's what I tried next.


Print all links and click them all. Ran out of memory and CPU power.


So how in earth could I do this, without killing the server?


Any method that runs on Windows or Linux is fine.


No comments:

Post a Comment