Remove unnecessary tag content from html source using scrapy



I am extracting html source of a web page using scrapy and save the output in .xml format. The web page source has the following content



<html>
<head>
<script type="text/javascript">var startTime = new Date().getTime(); </script><script type="text/javascript">var startTime = new Date().getTime();</script> <script type="text/javascript"> document.cookie = "jsEnabled=true";..........

...........<div style="margin: 0px">Required content</div>
</head>
</html>


From this I need to remove all .... tags and retain the required content with their respective tags. How can I do that by using scrapy?


No comments:

Post a Comment