Tuesday, 2 December 2014

Separate HTML document



I am working with the XML library in R and would like to separate a HTML in chunks



myHTML <- htmlTreeParse("myHTMLfile.HTML", useInternal=T)
unlist(xpathApply(myHTML, '//div', xmlValue))


This works fine and gives me one long vector of strings for the entire thing. However, ideally I'd like to split up my HTML in chunks. The HTML structure is the following:



<DOC>
<div>
Document 1 - Element 1
</div>

<div>
Document 1 - Element 2
</div>

<div>
Document 1 - Element 3
</div>

</DOC>

<DOC>
<div>
Document 2 - Element 1
</div>

<div>
Document 2 - Element 2
</div>

<div>
Document 2 - Element 3
</div>

</DOC>


So I would like to have a list, where each element corresponds to the content in one , and the elements of each list are string vectors, containing Element 1, 2, 3 for each DOC.


I struggle to (A) even query 'DOC' because it is not part of the namespace?? and (B) get this kind of list of string vectors output.


so instead of this output



[1] "Document 1 - Element 1"
[2] "Document 1 - Element 2"
[3] "Document 1 - Element 3"
[4] "Document 2 - Element 1"
[5] "Document 2 - Element 2"
[6] "Document 2 - Element 3"


I am looking to get this:



[[1]]
[1] "Document 1 - Element 1"
[2] "Document 1 - Element 2"
[3] "Document 1 - Element 3"
[[2]]
[1] "Document 2 - Element 1"
[2] "Document 2 - Element 2"
[3] "Document 2 - Element 3"


Thanks a lot for your help!


No comments:

Post a Comment