Merging Words in Xml



In the following xml:



<w:body>
<w:p w:rsidR="00912B30" w:rsidRPr="00912B30" w:rsidRDefault="00912B30" w:rsidP="00912B30">
<w:pPr>
<w:autoSpaceDE w:val="0"/>
<w:autoSpaceDN w:val="0"/>
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00912B30">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t xml:space="preserve">Considering those situations, after 1970 The </w:t>
</w:r>
<w:r w:rsidRPr="00E155EC">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:strike/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t>Agricultural Land Law</w:t>
</w:r>
<w:r w:rsidRPr="00912B30">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t xml:space="preserve"> of 1952 was modified and changed the principle to permit renting and lending agricultural land. The way of thinking was as follows. If it was difficult to widen farmers’ size by buying agricultural land, expanding the size by renting would be possible. After that some positive framework to promote renting and lending agricultural land. For example, The </w:t>
</w:r>
<w:r w:rsidRPr="00E155EC">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:strike/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t>Agricultural Land Use Promotion Project</w:t>
</w:r>
<w:r w:rsidRPr="00912B30">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t xml:space="preserve"> had started in 1975 and The </w:t>
</w:r>
<w:r w:rsidRPr="00E155EC">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:strike/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t>Agricultural Land Use Promotion Law</w:t>
</w:r>
<w:r w:rsidRPr="00912B30">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:t xml:space="preserve"> was established in 1980. Actually after that, area of agricultural land by transfer of ownership of owned agricultural land with compensation had been more than the area by transfer of rights for </w:t>
</w:r>
<w:r w:rsidRPr="00912B30">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
<w:snapToGrid w:val="0"/>
<w:kern w:val="0"/>
<w:szCs w:val="21"/>
</w:rPr>
<w:lastRenderedPageBreak/>
<w:t>lease.</w:t>
</w:r>
</w:p>
</w:body>


I need to extract all text with the tag <w:strike> where


w = 'http://ift.tt/JiuBoE'


The problem is that the striked words are not continuous, they are in arbitrary positions . When I extract and join them , the last word of previous strike instance merges with the first word of next strike instance.


My Approach:



text = "" #initialize empty string where all words will be stored
source = etree.parse(doc_xml)
for p in source.findall('.//'+w1+'p'): #iterate over every p tag
text+= " " # add a space to separate words in successive paragraphs
for b in p.findall('.//{%(ns)s}strike/../..//{%(ns)s}t' %{'ns':w}):
text+=''.join(b.text) #joins all strike text and appends to empty string


Output:



text =" Agricultural Land LawAgricultural Land Use Promotion ProjectAgricultural Land Use Promotion Law"


Expected Output:



text = " Agricultural Land Law Agricultural Land Use Promotion Project Agricultural Land Use Promotion Law"


Crude Fix: replace last line of code with:



text+=" " +''.join(b.text)


It fixes the above but there are many cases where a single word comes under 2 strike instances as a result of which the crude fix may output "he lp" instead of "help" . It is a bit tricky and I have thought of :

1. Extract strike text

2. Check for next text tag . If it doesn't have strike tag , add a space to text else if it has a strike tag , join it directly.


No comments:

Post a Comment