I want to extract with python the text from a XML file which contains tags, and also tags within tags
this is how my file looks like:
<p>blablabla</p> <p>blablabla / blablabla,</p> <p>blablabla</p> <p>blablabla / blablabla / blablabla</p> <p>blablabla.</p>
First I want to find whole entries (one whole entry in the file looks like the one above), then I want to split the entry in many parts after each "/", and finally remove all remaning tags "<p>" and "</p>"
Here is how I think this could be done (python2.7):
first_results = [] lines = open(sys.argv[1]) for l in lines: re.match(r'<p>[\s\S]*?\.<\/p>', l) l = l.split("/") first_results.append(l) for b in first_results: b = re.sub(r'(<p>)|(</p>)', r'', b)
My question is: This is somewhow not working properly. I can get my entries right with regex, but I am not sure how to do the rest. Is there a better way to do this? At the end I want to get the text splitted by "/" and separated by tabs, something similar to this:
blablabla blablabla lablabla blablabla blablabla ect...
What would be the best method to to this. At this point I want to say that I am new with python, but already a big fan:)
No comments:
Post a Comment