XML : Python extract text from xml

I want to extract with python the text from a XML file which contains tags, and also tags within tags

this is how my file looks like:

  <p>blablabla</p>  <p>blablabla / blablabla,</p>  <p>blablabla</p>  <p>blablabla / blablabla / blablabla</p>  <p>blablabla.</p>    

First I want to find whole entries (one whole entry in the file looks like the one above), then I want to split the entry in many parts after each "/", and finally remove all remaning tags "<p>" and "</p>"

Here is how I think this could be done (python2.7):

  first_results = []    lines = open(sys.argv[1])    for l in lines:      re.match(r'<p>[\s\S]*?\.<\/p>', l)      l = l.split("/")      first_results.append(l)    for b in first_results:      b = re.sub(r'(<p>)|(</p>)', r'', b)    

My question is: This is somewhow not working properly. I can get my entries right with regex, but I am not sure how to do the rest. Is there a better way to do this? At the end I want to get the text splitted by "/" and separated by tabs, something similar to this:

  blablabla   blablabla   lablabla   blablabla   blablabla ect...    

What would be the best method to to this. At this point I want to say that I am new with python, but already a big fan:)

No comments:

Post a Comment