XML : Parsing XML of unequal number of tags to make of lists of equal length. openpyxl and Beautifulsoup

I have an XML file that have book records that contains tags like the author, publication date, labels and so on. I'm going to parse this file to make a 3 lists, one will have the book title, the author in another list, and finally the labels in a third list, later I will write these lists to Excel columns using openpyxl. The problem is that some of the book records don't have a label tag. Using regular parsing technique with Beautiful soup will produce the first two lists with the same length, but the label list will have a shorter length.

I have three question:

1- how do I create all three lists with equal length (empty entry for books that dont have the label tag? 2- The label list looks like this ['Energy;Green Buildings;High Performance Buildings', 'Computing', 'Computing;Design;Green Buildings',.......]. I have created about an additional 15 columns titled with the label names that I have, such as "Computing" and "Design". Is there any way that I can use openpyXL to create an X mark or colored cell for the book-label combination if a book contains a specific label, for example, if a book titled "Architecture" in row 5 has the "Design" label, I need to have either a X mark or colored cell in cell(row'5',col'Design'). 3- Is there an easier method to accomplish this (parsing the XML file and writing efficiently in Excel)?

Here is a snap shot of the XML file and the code that I wrote ( also both the XML file and the Python files can be downloaded from here: http://www.ranialabib.com/#!python/icfwa

  <?xml version="1.0" encoding="UTF-8"?>  <xml>      <records>          <record>              <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>              <ref-type name="Book">1</ref-type>              <contributors>                  <authors>                      <author>AIA Research Corporation</author>                  </authors>              </contributors>              <titles>                  <title>Regional guidelines for building passive energy conserving homes</title>              </titles>              <periodical/>              <keywords/>              <dates>                  <year>1978</year>              </dates>              <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>              <urls/>              <label>Energy;Green Buildings;High Performance Buildings</label>          </record>      <record>              <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>              <ref-type name="Book">1</ref-type>              <contributors>                  <authors>                      <author>Akinci, Burcu</author>                      <author>Ph, D</author>                  </authors>              </contributors>              <titles>                  <title>Computing in Civil Engineering</title>              </titles>              <periodical/>              <pages>692-699</pages>              <keywords/>              <dates>                  <year>2007</year>              </dates>              <publisher>American Society of Civil Engineers</publisher>              <isbn>9780784409374</isbn>              <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>              <urls>                  <web-urls>                      <url>http://books.google.com/books?id=QigBgc-qgdoC</url>                  </web-urls>              </urls>          </record>        import xml.etree.ElementTree as ET  fhand = open('My_Collection.xml')  data = fhand.read()    Title=list()  Year=list()  Label=list()    tree = ET.fromstring(data)  titles = tree.findall('.//title')  years = tree.findall('.//year')  labels = tree.findall('.//label')      for t in titles :       Title.append(str(t.text))  print 'Titles: ', len(Title)  print Title    for y in years :       Year.append(str(y.text))  print 'years: ', len(Year)  print Year    for l in labels :       Label.append(str(l.text))  print 'Labels: ', len(Label)  print Label    from openpyxl import Workbook   wb = Workbook()   ws = wb.active     for row in zip(Title, Year, Label):           ws.append(row)     wb.save("Test2.xlsx")     

No comments:

Post a Comment