I have an XML file that have book records that contains tags like the author, publication date, labels and so on. I'm going to parse this file to make a 3 lists, one will have the book title, the author in another list, and finally the labels in a third list, later I will write these lists to Excel columns using openpyxl. The problem is that some of the book records don't have a label tag. Using regular parsing technique with Beautiful soup will produce the first two lists with the same length, but the label list will have a shorter length.
I have three question:
1- how do I create all three lists with equal length (empty entry for books that dont have the label tag? 2- The label list looks like this ['Energy;Green Buildings;High Performance Buildings', 'Computing', 'Computing;Design;Green Buildings',.......]. I have created about an additional 15 columns titled with the label names that I have, such as "Computing" and "Design". Is there any way that I can use openpyXL to create an X mark or colored cell for the book-label combination if a book contains a specific label, for example, if a book titled "Architecture" in row 5 has the "Design" label, I need to have either a X mark or colored cell in cell(row'5',col'Design'). 3- Is there an easier method to accomplish this (parsing the XML file and writing efficiently in Excel)?
Here is a snap shot of the XML file and the code that I wrote ( also both the XML file and the Python files can be downloaded from here: http://www.ranialabib.com/#!python/icfwa
<?xml version="1.0" encoding="UTF-8"?> <xml> <records> <record> <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database> <ref-type name="Book">1</ref-type> <contributors> <authors> <author>AIA Research Corporation</author> </authors> </contributors> <titles> <title>Regional guidelines for building passive energy conserving homes</title> </titles> <periodical/> <keywords/> <dates> <year>1978</year> </dates> <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher> <urls/> <label>Energy;Green Buildings;High Performance Buildings</label> </record> <record> <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database> <ref-type name="Book">1</ref-type> <contributors> <authors> <author>Akinci, Burcu</author> <author>Ph, D</author> </authors> </contributors> <titles> <title>Computing in Civil Engineering</title> </titles> <periodical/> <pages>692-699</pages> <keywords/> <dates> <year>2007</year> </dates> <publisher>American Society of Civil Engineers</publisher> <isbn>9780784409374</isbn> <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num> <urls> <web-urls> <url>http://books.google.com/books?id=QigBgc-qgdoC</url> </web-urls> </urls> </record> import xml.etree.ElementTree as ET fhand = open('My_Collection.xml') data = fhand.read() Title=list() Year=list() Label=list() tree = ET.fromstring(data) titles = tree.findall('.//title') years = tree.findall('.//year') labels = tree.findall('.//label') for t in titles : Title.append(str(t.text)) print 'Titles: ', len(Title) print Title for y in years : Year.append(str(y.text)) print 'years: ', len(Year) print Year for l in labels : Label.append(str(l.text)) print 'Labels: ', len(Label) print Label from openpyxl import Workbook wb = Workbook() ws = wb.active for row in zip(Title, Year, Label): ws.append(row) wb.save("Test2.xlsx")
No comments:
Post a Comment