pdf form to csv python or similar



I've got a bunch of pdf forms created with Adobe formscentral - they are all formatted the same and I want to extract the data in the fields to a CSV file. I'm (getting) a little familiar with python and have tried a few libraries to extract the text via the XML tags. I've got to the point where I'm way out of my depth though:(


I've managed to read the PDF with 'pdfquery' and/or 'beautifulsoup' but can't find a simple tutorial anywhere to help my parse the pdf to a csv/excel. I've searched SO and can't seem to find anything totally relevant. The XML tree I've managed to extract gives me the tags for field names (see below) but not sure how to proceed from here. Has anyone had any experience of this kind of operation or able to point me in the direction of any tutorials.


Any help gratefully received!


Thanks


Marty



<pdfxml ModDate="D:20140414114502+03'00'" CreationDate="D:20140407143830-04'00'" Producer="Adobe FormsCentral 889953 S" Creator="Adobe FormsCentral 738134">
<LTPage bbox="[0, 0, 595.27, 841.89]" height="841.89" pageid="1" rotate="0" width="595.27" x0="0" x1="595.27" y0="0" y1="841.89" page_index="0" page_label="">
<LTRect bbox="[0.0, 0.0, 595.27, 841.89]" height="841.89" linewidth="0" pts="[[0.0, 0.0], [595.27, 0.0], [595.27, 841.89], [0.0, 841.89]]" width="595.27" x0="0.0" x1="595.27" y0="0.0" y1="841.89">
<LTTextLineHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" width="99.816" word_margin="0.1" x0="34.015" x1="133.831" y0="732.217" y1="745.798"><LTTextBoxHorizontal bbox="[34.015, 732.217, 133.831, 745.798]" height="13.582" index="1" width="99.816" x0="34.015" x1="133.831" y0="732.217" y1="745.798">Name of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
<LTTextLineHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" width="94.724" word_margin="0.1" x0="34.015" x1="128.739" y0="707.554" y1="721.135"><LTTextBoxHorizontal bbox="[34.015, 707.554, 128.739, 721.135]" height="13.582" index="2" width="94.724" x0="34.015" x1="128.739" y0="707.554" y1="721.135">Type of organisation: </LTTextBoxHorizontal></LTTextLineHorizontal>
<LTTextBoxHorizontal bbox="[34.025, 631.024, 136.667, 657.37]" height="26.347" index="3" width="102.642" x0="34.025" x1="136.667" y0="631.024" y1="657.37"><LTTextLineHorizontal bbox="[34.025, 643.789, 136.667, 657.37]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="643.789" y1="657.37">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 631.024, 112.269, 645.166]" height="14.143" width="78.244" word_margin="0.1" x0="34.025" x1="112.269" y0="631.024" y1="645.166">members (male): </LTTextLineHorizontal></LTTextBoxHorizontal>
<LTTextBoxHorizontal bbox="[34.025, 581.871, 136.667, 620.462]" height="38.592" index="4" width="102.642" x0="34.025" x1="136.667" y0="581.871" y1="620.462"><LTTextLineHorizontal bbox="[34.025, 606.881, 136.667, 620.462]" height="13.582" width="102.642" word_margin="0.1" x0="34.025" x1="136.667" y0="606.881" y1="620.462">Number of employees/ </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 594.116, 134.963, 608.259]" height="14.143" width="100.938" word_margin="0.1" x0="34.025" x1="134.963" y0="594.116" y1="608.259">members aged 18-35 </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 581.871, 64.076, 596.014]" height="14.143" width="30.051" word_margin="0.1" x0="34.025" x1="64.076" y0="581.871" y1="596.014">(male) </LTTextLineHorizontal></LTTextBoxHorizontal>
<LTTextLineHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" width="78.836" word_margin="0.1" x0="34.025" x1="112.861" y0="557.728" y1="571.31"><LTTextBoxHorizontal bbox="[34.025, 557.728, 112.861, 571.31]" height="13.582" index="5" width="78.836" x0="34.025" x1="112.861" y0="557.728" y1="571.31">Location/Address </LTTextBoxHorizontal></LTTextLineHorizontal>
<LTTextBoxHorizontal bbox="[34.025, 494.974, 138.371, 533.045]" height="38.071" index="6" width="104.346" x0="34.025" x1="138.371" y0="494.974" y1="533.045"><LTTextLineHorizontal bbox="[34.025, 519.463, 99.821, 533.045]" height="13.582" width="65.795" word_margin="0.1" x0="34.025" x1="99.821" y0="519.463" y1="533.045">Type of waste </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 507.218, 138.371, 520.8]" height="13.582" width="104.346" word_margin="0.1" x0="34.025" x1="138.371" y0="507.218" y1="520.8">management activities </LTTextLineHorizontal><LTTextLineHorizontal bbox="[34.025, 494.974, 85.066, 508.555]" height="13.582" width="51.04" word_margin="0.1" x0="34.025" x1="85.066" y0="494.974" y1="508.555">carried out: </LTTextLineHorizontal></LTTextBoxHorizontal>

No comments:

Post a Comment