I want to scrape the Interactions table from the Entrez Gene page.
The Interactions table is populated from a web server and when I tried to use the XML package in R, I could get the Entrez gene page, but the Interactions table body was empty (it had not been populated by the web server).
Dealing with the web server issue in R may be solvable (and I'd love to see how), but it seemed Biopython was an easier path.
I put together the following, which gives me what I want for an example gene:
# Pull the Entrez gene page for MAP1B using Biopython
from Bio import Entrez
Entrez.email = "me@x"
handle = Entrez.efetch(db="gene", id="4131", retmode="xml")
record = Entrez.read(handle)
handle.close()
# Find the Dictionary that contains the Interaction table
for x in range(1, len(record[0]["Entrezgene_comments"])):
if x in record[0]["Entrezgene_comments"][x].values() == 'Interactions':
Interactions = record[0]["Entrezgene_comments"][x]
# Return the desired values: I want the Entrez ID and Gene symbol for each interacting protein
for x in range(0, len(Interactions['Gene-commentary_comment'])):
print Interactions['Gene-commentary_comment'][x]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id'] # print the Entrez IDs
print Interactions['Gene-commentary_comment'][x]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_anchor'] # print the gene symbols
This code works, giving me what I want. But I think its ugly, and am concerned that if the Entrez gene page changes slightly in format it will break the code. In particular, there must be a better way to extract the desired information than specifying the full path, as I do with:
Interactions['Gene-commentary_comment'][x]['Gene-commentary_comment'][1]['Gene-commentary_source'][0]['Other-source_src']['Dbtag']['Dbtag_tag']['Object-id']['Object-id_id']
But I cannot figure out how to search through a dictionary of dictionaries without specifying each level I want to descend. When I try functions like find(), they operate on the next level down, but not all the way to the bottom.
Is there a wildcard symbol, a Python equivalent of "//", or a function I can use to get to ['Object-id_id'] without naming the full path? Other suggestions for cleaner code are also appreciated.
No comments:
Post a Comment