XML : Xpath for a HTML5 tag

I'm using =importXML function on Google spreadsheets to scrap some information from different sites, I'm having a bad time trying to get the text inside an <article> tag using xpath.

Here is the source code: source

  <div id="blog-post-body-ad" class="ad">      </div>          <article class="blog-post-body">          <p>Fox&#39;s <em>X-Men </em>drama <em>Hellfire </em>is making a change at the top.</p>  <p>Writers Evan Katz and Manny Coto, who co-created the drama, are exiting, <em>The Hollywood Reporter </em>has learned. Also out are Patrick McKay and John D. Payne, who came up the the story for the drama alongside Katz and Coto and were set to pen the script. A search is under way for a new writer.</p>  <p>The changes come as <em>Hellfire </em>is on a slower development track, insiders say. <em>Hellfire, </em>which previously was&nbsp;<a href="http://www.hollywoodreporter.com/live-feed/fox-nears-deal-x-men-813542">considered a live-action&nbsp;<em>X-Men</em></a>, follows a young special agent who learns that a power-hungry woman with extraordinary abilities is working with a clandestine society of millionaires &mdash; known as &quot;The Hellfire Club&quot; &mdash; to take over the world.</p>  <p>      <div class="embedded-content" data-nid="832221" data-nodetype="blog" data-template="readmore">        <script type="application/json">          {            "nid": 832221,            "type": "blog",            "title": "Marvel Sets &#039;Legion&#039; Pilot With Noah Hawley at FX, Readying &#039;Hellfire&#039; for Fox",            "path": "http://www.hollywoodreporter.com/live-feed/marvel-legion-noah-hawley-fx-832221",            "relative-path": "/live-feed/marvel-legion-noah-hawley-fx-832221"          }        </script>      </div></p>  <p>Sources say the <em>X-Men </em>drama is not likely to go to pilot this season as it remains on a slower track. The change comes as Katz and Coto are shifting their focus to Fox&#39;s <em><a href="http://www.hollywoodreporter.com/live-feed/fox-greenlights-prison-break-event-856203" target="_blank">24: Legacy</a>, </em>which received a formal pilot order Friday during Fox&#39;s time in front of the press at the Television Critics Association&#39;s winter press tour. The new take on 24 will feature an entirely new cast with a diverse lead as Fox has high hopes to reboot the franchise for a new era.</p>  <p>The change at the top should not worry diehard fans of the <em>X-Men </em>franchise. Sources say Fox remains committed to <em>Hellfire </em>and wants to get it completely right as the <em>X-Men </em>franchise remains a valuable asset for the company. Should <em>Hellfire</em> go to series and the network renew Batman prequel <em>Gotham, </em>the network would have dramas from both comic book powerhouses DC Comics and Marvel &mdash; a first for a broadcast network and something insiders would love to see on their schedule.</p>  <p>&nbsp;</p>            <footer class="blog-post-tags">                              <a href="/topic/tv-development" data-tracklabel="Story Well - Bottom Tags TV Development">TV Development</a>                      </footer>      </article>        <div class="blog-post-footer-ad">  

Using Google Chrome > Inspect > Copy XPath

  //*[@id="page-content"]/div[1]/article    

I try it but Google sheets gives me a parsing error.

I try a solution on another question on stackoverflow but not working for me:

  =importXML(C2,"//article[contains(concat('', normalize-space(@class), ''), '')//div[@class='blog-post-body']]")    

What I'm trying to achieve is to get all the text inside the <article> tag and a BIG plus would be to get the text of the <article> without or excluding the <div class="embedded-content"> in the middle of the article.

All the help will be much appreciated.

Thanks

No comments:

Post a Comment