XML : How do I access the CDATA in a title tag in XML with Nokogiri?

Here is an example of the XML I am working with:

  <item rdf:about="http://auburn.craigslist.org/cpg/5368609005.html">  <title><![CDATA[Help Wanted for Online Business]]></title>  <link>http://auburn.craigslist.org/cpg/5368609005.html</link>  <description><![CDATA[Create a safer environment for your children and WORK FROM HOME helping others do the same.   NO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]]]></description>  <dc:date>2016-01-16T09:14:35-06:00</dc:date>  <dc:language>en-us</dc:language>  <dc:rights>&#x26;copy; 2016 &#x3C;span class="desktop"&#x3E;craigslist&#x3C;/span&#x3E;&#x3C;span class="mobile"&#x3E;CL&#x3C;/span&#x3E;</dc:rights>  <dc:source>http://auburn.craigslist.org/cpg/5368609005.html</dc:source>  <dc:title><![CDATA[Help Wanted for Online Business]]></dc:title>  <dc:type>text</dc:type>  <dcterms:issued>2016-01-16T09:14:35-06:00</dcterms:issued>  </item>

I got this by doing this:

      doc = Nokogiri::HTML(open(content_url)) do |config|          config.strict.noblanks      end          bq = doc.xpath("//item")

When I attempt to debug this with pry, this is what it tells me bq looks like:

  [5] pry(main)> bq.first  => #(Element:0x3fbfec8f9788 {    name = "item",    attributes = [ #(Attr:0x3fbfec8f195c { name = "rdf:about", value = "http://auburn.craigslist.org/cpg/5368609005.html" })],    children = [      #(Element:0x3fbfec8e939c { name = "title" }),      #(Element:0x3fbfec8e0b98 { name = "link" }),      #(Text "http://auburn.craigslist.org/cpg/5368609005.html\n"),      #(Element:0x3fbfec8dd18c { name = "description" }),      #(Element:0x3fbfed088e68 { name = "date", children = [ #(Text "2016-01-16T09:14:35-06:00")] }),      #(Element:0x3fbfed079620 { name = "language", children = [ #(Text "en-us")] }),      #(Element:0x3fbfec8d1044 { name = "rights", children = [ #(Text "&copy; 2016 <span class=\"desktop\">craigslist</span><span class=\"mobile\">CL</span>")] }),      #(Element:0x3fbfed054050 { name = "source", children = [ #(Text "http://auburn.craigslist.org/cpg/5368609005.html")] }),      #(Element:0x3fbfed025408 { name = "title" }),      #(Element:0x3fbfec89d2a8 { name = "type", children = [ #(Text "text")] }),      #(Element:0x3fbfec85e79c { name = "issued", children = [ #(Text "2016-01-16T09:14:35-06:00")] })]    })

Notice that the 3 fields that have CDATA values/text are all blank in Nokogiri.

Specifically, I am referring to these lines:

  <title><![CDATA[Help Wanted for Online Business]]></title>  <description><![CDATA[Create a safer environment for your children and WORK FROM HOME helping others do the same.   NO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]]]></description>  <dc:title><![CDATA[Help Wanted for Online Business]]></dc:title>

Which produced these results:

  [5] pry(main)> bq.first      #(Element:0x3fbfec8e939c { name = "title" }),      #(Element:0x3fbfec8dd18c { name = "description" }),      #(Element:0x3fbfed025408 { name = "title" }),

Why are those values blank and how can I specifically look for and get that CDATA text?

XML : How do I access the CDATA in a title tag in XML with Nokogiri?

No comments:

Post a Comment