I have XML docs that contain other XML docs encapsulated as CDATA, like this:
<mds>
<md>
<value>
<![CDATA[<?xml version="1.0" encoding="UTF-8"?><record xmlns:xsi="http://ift.tt/ra1lAU" xmlns:dc="http://ift.tt/mToXri" xmlns:dcterms="http://ift.tt/qxdZ4f">
<dc:title>some text containing &</dc:title></record>]]>
</value>
</md>
</mds>
I extract this XML and the dc:title from it using LibXML:
$dcrawData = <get the CDATA from above>;
$dcDOM = $::PRSR->load_xml(expand_entities => 0, string => $dcRawData);
$dcTitle = $dcDOM->findvalue("//dc:title");
Then I insert it into another XML section:
<mods:titleInfo>
<mods:title>some text containing &</mods:title>
</mods:titleInfo>
As you can see, the & entity gets expanded and becomes a single &. Which is a problem because now, the resulting XML generates a parsing error because any parser expects a named entity here.
Is there a way to prevent LibXML from expanding named entities when using findvalue or to reencode them before using the value? There might be others in other records. The expand_entities option does not make a difference.
No comments:
Post a Comment