Named entities in encapsulated XML cause parsing errors



I have XML docs that contain other XML docs encapsulated as CDATA, like this:



<mds>
<md>
<value>
<![CDATA[<?xml version="1.0" encoding="UTF-8"?><record xmlns:xsi="http://ift.tt/ra1lAU" xmlns:dc="http://ift.tt/mToXri" xmlns:dcterms="http://ift.tt/qxdZ4f">
<dc:title>some text containing &amp</dc:title></record>]]>
</value>
</md>
</mds>


I extract this XML and the dc:title from it using LibXML:



$dcrawData = <get the CDATA from above>;
$dcDOM = $::PRSR->load_xml(expand_entities => 0, string => $dcRawData);
$dcTitle = $dcDOM->findvalue("//dc:title");


Then I insert it into another XML section:



<mods:titleInfo>
<mods:title>some text containing &</mods:title>
</mods:titleInfo>


As you can see, the &amp entity gets expanded and becomes a single &. Which is a problem because now, the resulting XML generates a parsing error because any parser expects a named entity here.


Is there a way to prevent LibXML from expanding named entities when using findvalue or to reencode them before using the value? There might be others in other records. The expand_entities option does not make a difference.


No comments:

Post a Comment