Search and replace only within CDATA sections of an XML file



I'm not sure if this is possible with a REGEX expression (eg for Notepad++) or if I need a full script to carry this out. I want to find newlines only within the CDATA sections of an XML file to replace them with a dummy comment or other neutral marker. This in order to protect them during a subsequent filtering process.


All other newlines in the XML file should be ignored (those not in a CDATA section).


The background in more detail:


I have an XML file which is an export of several pages from a German-language Wordpress-based website. I want to import this file into my translation memory system (memoQ from Kilgray) in order to translate the contents and then re-export for the webmaster to re-import into the Wordpress site.


The translation memory software applies a cascading filter to filter out the code and select only the translatable contents, which it presents to me for translation. I translate them, it then reinserts those conents into the code and voila - I have the same file but with English contents instead of German.


The cascading filter in memoQ is an XML filter followed by a HTML filter.


This takes care of everything successfully but unfortunately doesn't preserve the newlines. I've tried tweaking the filter without success. Wordpress exports newlines in the text parts of the website as simple newlines, not as HTML tags. So these need to be preserved, but somewhere in the above-mentioned cascading filter they are not being recognised.


Which led me to try and protect them before importing into the translation memory software - I initially thought they were all double newlines so searched for double newlines and replaced each with a dummy comment <!--MORK_NEWLINE--> which was preserved all the way through and at the end I could search this and replace with the newline.


However, some of the text in question does not have a double newline but only a single one. And there are other single newlines in the XML file which are not relevant for this, so I don't want to touch them. Hence I'm trying to find out how to replace only those in the CDATA sections.


The relevant code in the XML file looks like this:



<item>
<title>Interview title</title>
<link>http://ift.tt/1qDD3lB;
<pubDate>Wed, xx Feb example 06:xx:xx +0000</pubDate>
<dc:creator><![CDATA[mrsmith]]></dc:creator>
<guid isPermaLink="false">http://ift.tt/1nX0tg7;
<description></description>
<content:encoded><![CDATA[<h3>Interview title</h3>
<em>Interview subtitle</em>

<strong>Question text1?</strong>

Answer text1.

<strong>Question text2?</strong>

Answer text2.

<strong>Question text3?</strong>

Answer text3.]]></content:encoded>

</item>


The sections don't always have three questions and there are other sections with e.g. an address



line 1
line 2
line 3


I hope that's enough information to be getting on with,


Thanks for your help


Craig


PS This is my first question here, I've tried searching and can't find anything that answers it directly, sorry if I overlooked anything.


PPS If the answer involves something like python (which related posts refer to) I have to admit I don't know how to run a script :-( so need a tip there, too!


PPS If the answer involves serious scripting I'm happy to commission someone via a freelance site to do it. Where to go?


No comments:

Post a Comment