XML : I need to check the sequence of link tag

I need to insert the <fgs> tag for link citation elements for the first occurence on unique/sequence wherever it is placed in the paragraph. However I have a check list to insert the same.

  1. Do not jump/skip without sequence.
  2. Do not insert <fgs> tag duplicate link tag.
  3. If the link tag shuffled in the same paragraph we need to insert the fgs tag. Do not consider the first occurence of link tag.

Input:

  <p xml:id="c09-para-0007">The targets for which analysis <link href="#c09-fig-0001"/> is ...</p>  <p xml:id="c09-para-0019">Antibodies (Abs) are the molecules the guard of the organism. Among them, lymphocytes B that maturate in the bone marrow are the producers of antibodies. These extraordinary proteins can be envisaged through a basic structure that is shown in Figure <link href="#c09-fig-0002"/>a.</p>  <p xml:id="c09-para-0027">Antibody can be lyzed into different constitutive fragments using the Figure <link href="#c09-fig-0003"/>a). The other part is crystallizable and named Fc.</p>  <p xml:id="c09-para-0028">Achieving smaller molecules with single aminoacidic chain (scFv, Figures <link href="#c09-fig-0003"/>a and <link href="#c09-fig-0004"/>). This is obtained by spontaneous association of V<sub>H</sub> and V<sub>L</sub> domains generated by recombinant techniques (genetic engineering) that are linked through a chain of 15 amino acids.</p>  <p xml:id="c09-para-0029">As commented earlier biologically (through hybridoma technology) (Figure <link href="#c09-fig-0004"/>). To avoid the generation of Fv fragments (by the) <link href="#c09-fig-0003"/>of triabodies (trimeric) or tetrabodies (tetrameric). In the case of bispecific diabodies, produced to generate trivalent mono&hyphen; or tetraspecific tetrabodies<link href="#c09-fig-0005"/>.</p>  <p xml:id="c09-para-0030">Antibodies can be obtained (Figure <link href="#c09-fig-0003"/>b). Another novel Ig with one variable domain, called <term xml:id="c09-term-0025">novel antigen receptor (V<sub>NAR</sub>)</term>, was discovered in cartilaginous fish, such as sharks <link href="#c09-bib-0135"/>. Both are small, highly soluble, possess superior <link href="#c09-fig-0004"/>stability, and seem very adequate for biosensing purposes.</p>  <p xml:id="c09-para-0030">Known as <term xml:id="c09-term-0023">V<sub>HH</sub></term> or <term xml:id="c09-term-0024">nanobody</term><link href="#c09-bib-0134"/> (Figure <link href="#c09-fig-0006"/>b). Both are small, highly soluble, possess superior stability, and seem very adequate for biosensing purposes.</p>    

Output:

  <p xml:id="c09-para-0007">The targets for which analysis <link href="#<fgs>c09-fig-0001</fgs>"/> is ...</p>  <p xml:id="c09-para-0019">Antibodies (Abs) are the molecules the guard of the organism. Among them, lymphocytes B that maturate in the bone marrow are the producers of antibodies. These extraordinary proteins can be envisaged through a basic structure that is shown in Figure <link href="#<fgs>c09-fig-0002</fgs>"/>a.</p>  <p xml:id="c09-para-0027">Antibody can be lyzed into different constitutive fragments using the Figure <link href="#c09-fig-0003"/>a). The other part is crystallizable and named Fc.</p>  <p xml:id="c09-para-0028">Achieving smaller molecules with single aminoacidic chain (scFv, Figures <link href="#c09-fig-0003"/>a and <link href="#c09-fig-0004"/>). This is obtained by spontaneous association of V<sub>H</sub> and V<sub>L</sub> domains generated by recombinant techniques (genetic engineering) that are linked through a chain of 15 amino acids.</p>  <p xml:id="c09-para-0029">As commented earlier biologically (through hybridoma technology) (Figure <link href="#<fgs>c09-fig-0004</fgs>"/>). To avoid the generation of Fv fragments (by the) <link href="#<fgs>c09-fig-0003</fgs>"/>of triabodies (trimeric) or tetrabodies (tetrameric). In the case of bispecific diabodies, produced to generate trivalent mono&hyphen; or tetraspecific tetrabodies<link href="#<fgs>c09-fig-0005</fgs>"/>.</p>  <p xml:id="c09-para-0030">Antibodies can be obtained (Figure <link href="#c09-fig-0003"/>b). Another novel Ig with one variable domain, called <term xml:id="c09-term-0025">novel antigen receptor (V<sub>NAR</sub>)</term>, was discovered in cartilaginous fish, such as sharks <link href="#c09-bib-0135"/>. Both are small, highly soluble, possess superior <link href="#c09-fig-0004"/>stability, and seem very adequate for biosensing purposes.</p>  <p xml:id="c09-para-0030">Known as <term xml:id="c09-term-0023">V<sub>HH</sub></term> or <term xml:id="c09-term-0024">nanobody</term><link href="#c09-bib-0134"/> (Figure <link href="#<fgs>c09-fig-0006</fgs>"/>b). Both are small, highly soluble, possess superior stability, and seem very adequate for biosensing purposes.</p>    

Code:

  readfile('myfile.xml', $tmpxml);  while($tmpxml=~m/<p(?: |^>)>((?:(?!<\/p>).)*)<\/p>/sg)  {      $fpre=$fpre.$`; $fmatch = $&; $fpost = $'; my $lsCnt = $cnt - 1; my $adCnt = $cnt + 1; my $dupMatch = $fmatch;      $cnt = sprintf "%04d", $cnt; $lsCnt = sprintf "%04d", $lsCnt; $adCnt = sprintf "%04d", $adCnt;      if($dupMatch=~m/-$cnt"/g)      {          my $nwfpost = $fpost;          if($fpost!~m/\-$lsCnt"/g)          {              if($nwfpost=~m/$adCnt"/g)              {                  $fmatch=~s/<link href="([^"]*)\/>/<link href="<fgs>$1<\/fgs>"\/>/g;                }          }      $cnt++;      }      $fpre = $fpre.$fmatch; $tmpxml = $fpost;  }  if(length $fpre) {  $tmpxml = $fpre.$fpost;  }    

No comments:

Post a Comment