Tuesday, 17 February 2015

Performance tuining and Optimization of XML::TWIG in Perl



Following is the code I have scribbled in order to filter my 3 to 5 GB XML file based on four conditions:


Following are the condition:


1) All sub-stock should be filtered.


2) Stock having certain origin should persist. Else all should be filtered.


3) Inside the stock trade, there is tag which subsequently have tag. For tag attribute 'code' should have value 'abc' and and tag should have certain values for their attributes as can be seen in code.


4) Only highest version(attribute of stock) for given ref(another attribute of stock) should be persisted with. rest all be deleted (This one is the most comlicated condition)


My code is:



use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', 'out.xml') or die "cannot create output file out.xml: $!";
my $twig = new XML::Twig(
twig_roots => { '/STOCKEXT/STOCK/STOCK'=> sub { $_->delete() },
'/STOCKEXT/STOCK[@origin != "ASIA"]' => sub { $_->delete; },
'/STOCKEXT/STOCK' => \&trade_handler
},
att_accessors => [ qw/ ref version / ],
pretty_print => 'indented',
);

my %max_version;
$twig->parsefile('1513.xml');
for my $stock ($twig->root->children('STOCK'))
{
my ($ref, $version) = ($trade->ref, $trade->version);

if ($version eq $max_version{$ref} &&
grep {grep {$_->att('code') eq 'abc' and $_->att('narrative') eq 'def'}
$_->children('subevent')} $trade->children('event[@eventtype="ghi"]'))

{
$trade->flush($out);
}

else
{
$trade->purge;

}
}

sub trade_handler
{
my ($twig, $trade) = @_;
{
my ($ref, $version) = ($trade->ref, $trade->version);

unless (exists $max_version{$ref} and $max_version{$ref} >= $version)
{
$max_version{$ref} = $version;
}
}

1;
}


Sample XML



<STOCKEXT>
<STOCK origin = "ASIA" ref="12" version="1" >(Filtered out, lower version ref)
<event eventtype="ghi">
<subevent code = "abc" narattive = "def" />
</event>
</STOCK>
<STOCK origin = "ASIA" ref="12" version="2" >(highest version=2 for ref=12)
<event eventtype="ghi">
<subevent code = "abc" narattive = "def" />
</event>
</STOCK>
<STOCK origin = "ASI" ref="13" version="1" >(Fileterd out "ASI" val wrong)
<event eventtype="ghi">
<subevent code = "abc" narattive = "def" />
</event>
</STOCK>


Code is working absolutely fine and providing requisite output. But it's consuming hell of memory, even though I have tried to implement "FLUSH" & "PURGE". Can anybody plz help with some optimization tips.


No comments:

Post a Comment