xml - LibXML - Looping through nodes until -
i'm trying parse below xml using perl's xml::libxml library.
<?xml version="1.0" encoding="utf-8" ?> <taggedpdf-doc> <part> <sect> <h4>2.1 study purpose </h4> <p>this study purpose content</p> <p>content 1</p> <p>content 2</p> <p>content 3 </p> <p>content 4</p> <p>3. header</p> <p>obj content 4</p> <p>obj content 2</p> </sect> </part> </taggedpdf-doc>
for header study purpose, i'm trying display related siblings. expected output is:
<h4>2.1 study purpose </h4> <p>this study purpose content</p> <p>content 1</p> <p>content 2</p> <p>content 3 </p> <p>content 4</p>
my perl code below. can display first node.
given value of first node,study purpose, there way can loop , print nodes until hit node containing "digit followed '.'"?
my perl implementation:
my $purpose_str = 'purpose , rationale|study purpose|study rationale'; $parser = xml::libxml->new; #print "parser file $file is: $parser \n"; $dom = $parser->parse_file($file); $root = $dom->getdocumentelement; $dom->setdocumentelement($root); $purpose_search('/taggedpdf-doc/part/sect/h4') { $purpose_nodeset = $dom->find($purpose_search); foreach $purp_node ($purpose_nodeset -> get_nodelist) { if ($purp_node =~ m/$purpose_str/i) { #get corresponding child nodes @childnodes = $purp_node->nonblankchildnodes(); $first_kid = shift @childnodes; $second_kid = $first_kid->nextnonblanksibling(); #$third_kid = $second_kid->nextnonblanksibling(); $first_kid -> string_value; $second_kid -> string_value; #$third_kid -> string_value; } print "study purpose is: $first_kid\n.$second_kid\n"; } }
do not @ child nodes if want siblings. use textcontent
if want match node's text content.
#!/usr/bin/perl use warnings; use strict; use xml::libxml; $file = 'input.xml'; $purpose_str = 'purpose , rationale|study purpose|study rationale'; $dom = xml::libxml->load_xml(location => $file); $purpose_search('/taggedpdf-doc/part/sect/h4') { $purpose_nodeset = $dom->find($purpose_search); $purp_node ($purpose_nodeset -> get_nodelist) { if ($purp_node->textcontent =~ m/$purpose_str/i) { @siblings = $purp_node->find('following-sibling::*') ->get_nodelist; $i (0 .. $#siblings) { if ($siblings[$i]->textcontent =~ /^[0-9]+\./) { splice @siblings, $i; last; } } print $_->textcontent, "\n" @siblings; } } }
Comments
Post a Comment