I have an HTML document that I am extracting elements out of with XPath. I'm trying to work out how to extract the full HTML branch of a given set of <DIV> tags so that I can process them further in subsequent graph nodes.
I see a similar note posted for doing this in PHP but I am not familiar with how to use this language:
http://stackoverflow.com/questions/1534 ... ery-in-php
In case it helps, this is the Mapping syntax i am currently using which extracts hyperlinks and descriptions from each DIV where class='details':
<Context
xpath="//DIV[@class='details']"
outPort="0" >
<Mapping
xpath="./DIV[@class='vehicle']//A/@href"
cloverField="Record_TagA_href"/>
<Mapping
xpath="./DIV[@class='vehicle']//A"
cloverField="Record_TagA"/>
<Mapping
xpath="./DIV[@class='vehicle']//H5"
cloverField="Record_TagH5"/>
<Mapping
xpath=".//UL[@class='specifics']"
cloverField="Record_TagULspecifics"/>
</Context>
Instead I would like a full string of each HTML DIV node where class='details' stripping out everything else e.g. just leaving:
<DIV class='details'>...</DIV>.
I originally tried in a standard transform via string manipulation with regex:
However, this didn't return the full outer DIV if there were any nested DIV tags within it.
Thanks
I see a similar note posted for doing this in PHP but I am not familiar with how to use this language:
http://stackoverflow.com/questions/1534 ... ery-in-php
In case it helps, this is the Mapping syntax i am currently using which extracts hyperlinks and descriptions from each DIV where class='details':
<Context
xpath="//DIV[@class='details']"
outPort="0" >
<Mapping
xpath="./DIV[@class='vehicle']//A/@href"
cloverField="Record_TagA_href"/>
<Mapping
xpath="./DIV[@class='vehicle']//A"
cloverField="Record_TagA"/>
<Mapping
xpath="./DIV[@class='vehicle']//H5"
cloverField="Record_TagH5"/>
<Mapping
xpath=".//UL[@class='specifics']"
cloverField="Record_TagULspecifics"/>
</Context>
Instead I would like a full string of each HTML DIV node where class='details' stripping out everything else e.g. just leaving:
<DIV class='details'>...</DIV>.
I originally tried in a standard transform via string manipulation with regex:
foreach (string item : find($in.0.Document_XHTML,'<DIV class="details">(.*?)</DIV>')
However, this didn't return the full outer DIV if there were any nested DIV tags within it.
Thanks
-
Hello chathaway,
If I understand right, the described functionality is available in XMLExtract component since CloverETL version 3.4.0, see https://bug.javlin.eu/browse/CL-2118
You can extract the whole XML subtree this way and then filter the result using a regexp to have only the DIVs with the desired class.
Best regards,
Please sign in to leave a comment.
Comments 1