Customer Portal

Reading RSS Feeds

Comments 7

  • Avatar
    avackova
    0
    Comment actions Permalink
    Hello,
    please see the attached graph - it reads rss data from BBC server, passes the url of each article to DataReader and saves the full article in the output file. Note, that you need to increase some default properties to run the graph successfully:
    DataParser.FIELD_BUFFER_LENGTH=262304
    Record.MAX_RECORD_SIZE = 524608
    DataFormatter.FIELD_BUFFER_LENGTH = 262304
    DEFAULT_INTERNAL_IO_BUFFER_SIZE = 262304
  • Avatar
    ccasano
    0
    Comment actions Permalink
    Thanks Agata. That worked great. Can you also use an XML Extract component to parse html? I've been trying without much luck. Was trying to get the text in the <body> tags of the document. After doing some research, I think I'm going doing the wrong path here...

    Thanks again
  • Avatar
    avackova
    0
    Comment actions Permalink
    Hello,
    this is question about parsing a html document, what is not easy. If you know the exact structure of the document, you will be probably able to get the text content only, but I don't see how to get the article's text only in our example.
  • Avatar
    ccasano
    0
    Comment actions Permalink
    One of my co-workers suggested using a regex to grab the body of the web page so I tried going down this path. I have the content of the web page coming through as a record. I'm then using a reformat component to apply a regex and extract the body of the page. It looks like the regex works in the regex tester but doesn't seem to work in the CTL2 code using the find() function. When I debug the output of the reformat component, no data is being placed into $0.body.

    Am I using the find() function incorrectly? If not, is there a better approach here?


    //#CTL2
    string strBody = "";

    // Transforms input record into output record.
    function integer transform() {

    foreach (string item : find($0.content,'\<body\>.\</body\>'))
    {
    strBody = concat(strBody, item);
    }

    $0.content = "none";
    $0.body = strBody;

    return ALL;
    }
  • Avatar
    avackova
    0
    Comment actions Permalink
    Hello,
    the problem is that CTL regexp doesn't support flags, so it is impossible to apply the pattern .* to the multiline input (https://bug.javlin.eu/browse/CL-1929). As a workaround you need to use java class (attached).
  • Avatar
    ccasano
    0
    Comment actions Permalink
    Great! Thank you Agata.
  • Avatar
    avackova
    0
    Comment actions Permalink
    At last I realized that CTL regexp supports flags :-). It is described in description of replace function in String Functions page. So following code in CTL works as well:
    //#CTL2

    // Transforms input record into output record.
    function integer transform() {

    string strBody = "";

    foreach (string item : find($0.content,"(?s)<body.*/body>"))
    {
    strBody = concat(strBody, item);
    }

    $0.content = "none";
    $0.body = strBody;

    return ALL;
    }

Please sign in to leave a comment.