Customer Portal

Filtering nodes do not support multi-language character sets

Comments 9

  • Avatar
    dpavlis
    0
    Comment actions Permalink
    Can you try to use unicode escape sequence in place of the characters (both for the regex and the substring) ?

    Also, I am not sure what problem are you describing - is it that 黄威 are not recognized as \w ?

    Sorry, I am not familiar with Asian alphabets and need a hint here.
  • Avatar
    hwhwhw
    0
    Comment actions Permalink
    table structure:
    create table t1 (f1 varchar(50), f2 varchar(50));

    record content:
    黄威 20071976北京
    huangwei 20071976beijing

    extFilter node expression:
    $f2 ~= '^[0-9]{8}[a-z]*'

    outPort (0) output record
    huangwei 20071976beijing

    outPort (1) output record
    黄威 20071976北京
    ----------------------------------------------------------
    I want outPort (0) to output content below
    黄威 20071976北京
    ----------------------------------------------------------

    extFilter node expression:
    $f2 ~= '^[0-9]{8}\p{InHanzi}*'
    output error info:
    ERROR [WatchDog] - EXT_FILTER_0 ...FAILED !
    Parser error when parsing expression: Encountered "\'^[0-9]{8}" at line 1, column

    8.
    Was expecting:
    <STRING_LITERAL> ...


    extFilter node expression:
    substring($f3,8,2)=='北京'

    outPort (0) output record 0
    outPort (1) output record 2
    黄威 20071976北京
    huangwei 20071976beijing
  • Avatar
    dpavlis
    0
    Comment actions Permalink
    If you use \ (backslash) in your regex string in transform language, you have to escape it - like this:

    $f2 ~= '^[0-9]{8}\\\\p{InHanzi}*'


    The reason why is that the backslash gets preprocessed twice - first when the expression is read from XML and \\ is preprocess to \ and then again the TL language parser preprocesses \\ to \ - then it gets to Java's regex evaluator.

    We will try to fix this nuisance (in 2.3.x and earlier) in next release of Clover.

    I will check the rest of the problem too, but check the updated expression above.
  • Avatar
    hwhwhw
    0
    Comment actions Permalink
    ERROR [WatchDog] - EXT_FILTER_0 ...FAILED !
    Error when parsing expression: Illegal repetition near index 11
    ^[0-9]{8}\\p{InHanzi}*


    --------------------------------

    substring($f2,8,2)=='北京'

    Substring function Why not support the "北京"?
  • Avatar
    dpavlis
    0
    Comment actions Permalink
    Well,interesting problem with the regex... I will see to it ..

    As for the substring - try to use unicode escape (\uxxxx) in place of the two chars - you will have to find their unicode numbers.
  • Avatar
    avackova
    0
    Comment actions Permalink
    I've found that such regex does not throw an exception:
    "^[0-9]{8}[\\p{InHanzi}]*"
  • Avatar
    hwhwhw
    0
    Comment actions Permalink
    Thank you for your response,Substring function issue has been resolved
  • Avatar
    dpavlis
    0
    Comment actions Permalink
    Cool,
    can I ask you how did you solve it ?
  • Avatar
    hwhwhw
    0
    Comment actions Permalink
    Solutions to the inconvenient, the process is this.

    1)
    D:\javasoft\Jdk1.5.0_04\bin>native2ascii
    北京
    \u5317\u4eac

    2)
    extFilter node expression:
    substring($f2,8,2)=='\u5317\u4eac'

    ====================================
    extFilter node expression:
    $f2 ~= '^[0-9]{8}[\u4e00-\u9fa5]*'

    [\u4e00-\u9fa5] On behalf of the Asian Regional Character Set,This realization is some trouble

Please sign in to leave a comment.