Customer Portal

Partition: cannot compare DataRecords using equals()??

Comments 12

  • Avatar
    achan
    0
    Comment actions Permalink
    Hi,

    anyone has any thought or solution on this issue?

    Thanks,
    al
  • Avatar
    twaller
    0
    Comment actions Permalink
    Hello Achan,

    I do not understand why you are using the Partition component. If you used ExtSort and Dedup, you would obtain the desired result.

    You should past ExtSort and Dedup components into your graph, connect the output port of ExtSort with the input port of Dedup by an edge.

    After that, you only need to define Sort key in ExtSort (select all fields from the input metadata) and define the same Dedup key in the Dedup component (the same fields in the same order).

    You can propagate metadata throughout ExtSort.

    Then you must connect an edge to the output port 0 of Dedup and another edge to its ouput port 1.

    You should propagate metadata throughout Dedup. Metadata of both of these edges will be the same as those on the Dedup input.

    By default, Dedup sends only one record with the same field values through the output port 0 and sends all the other records with the same field values through the output port 1.

    This is the simplest solution.

    Best regards,

    Tomas Waller
  • Avatar
    achan
    0
    Comment actions Permalink
    Hi Tomas,

    Thanks for your suggestion, I will give it a try...

    However, that still does not explain why the equal() method for DataRecord does not work... Please verify that...

    Thanks,
    al
  • Avatar
    jurban
    0
    Comment actions Permalink
    Hi,
    could you please attach the graph, with its input data, metadata etc., so we can test this? Comparing DataRecords should work as you describe.

    Thanks!
    Jaro
  • Avatar
    achan
    0
    Comment actions Permalink
    Hi Tomas,

    I remember why I use PARTITION, but not SORT and DEDUP... my data can have up to 2000 fields, so it's not practical to set the sortKey and dedupKey to 2000 fields, right? also, can Clover handle such a large number of keys for SORT and DEDUP? Thus, I resorted to use PARTITION since I don't have to specify the key and could do the comparison using the equal() method from DataRecord object in the partitionClass... Then I found out equal() method did not work as I expect...


    Hi Jaro,

    The stripped-down version of my graph is INPUT -> REFORMAT -> SORT -> PARTITION -> OUTPUT, looks like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <Graph name="GRAPH_1" description="">

    <Global>
    <Metadata id="INPUT_METADATA_0" fileURL="achan_testfile_1_csv_pre_INPUT.fmt"/>
    <Metadata id="INPUT_PARSER_METADATA_0" fileURL="achan_testfile_1_csv_INPUT.fmt"/>
    </Global>

    <Phase number="0">
    <Node id="INPUT_0" type="DATA_READER" fileURL="achan_testfile_1.csv" dataPolicy="Controlled" skipLeadingBlanks="false" trim="false" skipFirstLine="true" />
    <Node id="INPUT_PARSER_0" type="REFORMAT" transformClass="com.facorelogic.core.etl.transform.ParseInputData" variableLengthRows="true" quoteChar="&amp;quot;" />
    <Node id="SORT_0" type="EXT_SORT" sortKey="REFERENCE;" sortOrder="A;" />
    <Node id="PARTITION_0" type="PARTITION" partitionClass="com.facorelogic.core.etl.transform.DuplicateRowPartitioner">
    </Node>
    <Node id="LOG_EXACT_DUP_0" type="DELIMITED_DATA_WRITER" fileURL="achan_testfile_1_exact_dup_log.csv" />
    <Node id="OUTPUT_0" type="DELIMITED_DATA_WRITER" fileURL="achan_output_1.txt" />

    <Edge id="INPUT_PARSER_EDGE_0" fromNode="INPUT_0:0" toNode="INPUT_PARSER_0:0" metadata="INPUT_METADATA_0"/>
    <Edge id="IN_SORT_EDGE_0" fromNode="INPUT_PARSER_0:0" toNode="SORT_0:0" metadata="INPUT_PARSER_METADATA_0"/>
    <Edge id="SORT_REMOVE_EXACT_DUP_EDGE_0" fromNode="SORT_0:0" toNode="PARTITION_0:0" metadata="INPUT_PARSER_METADATA_0"/>
    <Edge id="LOG_EXACT_DUP_EDGE_0" fromNode="PARTITION_0:1" toNode="LOG_EXACT_DUP_0:0" metadata="INPUT_PARSER_METADATA_0"/>
    <Edge id="OUTPUT_EDGE_0" fromNode="PARTITION_0:0" toNode="OUTPUT_0:0" metadata="INPUT_PARSER_METADATA_0"/>
    </Phase>

    </Graph>



    INPUT_METADATA_0 looks like this :

    <?xml version="1.0" encoding="UTF-8"?>
    <Record name="RECORD_achan_testfile_2_csv_" type="delimited" recordDelimiter="\n">
    <Field name="ONE_RECORD" type="string" nullable="false" />
    </Record>



    INPUT_PARSER_METADATA_0 looks like:

    <?xml version="1.0" encoding="UTF-8"?>
    <Record name="RECORD_achan_testfile_1_csv_" type="delimited" fieldDelimiter="," recordDelimiter="\n">
    <Field name="REFERENCE" type="string" nullable="false" />
    <Field name="POSITION" type="string" nullable="true" />
    <Field name="AMOUNT_1" type="numeric" nullable="true" />
    </Record>



    The ParseInputData class basically take the long single string input and parse it, according to the "quoteChar" in the INPUT_PARSER_0 node in my graph and the fieldDelimiter in the INPUT_PARSER_METADATA_0 file, to various fields (works like a StringTokenizer)...


    my INPUT_0 looks like:

    REFERENCE, POSITION, AMOUNT_1
    "10272", "1", "100"
    "10273", "2", "0"
    "10273", "2", "0"
    "10274", "3", "200"



    I expected data row 2 and row 3 with "10273, 2, 0" would be exact duplicates, using the equal() method from the DataRecord class (see my DuplicateRowPartitioner in previous posting), but the result is they are different records??


    However, if I reduced my graph to INPUT_PARSER -> SORT -> PARTITION -> OUTPUT, then the equal() method in my Partition Class would work, resulting in this output:

    INFO [PARTITION_0] - Comparing record : '#0|REFERENCE|S->10273
    #1|POSITION|S->5
    #2|AMOUNT_1|N->0.0
    ' to previous record : '#0|REFERENCE|S->10273
    #1|POSITION|S->5
    #2|AMOUNT_1|N->0.0
    '...

    INFO [PARTITION_0] - Found duplicate record...


    I cannot imagine this is an issue with Clover's SORT node, right? I am using equal() method to compare the previous record and current record coming into PARTITION_0, so it does not matter the input to PARTITION is from INPUT or REFORMAT before the SORT, right? I am guessing it's the equal() in the PARTITION that somehow does not compare the previous record and current record (both have same metadata and data) correctly... maybe a reference pointer issue as I did use record.duplicate() to set the previous record (see my DuplicateRowPartitioner in previous posting)??


    Thank you to both of you for your time and help,
    albert :-)
  • Avatar
    mzatopek
    0
    Comment actions Permalink
    I check out your code again. Still I don't understand all observation what have you got. Nonetheless I have few suggestions. You have to be very careful whenever you use DataRecord.equal() method, since according our internal rules, two empty/null fields are different. So two 'same' records with at least one null field are considered as different.

    In this case of record comparison I would recommend you the RecordComparator class instead a simple equal() method.

    Second suggestion could be small performance enhancement. Try to substitute very slow DataRecord.duplicate() (new java object has to be created) by simple DataRecord.copyFrom(). Of course you need to create a temp record by one duplication call.

    Martin
  • Avatar
    achan
    0
    Comment actions Permalink
    Hi Martin,

    Thanks for clarifying the usage of DataRecord.equal(). I have data fields that are empty/null (I am just showing a stripped-down data set in my previous postings here), so that explains why the same record values (with some empty/null fields) are treated different although the non-null field values are the same.

    Is there a hit in performance if I compare values in each field (between the current record and previous record) in my Partition class instead of using RecordComparator?

    Thanks for your second suggestion too!

    albert :-)
  • Avatar
    mzatopek
    0
    Comment actions Permalink
    I don't think so. I definitely recommend you to use prepared RecordComparator. For example in this way:

    RecordComparator recordComparator = new RecordComparator(new int[] {0, 1, 2}); //you have to specify a key fields according the comparison will be done
    recordComparator.setEqualNULLs(true);
    if (recordComparator.compare(previousRecord, currentRecord) == 0) {
    //records are equal
    }


    That should be the least error-prone way.

    Martin
  • Avatar
    twaller
    0
    Comment actions Permalink
    Hi,

    One more thing from me:
    You can select all fields by clicking Ctrl + A in the Fields pane and, after that, copy them all by single clicking the Right arrow button in the wizard. This way all 2000 fields will be moved at one instant to the pane on the right to create the Sort key.
    In the same way, you can copy them to the Dedup key.

    Best regards,

    Tomas Waller
  • Avatar
    achan
    0
    Comment actions Permalink
    Hi Martin,

    If my data has 2000 fields and 1 key field, do I just instantiate the RecordComparator with the key field, like this:

    RecordComparator recordComparator = new RecordComparator(new int[] {0})

    and the recordComparator.compare(previousRecord, currentRecord) would compare all 2000 fields?


    Hi Tomas,

    I am not using the Clover GUI, I am generating the graph file programmatically, depending on what my user needs, via my web application.


    Thanks to both of you :-)

    al
  • Avatar
    mzatopek
    0
    Comment actions Permalink
    No, the constructor parameter - the array of integers - specifies the list of comparated fields. So if you need compare two records according 2000 fields. You have to prepare the array 'new int[] {1, 2, 3, ..., 2000}'

    Martin
  • Avatar
    achan
    0
    Comment actions Permalink
    Got it. Thanks, Martin!

    al

Please sign in to leave a comment.