Hi,
How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields...
here is a subset of my input data, sorted by REFERENCE (primary key):
"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1"
"000000010271 ","WFB ","1"
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"
i want an output result like this:
"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1" (removed the duplicate)
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"
i know i can use DEDUP and set the dedupKey="REFERENCE;NAME;NO" to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?
i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields... if values of each respective fields match, then it's a duplicate and remove it... that way, DEDUP would not need the dedupKey to be set to a large number of field names... right?
just to make sure, DEDUP does not sort the records, right?
thanks,
al
How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields...
here is a subset of my input data, sorted by REFERENCE (primary key):
"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1"
"000000010271 ","WFB ","1"
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"
i want an output result like this:
"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1" (removed the duplicate)
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"
i know i can use DEDUP and set the dedupKey="REFERENCE;NAME;NO" to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?
i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields... if values of each respective fields match, then it's a duplicate and remove it... that way, DEDUP would not need the dedupKey to be set to a large number of field names... right?
just to make sure, DEDUP does not sort the records, right?
thanks,
al
-
anyone has any idea of a better solution than putting all 2000 fields in the "key"?
this is an urgent matter for me, so any help/suggestion would be greatly appreciated :-)
al -
Hello,
only idea I have is to use Partition instead of Dedup component: in partiotion function you can compare current record with previous
int getOutputPort(DataRecord record){
if (record.compareTo(previous) != 0) {
previous = record;
return 0;
}else {
return 1;
}
}
and then on port 0 you will have only distinct records. -
Thanks for the suggestion :-)
I had to fix one thing: change
"previous = record;" to "previous = record.duplicate();"...
if not, the previous value will always be the current record since they are basically the same "pointer" or "address"...
Please sign in to leave a comment.
Comments 3