remove duplicate records and sorting in DEDUP??

Hi,

How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields...
here is a subset of my input data, sorted by REFERENCE (primary key):

"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1"
"000000010271 ","WFB ","1"
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"

i want an output result like this:

"REFERENCE","NAME","NO"
"000000010271 ","WFB ","1" (removed the duplicate)
"000000010272 ","ABC ","1"
"000000010272 ","ABC ","2"

i know i can use DEDUP and set the dedupKey="REFERENCE;NAME;NO" to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?

i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields... if values of each respective fields match, then it's a duplicate and remove it... that way, DEDUP would not need the dedupKey to be set to a large number of field names... right?

just to make sure, DEDUP does not sort the records, right?

thanks,
al

achan

August 28, 2008 23:33

anyone has any idea of a better solution than putting all 2000 fields in the "key"?

this is an urgent matter for me, so any help/suggestion would be greatly appreciated :-)

al

avackova

September 04, 2008 07:14

Hello,
only idea I have is to use Partition instead of Dedup component: in partiotion function you can compare current record with previous


int getOutputPort(DataRecord record){
  if (record.compareTo(previous) != 0) {
     previous = record;
     return 0;
  }else {
     return 1;
  }
}

and then on port 0 you will have only distinct records.

September 06, 2008 07:19

Thanks for the suggestion :-)

I had to fix one thing: change
"previous = record;" to "previous = record.duplicate();"...

if not, the previous value will always be the current record since they are basically the same "pointer" or "address"...

remove duplicate records and sorting in DEDUP??

Comments 3

Didn't find what you were looking for?

Quick links

Access my products

SUPPORT & SERVICES

Community

RESOURCES

remove duplicate records and sorting in DEDUP??

Quick links

ACCESS YOUR PRODUCTS

SUPPORT & SERVICES

Community

RESOURCES

Comments 3

Didn't find what you were looking for?