Customer Portal

Comments 8

  • Avatar
    imriskal
    0
    Comment actions Permalink
    Hi Anna,

    I am not sure about one thing: How do you know that cat is joined with 555-555-5555 and dog is joined with 777-777-7777? All IDs are 1 so why is the following output wrong?

    1,Weston,555-555-5555,dog
    1,Weston,777-777-7777,cat
    1,Weston,888-888-8888,

    I assume you just want to join the records based on order. If so, you can use Combine component with "Allow incomplete tuples" property set to true. ExtMergeJoin indeed has the same functionality as SQL join, it is the intention.

    http://doc.cloveretl.com/documentation/ ... mbine.html

    Does this solution meet your requirement?
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya,

    The second output is not wrong if the slaves came in on that order. This IS a join: A Person has one-to-many Phone Numbers AND a Person has one-to-many Pets; the pets and phone numbers are NOT related to each other, hence the cartesian-like product from this join. I need a single output file with Person-Phone Number-Pets, but the multiples caused but combining phone numbers and pets in the same output is undesireable because in my real world data this could cause large output files with essentially repeated data. I had all the IDs as 1 because I was trying to provide a simple example; sorry if I made it confusing!

    I looked at the Combine node and will play with it, but I'm not sure it will do what I need because there is no mechanism for joining the data on a kay. I suspect what I need is something similar to ExtMergeJoin where is does not "rewind" the slaves to get all the "duplicate" combinations. I need the node to know there is a master key, iterate over the slaves to create a record, and then "pop" the slave records off and iterate again over the slaves to create the next record until the next iteration shows that there are no slaves with matching key to the master; it then should try to create the next master. I have a java transform class, but all the underlying parent MergeJoin class doesn't seem to be visible to it so I can't seem to check in that class to see if I have seen a slave's data before and can "skip" sending that row forward because it is a "duplicate."

    Thanks,
    Anna



    Thanks,
    Anna
  • Avatar
    admin
    0
    Comment actions Permalink
    Hi Anna,

    In that case, maybe you can consider http://doc.cloveretl.com/documentation/ ... ields.html available in CloverETL 3.3+.

    So you would store your data as single record using list:
    1,Weston,[555-555-5555, 777-777-7777, 888-888-8888],[dog, cat]


    I believe this stores your data much better that 3 records duplicating id and name.

    You may prepare such record for example by following this steps:
    * do join as Lubos advised (so you get 6 records)
    * using denormalizer component feed this 6 records into single record containing lists

    There are handy functions in CTL2 http://doc.cloveretl.com/documentation/ ... -ctl2.html

    I hope this helps.
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya,

    Unfortunately, I am required to produce the output I described because it is the format for a process futher down the line. Well, with a few exceptions; the data looks like:

    1,Weston,555-555-5555,cat
    ,,777-777-7777,dog
    ,,888-888-8888,

    but in a fixed width format. I could accomplish this by only passing the master record when I process the first set of slaves.

    I will try and figure out a custom solution to my problem since it doesn't look like you have a way to produce it. I thought about trying to create a component that extends off of MergeJoin.java, but it looks like the method I'd need to override (flushMin()) is private.

    BTW, we are using Clover 3.4.0

    Thanks,
    Anna
  • Avatar
    novakj
    0
    Comment actions Permalink
    Hi Anna,

    I prepared for you example project with your desired functionality. The solution is based on Jaroslav's post (posted Thu Jan 23, 2014 9:12 am).

    In this example the data from files are joined by the ExtMergeJoin component, denormalized by the Denormalizer component (all information about one person is in one record) and than normalized by the Normalizer component to your desired format.

    If you have any question about transformations or about another details, please do not hesitate and let us know.
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya,

    I will look to see if I can adapt this solution; I really do appreciate your assistance! I understand what I am asking for may be unusual enough that may not be covered in an easy manner.

    What I'm concerned about is the scalability of what you are suggesting. As noted below, my example is a simplistic representation of the data and to start with a merge that creates the duplicate combinations could potentially create a large amount of records to try and squeeze back down into the set I need. Also, getting into one line like that could cause a very wide record if "pets" and "phones" have lots of attributes and in my experience wide records can cause a performance hit.

    For example, we have property data where the main property details are the parent, a property has owners, a property has transactions, a property has features, a property has valuations, etc. All of this data needs to get into a file in the format I described. So if a single property has had 10 owners, 100 transactions, 5 features, and 150 valuations that is 750,000(?) records to create in a merge join to get down to 150(?) records - and that's just ONE property (we even have some examples where a property is owned by a native tribe, which has over 2,000 listed owners!).

    Maybe I can try and create "key fields" for "phones" and "pets," use that, and then somehow use that to merge all the other attributes in. I will play with it, but I am conerned about scalability :-)

    Thanks,
    Anna
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya,


    Since I was concerned about scalability, I decided to test out my theory about MergeJoin.java. I made a custom component that was the MergeJoin.java code with a modified flushMin():



    private boolean flushMin() throws IOException, InterruptedException,
    TransformException {

    // create combination
    for (int i = 0; i < inputCnt; i++) {
    inRecords[i] = minIndicator[i] ? reader[i].next()
    : NullRecord.NULL_RECORD;
    }
    while (true) {

    outRecords[0].reset();
    int transformResult = -1;

    try {
    transformResult = transformation.transform(inRecords, outRecords);
    } catch (Exception exception) {
    transformResult = transformation.transformOnError(exception,
    inRecords, outRecords);
    }

    if (transformResult >= 0) {
    outPort.writeRecord(outRecords[0]);
    }
    else {
    handleException(transformation, transformResult);
    }

    boolean stillHasData = false;
    for (int i = 1; i < inputCnt; i++) {
    inRecords[i] = minIndicator[i] ? reader[i].next()
    : NullRecord.NULL_RECORD;
    if (inRecords[i] == null) {
    inRecords[i] = NullRecord.NULL_RECORD;
    }
    else if (minIndicator[i]) {
    stillHasData = true;
    }
    }

    if (!stillHasData) {
    return true;
    }
    }
    }



    So far my tests shows that this is producing the output I require without creating a bunch of extra rows and having to de-normalize then normalize. Although I'm not fond of a custom solution, what I'm trying to do is odd enough that it may not be useful to anyone else. I will run some more tests, but it is looking like this is the solution to my problem.

    Thanks,
    Anna
  • Avatar
    novakj
    0
    Comment actions Permalink
    Great work, thank you very much for sharing your code.

Please sign in to leave a comment.