Customer Portal

Updating Dimensions

Comments 8

  • Avatar
    oldforum
    0
    Comment actions Permalink
    You might want ot try the new "Dat intersect" component. It can give you the dimensional inserts, deletes and updates in different out streams in a single pass.
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Think about updating dimensions as a full outer join problem:

    current fullouter previous (on key)-->

    port 0 --> Records in current but not in previous -- New
    Port 1 --> Records in both current and previous -- update
    port 2 --> Records in previous but not in current -- Deletes

    I would slap each of the above streams into a file, and bulk load (really fast) them into the database. Assign surrogate keys and the rest from within the database.

    This is one of the fastest ways of updating your dimensions, which I have used in past. I used Abinitio though, not clover.

    Single inserts/updates/deletes via jdbc is very slow.
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hi Peter !

    In reverse order ;-)

    B) you can have it withing one graph. If you need to do it as a two steps, use phases. Edges which crosses a phase boundary is internally buffered, so you can keep sending records into it and they are buffered till next phases starts, at which point, the edge starts delivering the buffered records.

    A) Any approach which secures that you generate unique artifitial (surrogate) keys for your dimension is ok. Sequence seems to me be less problematic than hash-table size. For one (as you mentioned), when records are removed, the size shrinks thus there is potential danger of having duplicate keys.
    If you don't want to access DB during processing, then store the value in some file from which you read it during transformation init() function. Then, within reformat function, for each new record increment it by one.

    David.
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hi David,

    reversing the order again!

    A) Yes, that makes perfect sense.

    B) Not sure how to do this. The REFORMAT component only seems to write the transformed records to the output record (from target[0]). As a side effect of the transformation I will also produce a list of new records that need to be added to the dimension.

    Ideally, I'd like to be able to grab a reference to another output port for the REFORMAT Node and write the new records to that (so they can be processed as part of the next phase of the graph).

    For the moment, I write them to a file using a Formatter and re-read them again.

    C) One issue I have with using formatters, is there doesnt seem to be a way of escaping data that may contain the delimiter character. I guess I should either write an escape/unescape transform before going to/from a delimited stream? Or is there an existing solution?

    Thanks again,
    Peter
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hi David,

    Thanks for the prompt reply.

    I guess the problem you are thinking about with using the size would be if records were removed from the dimension the size would not generate a unique id? (or is there something else I should be careful about?).

    I'm trying to avoid needing online access to a database (to maintain the sequence), so for the moment, I will check the existing data in the lookup table at startup and find the next highest unused id.

    I was originally planning to write the new dimension records to a file, but it might be nicer if I could write them to one of the current node edges, so they could stay part of the current transformation graph.

    I don't think this is possible in the REFORMAT component though. Do you think its A) a good idea? B) easy to do?

    Thanks,
    Peter
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hi Akhilesh,

    I did look at using it, but I guess I would several passes to

    a) identify the new rows for my dimension
    b) assign a new artificial key to each unique row
    c) re-process the data to get the new artifical keys

    I would be interested in finding out how other people handle the problem.

    The solution I have, essentially a dynamically updated SimpleLookupTable, allows me to update several relatively small, slowly changing, dimensions and include the artificial keys in the fact table in a single pass through the data.

    Thanks,
    Peter
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hello Peter !

    Your description of steps required is quite correct. However I would not suggest using any kind of size number for assigning artifitial keys.
    The better approach would be to use some internal counter for cutting the keys and store the final value somewhere for the next time (so you know from where to start counting).
    The other possibility is to use database SEQUENCE and - you could probaly use DBLookupTable and not seatrch for something but just execute the query which queries SEQUENCE object for the next value.

    Like: "select SEQ.nextval() from dual" instead of
    "select mykey from mytable where x=?".

    Unfortunately, the support for sequences in Clover is not yet finished...

    David.
    david.pavlis<at>centrum.cz
  • Avatar
    oldforum
    0
    Comment actions Permalink
    Hello Peter !

    To solve your B) problem - the "standard" way I would use having the graph topology as you describe is to output all records (old & new) to the same port. Just add one field which indicates whether the record is old or new. Then connect after the REFORMAT some FILTER component which separates old (those are discarded - sent to TRASH) and new - those are inserted or furhter processed.

    C) Currently, there is really no standard way of escaping data (strings). You could probably modifiy one of the current DataFormatters to do it for you. Or if you need to process the data only with Clover, use some other delimiter. As you may know, Clover can work with delimited data where every field is delimited by different character (or set of characters - up to 32).

    David.

Please sign in to leave a comment.