Customer Portal

Quesion about TableLookup

Comments 5

  • Avatar
    avackova
    0
    Comment actions Permalink
    Hello Anna,
    In 2.8 version Lookups work as you describe (I've reported it to our bug tracking system: http://bug.cloveretl.com/view.php?id=3474) and you certainly shouldn't use the same Lookup object in more nodes. Maybe the best way is to create a new LookupTable in each Reformat and free it in finished method.
    Since 2.9 version some steps are moved to preExecute() and postExecute() method.
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya Agata,

    Thank you for the reply to my post!

    I did some further research and it looks like the best solution may be to use SimpleLookup. I read the lookup table's information into a LookupTable (simpleLookup type) in Phase 0 using a DB_INPUT_TABLE/LOOKUP_TABLE_READER_WRITER combo. The 4 nodes in Phase 1 all create their own Lookup/DataRecord objects in init() and the last node calls LookupTable.free() in its finished() method.

    I used this example as a template for reading in the lookup information: http://wiki.clovergui.net/files/example ... Reader.grf.

    Although this method didn't improve a smaller dataset all that much:

    -20,861 lookup records, 1 node using them.
    -69,841 records in output file.
    -35min [using DBLookup w/no cache] to 13min [using dbLookup w/cache] to 12min [using in-memory].

    It really made a difference in a larger dataset:

    -42,369 lookup records, 2 nodes using them.
    -982,385 records in output file.
    -92+hrs [using DBLookup w/no cache] to 27hr 20min [using dbLookup w/cache] to 5hr 33min [using SimpleLookup in-memory]).

    (NOTE: I'm providing these numbers as an example. The run time between different datasets (small vs large) is really unimportant to us - we are interested in the time difference within a dataset when we switched the lookup implementation in its graph. I know I need to do repetitive tests for more accurate performance analysis.) :-)

    Since each SimpleLookup has a pointer to the same data Map, it seems to resolve my issue with multiple cache, too. Each node still needs to have its own Lookup/DataRecord objects (to avoid multithreading issues), but its lookup is done against the same data Map and SimpleLookup does not create a "cache." At least that's what I thinks it's doing...

    This solution trades off higher memory usage for improved performance, so I will probably have to see how it does when there are multiple lookup tables or really large lookup datasets. I could see where the Phase 0 could take excessive time if there are a lot of lookup datasets to read into memory and it could also cause my graph to run out of memory if the amount of data is huge.

    Any thoughts? Does this seem like a valid approach?

    Thanks,
    Anna

    P.S. For DBLookup, your bug item says that "Freeing cache of DBLookup should be done after finishing execute method, not in free only" but in my specific scenario it seems like I would take a performance hit? For example, if node 1 caches 100 records and frees them after the node finishes, node 2 would have a re-retrieve and cache those 100 records again. That might be better than having to deal with synchronization issues on a global cache (probably haved to be stored in DbLookupTable instead of DBLookup), but is there a best-of-both-worlds solution? Just putting the thought out there...
  • Avatar
    avackova
    0
    Comment actions Permalink
    Hi Anna,
    Thank you for the deep analysis. You can also try to use PersistentLookup instead of SimpleLookup. This would help you to avoid memory exhausted problem.
    I've talked with our developers about the DBLookup cache behavior and they have promised me to look out for the solution in the 2.8.2 version, which is to be released in few days.
  • Avatar
    mzatopek
    0
    Comment actions Permalink
    Hello Anna,

    I think your consideration about our lookup tables is pretty good and your elaboration is really exhaustive.

    Currently we are not considering to make Lookup objects synchronized due performance issue of this solution. If you have four reformats with similar functionality - 'lookuping' same date, can you consolidate all these lookups at one component? This would be probably the fastest solution. Your latest approach with the simple lookup table is also very good and straightforward. If most of database data will be used by the lookup, what is probably your case, it is definitely fastest solution to unload data from database at once and then use it for simple lookup building (hash join component is also possible to use, for big lookup datasets you can sort the data first and use the merge join component instead). To avoid out-of-memory problem with this solution you can experiment with the persistence lookup table as was already mentioned.

    Theoretically it is possible to share same instance of lookup object through all reformats via our Dictionary, what is something like key-value shared memory for inter-component communication. If you want to know more about this option check out our documentation at http://www.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/dictionary.html or let me know for futher information. You have to also take care about synchronization of multi-thread working with this shared instance.

    If I miss something let me know I will complete my answer. Martin
  • Avatar
    anweston
    0
    Comment actions Permalink
    Heya,

    Hopefully these posts can help other people if they find themsleves with a similar situation. :-)

    I implemented the SimpleLookup solution more-or-less as I outlined about. The only change I made was to call TableLookup.free() in a special "last node" in our graph that does some cleanup/statistics gathering after all the output data has passed through. I confirmed with my business folks that our lookup data is small enough that we shouldn't have to worry about memory issues.

    I will keep in mind the PersistentLookup if I do run into any issues. We cannot combine the lookups into a single component or use JOINs because we use lookups inside a "rule engine," not directly in the grpah. I really appreciate your suggestions, though!

    Thank you both for your replies!

    Anna

Please sign in to leave a comment.