Customer Portal

How to Proceed with Large Joins

Comments 9

  • Avatar
    avackova
    0
    Comment actions Permalink
    Hello Brad,
    ExtHashJoin should be used for joining data with small slave set only, as it reads all slave data into a memory. For joining large data sets you should rather use ExtMergeJoin. It requires data to be sorted, but to sort data with ExtSort or FastSort is still better, than fighting with the OutOfMemory error. Another solution would be LookupJoin if you are able to store the slave data in a lookup table - database lookup for duplicate keys or PersistentLookup if the key is unique.
  • Avatar
    bmazurek
    0
    Comment actions Permalink
    Hi Agata,

    I ran a set of test data through using the Hash method and the FastSort to MergeJoin method. The Hash method significantly outperformed the MergeJoin: 2084 versus 2732 seconds.

    This obviously isn't an increase in performance, but I'm wondering if you could speculate as to why that might be. I've switched to a 64 bit JRE and we have a lot of available RAM on the server (100+ GB). Would that be a sufficient explanation for the difference?

    I tried using FastSort with the InMemoryOnly flag on, but that seemed to take even longer.

    The data we are joining is already in a database, but I have been asked to investigate the performance implications of moving the join into Clover. Do you have a gut reaction to that endeavour?

    Thank you,
    Brad
  • Avatar
    bmazurek
    0
    Comment actions Permalink
    Any thoughts?

    Based upon what I've seen, I suspect that the DB will win out in the head to head join. Where I suspect we will begin to see Clover take back ground will be on the parallelization front, on its ability to perform many joins and transformations in parallel, instead of being constrained.

    Any sense of when the advantages of parallelization will overcome a series of single threaded DB joins?

    Brad
  • Avatar
    julrych
    0
    Comment actions Permalink
    Hi Brad,

    Database vs. Clover join
    I wouldn't be too surprised when performance of join in database is better than in Clover. The reason is that data is already stored in database and in-database may therefore be really efficient provided that database indexes are built and up to date. On the other hand, reading data out from database for joining them outside using Clover maybe less efficient. In general it depends on
      What the join selectivity is. Does join produce significantly less data than what is available in the two datasources?

      What the target system is.

      What is the subsequent processing of joined data like? Is there any at all?


    On the other hand, joining data in-database may cause other complications. In such cases, Clover can offer similar or better performance. Below I list just few of typical scenarios where using Clover for external joins is beneficiary:
      (Unexpected) database server utilization during in-database joins may be undesirable.

      Database tables may be missing some of the indexes required for efficient in-database joins. If it is not possible to build these indexes (due to server utilization, space, etc.), in-database join may be very inefficient.

      Joining data from various sources would require loading data into database first in order to join them.

      Performing further transformations would require indispensable effort for implementation; e.g. coding in PL/SQL.

      Pipeline processing of the data throughout whole transformation. This means that although join it self may be slower in Clover, whole transformation with further processing may be faster than equivalent one implemented in database in PL/SQL


    Heap Space Issues
    After switching to 64 bit Java, you should be able to allocate several gigabytes of memory for it. If you are running JDK, I would also recommend using -server option which implements better optimizations and offers higher performance.

    Best Regards,
    Jan
  • Avatar
    bmazurek
    0
    Comment actions Permalink
    Hi Jan,

    Thanks for the response.

    The joins are almost always left handed joins (sometimes up to 190 of them) and therefore will result in a lot more data. There is filtering done, and it seems like there will definitely be benefits to playing with component ordering to optimize the impact of that.

    The database is definitely up to date, and the database is only being used as an ETL alternative. All indices identified by query optimizers and execution paths have been built. As the data comes out of the DB there is a small amount of processing to perform, but not substantial. The target system is a server process and the data is being communicated from CloverETL to this server process via Web Services.

    Thoughts?
    Brad
  • Avatar
    julrych
    0
    Comment actions Permalink
    Hi Brad,

    join optimization
    you are absolutely correct. If you can filter out any data before actually performing any sorts, joins or any other processing. Reducing data volume improves performance significantly (depending on the selectivity of the filtering condition). Similarly, if there is some work to be done on the not-joined data, it might be worth doing it before joins as joins are likely to multiply number of records on the output.

    outbound processing
    It looks to me rather strange that you are going to process so huge data through webservice interface. If batch interface is available, I would expect it to be more efficient.

    Anyway, do I understand it correctly that you would like to move all your processing from database into an ETL tool? If so, does the source still be database? Is there anything I can help with?

    Regards,
    Jan
  • Avatar
    bmazurek
    0
    Comment actions Permalink
    Hi Jan,

    Thanks for the response.

    In the "reducing data" vein, is there any benefit to making the records as small as possible during most of the processing (say only the critical join key information), then perform the joins we need to perform (~190 of them) and then fatten the records up at the end with a final series of joins?

    All data sent to the platform we are working with goes over web services (I believe, although that is abstracted away from us. It does batch incoming records, and that batch size is a configurable parameter on the CloverETL component that has been provided.

    Yes, we are evaluating how much, and precisely which processing it makes sense to move into CloverETL. At this point, I think we're going into the whole thing with an open mind either way.

    The source is originally flat files, but we currently load them into a database, where they live. The one benefit of that is that we build indices once rather than everytime a set of data passes through for processing (at the expense of additional space, of course).

    Cheers,
    Brad
  • Avatar
    rmorrison100
    0
    Comment actions Permalink
    Thank you for the answers.
  • Avatar
    julrych
    0
    Comment actions Permalink
    Hi Brad,

    You are correct about reducing the number and size of records during processing. The general recommendations for efficient transformation development are:

    • Do filtering as soon as possible. Typical scenarios include: filtering before sorting, filtering before complex reformats and joins

    • Drop unused fields. If you won't need some of the fields coming from source, get rid of them as soon as possible.Reducing record size optimizes memory and disk usage when transformation runs. This allows you to run the transformation faster and/or run more transformation in parallel with the same HW resources.

    • Avoid breaking the pipeline processing. Some of the components break the pipeline processing - typically sorts and unsorted aggregations. Use such components when necessary only.

    • Avoid unnecessary sorting. Many of the components preserve order of incoming records. Therefore, if the input is sorted by first field, the output of the component is still sorted by first field. Some of the components break this rule, see section Component Reference in our CloverETL documentation for details.


    If I understand your last post correctly, you mention first dropping all the fields except for join key, doing the join and then joining the required fields back. I don't think that this would be faster than doing the join with all required fields directly.

    Hope this helps.

    Regards,
    Jan

Please sign in to leave a comment.