Customer Portal

Performance issue with processing large csv files with joins

Comments 1

  • Avatar
    Vladimir Barton
    0
    Comment actions Permalink
    Hi Eric,
    there is no easy way to force the FlatFileReader component to break records down and process them in batches. However, from looking at your graph I would like to suggest a few tips that might help you resolve the memory overflow issue caused by the ExtHashJoin component.

    • As you rightly indicated, the ExtHashJoin component waits until all records (from the slave port) flow in before the joining starts. Generally, we recommend using the ExtHashJoin component if the number of slave ports records to be joined is considerably low. Since this is not the case as the number of records on the master and on the slave ports is the same, I would recommend using the ExtMergeJoin component. There is no caching (unlike ExtHashJoin) so the processing can be significantly faster. You can apply the same Master/Slave key definition and mapping for the ExtMergeJoin component as you did for the ExtHashJoin component. However, you would need to place a new ExtSort component before the SimpleCopy component in your graph. The sort key should coincide with the master key defined in the ExtMergeJoin component.

    • From the snippet of your graph, I dare to presume that the sorting order of your data does not get changed on the way from the FlatFileReader to the ExtHashJoin component whatsoever. If so there might be even a simpler approach to decrease the memory consumption. Try replacing the ExtHashJoin component with a new Combine component and apply the same mapping as you did for the ExtHashJoin component. The Combine component avoids caching the records as well and there is no need for joining by keys so the performance should be fairly good.

    Best regards,

Please sign in to leave a comment.