Customer Portal

Fuzzy Match Join

Comments 2

  • Avatar
    the_goat
    0
    Comment actions Permalink
    Hi Michael,

    well it depends a bit on what version of CloverETL do you have. I believe this component was deprecated for a moment but in later versions reinstated.

    Anyway, what you should have as well (behaves similarly and is easier to configure) is this: http://doc.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/aspell-lookup-table.html. The question is, about how many records are we talking about here..? Keep in mind, both component and lookup uses memory to store "slave" records (the ones you're joining against).
  • Avatar
    Vladimir Barton
    0
    Comment actions Permalink
    Hi Mike,
    as the_goat already mentioned, the Aspell Lookup feature looks like a good candidate for your case (in consideration of the memory consumption as stated above). Let me provide you with an example project where the usage of the Aspell Lookup is demonstrated.
    • The idea is to first configure the Aspell Lookup table, namely its source file (Data file URL) which would be one of the 2 input files of yours, its metadata and lookup field key.

    • In the graph itself, you would read your second input file and utilize a LookupJoin component for joining it with the lookup table (your first input file).

    • The key feature of the Aspell Lookup can be configured (again in the lookup edit mode) by defining the Spelling threshold and Costs. The higher the threshold, the more tolerant is the component to spelling errors. Words with this value higher than the specified limit are not included in the results.
      In this example, I set the Spelling threshold to 60 and the cost of every operation to 20. For instance, when Clover tries to join the first master record “AAA1234567890” with the slave record “1234567890” it needs to perform 3 INSERT operations on the slave record (insert “A” character 3 times). As the cost of the INSERT operation is 20, the total cost of the used operations is 60 which is exactly within the given threshold so the records are joined successfully.
      When Clover tries to join the second master record “001122334455_1” to the slave record “1122334455” it needs to perform 4 INSERT operations on the slave record. As the cost of the INSERT operation is 20, the total cost of the used operations is 80 which is above the given threshold and the records are not joined.
      You can try changing the Spelling threshold to various values (20, 40, 80 and 120) and see how the results change.

    Kind regards,

Please sign in to leave a comment.