JOIN question

anweston

January 16, 2014 00:00

Answered

Heya,

I'm trying to find out if I am (1) mis-interepting the wiki, (2) What I want to do requires something else or (3) Needs custom work. Say I have the following input files:

Person.txt
=======
PersonID,Last_Name
1,Weston

Phone.txt
=======
PersonID,phone_number
1,555-555-5555
1,777-777-7777
1,888-888-8888

Pets.txt
=======
PersonID,pet_type
1,cat
1,dog

What I am looking for is an output like:

PersonID,Last_Name,phone_number,pet_type
1,Weston,555-555-5555,cat
1,Weston,777-777-7777,dog
1,Weston,888-888-8888,

I have these files join using ExtMergeJoin where Person.txt is the master and Phone.txt/Pets.txt are slaves. When I looked at the wiki, it says it traverses the slaves to match up the records, but I thought it would join like above (that is, using each slave once until all the slaves have no data, then moving to the next master record). Instead, I seem to be getting more of a SQL-like join which results in somelthing like:

1,Weston,555-555-5555,cat
1,Weston,555-555-5555,dog
1,Weston,777-777-7777,cat
1,Weston,777-777-7777,dog
1,Weston,888-888-8888,cat
1,Weston,888-888-8888,dog

Is this the correct behaviour for ExtMergeJoin? If so, is there another way I can produce this output?

Thanks,
Anna

Comments 8

imriskal

January 21, 2014 12:55
0

Comment actions Permalink
Hi Anna,

I am not sure about one thing: How do you know that cat is joined with 555-555-5555 and dog is joined with 777-777-7777? All IDs are 1 so why is the following output wrong?

1,Weston,555-555-5555,dog
1,Weston,777-777-7777,cat
1,Weston,888-888-8888,

I assume you just want to join the records based on order. If so, you can use Combine component with "Allow incomplete tuples" property set to true. ExtMergeJoin indeed has the same functionality as SQL join, it is the intention.

http://doc.cloveretl.com/documentation/ ... mbine.html

Does this solution meet your requirement?
anweston

January 21, 2014 16:06
0

Comment actions Permalink
Heya,

The second output is not wrong if the slaves came in on that order. This IS a join: A Person has one-to-many Phone Numbers AND a Person has one-to-many Pets; the pets and phone numbers are NOT related to each other, hence the cartesian-like product from this join. I need a single output file with Person-Phone Number-Pets, but the multiples caused but combining phone numbers and pets in the same output is undesireable because in my real world data this could cause large output files with essentially repeated data. I had all the IDs as 1 because I was trying to provide a simple example; sorry if I made it confusing!

I looked at the Combine node and will play with it, but I'm not sure it will do what I need because there is no mechanism for joining the data on a kay. I suspect what I need is something similar to ExtMergeJoin where is does not "rewind" the slaves to get all the "duplicate" combinations. I need the node to know there is a master key, iterate over the slaves to create a record, and then "pop" the slave records off and iterate again over the slaves to create the next record until the next iteration shows that there are no slaves with matching key to the master; it then should try to create the next master. I have a java transform class, but all the underlying parent MergeJoin class doesn't seem to be visible to it so I can't seem to check in that class to see if I have seen a slave's data before and can "skip" sending that row forward because it is a "duplicate."

Thanks,
Anna

Thanks,
Anna
admin

January 23, 2014 08:12
0

Comment actions Permalink
Hi Anna,

In that case, maybe you can consider http://doc.cloveretl.com/documentation/ ... ields.html available in CloverETL 3.3+.

So you would store your data as single record using list:
```
1,Weston,[555-555-5555, 777-777-7777, 888-888-8888],[dog, cat]
```
I believe this stores your data much better that 3 records duplicating id and name.

You may prepare such record for example by following this steps:
* do join as Lubos advised (so you get 6 records)
* using denormalizer component feed this 6 records into single record containing lists

There are handy functions in CTL2 http://doc.cloveretl.com/documentation/ ... -ctl2.html

I hope this helps.
anweston

January 23, 2014 15:50
0

Comment actions Permalink
Heya,

Unfortunately, I am required to produce the output I described because it is the format for a process futher down the line. Well, with a few exceptions; the data looks like:

1,Weston,555-555-5555,cat
,,777-777-7777,dog
,,888-888-8888,

but in a fixed width format. I could accomplish this by only passing the master record when I process the first set of slaves.

I will try and figure out a custom solution to my problem since it doesn't look like you have a way to produce it. I thought about trying to create a component that extends off of MergeJoin.java, but it looks like the method I'd need to override (flushMin()) is private.

BTW, we are using Clover 3.4.0

Thanks,
Anna
novakj

January 24, 2014 14:18
0

Comment actions Permalink
Hi Anna,

I prepared for you example project with your desired functionality. The solution is based on Jaroslav's post (posted Thu Jan 23, 2014 9:12 am).

In this example the data from files are joined by the ExtMergeJoin component, denormalized by the Denormalizer component (all information about one person is in one record) and than normalized by the Normalizer component to your desired format.

If you have any question about transformations or about another details, please do not hesitate and let us know.
- JoinTest.zip
anweston

January 24, 2014 16:29
0

Comment actions Permalink
Heya,

I will look to see if I can adapt this solution; I really do appreciate your assistance! I understand what I am asking for may be unusual enough that may not be covered in an easy manner.

What I'm concerned about is the scalability of what you are suggesting. As noted below, my example is a simplistic representation of the data and to start with a merge that creates the duplicate combinations could potentially create a large amount of records to try and squeeze back down into the set I need. Also, getting into one line like that could cause a very wide record if "pets" and "phones" have lots of attributes and in my experience wide records can cause a performance hit.

For example, we have property data where the main property details are the parent, a property has owners, a property has transactions, a property has features, a property has valuations, etc. All of this data needs to get into a file in the format I described. So if a single property has had 10 owners, 100 transactions, 5 features, and 150 valuations that is 750,000(?) records to create in a merge join to get down to 150(?) records - and that's just ONE property (we even have some examples where a property is owned by a native tribe, which has over 2,000 listed owners!).

Maybe I can try and create "key fields" for "phones" and "pets," use that, and then somehow use that to merge all the other attributes in. I will play with it, but I am conerned about scalability :-)

Thanks,
Anna

anweston

January 28, 2014 04:47

Heya,

Since I was concerned about scalability, I decided to test out my theory about MergeJoin.java. I made a custom component that was the MergeJoin.java code with a modified flushMin():



	private boolean flushMin() throws IOException, InterruptedException,
			TransformException {

		// create combination
		for (int i = 0; i < inputCnt; i++) {
			inRecords[i] = minIndicator[i] ? reader[i].next()
					: NullRecord.NULL_RECORD;
		}
		while (true) {

			outRecords[0].reset();
			int transformResult = -1;

			try {
				transformResult = transformation.transform(inRecords, outRecords);
			} catch (Exception exception) {
				transformResult = transformation.transformOnError(exception,
						inRecords, outRecords);
			}

			if (transformResult >= 0) {
				outPort.writeRecord(outRecords[0]);
			}
			else {
				handleException(transformation, transformResult);
			}

			boolean stillHasData = false;
			for (int i = 1; i < inputCnt; i++) {
				inRecords[i] = minIndicator[i] ? reader[i].next()
						: NullRecord.NULL_RECORD;
				if (inRecords[i] == null) {
					inRecords[i] = NullRecord.NULL_RECORD;
				}
				else if (minIndicator[i]) {
					stillHasData = true;
				}
			}

			if (!stillHasData) {
				return true;
			}
		}
	}

So far my tests shows that this is producing the output I require without creating a bunch of extra rows and having to de-normalize then normalize. Although I'm not fond of a custom solution, what I'm trying to do is odd enough that it may not be useful to anyone else. I will run some more tests, but it is looking like this is the solution to my problem.

Thanks,
Anna

novakj

January 28, 2014 08:19
0

Comment actions Permalink
Great work, thank you very much for sharing your code.

Please sign in to leave a comment.

Quick links

Access my products

SUPPORT & SERVICES

Community

RESOURCES