听起来你基本上只是想做一个连接(从问题中不清楚这是否应该是 INNER、LEFT、RIGHT 还是 FULL。我认为 @SNeumann 基本上已经有了写答案,但我会添加一些代码以使其更清晰。
假设数据如下:
data1 = 'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
...
data2 = 'value1' 'result1'
'value2' 'result2'
...
我会做类似的事情(untested):
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
A_flattened = FOREACH A GENERATE item, d, things.thing AS thing; things.d1 AS d1, FLATTEN(things.values) AS value;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1'
--'item1', 111, 'thing1', 222, 'value2'
A_B_joined = JOIN A_flattened BY value, B BY v;
--This looks like:
--'item1', 111, 'thing1', 222, 'value1', 'value1', 'result1'
--'item1', 111, 'thing1', 222, 'value1', 'value2', 'result2'
A_B_joined1 = FOREACH A_B_JOINED GENERATE item, d, thing, d1, A_flattened::value AS value, r AS result;
A_B_grouped = GROUP A_B_joined1 BY (value, result);
从那里开始,按照你喜欢的方式重新装袋应该是微不足道的。
EDIT: 上面应该使用'.'作为元组上的投影运算符。我已经把它切换了。它还假设things
是一个大元组,但事实并非如此。这是一袋一件物品。如果OP从不打算在那个包里放多个项目,我强烈建议使用元组代替并加载为:
A = load 'data1' as (item:chararray, d:int, things:(thing:chararray, d1:int, values:bag{(v:chararray)}));
然后基本上按原样使用其余代码(注意:尚未测试).
如果绝对需要一个包,那么整个问题就会改变,并且当有多个包时,不清楚 OP 想要发生什么things
袋子里的物品。如前所述,袋子投影也相当复杂here http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html