这可以通过使用一系列自连接来找到同一类别中的其他房间,然后将结果合并到 2 个地图中来完成。
Code
CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;
SELECT
customer_rooms.customer,
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
GROUP BY
customer_rooms.customer
;
结果集
1 {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2 {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3 {"ee":"d6"} {"ee":"d6"}
解释
FROM `table` AS customer_rooms
首先,从初始结果中得出结果table
。我们将这种关系命名为customer_rooms
。正如您在问题中已经指出的那样,这足以构建map_customer_room_date
.
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
第一个自连接标识了与 中明确提到的房间具有相同类别的所有房间。customer_rooms
行。我们将这种关系命名为category_rooms
.
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
第二次自连接采用我们在其中识别的房间category_rooms
并尝试查找该房间是否已被中指定的客户占用customer_rooms
。我们将这种关系命名为customer_category_rooms
。这是一个LEFT OUTER JOIN
,因为我们想要保留先前连接中的所有行。结果将是 1) 的值customer_rooms
and customer_category_rooms
是相同的,因为客户已经拥有这个房间,或者 2) 来自的值customer_category_rooms
将是全部NULL
,因为客户并不持有这个房间,但它是同一类别之一的房间。这种区别将变得很重要,以便我们能够保留date
客户的信息(如果他们已经预订了房间)。
接下来,我们需要进行过滤。
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
这包括原始客户明确持有的房间table
.
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
这包括非客户持有但与客户持有的房间属于同一类别的房间。
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
map_customer_room_date
可以通过从表中收集原始数据来构建,我们将其别名为customer_rooms
.
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
建筑map_category_room_date
更复杂。如果客户明确保留房间,那么我们希望保留该房间date
。但是,如果客户没有明确保留房间,那么我们希望能够使用room
and date
来自具有重叠类别的另一行。为了实现这一点,我们使用 HiveCOALESCE https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions函数选择第一个不是的值NULL
。如果客户已经保留房间(如非NULL
值在customer_category_rooms
),然后我们将使用它。如果没有,那么我们将使用来自category_rooms
反而。
请注意,如果相同的类别/房间组合可以映射到多个,则仍然可能存在一些歧义。date
价值观。如果这很重要,那么您可能需要投入更多的工作来选择正确的date
基于一些业务规则(例如使用最快的date
) 或映射到多个date
值而不是单个值。如果有类似的额外要求,这应该为您提供一个良好的起点。