按值（而不是列）分组后从组中选择随机条目？

2023-12-29

我想使用 Postgres 和 PostGIS 编写查询。我也在使用 Railsrgeo, rgeo-activerecord and activerecord-postgis-adapter，但是 Rails 的东西相当不重要。

表结构：

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

示例数据：

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

该表是某种手动创建的view用于更快的查找（数百万行）。否则我们必须连接 8 个表，而且速度会变得更慢。但这不是问题的一部分。

简单版本：

参数：

Point p
int d

查询应该做什么：

1.该查询查找所有groundtruth具有一个distance < d从点p

SQL 非常简单：WHERE st_distance(groundtruth, p) < d

2.现在我们有一个列表groundtruth与他们的点anchor_ids。正如您在上表中看到的，可能有多个相同的 groundtruth-anchor_id 元组。例如：anchor_id=3 and groundtruth=POINT(1 4).

3.接下来，我想通过随机选择其中一个来消除相同的元组（！）。为什么不简单地选择第一个呢？因为data列不同。

在 SQL 中选择随机行：SELECT ... ORDER BY RANDOM() LIMIT 1

我对这一切的问题是：我可以想象一个使用 SQL 的解决方案LOOP和很多子查询，but肯定有一个解决方案使用GROUP BY或其他一些可以使其更快的方法。

完整版本：

与上面基本相同，但有一点不同：输入参数发生变化：

很多积分p1 ... p312456345
还是一个d

如果简单的查询有效，则可以使用LOOP在 SQL 中。但也许有更好（更快）的解决方案，因为数据库真的很大！

Solution

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

感谢欧文·布兰德施泰特！

为了消除重复项，这可能是 PostgreSQL 中最有效的查询：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

有关此查询样式的更多信息：

选择每个 GROUP BY 组中的第一行？ https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564

正如评论中提到的，这给了你一个随意的挑选。如果你需要random，稍微贵一些：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()

第二部分更难优化。EXISTS https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY-EXISTS半连接可能是最快的选择。对于给定的表ps (p point):

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

这可以立即停止评估p足够接近并且它使查询的其余部分保持简单。

一定要支持这一点一个匹配的胃肠道指数 http://blog.opengeo.org/2011/09/28/indexed-nearest-neighbour-search-in-postgis/.

如果您有一个数组作为输入，请创建一个CTE https://www.postgresql.org/docs/current/queries-with.html with unnest() https://www.postgresql.org/docs/current/functions-array.html#ARRAY-FUNCTIONS-TABLE即时：

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

根据评论更新

如果你只需要一个单排作为答案，您可以简化：

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

更快地使用`ST_DWithin()`

该功能可能更有效ST_DWithin() https://postgis.net/docs/ST_DWithin.html（以及匹配的 GiST 索引！）。
To get one行（此处使用子选择而不是 CTE）：

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

To get 每个点占一行p距离内d:

SELECT DISTINCT ON (ps.p) *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

Adding ORDER BY random()将进行此查询更贵。没有random(), Postgres 可以选择首先GiST 索引中的匹配行。别的all必须随机检索和排序可能的匹配项。

BTW, LIMIT 1 inside EXISTS毫无意义。读我提供的链接中的手册 https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY-EXISTS or 这个相关问题 https://stackoverflow.com/questions/7710153/what-is-easier-to-read-in-exists-subqueries.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)