这是一个很常见的问题。
Plain B-Tree
索引不适合这样的查询:
SELECT measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM intervals
JOIN measures
ON measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time ASC
索引适合搜索给定范围内的值,如下所示:
,但不适用于搜索包含给定值的边界,如下所示:
我博客中的这篇文章更详细地解释了这个问题:
- 邻接表与嵌套集:MySQL http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/
(嵌套集合模型处理类似类型的谓词)。
您可以将索引设置为time
,这样一来intervals
将在连接中领先,范围时间将在嵌套循环内使用。这将需要排序time
.
您可以创建空间索引intervals
(可用于MySQL
using MyISAM
存储),其中包括start
and end
在一个几何列中。这边走,measures
可以引导连接并且不需要排序。
然而,空间索引速度更慢,因此只有在度量很少但间隔很多的情况下这才会有效。
由于您的间隔很少但度量很多,只需确保您有一个索引measures.time
:
CREATE INDEX ix_measures_time ON measures (time)
Update:
这是一个要测试的示例脚本:
BEGIN
DBMS_RANDOM.seed(20091223);
END;
/
CREATE TABLE intervals (
entry_time NOT NULL,
exit_time NOT NULL
)
AS
SELECT TO_DATE('23.12.2009', 'dd.mm.yyyy') - level,
TO_DATE('23.12.2009', 'dd.mm.yyyy') - level + DBMS_RANDOM.value
FROM dual
CONNECT BY
level <= 1500
/
CREATE UNIQUE INDEX ux_intervals_entry ON intervals (entry_time)
/
CREATE TABLE measures (
time NOT NULL,
measure NOT NULL
)
AS
SELECT TO_DATE('23.12.2009', 'dd.mm.yyyy') - level / 720,
CAST(DBMS_RANDOM.value * 10000 AS NUMBER(18, 2))
FROM dual
CONNECT BY
level <= 1080000
/
ALTER TABLE measures ADD CONSTRAINT pk_measures_time PRIMARY KEY (time)
/
CREATE INDEX ix_measures_time_measure ON measures (time, measure)
/
这个查询:
SELECT SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM (
SELECT *
FROM (
SELECT /*+ ORDERED USE_NL(intervals measures) */
*
FROM intervals
JOIN measures
ON measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time
)
WHERE rownum <= 500000
)
uses NESTED LOOPS
并返回1.7
秒。
这个查询:
SELECT SUM(measure), AVG(time - TO_DATE('23.12.2009', 'dd.mm.yyyy'))
FROM (
SELECT *
FROM (
SELECT /*+ ORDERED USE_MERGE(intervals measures) */
*
FROM intervals
JOIN measures
ON measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time
)
WHERE rownum <= 500000
)
uses MERGE JOIN
之后我不得不停止它5
分钟。
更新2:
您很可能需要使用如下提示强制引擎在连接中使用正确的表顺序:
SELECT /*+ LEADING (intervals) USE_NL(intervals, measures) */
measures.measure as measure,
measures.time as time,
intervals.entry_time as entry_time,
intervals.exit_time as exit_time
FROM intervals
JOIN measures
ON measures.time BETWEEN intervals.entry_time AND intervals.exit_time
ORDER BY
time ASC
The Oracle
的优化器不够智能,无法看到间隔不相交。这就是为什么它很可能会使用measures
作为领先表(如果间隔相交,这将是一个明智的决定)。
更新3:
WITH splits AS
(
SELECT /*+ MATERIALIZE */
entry_range, exit_range,
exit_range - entry_range + 1 AS range_span,
entry_time, exit_time
FROM (
SELECT TRUNC((entry_time - TO_DATE(1, 'J')) * 2) AS entry_range,
TRUNC((exit_time - TO_DATE(1, 'J')) * 2) AS exit_range,
entry_time,
exit_time
FROM intervals
)
),
upper AS
(
SELECT /*+ MATERIALIZE */
MAX(range_span) AS max_range
FROM splits
),
ranges AS
(
SELECT /*+ MATERIALIZE */
level AS chunk
FROM upper
CONNECT BY
level <= max_range
),
tiles AS
(
SELECT /*+ MATERIALIZE USE_MERGE (r s) */
entry_range + chunk - 1 AS tile,
entry_time,
exit_time
FROM ranges r
JOIN splits s
ON chunk <= range_span
)
SELECT /*+ LEADING(t) USE_HASH(m t) */
SUM(LENGTH(stuffing))
FROM tiles t
JOIN measures m
ON TRUNC((m.time - TO_DATE(1, 'J')) * 2) = tile
AND m.time BETWEEN t.entry_time AND t.exit_time
此查询将时间轴分割为多个范围并使用HASH JOIN
将度量值和时间戳加入到范围值中,并在稍后进行精细过滤。
有关其工作原理的更详细说明,请参阅我的博客中的这篇文章:
- Oracle:连接时间戳和时间间隔 http://explainextended.com/2009/12/28/oracle-joining-timestamps-and-time-intervals/