我有丢失的数据时间序列价值观不规律对于每个用户,我想使用 BigQuery 标准 SQL 通过线性插值以一定的间隔对其进行转换。
+------+---------------------+-------+
| name | time | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 | 3 |
| Jane | 2020-11-14 09:05:00 | 5 |
| Jane | 2020-11-14 09:07:00 | 1 |
| Jane | 2020-11-14 09:09:00 | 8 |
| Jane | 2020-11-14 09:10:00 | 4 |
| Kay | 2020-11-14 09:01:00 | 7 |
| Kay | 2020-11-14 09:04:00 | 1 |
| Kay | 2020-11-14 09:05:00 | 10 |
| Kay | 2020-11-14 09:09:00 | 6 |
| Kay | 2020-11-14 09:10:00 | 7 |
+------+---------------------+-------+
我想将其转换如下:
+------+---------------------+-------+-----------------+
| name | time | value | |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3 | |
| Jane | 2020-11-14 09:02:00 | 3.5 | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4 | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5 | |
| Jane | 2020-11-14 09:06:00 | 3 | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1 | |
| Jane | 2020-11-14 09:08:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8 | |
| Jane | 2020-11-14 09:10:00 | 4 | |
| Kay | 2020-11-14 09:01:00 | 7 | |
| Kay | 2020-11-14 09:02:00 | 5 | <= interpolaetd |
| Kay | 2020-11-14 09:03:00 | 3 | <= interpolaetd |
| Kay | 2020-11-14 09:04:00 | 1 | |
| Kay | 2020-11-14 09:05:00 | 10 | |
| Kay | 2020-11-14 09:06:00 | 9 | <= interpolaetd |
| Kay | 2020-11-14 09:07:00 | 8 | <= interpolaetd |
| Kay | 2020-11-14 09:08:00 | 7 | <= interpolaetd |
| Kay | 2020-11-14 09:09:00 | 6 | |
| Kay | 2020-11-14 09:10:00 | 7 | |
+------+---------------------+-------+-----------------+
我可以问你一些聪明的解决方案吗?
补充:这是一个应用问题这个计算器问题 https://stackoverflow.com/questions/64816885/how-to-fill-irregularly-missing-values-with-linear-interepolation-in-bigquery。它非常相似但不同之处在于该数据是时间序列数据,它的名称为每个用户.
谢谢。
以下是 BigQuery SQL
#standardSQL
select name, time,
ifnull(value, start_value
+ (end_value - start_value) / timestamp_diff(end_tick, start_tick, minute) * timestamp_diff(time, start_tick, minute)
) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_timestamp_array(min(time), max(time), interval 1 minute) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t
using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
如果适用于您问题中的样本数据 - 输出是
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)