我正在尝试跑步sql
using r
dbplyr
并遇到了一个错误/错误结果使用时last()
功能与dbplyr
代码。我不确定我是否做错了或者是否确实存在错误。
客观的:使用 dbplyr 从表中获取每个员工的最新状态。
我尝试使用下面的代码来复制该问题。
1. Libs
library(tidyverse)
library(lubridate)
library(dbplyr)
library(RSQLite)
2. Data
df_emp_status <- data.frame(
Emp_id = c(121,321,451,121,451,451,321,755),
TimeStamp = c('29-07-2019 08:55:55','29-07-2019 09:02:55','29-07-2019 09:05:50',
'29-07-2019 10:05:50','29-07-2019 10:07:50','29-07-2019 10:10:10',
'29-07-2019 10:20:10','29-07-2019 11:00:00'),
Status = c('IN','IN','IN','OUT','OUT','IN','OUT','IN')
)
3. 数据库连接和数据
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_emp_status_sql <- copy_to(con, df_emp_status, overwrite = TRUE)
4. 目的:为了获得最后一个Status
每位员工的
4.1 R代码
using 总结()
df_emp_status %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
summarise(last_status = last(Status))
# A tibble: 4 × 2
Emp_id last_status
<dbl> <chr>
1 121 OUT
2 321 OUT
3 451 IN
4 755 IN
using mutate
df_emp_status %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status))
# A tibble: 8 × 4
# Groups: Emp_id [4]
Emp_id TimeStamp Status last_status
<dbl> <dttm> <chr> <chr>
1 121 2019-07-29 08:55:55 IN OUT
2 121 2019-07-29 10:05:50 OUT OUT
3 321 2019-07-29 09:02:55 IN OUT
4 321 2019-07-29 10:20:10 OUT OUT
5 451 2019-07-29 09:05:50 IN IN
6 451 2019-07-29 10:07:50 OUT IN
7 451 2019-07-29 10:10:10 IN IN
8 755 2019-07-29 11:00:00 IN IN
上面的结果是正确的并且给了我预期的结果:
4.2 SQL/DBPLYR复制(给出错误的结果)
since last()
不适用于summarise
使用时dbplyr
所以会用mutate()
instead
df_emp_status_sql %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status))
# Source: SQL [8 x 4]
# Database: sqlite 3.39.4 [:memory:]
# Groups: Emp_id
# Ordered by: Emp_id, TimeStamp
Emp_id TimeStamp Status last_status
<dbl> <dbl> <chr> <chr>
1 121 1564390555 IN IN
2 121 1564394750 OUT OUT
3 321 1564390975 IN IN
4 321 1564395610 OUT OUT
5 451 1564391150 IN IN
6 451 1564394870 OUT OUT
7 451 1564395010 IN IN
8 755 1564398000 IN IN
Above results are wrong
4.3 查询交叉检查
df_emp_status_sql %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status)) %>%
show_query()
<SQL>
SELECT
*,
LAST_VALUE(`Status`) OVER (PARTITION BY `Emp_id` ORDER BY `Emp_id`, `TimeStamp`) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`
我尝试单独运行上面的查询,这也会抛出错误的结果.
4.4 正确查询我能够找到适用于所需结果的正确查询如何在sql中使用last_value() https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-last_value-function/:
dbGetQuery(con,'
SELECT
*,
LAST_VALUE(`Status`) OVER (
PARTITION BY `Emp_id`
ORDER BY `Emp_id`, `TimeStamp`
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`
')
Emp_id TimeStamp Status last_status
1 121 1564390555 IN OUT
2 121 1564394750 OUT OUT
3 321 1564390975 IN OUT
4 321 1564395610 OUT OUT
5 451 1564391150 IN IN
6 451 1564394870 OUT IN
7 451 1564395010 IN IN
8 755 1564398000 IN IN
5、结论:
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
上面的代码对于获得正确的结果似乎很重要sql query
with last_value
所以 dbplyr 翻译还应该在它们的中包含上面的代码r to sql translation
否则会给出错误的结果。