我希望在一个数据框中选择行,data.1
,部分匹配第二个数据帧中的行,keep.these
,得到desired.result
。我在这里发现了几个基于一列匹配的问题,但我想在三列上匹配:STATE
, COUNTY
and CITY
。到目前为止,我已经提出了三种解决方案,但似乎没有一个是理想的。
请注意,每一行都包含一个唯一的组合STATE
, COUNTY
and CITY
在我的真实数据中。
当我使用merge
我必须重新order
。功能match
似乎可行,但我对它不熟悉,不知道我对这个功能的使用是否符合预期。这apply
下面的解决方案显然太复杂了。
The merge
如果我不必对结果重新排序,这种方法将是理想的。对于大型数据集,重新排序可能非常耗时。这match
如果有人可以确认这是一个合理的方法,那么该方法似乎还可以。
有没有更好的解决方案,最好是在基地R
?
data.1 <- read.table(text= "
CITY COUNTY STATE AA
1 1 1 2
2 1 1 4
1 2 1 6
2 2 1 8
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 3 200
2 1 3 400
1 2 3 600
2 2 3 800
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
1 1 5 20000
2 1 5 40000
1 2 5 60000
2 2 5 80000
", header=TRUE, na.strings=NA)
keep.these <- read.table(text= "
CITY COUNTY STATE BB
1 1 2 -10
2 1 2 -11
1 2 2 -12
2 2 2 -13
1 1 4 -14
2 1 4 -15
1 2 4 -16
2 2 4 -17
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
CITY COUNTY STATE AA
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
", header=TRUE, na.strings=NA)
##########
# this works, but I need to reorder
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))
new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)
##########
# this seems to work, but match is unfamiliar
new.data.2 <- data.1[match(data.1$CITY , keep.these$CITY , nomatch=0) &
match(data.1$STATE , keep.these$STATE , nomatch=0) &
match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]
rownames(desired.result) <- NULL
rownames(new.data.2) <- NULL
all.equal(desired.result, new.data.2)
##########
# this works, but is too complex
data.1b <- data.frame(my.group = apply( data.1[,1:3], 1, paste, collapse = "."), data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)
data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]
rownames(desired.result) <- NULL
rownames(data.1b) <- NULL
all.equal(desired.result, data.1b)
##########