您的数据采用“长格式”(多行公司、来源、年份...)
你想要总计的每个公司和年份的 amount.inkg,用于多个来源值。具体来说,您希望在“源”字段上使用条件进行聚合。
再次,请给我们提供可重现的示例。 (谢谢乔西尔伯)。
这是一个带有分割-应用-组合(ddply)或逻辑索引的四行:
df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
years.raw = c(1, 1, 1, 1, 2, 2),
source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
amount.inkg = c(5, 2, 10, 15, 14, 18))
# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)
# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...
选项 3.您还可以使用aggregate
(此处的联机帮助页) https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html with subset(...)
,尽管总和是矫枉过正的。
aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)
The by
聚合的参数是操作所在的位置(选择、按条件子集化)。
Note: %in%
执行扫描操作,因此如果您的向量和 d.f.变大,或者为了可扩展性,您需要将其分解为可以矢量化的布尔运算:(source=='Tea' | source=='Coffee')
如果子集为空,则要防止 NA 和,sum(c()) = 0
所以不用担心。但如果你这样做,要么使用 na.omit,要么做ifelse(is.na(x),0,x)
关于最终结果。