使用 dplyr 获取方差为零的列名称


我试图找到数据中方差为零的任何变量(即恒定连续变量)。我想出了如何使用 lapply 来做到这一点,但我想使用 dplyr,因为我试图遵循整洁的数据原则。我可以使用 dplyr 创建一个仅包含方差的向量,但在最后一步我发现值不等于零并返回令我困惑的变量名称。


# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   Exp = col_character(),
#>   Pedi = col_character(),
#>   Harvest = col_character()
#> )
#> See spec(...) for full column specifications.

# checking for missing variable
# df %>% 
#     select_if(function(x) any(is.na(x))) %>% 
    # summarise_all(funs(sum(is.na(.))))

# grab month for analysis
may <- df %>% 
june <- df %>% 
july <- df %>% 
aug <- df %>% 
sept <- df %>% 
oct <- df %>% 

# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))

zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)

noVar <- june %>% 

    select(numericVars) %>% 

    summarise_all(var) %>% 

    filter_if(all, all_vars(. != 0))
通过一个可重现的示例,我认为您的目标如下。请注意,正如 Colin 所指出的,我还没有处理使用字符变量选择变量的问题。有关详细信息,请参阅他的回答。

# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7


mtcars2 %>% 
  summarise_all(var) %>% 
  select_if(function(.) . == 0) %>% 
# [1] "mpg"  "qsec"

就我个人而言,我认为这混淆了你正在做的事情。使用以下其中一项purrr包(如果你想留在 tidyverse 中)将是我的偏好,并带有写得很好的评论。


# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]

如果您想优化速度,请坚持使用基本 R

# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]

