如果没有的话我真的不会做这件事lot合理性,因为人们通常期望箱线图的最小/最大/箱值对应于相同的分位数位置,但这是可以做到的。
使用的数据(添加极值以显示异常值):
set.seed(12)
u <- data.frame(A = sample.int(4, 100, replace = TRUE), B = rnorm(100))
u$B[c(30, 70, 76)] <- c(4, -4, -5)
解决方案1:您可以预先计算值,而无需通过基本 R 路线,并在同一步骤中包括异常值的计算。我会完全在 Hadley 的 tidyverse 库中完成它,我发现它更整洁:
library(dplyr)
library(tidyr)
u %>%
group_by(A) %>%
summarise(lower = quantile(B, qlower),
upper = quantile(B, qupper),
middle = quantile(B, qmiddle),
IQR = diff(c(lower, upper)),
ymin = max(quantile(B, qymin), lower - 1.5 * IQR),
ymax = min(quantile(B, qymax), upper + 1.5 * IQR),
outliers = list(B[which(B > upper + 1.5 * IQR |
B < lower - 1.5 * IQR)])) %>%
ungroup() %>%
ggplot(aes(x = A)) +
geom_boxplot(aes(lower = lower, upper = upper,
middle = middle, ymin = ymin, ymax = ymax ),
stat="identity") +
geom_point(data = . %>%
filter(sapply(outliers, length) > 0) %>%
select(A, outliers) %>%
unnest(),
aes(y = unlist(outliers)))
解决方案2:您可以覆盖 ggplot 使用的实际分位数规范。计算为geom_boxplot()
的分位数实际上是在StatBoxplot
's compute_group()
函数,发现here https://github.com/cran/ggplot2/blob/master/R/stat-boxplot.r:
compute_group = function(data, scales, width = NULL, na.rm = FALSE, coef = 1.5) {
qs <- c(0, 0.25, 0.5, 0.75, 1)
if (!is.null(data$weight)) {
mod <- quantreg::rq(y ~ 1, weights = weight, data = data, tau = qs)
stats <- as.numeric(stats::coef(mod))
} else {
stats <- as.numeric(stats::quantile(data$y, qs))
}
... (omitted for space)
The qs
矢量定义分位数位置。它不受传递给的参数的影响compute_group()
,所以改变它的唯一方法是改变定义compute_group()
itself:
# save a copy of the original function, in case you need to revert
original.function <- environment(ggplot2::StatBoxplot$compute_group)$f
# define new function (only the first line for qs is changed, but you'll have to
# copy & paste the whole thing)
new.function <- function (data, scales, width = NULL, na.rm = FALSE, coef = 1.5) {
qs <- c(0.1, 0.2, 0.5, 0.8, 0.9)
if (!is.null(data$weight)) {
mod <- quantreg::rq(y ~ 1, weights = weight, data = data,
tau = qs)
stats <- as.numeric(stats::coef(mod))
}
else {
stats <- as.numeric(stats::quantile(data$y, qs))
}
names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
iqr <- diff(stats[c(2, 4)])
outliers <- data$y < (stats[2] - coef * iqr) | data$y > (stats[4] +
coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], data$y[!outliers]),
na.rm = TRUE)
}
if (length(unique(data$x)) > 1)
width <- diff(range(data$x)) * 0.9
df <- as.data.frame(as.list(stats))
df$outliers <- list(data$y[outliers])
if (is.null(data$weight)) {
n <- sum(!is.na(data$y))
}
else {
n <- sum(data$weight[!is.na(data$y) & !is.na(data$weight)])
}
df$notchupper <- df$middle + 1.58 * iqr/sqrt(n)
df$notchlower <- df$middle - 1.58 * iqr/sqrt(n)
df$x <- if (is.factor(data$x))
data$x[1]
else mean(range(data$x))
df$width <- width
df$relvarwidth <- sqrt(n)
df
}
Result:
# toggle between the two definitions
environment(StatBoxplot$compute_group)$f <- original.function
ggplot(u, aes(x = A, y = B, group = A)) +
geom_boxplot() +
ggtitle("original definition for calculated quantiles")
environment(StatBoxplot$compute_group)$f <- new.function
ggplot(u, aes(x = A, y = B, group = A)) +
geom_boxplot() +
ggtitle("new definition for calculated quantiles")
请注意,当您更改定义时,它会影响环境中的每个 ggplot 对象。因此,如果您创建了 ggplot boxplot 对象before定义更改,并将其打印出来然后,箱线图将遵循新的定义。 (对于上面的并排比较,我必须立即将每个 ggplot 转换为 grob 对象,以保留差异。)