我正在寻找一些简单的矢量化方法for loop在 R 中。 我有以下数据框,其中包含句子和两本正面和负面单词的字典:

# Create data.frame with sentences
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
                         "wouldnt bad notebook", "very good quality", "orgtop",
                         "great improvement for that bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")


# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
# library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
rownames(sent) <- NULL

对于下一步,我必须对字典中的单词及其情绪分数进行降序排序(正字 = 1 和负字 = -1)。

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL

然后我用 for 循环定义以下函数:

# Sentiment score function
scoreSentence2 <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
      score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
      sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), ' ', sentence) # remove matched words from wordsDF
      # library(qdapRegex)
      sentence <- rm_white(sentence)


# Apply scoreSentence function to sentences
SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))
# Time consumption for 700.000 sentences in sent data.frame:
# user       system    elapsed
# 1054.19    0.09      1056.17
# Add sentiment score to origin sent data.frame
sent <- cbind(sent, SentimentScore2)


Words                                             user      SentimentScore2
just right size and i love this notebook          1         2
benefits great laptop                             2         1
wouldnt bad notebook                              3         1
very good quality                                 4         1
orgtop                                            5         0


请任何人都可以帮助我减少我原来方法的计算时间。由于我的 R 初学者编程技能,我最终:-) 我们将非常感谢您的任何帮助或建议。预先非常感谢您。


  1. 复制你的代码:你会把它搞砸的!

  2. 找到瓶颈:


    Rep  <- 100
    df.expanded <- as.data.frame(replicate(nRep,sent$words))
    sent <- coredata(sent)[rep(seq(nrow(sent)),nRep),]


    sentRef <- sent




    SentimentScore2 <- unlist(lapply(sent$words, scoreSentence2))

    1d:查看结果,以 R 为基数:


    还有更好的工具,你可以检查包 轮廓R 或者 线教授

    线教授 是我选择的工具,这里有一个真正的附加值,可以将问题范围缩小到这两行:

    matchWords <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
    count <- length(grep(matchWords,sentence)) # count them
  3. Fix it.

    3.1 幸运的是,主要问题相当简单:您不需要将第一行放在函数中,只需将其移到前面即可。顺便说一下,这同样适用于你的paste0()。你的代码变成:

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- length(grep(matchWords[x],sentence)) # count them
                score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
                sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
                # sentence <- rm_white(sentence)

    这将 1000 次重复的执行时间从
    5.64 秒至 2.32 秒。不错的投资!

    3.2 下一个漏洞是“count

    matchWords <- paste("\\<",wordsDF[,1],'\\>', sep="") # matching exact words
    matchedWords <- paste0('\\s*\\b', wordsDF[,1], '\\b\\s*')
    # Sentiment score function
    scoreSentence2 <- function(sentence){
        score <- 0
        for(x in 1:nrow(wordsDF)){
            count <- grepl(matchWords[x],sentence) # count them
            score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
            sentence <- gsub(matchedWords[x],' ', sentence) # remove matched words from wordsDF
            # sentence <- rm_white(sentence)

这使得 0.18 秒或 31 倍快......


