每 n 个单词分割单词向量(向量在列表中)

2024-04-02

如何最好地分割列表中的单词向量?这就是我目前正在做的事情(感谢 geektrader 的回答)here https://stackoverflow.com/a/15832050/1036500),但它让 RStudio 颤抖并冻结了相当多。这个问题和我之前的问题密切相关。这里的新功能是列表结构,它更接近我的实际用例。

# reproducible data
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem" 
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following:  Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."

# make a big list of character vectors containing words
list_examps <- lapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
list_examps <- rep(list_examps, 2000)

# my current method
n <- 30 # number of words in each chunk
temp1 <- vector("list", length(list_examps))
temp2 <- vector("list", length(list_examps))
for(i in 1:length(list_examps))
{
  temp1[[i]] <- unlist(strsplit(list_examps[[i]], " "))
  temp2[[i]] <- split(unlist(strsplit(temp1[[i]] , " ")),
                      seq_along(unlist(strsplit(temp1[[i]], " ")))%/%n)
}
listofnwords <- unlist(temp2, recursive = FALSE) # desired output

有没有更有效的方法来做到这一点?

UPDATE 1添加(相当适度)机器规格

 > sessionInfo()
    R version 3.0.0 (2013-04-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.1252    

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    loaded via a namespace (and not attached):
    [1] tools_3.0.0

    > gc()
              used (Mb) gc trigger (Mb) max used (Mb)
    Ncells  434928 23.3     818163 43.7   667722 35.7
    Vcells 7086406 54.1   12291671 93.8 12291671 93.8

> Sys.getenv()  # relevant exerpts, far off to the right for some reason...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           NUMBER_OF_PROCESSORS 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "4" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             OS 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "Windows_NT" 

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         PROCESSOR_ARCHITECTURE 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "AMD64" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           PROCESSOR_IDENTIFIER 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           "Intel64 Family 6 Model 42 Stepping 7, GenuineIntel" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                PROCESSOR_LEVEL 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "6" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             PROCESSOR_REVISION 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         "2a07" 

UPDATE 2使用 EC2 实例(免费套餐)上的上述数据对答案进行速度测试。获胜者是... SimonO101!感谢大家的帮助。

# make a big list of character vectors containing words
list_examps <- lapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
list_examps <- rep(list_examps, 200)

# my current method
n <- 30 # number of words in each chunk
temp1 <- vector("list", length(list_examps))
temp2 <- vector("list", length(list_examps))

me <- function(list_examps){
  for(i in 1:length(list_examps))
  {
    temp1[[i]] <- unlist(strsplit(list_examps[[i]], " "))
    temp2[[i]] <- split(unlist(strsplit(temp1[[i]] , " ")),
                        seq_along(unlist(strsplit(temp1[[i]], " ")))%/%n)
  }
  listofnwords <- unlist(temp2, recursive = FALSE) # desired output
}

dr <- function(list_examps){
  f <- function(list_examps) {
    y <- unlist(strsplit(list_examps, " "))
    ly <- length(y)
    split(y, gl(ly%/%n+1, n, ly))
  }

  listofnwords <- sapply(list_examps, f)
  listofnwords <- unlist(listofnwords, recursive=F)
}

si <- function( list_examps ){

  words <- unlist( ( sapply( list_examps , strsplit , " " ) ) )
  results <- lapply( seq( 0, length(words) , by = n ) , function(x) c( words[(x+1):(x+n)] ) )

}

er <- function( x ){
  x <- do.call(paste, x)
  x <- strsplit(x, ' ')[[1]]
  result <- split(x, cut(seq_along(x), breaks = seq(0, by = n, length(x)) , include.lowest = TRUE ) )
}


library(rbenchmark)

benchmark(
  me(list_examps),
  dr(list_examps),
  si(list_examps),
  er(list_examps),
  replications = 10)

结果如下:

             test replications elapsed relative user.self sys.self user.child sys.child
2 dr(list_examps)           10  48.104    1.119    47.907    0.000          0         0
4 er(list_examps)           10  71.316    1.660    70.645    0.568          0         0
1 me(list_examps)           10  48.156    1.121    48.543    0.000          0         0
3 si(list_examps)           10  42.971    1.000    42.875    0.000          0         0

这个怎么样?

x <- do.call(paste, list_examps)
x <- strsplit(x, ' ')[[1]]
result <- split(x, cut(seq_along(x), breaks = seq(0, by = 30, length(x) ) ) )
result[[1]]
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

每 n 个单词分割单词向量(向量在列表中) 的相关文章

随机推荐

  • 在客户端 SAPUI5 中仅过滤 sap.m.List

    我正在寻找仅在客户端过滤列表 并让其他类似表等在服务器端过滤 是否有任何选项可以添加到列表中以在客户端进行过滤 问候 您可以使用操作模式v2 ODataModel 的参数 该参数可以设置为模型级别 https openui5 hana on
  • 如何应用 django 补丁

    我想对这个错误应用补丁 http code djangoproject com ticket 13095 http code djangoproject com ticket 13095 但我以前从未这样做过 我不知道从哪里开始 谁能给我指
  • 如何使用 Spring JPA 仅获取实体的选定属性?

    我在我的项目中使用 Spring Boot 1 3 3 RELEASE 和 Hibernate JPA 我的实体看起来像这样 Data NoArgsConstructor Entity Table name rule public clas
  • 读取 jar 文件中的 zip 文件

    之前我们的 Web 应用程序中有一些 zip 文件 我们想要解析 zip 文件中的特定文本文档 这不是问题 URL url getClass getResource zipfile ZipFile zip new ZipFile url g
  • 序列或批处理项目 DataGridView

    我有一个包含 940000 行的大型 DataGridView 哎呀 通过解析 csv 文件填充 DataGridView 有一个名为序列的列 编号为 1 到 940000 我试图做的是重新编号序列以溢出到序列中DataGridView 中
  • Angular Reactive Form 提交并明确验证

    我有一个反应形式
  • 如何监控SciPy.odeint的进程?

    SciPy 可以通过 scipy integrate odeint 或其他包求解 ode 方程 但它在函数完全求解后给出结果 但是 如果 ode 函数非常复杂 程序将花费大量时间 一两天 才能给出整个结果 那么我如何监控求解方程的步骤 当方
  • 将动态添加到 IList 失败

    在下面的代码示例中调用l Add s and c Add s 是成功的 但对于泛型时会失败IList
  • com.jcraft.jsch.JSchException:java.net.ConnectException:连接被拒绝:连接

    我知道有重复的 gt gt gt 从重复的 gt gt gt 复制 只要您的本地计算机有一个正在运行的 SSH 服务器 它说 只要你的本地机器有运行的 SSH 服务器 但我不知道如何运行 SSH 服务器 我打开我的腻子 双击它 不确定这是否
  • utf-8特殊字符不显示[重复]

    这个问题在这里已经有答案了 我将网站从本地测试服务器移至 NameCheap 共享主机 现在我遇到了问题 某些页面无法正确显示 utf 8 特殊字符 而是显示问号 所有页面均采用 utf 8 编码 所有数据库表也是如此 奇怪的是 有些页面可
  • 无法加载 libgdx 的共享库 box2d

    我有一个使用 Maven 构建的 libGDX 项目 它过去运行良好 但最近它停止工作 因为 libGDX 将 box2d 移动为扩展 我将扩展作为依赖项添加到项目的核心 就像任何其他依赖项一样
  • 在 Python 中将 USB 视频捕获设备友好名称与 OpenCV 端口号相关联

    我想在 Windows 平台上使用 Python 获取外部 USB 视频捕获设备的友好名称和 USB 端口号 我正在使用 OpenCV 从 USB 捕获设备捕获视频 OpenCV 将 USB 端口称为 1 找到的第一个工作摄像头 0 对我来
  • 为Buildbot添加自定义功能

    我用 python 编写了一个函数 我想让 Buildbot 来执行这个函数 当它收到 构建 命令时 我之前使用过 factory addStep 通过命令行添加新命令 但我不确定如何向 Buildbot 添加 python 函数 谢谢 如
  • Symfony2:如何禁用表单级联验证?

    我有一个带有一个实体表单字段的表单 当我打电话时 form gt isValid symfony 验证与该字段关联的对象 我知道问题是fixed http github com symfony symfony commit 0c70a410
  • Android 模拟器上的布局看起来不像我在 Android 应用程序项目中所做的布局

    Android 模拟器上的布局看起来不像我在 Android 应用程序项目中所做的布局 我在 Eclipse 中做了一个带有两个单选按钮和两个旋转器的布局 但是当我在 Android 模拟器上运行它时 旋转器在箭头符号旁边各有两个额外的单选
  • Python greenthread 和 requests 模块每次只处理 10 个请求?

    我正在使用Python 2 7 5 并尝试使用协程绿色线程 Pythoneventlet 和Pythonrequests模块来加速我的 REST API 请求 我知道Pythonrequests https requests readthe
  • 使用 CSS 将输入值旋转 90 度

    我有一个提交按钮 其文本需要旋转 但是 我似乎只能弄清楚如何旋转整个提交按钮而不仅仅是旋转VALUE 我用来旋转的 CSS 很简单 webkit transform rotate 90deg moz transform rotate 90d
  • Angular2 中的 404 页面和延迟加载

    我无法让我的 404 页面 使用延迟加载模块工作 当我在浏览器中输入随机 URL 时 我只看到一个空白页面 而不是很酷的 404 页面 这是我的路由配置 export const routes Routes path redirectTo
  • 实体字段的可为空属性,实体框架通过 Code First

    使用数据注释Required像这样 Required public int somefield get set 将定萨姆菲尔德 to Not Null在数据库中 我该如何设置萨姆菲尔德允许 NULL 我尝试通过 SQL Server Man
  • 每 n 个单词分割单词向量(向量在列表中)

    如何最好地分割列表中的单词向量 这就是我目前正在做的事情 感谢 geektrader 的回答 here https stackoverflow com a 15832050 1036500 但它让 RStudio 颤抖并冻结了相当多 这个问