每 n 个单词分割单词向量（向量在列表中）

2024-04-02

如何最好地分割列表中的单词向量？这就是我目前正在做的事情（感谢 geektrader 的回答）here https://stackoverflow.com/a/15832050/1036500），但它让 RStudio 颤抖并冻结了相当多。这个问题和我之前的问题密切相关。这里的新功能是列表结构，它更接近我的实际用例。

# reproducible data
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem" 
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following:  Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."

# make a big list of character vectors containing words
list_examps <- lapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
list_examps <- rep(list_examps, 2000)

# my current method
n <- 30 # number of words in each chunk
temp1 <- vector("list", length(list_examps))
temp2 <- vector("list", length(list_examps))
for(i in 1:length(list_examps))
{
  temp1[[i]] <- unlist(strsplit(list_examps[[i]], " "))
  temp2[[i]] <- split(unlist(strsplit(temp1[[i]] , " ")),
                      seq_along(unlist(strsplit(temp1[[i]], " ")))%/%n)
}
listofnwords <- unlist(temp2, recursive = FALSE) # desired output

有没有更有效的方法来做到这一点？

UPDATE 1添加（相当适度）机器规格

 > sessionInfo()
    R version 3.0.0 (2013-04-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.1252    

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    loaded via a namespace (and not attached):
    [1] tools_3.0.0

    > gc()
              used (Mb) gc trigger (Mb) max used (Mb)
    Ncells  434928 23.3     818163 43.7   667722 35.7
    Vcells 7086406 54.1   12291671 93.8 12291671 93.8

> Sys.getenv()  # relevant exerpts, far off to the right for some reason...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           NUMBER_OF_PROCESSORS 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "4" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             OS 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "Windows_NT" 

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         PROCESSOR_ARCHITECTURE 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "AMD64" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           PROCESSOR_IDENTIFIER 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           "Intel64 Family 6 Model 42 Stepping 7, GenuineIntel" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                PROCESSOR_LEVEL 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "6" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             PROCESSOR_REVISION 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         "2a07"

UPDATE 2使用 EC2 实例（免费套餐）上的上述数据对答案进行速度测试。获胜者是... SimonO101！感谢大家的帮助。

# make a big list of character vectors containing words
list_examps <- lapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
list_examps <- rep(list_examps, 200)

# my current method
n <- 30 # number of words in each chunk
temp1 <- vector("list", length(list_examps))
temp2 <- vector("list", length(list_examps))

me <- function(list_examps){
  for(i in 1:length(list_examps))
  {
    temp1[[i]] <- unlist(strsplit(list_examps[[i]], " "))
    temp2[[i]] <- split(unlist(strsplit(temp1[[i]] , " ")),
                        seq_along(unlist(strsplit(temp1[[i]], " ")))%/%n)
  }
  listofnwords <- unlist(temp2, recursive = FALSE) # desired output
}

dr <- function(list_examps){
  f <- function(list_examps) {
    y <- unlist(strsplit(list_examps, " "))
    ly <- length(y)
    split(y, gl(ly%/%n+1, n, ly))
  }

  listofnwords <- sapply(list_examps, f)
  listofnwords <- unlist(listofnwords, recursive=F)
}

si <- function( list_examps ){

  words <- unlist( ( sapply( list_examps , strsplit , " " ) ) )
  results <- lapply( seq( 0, length(words) , by = n ) , function(x) c( words[(x+1):(x+n)] ) )

}

er <- function( x ){
  x <- do.call(paste, x)
  x <- strsplit(x, ' ')[[1]]
  result <- split(x, cut(seq_along(x), breaks = seq(0, by = n, length(x)) , include.lowest = TRUE ) )
}


library(rbenchmark)

benchmark(
  me(list_examps),
  dr(list_examps),
  si(list_examps),
  er(list_examps),
  replications = 10)

结果如下：

             test replications elapsed relative user.self sys.self user.child sys.child
2 dr(list_examps)           10  48.104    1.119    47.907    0.000          0         0
4 er(list_examps)           10  71.316    1.660    70.645    0.568          0         0
1 me(list_examps)           10  48.156    1.121    48.543    0.000          0         0
3 si(list_examps)           10  42.971    1.000    42.875    0.000          0         0

这个怎么样？

x <- do.call(paste, list_examps)
x <- strsplit(x, ' ')[[1]]
result <- split(x, cut(seq_along(x), breaks = seq(0, by = 30, length(x) ) ) )
result[[1]]

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

r

list

vector

split

每 n 个单词分割单词向量（向量在列表中）的相关文章

改进R中从google获取股票新闻数据的功能

我已经编写了一个函数来从 Google 获取和解析给定股票代码的新闻数据但我确信有一些方法可以改进它对于初学者来说我的函数返回一个 GMT 时区的对象而不是用户当前的时区如果传递的数字大于 299 它就会失败可能是因为 goog
Python 将列表追加到列表中

我正在尝试编写一个通过矩阵的函数当满足条件时它会记住该位置我从一个空列表开始 locations 当函数遍历行时我使用以下方法附加坐标 locations append x locations append y 函数末尾的列表如下所
如何在基数 R 中进行分组

我想使用以下 SQL 查询来表达base R 没有任何特定的包 select month day count as count avg dep delay as avg delay from flights group by month d
R：ifelse 中的字符串列表

我正在寻找与 MySQL 中的 where var in 语句类似的东西我的代码如下 data lt data frame id 10001 10030 cc1 rep c a b c 10 attach data data new lt
根据 R 数据框中的名称对列进行平均

我想知道是否有一种有效的方法来获取每组的平均值类似命名的列谁的名字结尾为 1S and 2S ex ex1S ex2S at time 1并取每组的平均值类似命名的列谁的名字结尾为 1C or 2C ex ex1C ex2C at time
dplyr 中的标准评估：全局环境中的函数出现“无法找到函数”错误

我试图在 dplyr 中对全局环境中的函数使用标准评估但出现无法找到函数错误这是一些代码 create data frame df lt data frame x rnorm 10 y rnorm 10 define arbitra
通过 r markdown 中的循环创建代码片段

如同如何使用R中的knitr创建一个包含代码块和文本的循环 https stackoverflow com questions 36373630 how to create a loop that includes both a code
按组计算连续行中的值之间的差异

这是我的一个df 数据框 group value 1 10 1 20 1 25 2 5 2 10 2 15 我需要按组计算连续行中的值之间的差异所以我需要一个结果 group value diff 1 10 NA because the
如何确定 R 包的作者？

如何确定包的作者是谁鉴于我们拥有这个广泛使用的代码库我认为参考我在分析中使用的软件是合适的有没有办法以编程方式检索作者和任何其他相关信息在伪代码中我想执行以下操作 references base 我怎样才能做到这一点为了能够引用
汇总表中各列的字符值比例

在这种数据框中 df lt data frame w1 c A A B C A w2 c C A A C C w3 c C A B C B 我需要计算所有列中字符值的列内比例有趣的是以下代码适用于大型实际数据集但对上述玩具数据会引发错
scala 返回列表中的第一个 Some

我有一个清单l List T1 目前我正在执行以下操作 myfun T1 gt Option T2 val x Option T2 l map myfun l flatten find gt true The myfun函数返回 None
R中整数类和数字类有什么区别

我想先说我是一个绝对的编程初学者所以请原谅这个问题是多么基本我试图更好地理解 R 中的原子类也许这适用于一般编程中的类我理解字符逻辑和复杂数据类之间的区别但我正在努力寻找数字类和整数类之间的根本区别假设我有一个简单的向量x
如何在 R 中绘制一列与其余列的关系图

我有一个数据集其中 1 是时间接下来的 14 个是幅度我想在一张图表上散布所有大小与时间的关系其中每个不同的列都是网格化的分层在另一个之上我想使用原始数据来制作这些图表并单独制作它们但只想执行此过程一次数据集A 唯一的自变
在另一个向量中定位子向量

我有一个vector
如何在 C++ 中为 MPL 向量的所有成员显式实例化模板？

考虑以下头文件 Foo h class Foo public template
如何自动启动我的 ec2 实例、运行命令然后将其关闭？

我想每周对 redshift postgres 数据库中的数据运行一次机器学习模型我使用以下命令将 R 脚本设置为休息 apiplumbr然后我将其设置为一项任务来管理pm2 我有它所以任务会在ec2实例启动然后继续运行要让 R 脚本
如何定义“f_n-chi-square”函数并使用“uniroot”求置信区间？

I want to get a 95 confidence interval for the following question 我已经写了函数f n在我的 R 代码中我首先使用 Normal 随机采样 100 个样本然后定义函数h
按特定样本前缀对列名称向量进行子集化

假设我有一个如下所示的数据框 ca01 lt c 1 10 ca02 lt c 2 11 ca03 lt c 3 12 stuff 1 lt rep test 10 other lt rep 9 10 data lt data frame
计算互相关函数？

In R 我在用ccf or acf计算成对互相关函数以便我可以找出哪个移位给我带来最大值从它的外观来看 R给我一个标准化的值序列 Python 的 scipy 中是否有类似的东西或者我应该使用fft模块目前我正在这样做 xcor
通过使用 navbarPanel() 并隐藏导航栏构建多页闪亮应用程序用户端（在 ui.R 中）？

我想构建一个多页闪亮应用程序我可以在其中控制用户可以看到哪个页面迪安阿塔利确实这个演示应用程序中有类似的东西 https github com daattali advanced shiny tree master multiple

随机推荐

在客户端 SAPUI5 中仅过滤 sap.m.List

我正在寻找仅在客户端过滤列表并让其他类似表等在服务器端过滤是否有任何选项可以添加到列表中以在客户端进行过滤问候您可以使用操作模式v2 ODataModel 的参数该参数可以设置为模型级别 https openui5 hana on
如何应用 django 补丁

我想对这个错误应用补丁 http code djangoproject com ticket 13095 http code djangoproject com ticket 13095 但我以前从未这样做过我不知道从哪里开始谁能给我指
如何使用 Spring JPA 仅获取实体的选定属性？

我在我的项目中使用 Spring Boot 1 3 3 RELEASE 和 Hibernate JPA 我的实体看起来像这样 Data NoArgsConstructor Entity Table name rule public clas
读取 jar 文件中的 zip 文件

之前我们的 Web 应用程序中有一些 zip 文件我们想要解析 zip 文件中的特定文本文档这不是问题 URL url getClass getResource zipfile ZipFile zip new ZipFile url g
序列或批处理项目 DataGridView

我有一个包含 940000 行的大型 DataGridView 哎呀通过解析 csv 文件填充 DataGridView 有一个名为序列的列编号为 1 到 940000 我试图做的是重新编号序列以溢出到序列中DataGridView 中
Angular Reactive Form 提交并明确验证

我有一个反应形式
如何监控SciPy.odeint的进程？

SciPy 可以通过 scipy integrate odeint 或其他包求解 ode 方程但它在函数完全求解后给出结果但是如果 ode 函数非常复杂程序将花费大量时间一两天才能给出整个结果那么我如何监控求解方程的步骤当方
将动态添加到 IList 失败

在下面的代码示例中调用l Add s and c Add s 是成功的但对于泛型时会失败IList
com.jcraft.jsch.JSchException：java.net.ConnectException：连接被拒绝：连接

我知道有重复的 gt gt gt 从重复的 gt gt gt 复制只要您的本地计算机有一个正在运行的 SSH 服务器它说只要你的本地机器有运行的 SSH 服务器但我不知道如何运行 SSH 服务器我打开我的腻子双击它不确定这是否
utf-8特殊字符不显示[重复]

这个问题在这里已经有答案了我将网站从本地测试服务器移至 NameCheap 共享主机现在我遇到了问题某些页面无法正确显示 utf 8 特殊字符而是显示问号所有页面均采用 utf 8 编码所有数据库表也是如此奇怪的是有些页面可
无法加载 libgdx 的共享库 box2d

我有一个使用 Maven 构建的 libGDX 项目它过去运行良好但最近它停止工作因为 libGDX 将 box2d 移动为扩展我将扩展作为依赖项添加到项目的核心就像任何其他依赖项一样
在 Python 中将 USB 视频捕获设备友好名称与 OpenCV 端口号相关联

我想在 Windows 平台上使用 Python 获取外部 USB 视频捕获设备的友好名称和 USB 端口号我正在使用 OpenCV 从 USB 捕获设备捕获视频 OpenCV 将 USB 端口称为 1 找到的第一个工作摄像头 0 对我来
为Buildbot添加自定义功能

我用 python 编写了一个函数我想让 Buildbot 来执行这个函数当它收到构建命令时我之前使用过 factory addStep 通过命令行添加新命令但我不确定如何向 Buildbot 添加 python 函数谢谢如
Symfony2：如何禁用表单级联验证？

我有一个带有一个实体表单字段的表单当我打电话时 form gt isValid symfony 验证与该字段关联的对象我知道问题是fixed http github com symfony symfony commit 0c70a410
Android 模拟器上的布局看起来不像我在 Android 应用程序项目中所做的布局

Android 模拟器上的布局看起来不像我在 Android 应用程序项目中所做的布局我在 Eclipse 中做了一个带有两个单选按钮和两个旋转器的布局但是当我在 Android 模拟器上运行它时旋转器在箭头符号旁边各有两个额外的单选
Python greenthread 和 requests 模块每次只处理 10 个请求？

我正在使用Python 2 7 5 并尝试使用协程绿色线程 Pythoneventlet 和Pythonrequests模块来加速我的 REST API 请求我知道Pythonrequests https requests readthe
使用 CSS 将输入值旋转 90 度

我有一个提交按钮其文本需要旋转但是我似乎只能弄清楚如何旋转整个提交按钮而不仅仅是旋转VALUE 我用来旋转的 CSS 很简单 webkit transform rotate 90deg moz transform rotate 90d
Angular2 中的 404 页面和延迟加载

我无法让我的 404 页面使用延迟加载模块工作当我在浏览器中输入随机 URL 时我只看到一个空白页面而不是很酷的 404 页面这是我的路由配置 export const routes Routes path redirectTo
实体字段的可为空属性，实体框架通过 Code First

使用数据注释Required像这样 Required public int somefield get set 将定萨姆菲尔德 to Not Null在数据库中我该如何设置萨姆菲尔德允许 NULL 我尝试通过 SQL Server Man
每 n 个单词分割单词向量（向量在列表中）

如何最好地分割列表中的单词向量这就是我目前正在做的事情感谢 geektrader 的回答 here https stackoverflow com a 15832050 1036500 但它让 RStudio 颤抖并冻结了相当多这个问

每 n 个单词分割单词向量（向量在列表中）

每 n 个单词分割单词向量（向量在列表中） 的相关文章

随机推荐

热门标签

每 n 个单词分割单词向量（向量在列表中）的相关文章