如何从 csv 文件读取表格中的文本

2024-02-23

我是新使用 tm 包。我想读取一个 csv 文件，其中一列包含 2000 个文本，第二列包含因子变量 yes/no 到语料库中。我的目的是将文本转换为矩阵并使用因子变量作为预测目标。我还需要将语料库划分为训练集和测试集。我阅读了一些文档，例如 tm.pdf 等，发现文档相对有限。这是我在同一主题上再次发出威胁后的尝试，

TexTest<-read.csv("C:/Test.csv")
 m <- list(Text = "Text", Clasification = "Classification")
 corpus1 <-
Corpus(x=TexTest,readerControl=list(reader=readTabular(mapping=m),language="en"))

Error in if (x$Length > 0) vector("list", as.integer(x$Length)) else list() : 
  argument is of length zero

Using

corpus1 <- Corpus(VectorSource(TexTest))

结果是

A corpus with 2 text documents

而不是 2000 条短信。

这里的标准流程是怎样的？谢谢

你需要使用DataframeSource in the Corpus函数，这就是您的示例与第 4 页上的示例不同的地方。 PDF 2扩展：如何处理自定义文件格式 http://cran.r-project.org/web/packages/tm/vignettes/extensions.pdf in the tm包裹。

一些可重现的数据：

TexTest <- structure(list(Text = c("When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?", 
"Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem", 
"You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system.", 
"Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation", 
"Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following:  Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
), Classification = c("Yes", "No", "Yes", "No", "Yes")), .Names = c("Text", 
"Classification"), class = "data.frame", row.names = c(NA, -5L
))

制作包含五个文档的语料库（CSV 文件中的每一行一个）

# TexTest<-read.csv("Test.csv", stringsAsFactors = FALSE)
m <- list(Content = "Text", Topic = "Classification")
library(tm)
myReader <- readTabular(mapping = m)
(corpus <- Corpus(DataframeSource(TexTest), readerControl = list(reader = myReader)))

A corpus with 5 text documents
# as expected, one doc per row of the CSV file

corpus[[1]]

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?

# as expected, the first row of the CSV file

这就是你想做的吗？

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

如何从 csv 文件读取表格中的文本的相关文章

对 data.table 中的列表列执行操作

假设我有一个data table 例如dt lt data table foo list 1 3 4 6 bar c 2 7 如何使用 dt 框架对 foo 向量列表执行操作操作可能是将 bar 添加到 foo 返回列表 3 5 11 1
将第 N 行上的 NA 行插入 data.frames 列表，其中 N 来自列表

经过几个小时后我发现自己无法解决以下问题我有一个数据框列表我想分别向每个 DF 插入而不是替换一行或多行 NA 始终至少一行要插入的 NA 数量存储在单独的列表中为了说明这一点我有以下两个列表 list of datafra
使用字符串中的变量名称访问变量值，R

Intro 一个数据集有大量的age year变量 age 1990 age 1991 etc 我有一个字符串值数组length age years 表示这些变量使得age years 1 回报 age 1990 etc Need 我想搜
如何添加链接以从我的 R闪亮应用程序在新窗口中打开 pdf 文件？

我可以使用 a 从我的 Shiny 应用程序添加到外部站点的超链接 a google href http www google com 但如何创建一个链接来打开 pdf 或类似文件看起来应该很简单但我找不到任何例子我的问题与此类似
在 R 传单中添加不透明度滑块

如何在 R leaflet 应用程序中添加滑块来控制特定图层的不透明度对于这个应用程序我不想使用闪亮这里建议在 R 传单应用程序中添加滑块 https stackoverflow com questions 37682619 add
R中的字典数据结构

在 R 中我有例如 gt foo lt list a 1 b 2 c 3 如果我输入foo I get a 1 1 b 1 2 c 1 3 我怎样才能看透foo仅获取键列表在这种情况下 a b c R 列表可以具有命名元素因此可
将字符串列拆分为多个虚拟变量

作为 R 中 data table 包的相对缺乏经验的用户我一直在尝试将一个文本列处理为大量指示符列虚拟变量每列中的 1 表示特定的子字符串是在字符串列中找到例如我想处理这个 ID String 1 a b 2 b c 3 c 进入
R - 计算 bin 中特定值的数量

我有一个如下所示的数据框 df Value lt c 1 1 0 2 1 3 4 0 0 1 2 0 3 0 4 5 2 3 0 6 Sl lt c 1 20 df lt data frame Sl Value gt df Sl Value
列出 R 数据文件的内容而不加载

我有时用print load myDataFile RData 当我加载数据文件时列出它的内容有没有办法列出内容而不加载数据文件中包含的对象我认为如果不加载对象就无法做到这一点解决方案可能是使用包装器将 R 对象保存到save 该函数
purrr::可能函数可能无法与map2_chr函数一起使用

我怀疑这是 purrr 包中的错误但想先在 StackOverflow 中检查我的逻辑在我看来 possibly功能在内部不起作用map2 chr功能我正在使用 purrr 版本 0 2 5 考虑这个例子 library dplyr
使用officer R导出时如何提高ggplots的分辨率

我想将图表导出到 PPT 并使用Officer 包来实现相同的目的但是图表的默认分辨率较低我想更改它我目前正在使用以下电话 ph with gg p1 type chart res 1200 其中 p1 是 ggplot 对象运行
正态分布平均值的贝叶斯推理玩具 R 代码 [降雪量数据]

我有一些降雪观测 x lt c 98 044 107 696 146 050 102 870 131 318 170 434 84 836 154 686 162 814 101 854 103 378 16 256 我被告知它遵循正态分布
访问或解析 R 中的 summary() 中的元素

我运行以下 R 命令来进行 Dunnett 测试并获取摘要如何访问下面线性假设的每一行这是摘要输出的一部分基本上我不知道摘要的结构我尝试使用名称但它似乎不起作用因为我没有看到任何命名属性来提供这一点 library multco
警告消息 - 来自 dummies 包的 dummy

我正在使用 dummies 包为分类变量生成虚拟变量其中一些变量具有两个以上类别 testdf lt data frame A as factor c 1 2 2 3 3 1 B c A B A B C C C c D D E D D E
如何纠正 data.frame 上的字符编码

我有一个像这样的数据框 data names lt data frame DATA c 1 5 rownames data names lt c IV xc1N JOS xc9 LUC xcdA RAM xd3N TO xd1O data
sapply - 保留列名称

我试图总结数据集中许多不同列变量的平均值标准差等我已经编写了自己的汇总函数以准确返回我需要和正在使用的内容sapply立即将此函数应用于所有变量它工作正常但是返回的数据帧没有列名我似乎甚至无法使用列号引用重命名它们也就是说
如何按定义的顺序将图像合并到一个文件中

我有大约 100 张图像 png 我不想手动执行此操作而是希望将它们按照定义的顺序基于文件名并排放置在一个 pdf 中每行 12 个图像有人有什么建议吗我按照下面托马斯告诉我的方法尝试了它把它们贴在旁边有一个黑边我怎样才能去
要在子集中显示的非数字条目的维恩图

我有以下数据框 SET1 SET2 SET3 par1 par2 par1 par2 par3 par2 par3 par4 par5 我想制作一个维恩图其中所有这些 parX 元素都显示在各自的子集中即作为标签而不仅仅是重叠元素的数
当将遗传算法与 lme4 一起使用时，glmulti 无限期运行

我在 R 中使用 glmulti 进行模型平均我的模型中有大约 10 个变量使得详尽的筛选不切实际因此我需要使用遗传算法 GA 调用 method g 我需要包含随机效应因此我使用 glmulti 作为 lme4 的包装器此处提供
case_when 与部分字符串匹配和 contains()

我正在使用一个数据集其中有许多名为 status1 status2 等的列在这些列中它表示某人是否豁免完整注册等不幸的是豁免投入并不一致这是一个示例 library dplyr problem lt tibble perso

随机推荐

在加载的 ELF（.so 共享库）中挂钩并替换导出函数

我正在编写一些 C 代码来将 so ELF 共享库的某些函数加载到内存中我的 C 代码应该能够重定向另一个加载到应用程序程序内存中的 so 库的导出函数这里有一些详细说明 Android 应用程序将加载多个 so 文件我的 C 代
Google 登录：使用 google-auth Python 包时“未找到密钥 ID xxxx 的证书”

我正在维护一个网站及其移动应用程序 iOS 和 Android 对于移动应用程序中的 Google 登录我正在使用google auth Python 包 https github com googleapis google auth l
知道任何体素图形 C++ 库吗？ [关闭]

Closed 这个问题不符合堆栈溢出指南 help closed questions 目前不接受答案所以我正在寻找带有 C 库面向游戏的体素图形引擎只是为了好玩这将是我第一次使用图形库因此它不必非常复杂或强大只需易于理解即可
尽管allowtgtsessionkey注册表项无法检索TGT

我正在尝试连接我们的 Windows 客户端应用程序以使用单点登录机制我正在遵循可以找到的解释here http www javaactivedirectory com page id 196 我已经很难完成第一步即获取登录用户的票证授
尝试在构造函数中访问 @Inject bean 时出现 NullPointerException

我有一个会话范围的 bean Named SessionScoped public class SessionBean implements Serializable private String someProperty public S
JTable更改列字体

我正在制作一个表格我想在其中制作具有更高字体大小的第一列例如在第 0 列中我希望字体大小为 30 在第 1 3 列中我希望字体大小为 13 这是我的代码 import java awt import java awt event
当cmd以管理员身份运行时如何将输入发送到命令？

我创建了一个将键盘输入发送到的应用程序cmd exe 这在运行时有效cmd作为普通用户但失败时cmd以管理员身份运行这是我的代码 Var Wnd hwnd begin wnd FindWindow ConsoleWindowClass 0
在 PostScript 中显示 Unicode 字符

如何让我的 PostScript 程序显示 G 谱号字符Bravura https github com steinbergmedia bravura字体根据这个SMuFL http www smufl org files smufl 0
如何取数据？

我正在学习使用神经网络并且遇到了问题我不知道如何转换神经网络的数据据我了解我需要对数据进行标准化在标准化和学习之后答案总是平均的 https jsfiddle net eoy7krzj https jsfiddle net eo
奇怪的方法行为 - 函数的 ToString

考虑这个代码片段 class Program static void Main string args Console WriteLine Test ToString static IEnumerable
如何使用 Azure API Manager 缓存存储值策略存储 JSON 有效负载？

再会我尝试使用缓存存储值策略将传入的 JSON 负载存储到 Azure API Manager 内部缓存中密钥将是有效负载内的字段之一我能够提取密钥但是当我尝试存储有效负载时我收到错误表达式求值失败未将对象引用设置为对象的
尝试使用 SQL 从多个表中删除

我的应用程序中有 4 个表 User usession upklist 项目共享最后三个表包含一个名为session id 在下面的代码中括号中的部分用于获取所有session id值来自usession用户 awpeople 的表问
Apache Beam 每用户会话窗口未合并

我们有一个有用户的应用程序每个用户每次使用我们的应用程序大约 10 40 分钟我想根据发生的特定事件例如该用户已转换该用户上次会话出现问题该用户上次会话成功在此之后我想计算每天这些更高级别的事件但这是一个单独的问题为此
如何在 Jupyter 中将变量从 javascript 传递到 python？

据我了解我应该能够打印变量foo在下面的代码片段中 from IPython display import HTML HTML print foo 相反我看到以下错误消息 NameErrorTraceback most recent c
在.NET中设置打印机“保留打印文档”属性

这就是我们正在尝试做的事情我们希望以一种不引人注目的方式获取客户在其计算机上打印的所有内容我们所有的客户都运行 POS 系统并专门使用 Windows XP 并将其发送给我们我们决定最好的方法是创建一个 c 向我们发送假脱机文件的应用
如何将文本 URL 转换为 PHP 页面中的可点击链接？

我确信这是一个非常简单明显的答案但我的大脑已经崩溃了我似乎无法理解它我有一个 PHP 站点允许用户将信息发布到 mySQL 中的文本字段这些帖子都可以在线查看在发布编辑模式下该字段是 HTML 表单中的文本区域在阅读模式
在开发环境中覆盖ActionMailer的邮件地址

在我的开发环境中我在本地测试时使用生产数据库的副本出于测试和防止向真实用户发送测试开发电子邮件的原因在开发模式下覆盖邮件地址的最佳方法是什么我知道我可以在每个邮件程序中编写逻辑但我有几个最好将它们全部放在一个地方我可以覆盖m
IllegalArgumentException：在 ViewPager 中找不到片段 id 的视图 --- ViewPager

我遇到了困扰我好几天的问题有一个ViewPager在主要活动中持有 3Fragments 作为选项卡片段在里面first片段有一个ListView哪个持有一些观点哪个是最重要的另一个ViewPager 我想在子里保留一些照片View
如何在本地测试并发？

本地测试并发的最佳方法是什么即我想测试 10 个并发点击我知道类似的服务Blitz http blitz io 然而我试图找到一种更简单的方法在本地进行测试以对抗竞争条件有任何想法吗也许通过卷曲查看 Apache Bench a
如何从 csv 文件读取表格中的文本

我是新使用 tm 包我想读取一个 csv 文件其中一列包含 2000 个文本第二列包含因子变量 yes no 到语料库中我的目的是将文本转换为矩阵并使用因子变量作为预测目标我还需要将语料库划分为训练集和测试集我阅读了一些文档例

如何从 csv 文件读取表格中的文本

如何从 csv 文件读取表格中的文本 的相关文章

随机推荐

热门标签

如何从 csv 文件读取表格中的文本的相关文章