如何使用 R 找到数据中最常见的序列?

2024-03-13

我正在尝试弄清楚如何使用rollapply函数(从Zoo包)来查找数据集中最常见字符串的序列,但我还需要对某些变量(例如日期、行等)进行分组

在进一步讨论之前,值得注意的是,该查询建立在我之前在此发布的一个问题的基础上:如何使用 Tableau 找到数据中最常见的(字符串)序列? https://stackoverflow.com/questions/67064314/how-can-i-find-most-common-sequences-of-strings-in-my-data-using-tableau

那里提供的解决方案效果非常好,但我现在想将其应用到不同的数据集,这带来了一些新的挑战!以下是我在这个新数据集中使用的数据示例:

structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith", 
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show", 
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death", 
"Bollywood: The World's Biggest Film Industry", "One Hot Summer", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day", 
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer", 
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock", 
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock", 
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty", 
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States", 
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends", 
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama", 
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport", 
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy", 
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary", 
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport", 
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi", 
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary", 
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama", 
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary", 
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured", 
"Featured", "Featured", "Featured", "This Weekend's Football", 
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured", 
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10", 
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured", 
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", 
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux", 
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1", 
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", 
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", 
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", 
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2", 
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2", 
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5", 
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", 
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA, 
-56L), class = "data.frame") 

抱歉,但我不太确定共享数据的最佳实践。希望以上方法有效。它应该看起来像这样:

   Title            Programme_Genre     Programme_Category  date         column row
1   Dragons Den     Entertainment       Featured            13/08/2018      1   1
2  One Hot Summer   Documentary         Featured            13/08/2018      2   1
3  Keeping Faith    Drama               Featured            13/08/2018      3   1
4  Cuckoo           New Series Comedy   Featured            13/08/2018      4   1
5  Match of the Day Sport               This Weekends...    13/08/2018      1   2
6  Sportscene       Sport               This Weekends...    13/08/2018      2   2

我想做的是使用rollapply功能类似于我在上一个问题中建议的方式(请参阅上面的链接),但仅查找出现在同一日期和特定列范围内的序列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只想要rollapply函数在每个日期的每一行的第 1-4 列中执行此操作。我确信我没有很好地解释这一点(如果你没有猜到,我没有数据科学背景)所以如果有必要的话我非常乐意详细说明。提前致谢!


使用 tidyverse、zoo 和 lubridate,尝试:

library(tidyverse)
library(zoo)
library(lubridate)

df %>% 
  mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
  filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
  group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
  # Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
  map_df(.x = . ,
         .f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>% 
                  as_tibble() %>% 
                  mutate(date = unique(.x$date), row = unique(.x$row)))) %>% 
  group_by_all() %>% 
  tally() %>% 
  arrange(date, row, n)

    # A tibble: 26 x 6
# Groups:   V1, V2, V3, date [26]
   V1            V2            V3               date       row       n
   <chr>         <chr>         <chr>            <date>     <chr> <int>
 1 Documentary   Drama         New SeriesComedy 2018-08-13 1         1
 2 Entertainment Documentary   Drama            2018-08-13 1         1
 3 Sport         Sport         Sport            2018-08-13 2         2
 4 Drama         Entertainment Documentary      2018-08-13 3         1
 5 Sport         Drama         Entertainment    2018-08-13 3         1
 6 Comedy        Drama         Comedy           2018-08-13 4         1
 7 Drama         Comedy        Documentary      2018-08-13 4         1
 8 Crime Drama   Documentary   Documentary      2018-08-14 1         1
 9 Documentary   Documentary   Documentary      2018-08-14 1         1
10 Comedy        Drama         Comedy           2018-08-14 2         1
# ... with 16 more rows
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

如何使用 R 找到数据中最常见的序列? 的相关文章

随机推荐

  • 从 Java 7 升级到 Java 8 后签名引用失败

    我最近已将 Java 从 7 升级到 8 并且我面临使用 saml 请求的服务之一的问题 我收到以下错误 Jan 05 2015 3 42 06 PM org jcp xml dsig internal dom DOMReference v
  • 在项目的根目录下创建 .env 文件

    我正在尝试从 github 下载一个 Django 项目 其中一个要求是 由于该项目使用 python de Couple 因此您需要在项目的根目录上创建一个名为 env 的文件 其中包含三个值 如下所示 DEBUG True SECRET
  • 将单词 (.docx) 转换为 docbook

    我的任务是找到一种将大量 docx 文件转换为 docbook 5 的方法 目前 我们在 openoffice 中打开该文件并保存到 docbook 这是一项耗时的任务 但我相信有更好的方法 然后 这些文件将被进一步处理为我们的自定义rel
  • 如何在 CMake 中删除字符串中的一行文本,解决 CMake 缺乏基于行的正则表达式匹配的问题?

    我发现 CMake 并没有按照我预期的方式执行 RegEx 显然 其他人也遇到了同样的问题 https cmake org pipermail cmake 2007 October 017107 html问题是 CMake 不是基于行的 当
  • Pip无法安装枕头

    当尝试使用 pip 安装枕头时 会产生此错误 我尝试过重新安装安装工具 但没有效果 我在安装了 Diet Pi 的 pi 0 上运行 pip install pillow Looking in indexes https pypi org
  • Objective-C 中的文本解析?

    是否有任何库可以在 Objective C iPhone 应用程序中解析 Textile Textile 到 HTML C 库也可以工作 Update 我在 C Obj C 中找不到任何足够开发的库 但我确实找到了一个用 Javascrip
  • 多态性、泛型和匿名类型 C#

    考虑以下场景 文档 gt 部分 gt 正文 gt 项目 文档有部分 部分包含正文 正文有一些文本和项目列表 这些项目就是问题的内容 有时项目是基本的字符串列表 但有时项目包含自定义数据类型的列表 So public class Docume
  • 为什么 git Remote prune origin 会删除我的本地标签?

    我有几个标记引用来自本地分支和远程跟踪分支的提交 或这些提交的祖先 我想删除对分支和标签的引用origin跑完后git fetch git remote prune origin dry run 但输出表明它会修剪我的本地标签 即使是我手动
  • YDN DB 包含问题

    我已经包含了 YDN DB 文件 ydn db isw core crypt qry dev js 来加密角度应用程序中的索引数据库 它被添加到 socket io 1 4 5 js 之后的 index html 主体部分中 加载器 spi
  • 从 saber API 发出票据

    我们已经成功实施了 sabre 低价搜索和预订工作流程并创建了 PNR 现在 我想使用 api 开具该 PNR 的机票 My workflow is 1 BargainFinderMaxRQ find 2 EnhancedAirBook b
  • 通过surfaceview使用zxing条码阅读器

    我正在创建一个扫码机应用程序 我想使用Zxing要读取条形码 我的应用程序有一个表面视图并向其中显示相机 但现在我想从 SurfaceView 相机扫描条形码 我使用它的原因是我的布局中的表面视图下有两个 Edittexts 来显示条形码的
  • 如何处理损坏的 Git 对象文件?

    当我接近配额时 我做了一次 Git pull 结果 我认为 得到了一个损坏的文件 git pull walk dffbfa18916a9db95ef8fafc6d7d769c29a445aa fatal object d4a0e759949
  • 是否可以从 Scala(spark) 调用 python 函数

    我正在创建一个 Spark 作业 需要使用用 python 编写的函数将列添加到数据帧中 其余的处理是使用 Scala 完成的 我找到了如何从 pyspark 调用 Java Scala 函数的示例 https community hort
  • 安装 MSDeploy 包时如何保留现有文件?

    我需要保留一些由我的网站生成的文件 是否可以使 MSDeploy 不删除任何文件 并且仅当包包含较新版本的文件时才覆盖现有文件 enableRule 跳过新文件规则将跳过对具有较新写入时间的文件的更新 enableRule 不删除规则将阻止
  • 如何移动然后删除MySQL中的字段

    我试图将字段从一个表移动到另一个表 然后从第一个表中删除它 我遇到的问题是它可以很好地移动数据 但不会从第一个表中删除它 这是我的代码 INSERT INTO out tickets SELECT FROM tickets DELETE F
  • 什么推理导致“包含递归定义的序列表达式编译不正确”

    问题尽管尾调用位置存在堆栈溢出 但仅限 64 位 https stackoverflow com q 35751350 1243762导致发现一个bug https github com Microsoft visualfsharp iss
  • 参数类型“Map Function()”无法分配给参数类型“Map

    这最初可能有效 但在 firebase 更新后 现在给我这个错误 我已在给出错误的部分添加了星号 错误消息已添加到代码下方 import package cloud firestore cloud firestore dart class
  • Grails 从服务中渲染视图?

    我刚刚开始在 Grails 中使用服务 在尝试从服务内部渲染页面时遇到问题 我尝试了如下所示的几种方法 但没有成功 服务电话 1 GroupCheckService isEnabled userObjects group notenable
  • 在 AuthorizeAttribute Mvc Core Web Api 中获取控制器实例

    我使用下面的类来控制我的 api 方法请求 并设置 BaseController 类的一些属性以在方法中常用 这个来自 Asp Net Mvc Web Api using System using System Collections Ge
  • 如何使用 R 找到数据中最常见的序列?

    我正在尝试弄清楚如何使用rollapply函数 从Zoo包 来查找数据集中最常见字符串的序列 但我还需要对某些变量 例如日期 行等 进行分组 在进一步讨论之前 值得注意的是 该查询建立在我之前在此发布的一个问题的基础上 如何使用 Table