我正在尝试弄清楚如何使用rollapply
函数(从Zoo
包)来查找数据集中最常见字符串的序列,但我还需要对某些变量(例如日期、行等)进行分组
在进一步讨论之前,值得注意的是,该查询建立在我之前在此发布的一个问题的基础上:如何使用 Tableau 找到数据中最常见的(字符串)序列? https://stackoverflow.com/questions/67064314/how-can-i-find-most-common-sequences-of-strings-in-my-data-using-tableau
那里提供的解决方案效果非常好,但我现在想将其应用到不同的数据集,这带来了一些新的挑战!以下是我在这个新数据集中使用的数据示例:
structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith",
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show",
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death",
"Bollywood: The World's Biggest Film Industry", "One Hot Summer",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day",
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer",
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock",
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock",
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle",
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps",
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty",
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States",
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends",
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama",
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport",
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy",
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary",
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport",
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi",
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary",
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama",
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary",
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured",
"Featured", "Featured", "Featured", "This Weekend's Football",
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured",
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets",
"Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10",
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured",
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets",
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular",
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux",
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018",
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018",
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018",
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1",
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2",
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3",
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4",
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1",
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2",
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2",
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5",
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3",
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA,
-56L), class = "data.frame")
抱歉,但我不太确定共享数据的最佳实践。希望以上方法有效。它应该看起来像这样:
Title Programme_Genre Programme_Category date column row
1 Dragons Den Entertainment Featured 13/08/2018 1 1
2 One Hot Summer Documentary Featured 13/08/2018 2 1
3 Keeping Faith Drama Featured 13/08/2018 3 1
4 Cuckoo New Series Comedy Featured 13/08/2018 4 1
5 Match of the Day Sport This Weekends... 13/08/2018 1 2
6 Sportscene Sport This Weekends... 13/08/2018 2 2
我想做的是使用rollapply
功能类似于我在上一个问题中建议的方式(请参阅上面的链接),但仅查找出现在同一日期和特定列范围内的序列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只想要rollapply
函数在每个日期的每一行的第 1-4 列中执行此操作。我确信我没有很好地解释这一点(如果你没有猜到,我没有数据科学背景)所以如果有必要的话我非常乐意详细说明。提前致谢!