我有一个看起来像这样的数据集
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Date = c("2020-01-
\n04",
"2020-04-03", "2020-12-10", "2020-09-12", "2020-11-19", "2020-04- \n03",
"2020-06-03", "2020-05-03", "2020-08-09", "2020-10-10"), Name = c("Jon",
"Mike", "", "Rodney", "Jon", "Mike", "", "Ryan", "Ryan", "Ryan"
), Phone = c("555-555-5555", "123-456-7890", "123-456-7890",
"333-333-3333", "", "123-456-7890", "098-765-4321", "", "", "444-444-
\n4444"
), Email = c("[email protected]", "[email protected]", "[email protected]",
"[email protected]", "", "", "", "[email protected]", "", "[email protected]"
), Address = c("123 Main Street", "456 Washingto Avenue", "",
"16 Henderson St", "", "456 Washingto Avenue", "123 Lincoln Avenue",
"123 Lincoln Avenue", "", "156 Jefferson Street"), Group = c("1",
"2", "2", "3", "1", "2", "4", "4", "4", "5")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
我想要获得一个如下所示的数据集(请注意,分数列中的数字并不完全是我想要的数字。我只是添加了数字作为占位符。我将允许该方法确定正确的分数计数。但是 1 应该参考满分。
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Date = c("2020-01-
04", "2020-04-03", "2020-12-10", "2020-09-12", "2020-11-19", "2020-04-
03", "2020-06-03", "2020-05-03", "2020-08-09", "2020-10-10"), Name =
c("Jon", "Mike", "", "Rodney", "Jon", "Mike", "", "Ryan", "Ryan", "Ryan"
), Phone = c("555-555-5555", "123-456-7890", "123-456-7890",
"333-333-3333", "", "123-456-7890", "098-765-4321", "", "", "444-444-
4444"), Email = c("[email protected]", "[email protected]", "[email protected]",
"[email protected]", "", "", "", "[email protected]", "[email protected]",
"[email protected]"), Address = c("123 Main Street", "456 Washingto
Avenue","", "16 Henderson St", "", "456 Washingto Avenue", "123 Lincoln
Avenue",
"123 Lincoln Avenue", "", "156 Jefferson Street"), Group = c("1",
"2", "2", "3", "1", "2", "4", "4", "4", "5"), Score = c("1",
"1", ".88", "1", ".96", ".96", "1", "1", ".25", "1")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
“分数”栏中的数字是任意的。我可以根据模糊匹配过程的规则获取其他数字。我脑海中的想法是,根据长数据集,脚本发现有四个组。这些组对应于 1、2、3 和 4,分别指 Jon、Mike、Rodney 和 Ryan。请注意,Ryan 的得分为 0.25,因为它只包含他的姓名,而不包含电话或电子邮件等其他信息。该分数是组内的相对分数,而不是相对于整个数据集的分数。
一套完整的
Col<-("Name","Phone","Email","Address")
应该画出一个完美的近似值,没有争议。 4 中 3 的一组应该高于 4 中 2 的集合,依此类推。这个过程该如何进行呢?这可能吗?