Day2 Data Management I

2023-11-16

This article is based on R 4.1.2.

1. Data import/Export

1.1 Import data

1.1.1 import data by base R

  • Example:
# General import function: 
read.table(“file name”, header = T, as.is = T, sep = “,”)

# Simplified version for .csv files: 
read.csv(“file name.csv”, header = T, as.is = T)
  • Detail (R 4.1.2)
    General import function:
    read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
    Simplified version for .csv files:
    read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
    Other uncommon used import function in base R:
    read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
    read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
    read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)

  • 常用参数解析(其余可以直接翻阅 ?read.table)

参数 简析
file 即要读取的file名,可以是绝对路径or相对路径。file也可以是readable text-mode connection OR complete URL
header Logical value. 指示原文件的第一行是否作为各列的名字。 如果缺少,此值由文件格式决定:即当且仅当第一行包含的字段比列数少一个时,header会被被设置为TRUE。
dec the character used in the file for decimal points. 原文件file中用来表示小数点的符号,如逗号(,)或点(.)。
skip integer: the number of lines of the data file to skip before beginning to read data.
na.strings character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.
sep the field separator character. Values on each line of the file are separated by this character. If sep = “” (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.

Other arguments chould be: numerals, row.names, col.names, as.is, colClasses, nrows, check.names, fill, strip.white, blank.lines.skip, comment.char, allowEscapes, flush, stringsAsFactors, fileEncoding, encoding, text, skipNul
NB:It can be favorable to specify both the file path and file name when reading in data! (这是为了reproducible:1.别人可以很容易看见路径和文件名。别人改的时候只需要改一下路径,不需要整体改。2.我们想要改其他文件,只需要改file的名字)
example:

# Reading in the file with name “mydata.csv” in the folder with path “../raw_data”
dir <- “../raw_data/”
file <- “mydata.csv”
file_path <- paste(dir, file, sep = “”)
data_raw <- read.csv(file = file_path, header = T, as.is = T)

1.1.2 import data by readr

  • R’s ability to import large datasets can be limited and the speed can be greatly increased with the package:
    – readr (work for many types of ‘rectangular’/fixed format data)
    常用函数:read_table() & read_csv()
# detail read_table()
read_table(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = "NA", skip = 0, n_max = Inf, guess_max = min(n_max, 1000), progress = show_progress(), comment = "", show_col_types = should_show_types(), skip_empty_rows = TRUE)

# detail read_csv:
read_csv(file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy())

example:
data_raw <- read_table(file = “bigdata", col_names = T, col_types = cols(.default = col_double()))

readr 还包含了一些其他函数:read_delim(), read_tsv(), read_file(), read_lines() etc

1.1.3 read Excel or SAS data

  • R is natively not compatible with Excel or SAS files, but solutions exists through user-written packages:
    • readxl (work for Excel files such as .xls and .xlsx)
    • Hmisc (work for SAS databases, multiple formats)
      example:
library(“readxl”)
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)

1.2 Export data

常用函数:

  • General export function:

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "")

  • Simplified version for export to .csv format:

write.csv(...)

Example:

write.table(object, “file name”, quote = F, sep = “,”)
write.csv(object, ‘file name.csv’, quote = T, row.names = F)

NB: Specify both the export path and desired file name when exporting an object!
Example:

# This would export the object to a .csv file named “manipulated_data_raw_YYYY-MM-DD.csv”
# Sys.time() prints the system time in the specified format (time, month, year, date, weekday, etc.)

final_path <- “raw_data1/"
final_name <- “manipulated_data_raw"
write.csv(object, paste(final_path,final_name,"_", format(Sys.time(),'%Y-%m-%d'), ".csv", sep = ""), row.names = F, quote = F, na = ".")

2. Data Overview and Summary

Data Overview:

  • 常用函数:
    • View(data) (也可以直接在Rstudio点开)
    • str()
    • names(data) & colnames(data) & rownames(data)

Data Check:

函数 作用
mean() calculate the mean value
median() calculate the median value
hist() to examine a distribution of values
quantile() calculate the quantiles of value

临床数据处理的一般注意(不处理临床数据的童鞋可跳过):
(1) Maintain original data and use a programmatic script to clean, fix mistakes, add missing information (avoid overwritting the original dataset).
(2) A tidy data means: [a]Observations are in rows. [b]variables are in columns. [c]Contained in a single dataset.
(3)常用检查方法检查数据是否正常导入:检查[a] number of rows/columns [b] Missing values [c] number of subjects/experiments [d] Range, mean, median of values
(4) 因为在tidy data里面,一个病人一行,有的时候一个病人会有好多行。这样在计算比如整体的体重的均值就可能出错。需要去除重复的行

# A solution is to use !duplicated() to remove duplicate rows:
data_ind <- data[!duplicated(data$ID), ]

Data Summary:

  1. summary() works good for continuous variables (weight, age) but not discrete (i.e. categories)
  • summary() returns the minimum and maximum values, the mean and median, as well as the 25th and 75th percentiles for each column of a data frame.
  1. unique() & table() works good for categorical variables(sex, race)
  • unique() returns any unique values and can be combined with length(unique()) to return the number of unique values. example like unique(data$SEX) OR length(unique(data$SEX))
  • table() can be used to summarize categorical data
> unique(data$SEX)
[1]	1	2
> length(unique(data$SEX))
[1]	2
> table(data$SEX)
1	2
10	12 # 10 patients sex==1 while 12 patients sex==2 

3. Subsetting

3.1 Cutting the data

Subsetting allows to select range of elements according to a specified criteria
Three main ways to subset an object in R:

  1. subset(data, SEX == 1, select = c(ID, TIME)) 留下sex==1的ID,TIME
  2. $ can select one column e.g. data$ID
  3. [ ] can be used with any object:
    A[element] (for vectors)
    data[rows, columns] (for data frames)
    my_list[[level]][sublevel] (for lists)

3.2 multiple matching: match() & %in%

实例问题:
To retrieve the IDs present in both Study 1 and 2:

study1 <- c(1, 2, 3, 4, 5, 7, 9)
study2 <- c(1, 3, 4, 6, 8, 9, 10)

In this case == or != do not work as Study 1 and 2 have multiple values.
Here, we could used match(x, table). This function return the position(千万注意这里返回的是位置) of each first match of x in table.

match(study2, study1)
[1] 1 3 4 NA NA 9 NA

If we want to use for sub setting, then we can add argument nomatch = 0 (default NA) which is the value to be generated if a value cannot be matched

study1[match(study2, study1, nomatch=0)] #这样subset match 的值
[1] 1 3 4 9 

match(study2, study1, nomatch = 0)
[1] 1 3 4 0 0 7 0

study1[match(study2, study1)] # 如果不加nomatch=0会返回不必要的NA
[1]  1  3  4 NA NA  9 NA

We can also use another more intuitive function x %in% table.

# Using %in% never returns NA values.
> Study1 %in% Study2 
[1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE
> Study1[Study1 %in% Study2]
[1] 1 3 4 9

R 4.1.2 中关于两个函数的详细描述:
match(x, table, nomatch = NA_integer_, incomparables = NULL) returns a vector of the positions of (first) matches of its first argument in its second.
x %in% table is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
具体参数:

参数 描述
x vector or NULL: the values to be matched. Long vectors are supported.
table vector or NULL: the values to be matched against. Long vectors are not supported.
nomatch the value to be returned in the case when no match is found. Note that it is coerced to integer.
incomparables a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value. For historical reasons, FALSE is equivalent to NULL.

4. Merging Data

4.1 Character manipulation

4.1.1 Creat words or sentences paste(), substr() & nchar()

函数 简述
paste (..., sep = " ", collapse = NULL, recycle0 = FALSE) combine objects into a character string
substr(x, start, stop) extracts or replaces parts of a string.
nchar() will return the number of characters of a string
> string <- paste('a','b','c',sep = "_")
> string
[1] "a_b_c"

> substr(string,start = 2,stop = 4) #提取出string里面从下标2到下标4的元素。注意在R里面字符串是不能用`[]`切片的
[1] "_b_"
> substr(string,start = 3,stop = 3)
[1] "b"
# 错误示范:
> string[3] #注意区分python,python里字符串是可以用`[]`切片的,而R里是不可以的
[1] NA
> string[2:4]
[1] NA NA NA

> nchar(string)
[1] 5
> string2 <- "a b c"
> nchar(string)
[1] 5

4.1.2 Splitting sentences or words paste(), substr() & nchar()

函数 简述
strsplit(x, split, fixed = F) cuts x based on split
grep(pattern, x, ignore.case = F, value = F) looks for the elements of x(x should be vector) that matches pattern and returns the elements position in the object (value = F) or the element that matched the pattern (value = T). pattern 可以用正则表达式
sub(pattern, replacement, x, ignore.case = F) looks for the elements of x matching the pattern and replaces them with the replacement. pattern 可以用正则表达式
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) similar to grep, but here we can use text instead of x vector. Also this can give more information see example below

Example for strsplit & sub:

> string <- "make america great again"
> strsplit(string, " ")
[[1]]
[1] "make"    "america" "great"   "again"  

> sub(pattern = "america", replacement = "china", x = string)
[1] "make china great again"

Example for grep & gregexpr

########### grep()函数
# grep里的x一般是输入一个vector,如果要在一个string里找pattern的位置可以尝试 `gregexpr(pattern, text)` 函数
> grep("america", string) #这里只有一个string,在string里有america就会匹配并返回1。
[1] 1

> list_my <- strsplit(string," ")
> list_my
[[1]]
[1] "make"    "america" "great"   "again"  
> grep("america", list_my) #grep在list中貌似只会返回包含pattern的layer,还是别这么用吧,官方文件里grep的x应该是个vector
[1] 1

官方文档对参数x的解释:
x, text	:a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

> vector_my <- list_my[[1]]
> vector_my
[1] "make"    "america" "great"   "again"  
> grep(pattern = "america", x = vector_my)
[1] 2
> vector_my2 <- c(1, 2,1, 3, 4, 1, 1, 5, 6) #这里按照解释是把vector_my2先as.character 转化成character vector
> grep("1",vector_my2)
[1] 1 3 6 7

########### gregexpr()函数
> string <- "make america great again"
> gregexpr(" ", string)
[[1]]
[1]  5 13 19
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

> gregexpr("america", string)
[[1]]
[1] 6
attr(,"match.length")
[1] 7
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

4.1.3 Sorting vectors or columns sort() & order()

函数 简述
sort(A) will sort the values of A by ascending order, or descending if option “decreasing = T” is supplied
order(A) will return the order in which each element should reordered. this also can be useful for sorting data frames by multiple columns values (see example)

Example:

> A <- c(1, 5, 2, 7, 8, 9, 3, 6, 4)
> sort(A)
[1] 1 2 3 4 5 6 7 8 9

> order(A)  #返回的是每个位置排序后的数字对应原来的位置 如:下方第二个值是3,意思是第二个值应该是原来A的第三个元素
[1] 1 3 7 9 2 8 4 5 6
> A[order(A)]
[1] 1 2 3 4 5 6 7 8 9

# order() used in dataframe: 先按照ID排序,之后再按照Time1排序
> x  
  ID Time1 
1  1     1
2  2     4
3  4     6
4  2     2
5  3     6
> ox <- x[order (x$ID, x$Time1), ] #注意后面的这个逗号,order (x$ID, x$Time1)返回的对应的row的数字排序
> ox  
  ID Time1
1  1     1
4  2     2
2  2     4
5  3     6
3  4     6

4.1.4 Binding objects together

函数 简述
c() can both create vectors or bind them together
cbind() can be used to bind elements of same length() or nrow() into columns
rbind() can be used to bind elements of same length() or ncol() into rows

Example:

> A <- 1:3 
> B <- 3:1

> c(A,B)
[1] 1 2 3 3 2 1
> cbind(A,B)
     A B
[1,] 1 3
[2,] 2 2
[3,] 3 1
> rbind(A,B)
  [,1] [,2] [,3]
A    1    2    3
B    3    2    1

# More example
> q 
    [,1] [,2]
[1,]    1    1
[2,]    2    2
> w
     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    4    4    4
> e 
    [,1] [,2]
[1,]    5    6
[2,]    6    5
[3,]    5    6
> cbind(q,w)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    3    3    3
[2,]    2    2    4    4    4
> rbind(q,e)
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    5    6
[4,]    6    5
[5,]    5    6

4.1.5 Binding objects together merge()

The function merge(x, y, by = “”, all = F) is used when:

  • “x” and “y” are two data frames to be merged.
  • “by” is a column name or a vector of column names existing in both “x” and “y” on which the merge will be based
  • If there are unmatched elements in the “by” variable(s), these elements will by default (“all=F”) be removed. This can result in loss of data if many unmatched elements
  • “all=T” tells merge to keep all elements of “x” and “y”.
  • 比如:x 里面有病人ID和体重;y里面有病人ID和实验数据。merge两个,by ID (通过ID来merge两个到一起)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Day2 Data Management I 的相关文章

  • python中全部注释_Python中的注释

    1 1 注释的目的 通过用自己熟悉的语言 在程序中对某些代码进行标注说明 这就是注释的作用 能够大大增强程序的可读性 1 2 注释的分类 1 2 1 单行注释 以 开头 右边的所有东西当做说明 而不是真正要执行的程序 起辅助说明作用 我是注

随机推荐

  • 管道

    p The hero s birthday is birthday date format uppercase p
  • CentOS 8安装postgresql 13

    安装postgresql 登陆网址 https www postgresql org download linux redhat 选择postgresql版本和操作系统平台等信息 直接复制命令 依次执行 Install the reposi
  • ZK相关(转载)

    Paxos算法 https www douban com note 208430424 zAB https blog csdn net junchenbb0430 article details 77583955 utm medium di
  • static函数与普通函数

    全局变量 外部变量 的说明之前再冠以static 就构成了静态的全局变量 全局变量本身就是静态存储方式 静态全局变量当然也是静态存储方式 这两者在存储方式上并无不同 这两者的区别虽在于非静态全局变量的作用域是整个源程序 当一个源程序由多个源
  • react-hooks常用API基本简绍

    介绍 react hooks 是 React 16 8 的新增特性 它可以让我们在函数组件中使用 state 生命周期以及其他 react 特性 而不仅限于 class 组件 react hooks 的出现 标示着 react 中不会在存在
  • 【小技巧】强制重启无线网卡,解决“区域中找不到无线网络,请确定您计算机上的无线开关已启用”问题...

    1 找不到无线网络的错误 笔记本有点旧了 无缘无故找不到网络 出现 区域中找不到无线网络 请确定您计算机上的无线开关已启用 的错误 尝试恢复之 重新关闭并开启wifi开关 居然还是这个问题 所以估计是软件问题了 2 右击无线图标尝试修复 2
  • MRI T1、T2mapping

    T1 T2mapping的步骤 1 水模放到线圈中并推入 此时开启激光对准 是水模处于腔体中心位置 2 定位 匀场 选择合适的SE序列 将图像中各层移到图像中心 3 现在心中估计T1 T2值 设置TR TE参数 一般T1 mapping时选
  • 施努卡:基于深度学习的裂缝检测技术

    不同于传统的裂缝检测方法 基于深度学习的检测方法不需要人工干预 能自主学习有代表性的特征 直接对输入图像的裂缝进行检测 随着深度学习方法在许多应用场景中取得了很好的检测结果 国内众多研究人员也开始把基于深度学习的目标检测模型和语义分割模型用
  • 连续自然数求和(3)

    题目描述 求 1 2 3 100 要求分别用 do while 语句 while 语句和 for 语句实现 输入 无 输出 计算结果 注意末尾输出换行 样例输入 无 样例输出 5050 代码 Java版 do while 语句实现 publ
  • pytorch自学笔记

    pytorch和tensorflow是机器学习的两大框架 上一篇帖子已经完整梳理了TensorFlow自学的知识点 这一篇把自学pytorch的知识点也整理下来 内容参考了网上很多帖子 不一一引用了 如有侵权请联系我 目录 0 前言 0 1
  • Windows BitLocker驱动器加密

    BitLocker驱动器加密 首先来说下什么是BitLocker吧 Windows BitLocker驱动器加密通过加密Windows操作系统卷上存储的所有数据可以更好地保护计算机中的数据 为什么要拿来说一说呢 因为这个问题 我两天没工作了
  • 详解JS 和CSS 代码利用率统计利器Coverage以及高版本chrome中coverage的一个致命问题

    在做前端性能测试时较大的文件需要更多时间来下载 并可能导致我们的网站加载缓慢 从而导致用户体验欠佳 因此删除JS和CSS中未被使用的代码就很有必要了 Coverage工具就可以非常方便的实现这一需求 Coverage使用方法 Coverag
  • uniapp打包之配置MacOS虚拟机生成iOS打包证书

    前言 uniapp是一款跨端开发框架 可用于快速开发iOS Android H5等多端应用 本文将详细介绍如何实现uniapp开发的iOS应用打包 详细步骤 一 下载苹果原版镜像文件 点击此处下载 二 安装VMware uniapp打包iO
  • 【android】巧用android:divider属性设置LinearLayout中元素之间的间隔

    如上图 要想实现3个button线性排列并且使它们的大小相同 间隔相等 而且整体填充满整个linearlayout 我们一般的做法是在每两个button之间放一个固定宽度的view 然后设置button的宽度为0 layout weight
  • IDEA maven 项目修改后代码不生效,需要mvn install 后才生效

    错误描述 每次更新完代码 还需要手动install一下才生效 太令人头疼了 错误原因 应该是idea的问题 解决方案 删除idea文件夹 重新打开项目就好了
  • nas linux手机照片备份,本身着手,组建简洁好用的NAS!(存储同步篇)

    之前老张给人人带来了NAS安装 戳这里 和根基设置 戳这里 过程 那么接下来就是正戏了 此次包含了共享设置 速度测试 多盘设置 照片视频同步设置的过程 也是NAS在内网情况中的本职工作 事实要怎么做呢 一路来看看吧 原文来自 www pp0
  • Word文档标题编号调整

    现象 标题目录编号错乱 工具 Word2016 1 先选中第一个一级目录 自定义多级目录样式 包含所有的文档目录级别 1 2 3 4 5 2 将第2章标题下的2级目录标题 2 1环境说明 使用格式刷刷成1级标题中设置的样式 3 选中 第3章
  • Python小游戏————滑雪小游戏代码开源

    作者 小刘在这里 每天分享云计算网络运维课堂笔记 励志爬上IT介顶峰 努力不一定有回报 但一定会有收获加油 一起努力 共赴美好人生 夕阳下 是最美的 绽放 愿所有的美好 再疫情结束后如约而至 目录 一 主代码 二 cfg 三 README
  • Java实现音频播放

    import java io File import java io IOException import javax sound sampled AudioFormat import javax sound sampled AudioIn
  • Day2 Data Management I

    This article is based on R 4 1 2 1 Data import Export 1 1 Import data 1 1 1 import data by base R Example General import