Day2 Data Management I

2023-11-16

This article is based on R 4.1.2.

1. Data import/Export

1.1 Import data

1.1.1 import data by base R

Example：

# General import function: 
read.table(“file name”, header = T, as.is = T, sep = “,”)

# Simplified version for .csv files: 
read.csv(“file name.csv”, header = T, as.is = T)

Detail (R 4.1.2)
General import function:
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = FALSE, fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
Simplified version for .csv files:
read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
Other uncommon used import function in base R:
read.csv2(file, header = TRUE, sep = ";", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
read.delim(file, header = TRUE, sep = "\t", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
read.delim2(file, header = TRUE, sep = "\t", quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
常用参数解析（其余可以直接翻阅 ?read.table)

参数	简析
`file`	即要读取的file名，可以是绝对路径or相对路径。file也可以是readable text-mode connection OR complete URL
`header`	Logical value. 指示原文件的第一行是否作为各列的名字。如果缺少，此值由文件格式决定:即当且仅当第一行包含的字段比列数少一个时，header会被被设置为TRUE。
`dec`	the character used in the file for decimal points. 原文件file中用来表示小数点的符号，如逗号(,)或点(.)。
`skip`	integer: the number of lines of the data file to skip before beginning to read data.
`na.strings`	character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.
`sep`	the field separator character. Values on each line of the file are separated by this character. If sep = “” (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.

Other arguments chould be: numerals, row.names, col.names, as.is, colClasses, nrows, check.names, fill, strip.white, blank.lines.skip, comment.char, allowEscapes, flush, stringsAsFactors, fileEncoding, encoding, text, skipNul。
NB：It can be favorable to specify both the file path and file name when reading in data! (这是为了reproducible：1.别人可以很容易看见路径和文件名。别人改的时候只需要改一下路径，不需要整体改。2.我们想要改其他文件，只需要改file的名字)
example:

# Reading in the file with name “mydata.csv” in the folder with path “../raw_data”
dir <- “../raw_data/”
file <- “mydata.csv”
file_path <- paste(dir, file, sep = “”)
data_raw <- read.csv(file = file_path, header = T, as.is = T)

1.1.2 import data by readr

R’s ability to import large datasets can be limited and the speed can be greatly increased with the package:
– readr (work for many types of ‘rectangular’/fixed format data)
常用函数：read_table() & read_csv()

# detail read_table()
read_table(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = "NA", skip = 0, n_max = Inf, guess_max = min(n_max, 1000), progress = show_progress(), comment = "", show_col_types = should_show_types(), skip_empty_rows = TRUE)

# detail read_csv:
read_csv(file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale = default_locale(), na = c("", "NA"), quoted_na = TRUE, quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000, n_max), name_repair = "unique", num_threads = readr_threads(), progress = show_progress(), show_col_types = should_show_types(), skip_empty_rows = TRUE, lazy = should_read_lazy())

example:
data_raw <- read_table(file = “bigdata", col_names = T, col_types = cols(.default = col_double()))

readr 还包含了一些其他函数：read_delim(), read_tsv(), read_file(), read_lines() etc

1.1.3 read Excel or SAS data

R is natively not compatible with Excel or SAS files, but solutions exists through user-written packages:
- readxl (work for Excel files such as .xls and .xlsx)
- Hmisc (work for SAS databases, multiple formats)
  example:

library(“readxl”)
read_excel(path, sheet = 1, col_names = TRUE, col_types = NULL, na = "", skip = 0)

1.2 Export data

常用函数：

General export function:

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE, qmethod = c("escape", "double"), fileEncoding = "")

Simplified version for export to .csv format:

write.csv(...)

Example:

write.table(object, “file name”, quote = F, sep = “,”)
write.csv(object, ‘file name.csv’, quote = T, row.names = F)

NB: Specify both the export path and desired file name when exporting an object!
Example:

# This would export the object to a .csv file named “manipulated_data_raw_YYYY-MM-DD.csv”
# Sys.time() prints the system time in the specified format (time, month, year, date, weekday, etc.)

final_path <- “raw_data1/"
final_name <- “manipulated_data_raw"
write.csv(object, paste(final_path,final_name,"_", format(Sys.time(),'%Y-%m-%d'), ".csv", sep = ""), row.names = F, quote = F, na = ".")

2. Data Overview and Summary

Data Overview:

常用函数：
- View(data) (也可以直接在Rstudio点开）
- str()
- names(data) & colnames(data) & rownames(data)

Data Check:

函数	作用
`mean()`	calculate the mean value
`median()`	calculate the median value
`hist()`	to examine a distribution of values
`quantile()`	calculate the quantiles of value

临床数据处理的一般注意（不处理临床数据的童鞋可跳过）:
(1) Maintain original data and use a programmatic script to clean, fix mistakes, add missing information (avoid overwritting the original dataset).
(2) A tidy data means: [a]Observations are in rows. [b]variables are in columns. [c]Contained in a single dataset.
(3)常用检查方法检查数据是否正常导入：检查[a] number of rows/columns [b] Missing values [c] number of subjects/experiments [d] Range, mean, median of values
(4) 因为在tidy data里面，一个病人一行，有的时候一个病人会有好多行。这样在计算比如整体的体重的均值就可能出错。需要去除重复的行

# A solution is to use !duplicated() to remove duplicate rows:
data_ind <- data[!duplicated(data$ID), ]

Data Summary:

summary() works good for continuous variables (weight, age) but not discrete (i.e. categories)

summary() returns the minimum and maximum values, the mean and median, as well as the 25th and 75th percentiles for each column of a data frame.

unique() & table() works good for categorical variables(sex, race)

unique() returns any unique values and can be combined with length(unique()) to return the number of unique values. example like unique(data$SEX) OR length(unique(data$SEX))
table() can be used to summarize categorical data

> unique(data$SEX)
[1]	1	2
> length(unique(data$SEX))
[1]	2
> table(data$SEX)
1	2
10	12 # 10 patients sex==1 while 12 patients sex==2

3. Subsetting

3.1 Cutting the data

Subsetting allows to select range of elements according to a specified criteria
Three main ways to subset an object in R:

subset(data, SEX == 1, select = c(ID, TIME)) 留下sex==1的ID，TIME
$ can select one column e.g. data$ID
[ ] can be used with any object:
– A[element] (for vectors)
– data[rows, columns] (for data frames)
– my_list[[level]][sublevel] (for lists)

3.2 multiple matching: `match()` & `%in%`

实例问题：
To retrieve the IDs present in both Study 1 and 2：

study1 <- c(1, 2, 3, 4, 5, 7, 9)
study2 <- c(1, 3, 4, 6, 8, 9, 10)

In this case == or != do not work as Study 1 and 2 have multiple values.
Here, we could used match(x, table). This function return the position(千万注意这里返回的是位置) of each first match of x in table.

match(study2, study1)
[1] 1 3 4 NA NA 9 NA

If we want to use for sub setting, then we can add argument nomatch = 0 (default NA) which is the value to be generated if a value cannot be matched

study1[match(study2, study1, nomatch=0)] #这样subset match 的值
[1] 1 3 4 9 

match(study2, study1, nomatch = 0)
[1] 1 3 4 0 0 7 0

study1[match(study2, study1)] # 如果不加nomatch=0会返回不必要的NA
[1]  1  3  4 NA NA  9 NA

We can also use another more intuitive function x %in% table.

# Using %in% never returns NA values.
> Study1 %in% Study2 
[1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE
> Study1[Study1 %in% Study2]
[1] 1 3 4 9

R 4.1.2 中关于两个函数的详细描述：
match(x, table, nomatch = NA_integer_, incomparables = NULL) returns a vector of the positions of (first) matches of its first argument in its second.
x %in% table is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
具体参数：

参数	描述
`x`	vector or NULL: the values to be matched. Long vectors are supported.
`table`	vector or NULL: the values to be matched against. Long vectors are not supported.
`nomatch`	the value to be returned in the case when no match is found. Note that it is coerced to integer.
`incomparables`	a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value. For historical reasons, FALSE is equivalent to NULL.

4. Merging Data

4.1 Character manipulation

4.1.1 Creat words or sentences `paste()`, `substr()` & `nchar()`

函数	简述
`paste (..., sep = " ", collapse = NULL, recycle0 = FALSE)`	combine objects into a character string
`substr(x, start, stop)`	extracts or replaces parts of a string.
`nchar()`	will return the number of characters of a string

> string <- paste('a','b','c',sep = "_")
> string
[1] "a_b_c"

> substr(string,start = 2,stop = 4) #提取出string里面从下标2到下标4的元素。注意在R里面字符串是不能用`[]`切片的
[1] "_b_"
> substr(string,start = 3,stop = 3)
[1] "b"
# 错误示范：
> string[3] #注意区分python，python里字符串是可以用`[]`切片的，而R里是不可以的
[1] NA
> string[2:4]
[1] NA NA NA

> nchar(string)
[1] 5
> string2 <- "a b c"
> nchar(string)
[1] 5

4.1.2 Splitting sentences or words `paste()`, `substr()` & `nchar()`

函数	简述
`strsplit(x, split, fixed = F)`	cuts x based on split
`grep(pattern, x, ignore.case = F, value = F)`	looks for the elements of x(x should be vector) that matches pattern and returns the elements position in the object (value = F) or the element that matched the pattern (value = T). pattern 可以用正则表达式
`sub(pattern, replacement, x, ignore.case = F)`	looks for the elements of x matching the pattern and replaces them with the replacement. pattern 可以用正则表达式
`gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)`	similar to grep, but here we can use text instead of x vector. Also this can give more information see example below

Example for strsplit & sub:

> string <- "make america great again"
> strsplit(string, " ")
[[1]]
[1] "make"    "america" "great"   "again"  

> sub(pattern = "america", replacement = "china", x = string)
[1] "make china great again"

Example for grep & gregexpr

########### grep()函数
# grep里的x一般是输入一个vector，如果要在一个string里找pattern的位置可以尝试 `gregexpr(pattern, text)` 函数
> grep("america", string) #这里只有一个string，在string里有america就会匹配并返回1。
[1] 1

> list_my <- strsplit(string," ")
> list_my
[[1]]
[1] "make"    "america" "great"   "again"  
> grep("america", list_my) #grep在list中貌似只会返回包含pattern的layer，还是别这么用吧，官方文件里grep的x应该是个vector
[1] 1

官方文档对参数x的解释：
x, text	：a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

> vector_my <- list_my[[1]]
> vector_my
[1] "make"    "america" "great"   "again"  
> grep(pattern = "america", x = vector_my)
[1] 2
> vector_my2 <- c(1, 2,1, 3, 4, 1, 1, 5, 6) #这里按照解释是把vector_my2先as.character 转化成character vector
> grep("1",vector_my2)
[1] 1 3 6 7

########### gregexpr()函数
> string <- "make america great again"
> gregexpr(" ", string)
[[1]]
[1]  5 13 19
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

> gregexpr("america", string)
[[1]]
[1] 6
attr(,"match.length")
[1] 7
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

4.1.3 Sorting vectors or columns `sort()` & `order()`

函数	简述
`sort(A)`	will sort the values of A by ascending order, or descending if option “decreasing = T” is supplied
`order(A)`	will return the order in which each element should reordered. this also can be useful for sorting data frames by multiple columns values (see example)

Example:

> A <- c(1, 5, 2, 7, 8, 9, 3, 6, 4)
> sort(A)
[1] 1 2 3 4 5 6 7 8 9

> order(A)  #返回的是每个位置排序后的数字对应原来的位置 如：下方第二个值是3，意思是第二个值应该是原来A的第三个元素
[1] 1 3 7 9 2 8 4 5 6
> A[order(A)]
[1] 1 2 3 4 5 6 7 8 9

# order() used in dataframe: 先按照ID排序，之后再按照Time1排序
> x  
  ID Time1 
1  1     1
2  2     4
3  4     6
4  2     2
5  3     6
> ox <- x[order (x$ID, x$Time1), ] #注意后面的这个逗号，order (x$ID, x$Time1)返回的对应的row的数字排序
> ox  
  ID Time1
1  1     1
4  2     2
2  2     4
5  3     6
3  4     6

4.1.4 Binding objects together

函数	简述
`c()`	can both create vectors or bind them together
`cbind()`	can be used to bind elements of same `length()` or `nrow()` into columns
`rbind()`	can be used to bind elements of same `length()` or `ncol()` into rows

Example:

> A <- 1:3 
> B <- 3:1

> c(A,B)
[1] 1 2 3 3 2 1
> cbind(A,B)
     A B
[1,] 1 3
[2,] 2 2
[3,] 3 1
> rbind(A,B)
  [,1] [,2] [,3]
A    1    2    3
B    3    2    1

# More example
> q 
    [,1] [,2]
[1,]    1    1
[2,]    2    2
> w
     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    4    4    4
> e 
    [,1] [,2]
[1,]    5    6
[2,]    6    5
[3,]    5    6
> cbind(q,w)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    3    3    3
[2,]    2    2    4    4    4
> rbind(q,e)
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    5    6
[4,]    6    5
[5,]    5    6

4.1.5 Binding objects together `merge()`

The function merge(x, y, by = “”, all = F) is used when:

“x” and “y” are two data frames to be merged.
“by” is a column name or a vector of column names existing in both “x” and “y” on which the merge will be based
If there are unmatched elements in the “by” variable(s), these elements will by default (“all=F”) be removed. This can result in loss of data if many unmatched elements
“all=T” tells merge to keep all elements of “x” and “y”.
比如：x 里面有病人ID和体重；y里面有病人ID和实验数据。merge两个，by ID （通过ID来merge两个到一起）

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

R basic

R语言