Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45)

2023-11-19

U37: Counting with grep
Running grep -c simply counts how many lines match the specified pattern.

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c i2 intron_IME_data.fasta 
9785
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCCA" intron_IME_data.fasta 
0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta 
11
#here can't less the matched results
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta | less
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# ls
At_genes.gff  At_proteins.fasta  chr1.fasta  intron_IME_data.fasta
#match with all the files with the extension of fasta in the current directory
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCA"  *.fasta
At_proteins.fasta:0
chr1.fasta:2
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACGT" *.fasta
At_proteins.fasta:70
chr1.fasta:50612
intron_IME_data.fasta:11924
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "^ATG.*ACACAC.*TGA$"  *.fasta
At_proteins.fasta:0
chr1.fasta:3
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC.GT" *.fasta
At_proteins.fasta:47
chr1.fasta:62327
intron_IME_data.fasta:17998
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "A*C*G*T*" *.fasta
At_proteins.fasta:269436
chr1.fasta:385224
intron_IME_data.fasta:250978

U38: Regular expressions in less
If you are viewing a file with less, you can type a forward-slash / character, and this allows you to then specify a pattern and it will then search for (and highlight) all matches to that pattern. Technically it is searching forward from whatever point you are at in the file. You can also type a question-mark ? and less will allow you to search backwards. The real bonus is that the patterns you specify can be regular expressions.

Task U38.1
Try viewing a sequence file with less and then searching for a pattern such as ATCG.*TAG$. This should make it easier to see exactly where your regular expression pattern matches. After typing a forward-slash (or a question-mark), you can press the up and down arrows to select previous searches.

U38_forward_slash_and_question_mark_searching_forward_backward

U39: Let me transl(iter)ate that for you
upper-case characters to lower-case characters. Unix command tr (short for transliterate)
U39_tr_upper_to_lower

U40: That’s what she sed
change a particular pattern into something completely different. sed that is capable of performing a variety of text manipulations. The ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes).
U40_head_sed

U41: Word up
get a feeling for how large a file is before you start running lots of commands against it. know how many ‘lines’ it has. That is because many Unix commands like grep and sed work on a line by line basis. Unix command called wc (word count) . count the number of lines, words and bytes in the specified file(s). run wc -l, the -l option would have shown us just the line count.
u41_word_count_wc

U42: GFF and the art of redirection
** GFF** file. This is a common file format in bioinformatics and GFF files are used to describe the location of various features on a DNA sequence. Features can be exons, genes, binding sites etc, and the sequence can be a single gene or (more commonly) an entire chromosome. create a new (smaller) file that contains a subset of the original:
want to redirect the output into an actual file, and that is what the > symbol is doing, it acts as one of three redirection operators in Unix.
GFF file that we are working with is a standard file format in bioinformatics. For now, all you really need to know is that every GFF file has 9 fields, each separated with a tab character. There should always be some text at every position (even if it is just a ‘.’ character). The last field often is used to store a lot of text.
U42_gff_features
U42_gff_subset_redirection
U43: Not just a pipe dream
The 2nd and/or 3rd fields of a GFF file are usually used to describe some sort of biological feature. We might be interested in seeing how many different features are in our file:
u43_not_just_a_pipe_dream_cut_sort_uniq_

1.The cut command first takes the At_genes_subset.gff file and ‘cuts’ out just the 3rd column (as specified by the -f option). Luckily, the default behavior for the cut command is to split text files into columns based on tab characters (if the columns were separated by another character such as a comma then we would need to use another command line option to specify the comma).
2.The sort command takes the output of the cut command and sorts it alphanumerically.
3.The uniq command (in its default format) only keeps lines which are unique to the output (otherwise you would see thousands of fields which said ‘curated’, ‘Coding_transcript’ etc.)

Want to find which features start earliest in the chromosome sequence. The start coordinate of features is always specified by column 4 of the GFF file, so: cut out just two columns of interest (3 & 4) . The -f option of the cut command lets us specify which columns we want to remove. sort will sort alphanumerically, use the -n option to specify that sort numerically. could sort based on either column. The -k 2 specifies that use the second column. use the head command to get just the 10 rows of output. lines from the GFF file the lowest starting coordinate.
U43_cut_sort_head

U44: The end of the line
pressing enter will generate one of two different events (depending on what computer you are using). pressing enter generates a newline character which is represented internally by either a line feed or carriage return character (actually, Windows uses a combination of both to represent a newline). text file looks unreadable in the Unix text viewer. In Unix (and in Perl and other programming languages) the patterns \n and \r can both be used to denote newlines. A common fix for this requires substituting \r for \n.

Use less to look at the Data/Misc/excel_data.csv file. This is a simple 4-line file that was exported from a Mac version of Microsoft Excel. You should see that if you use less, then this appears as one line with the newlines replaced with ^M characters. You can convert these carriage returns into Unix-friendly line-feed characters by using the tr command like so:
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csvU34_newline_character

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv 
sequence 1,acacagagag
sequence 2,acacaggggaaa
sequence 3,ttcacagaga
sequence 4,cacaccaaacac
sequence 5,tttatatttaatataroot@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv  >excel_data_formatted.csv
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  excel_data_formatted.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data_formatted.csv 

U44_tr_redirect_operator

U45: This one goes to 11
Arabidopsis intron_IME_data.fasta, Every intron sequence in this file has a header line that contains the following pieces of information:
1.gene name
2.intron position in gene
3.distance of intron from transcription start site (TSS)
4.type of sequence that intron is located in (either CDS or UTR)

extract five sequences from this file that are: a) from first introns, b) in the 5’ UTR, and c) closest to the TSS. Notice that use one of the other redirect operators < to read from a file.
U45

Summary
If you have learnt (and understood) all of the Unix commands so far then you probably will never need to learn anything more in order to do a lot of productive Unix work. But keep on dipping into the man page for all of these commands to explore them in even further detail. If you include the three, as-yet-unmentioned, commands in the last column, then you will probably be able to achieve >95% of everything that you will ever want to do in Unix (remember, you can use the man command to find out more about top, ps, and kill). The power comes from how you can use combinations of these commands.
Summary

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45) 的相关文章

  • 如何在屏障处正确同步线程

    我遇到一个问题 我很难判断应该使用哪个同步原语 我正在创建 n 个在内存区域上工作的并行线程 每个线程都分配给该区域的特定部分 并且可以独立于其他线程完成其任务 在某些时候 我需要收集所有线程的工作结果 这是使用屏障的一个很好的例子 这就是
  • find 命令的“exec”功能可以在后台启动程序吗?

    我想做这样的事情 find iname Advanced Linux Program exec kpdf 可能的 还有其他类似的方法吗 首先 它不会像你输入的那样工作 因为 shell 会将其解释为 find iname Advanced
  • 用于解析文件( csv )并逐行处理的 Shell 脚本[重复]

    这个问题在这里已经有答案了 您好 需要一个 shell 脚本来解析 csv 文件 逐行解析 然后逐个字段 该文件将如下所示 X1 X2 X3 X4 Y1 Y2 Y3 Y4 我需要提取这些 X1 X2 我写了一个脚本 但如果行超过一行 它就会
  • 放弃root权限

    我有一个以 root 身份启动的守护进程 因此它可以绑定到低端口 出于安全原因 初始化后我非常希望它放弃 root 权限 谁能指点我已知正确C 中的一段代码可以做到这一点 我阅读了手册页 研究了不同应用程序中的各种实现 它们都是不同的 其中
  • 如何在 ruby​​ 中后台运行多个外部命令

    给定这个 Unix shell 脚本 test sh bin sh sleep 2 sleep 5 sleep 1 wait 时间 test sh real 0m5 008s user 0m0 040s sys 0m0 000s 如何在 U
  • 如何让SSH命令执行超时

    我有一个这样的程序 ssh q email protected cdn cgi l email protection exit echo output value gt 在上面的代码中 我尝试通过 SSH 连接到远程服务器 并尝试检查是否可
  • 如何以“less”显示行号(GNU)

    执行的命令是什么less https linux die net man 1 less在左栏中显示行号 来自manual http unixhelp ed ac uk CGI man cgi less N 或 行号 导致在每个行的开头显示行
  • 如何使用 diff 排除多行模式?

    我想对两个 xml 文件进行比较 但忽略 2 3 行模式 例如 假设我想在比较下面的 xml 格式时忽略可用性和价格 这是我到目前为止所拥有的 diff I
  • 使用 Shell 脚本提供密码

    我已将客户端和服务器设置为无密码登录 就像无密码登录一样 通过将服务器的 RSA 密钥复制到所有客户端的 root ssh id rsa pub 来实现 但这是我手动完成的 我喜欢使用 shell 脚本自动执行此过程 并通过脚本向计算机提供
  • 模拟用户输入以使用不同参数多次调用脚本

    我必须使用提供的脚本 该脚本在脚本运行时接受用户输入而不是参数 我无法解决这个问题 脚本的一个例子是 bin bash echo param one read one doSomething echo param two read two
  • 将用户添加到组但运行“id”时未反映

    R 创建了一个名为 Staff 的组 我希望能够在不以 sudo 身份启动 R 的情况下更新软件包 所以我使用以下方法将自己添加到员工中 sudo usermod G adm dialout cdrom plugdev lpadmin ad
  • UNIX crontab 中的日期时间格式

    我每 6 小时运行一次 cron 来备份我的数据库 我希望文件名包含按以下格式创建的日期和时间 mysqlbackup 22 5 2013 15 45 sql gz 这是我运行的命令 date date d mysqldump uusern
  • 寻找下一个开放端口

    有没有什么办法 使用基本的 Unix 命令 找到下一个未使用的端口号 从端口 4444 开始向上 我通过 ssh 通过 openssh 进入 Windows XP 计算机 运行 Cygwin 工具并使用 bash shell 谢谢 戴夫 尝
  • Bash 中 $() 和 () 之间的区别

    当我打字时ls l echo file 支架的输出 这只是简单的回显 被获取并传递到外部ls l命令 就等于简单的ls l file 当我打字时ls l echo file 我们有错误 因为不能嵌套 内部外部命令 有人可以帮助我理解之间的区
  • 如何将文件中的值分配给 UNIX sh shell 中的变量?

    我一直在搜索这个网站 试图找到这个问题的答案 并发现了几个非常好的答案 不幸的是 它们都不适合我 这是我正在使用的脚本 VALUE cat szpfxct tmp export VALUE echo gt gt LGFILE echo te
  • meld - GLib-GIO-ERROR**:系统上未安装 GSettings 架构

    经过近40个小时的努力 我终于安装了meld 3 14 2 在Redhat 6 3服务器的NFS共享上 安装了每个依赖项 最后似乎成功了 但最后一个错误需要解决 meld 20703 GLib GIO ERROR No GSettings
  • linux下如何获取昨天和前天?

    我想在变量中获取 sysdate 1 和 sysdate 2 并回显它 我正在使用下面的查询 它将今天的日期作为输出 bin bash tm date Y d m echo tm 如何获取昨天和前天的日期 这是另一种方法 对于昨天来说 da
  • Python 用静态图像将 mp3 转换为 mp4

    我有x文件包含一个列表mp3我想转换的文件mp3文件至mp4文件带有static png photo 似乎这里唯一的方法是使用ffmpeg但我不知道如何实现它 我编写了脚本来接受输入mp3文件夹和一个 png photo 然后它将创建新文件
  • 类unix系统中的python和python3命令有什么区别?

    我通读了每个命令的描述 但每个命令的描述都是完全相同的 所以我不明白这两个命令在类 Unix 系统中的工作方式有何不同 谁能解释其中的区别吗 Python3命令的引入是因为python命令指向了python2 从那时起 Python3 已成
  • 我的 unix 脚本出了什么问题

    bin bash while echo n Player s name read name name ZZZ do searchresult grep name playername if searchresult 0 then echo

随机推荐