Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45)

2023-11-19

U37: Counting with grep
Running grep -c simply counts how many lines match the specified pattern.

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c i2 intron_IME_data.fasta 
9785
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCCA" intron_IME_data.fasta 
0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta 
11
#here can't less the matched results
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCA" intron_IME_data.fasta | less
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# ls
At_genes.gff  At_proteins.fasta  chr1.fasta  intron_IME_data.fasta
#match with all the files with the extension of fasta in the current directory
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACCCCCCCCA"  *.fasta
At_proteins.fasta:0
chr1.fasta:2
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "ACGT" *.fasta
At_proteins.fasta:70
chr1.fasta:50612
intron_IME_data.fasta:11924
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "^ATG.*ACACAC.*TGA$"  *.fasta
At_proteins.fasta:0
chr1.fasta:3
intron_IME_data.fasta:0
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC.GT" *.fasta
At_proteins.fasta:47
chr1.fasta:62327
intron_IME_data.fasta:17998
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "AC*GT" *.fasta
At_proteins.fasta:2600
chr1.fasta:288454
intron_IME_data.fasta:103917
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Arabidopsis# grep -c "A*C*G*T*" *.fasta
At_proteins.fasta:269436
chr1.fasta:385224
intron_IME_data.fasta:250978

U38: Regular expressions in less
If you are viewing a file with less, you can type a forward-slash / character, and this allows you to then specify a pattern and it will then search for (and highlight) all matches to that pattern. Technically it is searching forward from whatever point you are at in the file. You can also type a question-mark ? and less will allow you to search backwards. The real bonus is that the patterns you specify can be regular expressions.

Task U38.1
Try viewing a sequence file with less and then searching for a pattern such as ATCG.*TAG$. This should make it easier to see exactly where your regular expression pattern matches. After typing a forward-slash (or a question-mark), you can press the up and down arrows to select previous searches.

U38_forward_slash_and_question_mark_searching_forward_backward

U39: Let me transl(iter)ate that for you
upper-case characters to lower-case characters. Unix command tr (short for transliterate)
U39_tr_upper_to_lower

U40: That’s what she sed
change a particular pattern into something completely different. sed that is capable of performing a variety of text manipulations. The ‘s’ part of the sed command puts sed in ‘substitute’ mode, where you specify one pattern (between the first two forward slashes) to be replaced by another pattern (specified between the second set of forward slashes).
U40_head_sed

U41: Word up
get a feeling for how large a file is before you start running lots of commands against it. know how many ‘lines’ it has. That is because many Unix commands like grep and sed work on a line by line basis. Unix command called wc (word count) . count the number of lines, words and bytes in the specified file(s). run wc -l, the -l option would have shown us just the line count.
u41_word_count_wc

U42: GFF and the art of redirection
** GFF** file. This is a common file format in bioinformatics and GFF files are used to describe the location of various features on a DNA sequence. Features can be exons, genes, binding sites etc, and the sequence can be a single gene or (more commonly) an entire chromosome. create a new (smaller) file that contains a subset of the original:
want to redirect the output into an actual file, and that is what the > symbol is doing, it acts as one of three redirection operators in Unix.
GFF file that we are working with is a standard file format in bioinformatics. For now, all you really need to know is that every GFF file has 9 fields, each separated with a tab character. There should always be some text at every position (even if it is just a ‘.’ character). The last field often is used to store a lot of text.
U42_gff_features
U42_gff_subset_redirection
U43: Not just a pipe dream
The 2nd and/or 3rd fields of a GFF file are usually used to describe some sort of biological feature. We might be interested in seeing how many different features are in our file:
u43_not_just_a_pipe_dream_cut_sort_uniq_

1.The cut command first takes the At_genes_subset.gff file and ‘cuts’ out just the 3rd column (as specified by the -f option). Luckily, the default behavior for the cut command is to split text files into columns based on tab characters (if the columns were separated by another character such as a comma then we would need to use another command line option to specify the comma).
2.The sort command takes the output of the cut command and sorts it alphanumerically.
3.The uniq command (in its default format) only keeps lines which are unique to the output (otherwise you would see thousands of fields which said ‘curated’, ‘Coding_transcript’ etc.)

Want to find which features start earliest in the chromosome sequence. The start coordinate of features is always specified by column 4 of the GFF file, so: cut out just two columns of interest (3 & 4) . The -f option of the cut command lets us specify which columns we want to remove. sort will sort alphanumerically, use the -n option to specify that sort numerically. could sort based on either column. The -k 2 specifies that use the second column. use the head command to get just the 10 rows of output. lines from the GFF file the lowest starting coordinate.
U43_cut_sort_head

U44: The end of the line
pressing enter will generate one of two different events (depending on what computer you are using). pressing enter generates a newline character which is represented internally by either a line feed or carriage return character (actually, Windows uses a combination of both to represent a newline). text file looks unreadable in the Unix text viewer. In Unix (and in Perl and other programming languages) the patterns \n and \r can both be used to denote newlines. A common fix for this requires substituting \r for \n.

Use less to look at the Data/Misc/excel_data.csv file. This is a simple 4-line file that was exported from a Mac version of Microsoft Excel. You should see that if you use less, then this appears as one line with the newlines replaced with ^M characters. You can convert these carriage returns into Unix-friendly line-feed characters by using the tr command like so:
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv U34_newline_character

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# pwd
/root/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv 
sequence 1,acacagagag
sequence 2,acacaggggaaa
sequence 3,ttcacagaga
sequence 4,cacaccaaacac
sequence 5,tttatatttaatataroot@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data.csv 

root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# tr '\r' '\n'  < excel_data.csv  >excel_data_formatted.csv
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# ls
excel_data.csv  excel_data_formatted.csv  oligos.txt
root@kali:~/Downloads/Unix_and_perl_primer_course_material/Unix_and_Perl_course/Data/Misc# less excel_data_formatted.csv

U44_tr_redirect_operator

U45: This one goes to 11
Arabidopsis intron_IME_data.fasta, Every intron sequence in this file has a header line that contains the following pieces of information:
1.gene name
2.intron position in gene
3.distance of intron from transcription start site (TSS)
4.type of sequence that intron is located in (either CDS or UTR)

extract five sequences from this file that are: a) from first introns, b) in the 5’ UTR, and c) closest to the TSS. Notice that use one of the other redirect operators < to read from a file.
U45

Summary
If you have learnt (and understood) all of the Unix commands so far then you probably will never need to learn anything more in order to do a lot of productive Unix work. But keep on dipping into the man page for all of these commands to explore them in even further detail. If you include the three, as-yet-unmentioned, commands in the last column, then you will probably be able to achieve >95% of everything that you will ever want to do in Unix (remember, you can use the man command to find out more about top, ps, and kill). The power comes from how you can use combinations of these commands.
Summary

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Unix and perl primer for Biologist

unix

Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45) 的相关文章

如何在屏障处正确同步线程

我遇到一个问题我很难判断应该使用哪个同步原语我正在创建 n 个在内存区域上工作的并行线程每个线程都分配给该区域的特定部分并且可以独立于其他线程完成其任务在某些时候我需要收集所有线程的工作结果这是使用屏障的一个很好的例子这就是
find 命令的“exec”功能可以在后台启动程序吗？

我想做这样的事情 find iname Advanced Linux Program exec kpdf 可能的还有其他类似的方法吗首先它不会像你输入的那样工作因为 shell 会将其解释为 find iname Advanced
用于解析文件（ csv ）并逐行处理的 Shell 脚本[重复]

这个问题在这里已经有答案了您好需要一个 shell 脚本来解析 csv 文件逐行解析然后逐个字段该文件将如下所示 X1 X2 X3 X4 Y1 Y2 Y3 Y4 我需要提取这些 X1 X2 我写了一个脚本但如果行超过一行它就会
放弃root权限

我有一个以 root 身份启动的守护进程因此它可以绑定到低端口出于安全原因初始化后我非常希望它放弃 root 权限谁能指点我已知正确C 中的一段代码可以做到这一点我阅读了手册页研究了不同应用程序中的各种实现它们都是不同的其中
如何在 ruby 中后台运行多个外部命令

给定这个 Unix shell 脚本 test sh bin sh sleep 2 sleep 5 sleep 1 wait 时间 test sh real 0m5 008s user 0m0 040s sys 0m0 000s 如何在 U
如何让SSH命令执行超时

我有一个这样的程序 ssh q email protected cdn cgi l email protection exit echo output value gt 在上面的代码中我尝试通过 SSH 连接到远程服务器并尝试检查是否可
如何以“less”显示行号（GNU）

执行的命令是什么less https linux die net man 1 less在左栏中显示行号来自manual http unixhelp ed ac uk CGI man cgi less N 或行号导致在每个行的开头显示行
如何使用 diff 排除多行模式？

我想对两个 xml 文件进行比较但忽略 2 3 行模式例如假设我想在比较下面的 xml 格式时忽略可用性和价格这是我到目前为止所拥有的 diff I
使用 Shell 脚本提供密码

我已将客户端和服务器设置为无密码登录就像无密码登录一样通过将服务器的 RSA 密钥复制到所有客户端的 root ssh id rsa pub 来实现但这是我手动完成的我喜欢使用 shell 脚本自动执行此过程并通过脚本向计算机提供
模拟用户输入以使用不同参数多次调用脚本

我必须使用提供的脚本该脚本在脚本运行时接受用户输入而不是参数我无法解决这个问题脚本的一个例子是 bin bash echo param one read one doSomething echo param two read two
将用户添加到组但运行“id”时未反映

R 创建了一个名为 Staff 的组我希望能够在不以 sudo 身份启动 R 的情况下更新软件包所以我使用以下方法将自己添加到员工中 sudo usermod G adm dialout cdrom plugdev lpadmin ad
UNIX crontab 中的日期时间格式

我每 6 小时运行一次 cron 来备份我的数据库我希望文件名包含按以下格式创建的日期和时间 mysqlbackup 22 5 2013 15 45 sql gz 这是我运行的命令 date date d mysqldump uusern
寻找下一个开放端口

有没有什么办法使用基本的 Unix 命令找到下一个未使用的端口号从端口 4444 开始向上我通过 ssh 通过 openssh 进入 Windows XP 计算机运行 Cygwin 工具并使用 bash shell 谢谢戴夫尝
Bash 中 $() 和 () 之间的区别

当我打字时ls l echo file 支架的输出这只是简单的回显被获取并传递到外部ls l命令就等于简单的ls l file 当我打字时ls l echo file 我们有错误因为不能嵌套内部外部命令有人可以帮助我理解之间的区
如何将文件中的值分配给 UNIX sh shell 中的变量？

我一直在搜索这个网站试图找到这个问题的答案并发现了几个非常好的答案不幸的是它们都不适合我这是我正在使用的脚本 VALUE cat szpfxct tmp export VALUE echo gt gt LGFILE echo te
meld - GLib-GIO-ERROR**：系统上未安装 GSettings 架构

经过近40个小时的努力我终于安装了meld 3 14 2 在Redhat 6 3服务器的NFS共享上安装了每个依赖项最后似乎成功了但最后一个错误需要解决 meld 20703 GLib GIO ERROR No GSettings
linux下如何获取昨天和前天？

我想在变量中获取 sysdate 1 和 sysdate 2 并回显它我正在使用下面的查询它将今天的日期作为输出 bin bash tm date Y d m echo tm 如何获取昨天和前天的日期这是另一种方法对于昨天来说 da
Python 用静态图像将 mp3 转换为 mp4

我有x文件包含一个列表mp3我想转换的文件mp3文件至mp4文件带有static png photo 似乎这里唯一的方法是使用ffmpeg但我不知道如何实现它我编写了脚本来接受输入mp3文件夹和一个 png photo 然后它将创建新文件
类unix系统中的python和python3命令有什么区别？

我通读了每个命令的描述但每个命令的描述都是完全相同的所以我不明白这两个命令在类 Unix 系统中的工作方式有何不同谁能解释其中的区别吗 Python3命令的引入是因为python命令指向了python2 从那时起 Python3 已成
我的 unix 脚本出了什么问题

bin bash while echo n Player s name read name name ZZZ do searchresult grep name playername if searchresult 0 then echo

随机推荐

DNS使用TCP与UDP

DNS同时占用UDP和TCP端口53是公认的这种单个应用协议同时使用两种传输协议的情况在TCP IP栈也算是个另类但很少有人知道DNS分别在什么情况下使用这两种协议先简单介绍下TCP与UDP TCP是一种面向连接的协议提供可靠的数据
tensorflow-ssd 实现纸张缺陷检测

环境 win10 tensorflow1 10 python3 6 9 下载https github com balancap SSD Tensorflow到本地 1 解压并测试demo 打开Anaconda prompt 切换到SSD T
我又把HTMLTestRunner改了一下，支持Python3，添加echarts统计饼图

之前用Bootstrap把HTMLTestRunner改的美观了一点同时改成了中文的报告但那个是基于Python2的见这篇博文 selenium之输出报告对HTMLTestRunner进行样式调整后的示例这次呢博主又给它改成了
mac编译安装Nginx

一安装wget 使用homebrew安装wget brew install wget 安装wget时报错 tar Error opening archive Failed to open Users xxx Library Caches
【数据库】如何创建一个非常便宜的无服务器数据库

云对象存储可以用作功能强大且非常便宜的数据库您是否相信您可以使用完全托管可大规模扩展高度可用且价格低廉的无服务器数据库每月只需 5 美元您就可以存储数亿条记录并读写数十亿条记录如果您的数据库需求可以通过非常简单的键值存储来满足
近期deep learning做图像质量评价（image quality assessment）的论文4

1 2017会议论文ICME An accurate deepconvolutional neural networks model for no reference image quality assessment 复旦大学 1 1用了部
Python开发环境Wing IDE如何查看调试数据

Wing IDE具有一个十分有用的特性就是处理复杂bug的时候可以以各种各样的方式查看调试数据值这个功能具体是由Watch工具实现的查看数据值在PrintAsHTML中发生异常时右键单击Stack Data工具中的本地数值这将显
【STM32学习笔记】（13）——外部中断详解

EXTI 简介 EXTI External interrupt event controller 外部中断事件控制器管理了控制器的 20 个中断事件线每个输入线可以独立地配置输入类型脉冲或挂起和对应的触发事件上升沿或下降沿或
unity本地分数排行榜简单解决方案（Json）

具体效果大体方法创建一个分数类Score和一个分数类的容器List
TLSv1.2抓包解密分析过程之ECDHE_RSA_WITH_AES_128_CBC_SHA256

ECDHE RSA WITH AES 128 CBC SHA256模式下 RSA只用于身份验证不用于加密加密密钥是通过有限域的椭圆曲线算法交换的需要拿到ECDH的私钥才能解密本文的demo样本使用了特殊方法来获取这些参数椭圆曲线加
Nginx 通过upstream反向代理报 400 Bad Request

首先看一下400错误的定义 400 Bad Request 是一种 HTTP 错误状态码 HTTP 1 1 对 400 Bad Request的定义主要是语义有误当前请求无法被服务器理解请求参数有误丢包导致异常背景一开发了一个s
从一道题目学习Nunjucks模板

Nunjucks简介 Nunjucks 是一个功能丰富强大的 JavaScript 专用模板引擎 Nunjucks 提供丰富的语言特性和块继承自动转移宏和异步控制等等重点要关注的是 Nunjucks 模板引擎的代码在沙箱环境中运行
xilinx xdma PCIe中断bug

xilinx xdma PCIe中断存在bug bug1 此中断虽然是msi或者msx中断但是不中断cpu bug2 此中断不是边沿中断而是电平中断在驱动层需要不断地轮训查询中断事件 bug3 此中断持续时间必须长而且在收到中断应答
关于the selection cannot be run on any server错误的问题，如何快速的解决。

最近在导入外来项目时遇到了一个难题就是出现了图中的错误 the selection cannot be run on any server 无法在任何服务器上运行所选内容这个错误的原因在于Dynamic Web Module 的版本与
基于CNN-GRU的多维数据预测算法——附带Matlab代码

基于CNN GRU的多维数据预测算法附带Matlab代码近年来卷积神经网络 CNN 和门控循环单元 GRU 在时序数据处理中的应用十分广泛本文提出了一种基于CNN GRU结构的多维数据预测算法并提供了相应的Matlab代码首先
FPGA Xilinx 7系列高速收发器GTX通信

Xilinx 7系列高速收发器GTX 说明 FPGA TX端 zynq 7z035 RX端 zynq 7z100 两个FPGA通过SFP 光纤接口相连进行GTX的通信环境 Vivado2018 2 IP核 7 Series FPGAs
radius认证服务器无响应,squid radius认证“RADIUS服务器没有响应”

我已经用mysql成功配置freeradius 我可以radtest使用命令 squid radius认证 RADIUS服务器没有响应 sudo radtest alice password 192 168 2 3 1812 testing
java面向对象 final && static 关键字

目录关键字 static 类属性类方法的设计思想类变量 class Variable 静态变量的内存解析类方法 class method 单例 Singleton 设计模式理解main方法的语法代码块关键字 final 总结
LTE MIB&SIB1

LTE MIB SIB1 LTE MIB SIB 消息可以参考 http blog csdn net wowricky article details 51348613 UE 接受完MIB SIB1后就可以判断这个CELL是否可以驻留这里
Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45)

U37 Counting with grep Running grep c simply counts how many lines match the specified pattern root kali Downloads Unix

Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45)

Unix and perl primer for Biologists - Part2 :Advanced Unix- Reading Notes(U37-U45) 的相关文章

随机推荐

热门标签