pagetitle |
---|
SARS-CoV-2 Genomics |
:::highlight
Questions
- How can I inspect the content of and manipulate text files?
- How can I find and replace text patterns within files?
Learning Objectives
- Inspect the content of text files (
head
,tail
,cat
,zcat
,less
). - Use the
*
wildcard to work with multiple files at once. - Redirect the output of a command to a file (
>
,>>
). - Find a pattern in a text file (
grep
) and do basic pattern replacement (sed
).
:::
:::note This section has an accompanying slide deck. :::
Often we want to investigate the content of a file, without having to open it in a text editor. This is especially useful if the file is very large (as is often the case in bioinformatic applications).
For example, let's take a look at the file artic_primers_pool1.bed
.
We will start by printing the whole content of the file with the cat
command, which stands for "concatenate" (we will see why it's called this way in a little while):
$ cat artic_primers_pool1.bed
This outputs a lot of text, because the file is quite long!
Instead, it is often more useful to look only at the top few lines of the file.
We can do this with the head
command:
$ head artic_primers_pool1.bed
MN908947.3 30 54 nCoV-2019_1_LEFT 1 +
MN908947.3 385 410 nCoV-2019_1_RIGHT 1 -
MN908947.3 642 664 nCoV-2019_3_LEFT 1 +
MN908947.3 1004 1028 nCoV-2019_3_RIGHT 1 -
MN908947.3 1242 1264 nCoV-2019_5_LEFT 1 +
MN908947.3 1623 1651 nCoV-2019_5_RIGHT 1 -
MN908947.3 1875 1897 nCoV-2019_7_LEFT 1 +
MN908947.3 1868 1890 nCoV-2019_7_LEFT_alt0 1 +
MN908947.3 2247 2269 nCoV-2019_7_RIGHT 1 -
MN908947.3 2242 2264 nCoV-2019_7_RIGHT_alt5 1 -
By default, head
prints the first 10 lines of the file.
We can change this using the -n
option, followed by a number, for example:
$ head -n 2 artic_primers_pool1.bed
MN908947.3 30 54 nCoV-2019_1_LEFT 1 +
MN908947.3 385 410 nCoV-2019_1_RIGHT 1 -
Similarly, we can look at the bottom few lines of a file with the tail
command:
$ tail -n 2 artic_primers_pool1.bed
MN908947.3 29288 29316 nCoV-2019_97_LEFT 1 +
MN908947.3 29665 29693 nCoV-2019_97_RIGHT 1 -
Finally, if we want to open the file and browse through it, we can use the less
command:
$ less artic_primers_pool1.bed
less
will open the file and you can use ↑ and ↓ to move line-by-line or the Page Up and Page Down keys to move page-by-page.
You can exit less
by pressing Q (for "quit").
This will bring you back to the console.
Finally, it can sometimes be useful to count how many lines, words and characters a file has.
We can use the wc
command for this:
wc artic_primers_pool*.bed
110 660 4961 artic_primers_pool1.bed
108 648 4878 artic_primers_pool2.bed
218 1308 9839 total
In this case, we used the *
wildcard to count lines, words and characters (in that order, left-to-right) of both our primer files.
Often, we only want to count one of these things, and wc
has options for all of them:
-l
counts lines only.-w
counts words only.-c
counts characters only.
For example, if we just want to know how many primers we have (the number of lines in the files):
wc -l artic_primers_pool*.bed
110 artic_primers_pool1.bed
108 artic_primers_pool2.bed
218 total
:::exercise
- Use the
less
command to look inside the filedata/reference_genome.fa
. - How many lines does this file contain?
- Use the
less
command again but with the option-S
. Can you understand what this option does?
Answer
We can investigate the content of the reference file using less data/reference_genome.fa
.
From this view, it looks like this file contains several lines of content: the genome is almost 30kb long, so it's not surprising we see so much text!
We can use Q to quit and go back to the console.
To check the number of lines in the file, we can use the wc -l data/reference_genome.fa
command.
The answer is only 2.
If we use less -S data/reference_genome.fa
the display is different this time.
We see only two lines in the output.
If we use the → and ← arrows we can see that the text now goes "out of the screen".
So, what happens is that by default less
will "wrap" long lines, so if a line of text is too long, it will continue it on the next line of the screen.
When we use the option -S
it instead displays each line individually, and we can use the arrow keys to see the content that does not fit on the screen.
:::note The primer files we just looked into are in a format called BED. This is a standard bioinformatic file format used to store coordinates of genomic regions. In this case, it corresponds to the coordinates of each primer start and end position in the SARS-CoV-2 reference genome (Wuhan-Hu-1).
In the exercise we looked at another standard file format called FASTA. This one is used to store nucleotide or amino acid sequences. In this case, the complete nucleotide sequence of the SARS-CoV-2 reference genome.
We will learn more about these files in Intro to NGS. :::
We said that the cat
command we used above stands for "concatenate".
This is because this command can be used to concatenate (combine) several files together.
For example, if we wanted to combine both sets of primer pools into a single file:
cat artic_primers_pool1.bed artic_primers_pool2.bed
The cat
command we just used printed the output to the screen.
But what if we wanted to save it into a file?
We can achieve this by sending (or redirecting) the output of the command to a file using the >
operator.
cat artic_primers_pool1.bed artic_primers_pool2.bed > artic_primers.bed
Now, the output is not printed to the console, but instead sent to a new file.
We can check that the file was created with ls
.
If we use >
and the output file already exists, its content will be replaced.
If what we want to do is append the result of the command to the existing file, we should use >>
instead.
Let's see this in practice in the next exercise.
:::exercise
- List the files in the
data/sequencing_run1/
directory. Save the output in a file called "sequencing_files.txt". - What happens if you run the command
ls data/sequencing_run2/ > sequencing_files.txt
? - The operator
>>
can be used to append the output of a command to an existing file. Try re-running both of the previous commands, but instead using the>>
operator. What happens now?
Answer
Task 1
To list the files in the directory we use ls
, followed by >
to save the output in a file:
$ ls data/sequencing_run1/ > sequencing_files.txt
We can check the content of the file:
$ cat sequencing_files.txt
sample1_run1.fastq
sample2_run1.fastq
sample3_run1.fastq
sample4_run1.fastq
Task 2
If we run ls data/sequencing_run2/ > sequencing_files.txt
, we will replace the content of the file:
$ cat sequencing_files.txt
sample1_run2.fastq
sample2_run2.fastq
sample3_run2.fastq
sample4_run2.fastq
sample5_run2.fastq
sample6_run2.fastq
Task 3
If we start again from the beggining, but instead use the >>
operator the second time we run the command, we will append the output to the file instead of replacing it:
$ ls data/sequencing_run1/ > sequencing_files.txt
$ ls data/sequencing_run2/ >> sequencing_files.txt
$ cat sequencing_files.txt
sample1_run1.fastq
sample2_run1.fastq
sample3_run1.fastq
sample4_run1.fastq
sample1_run2.fastq
sample2_run2.fastq
sample3_run2.fastq
sample4_run2.fastq
sample5_run2.fastq
sample6_run2.fastq
Something it can be very useful to find lines of a file that match a particular text pattern.
We can use the tool grep
("global regular expression print") to achieve this.
For example, let's find the word "nCoV-2019_46" in our primers file:
$ grep "nCoV-2019_46" artic_primers.bed
MN908947.3 13599 13621 nCoV-2019_46_LEFT 2 +
MN908947.3 13602 13625 nCoV-2019_46_LEFT_alt1 2 +
MN908947.3 13962 13984 nCoV-2019_46_RIGHT 2 -
MN908947.3 13961 13984 nCoV-2019_46_RIGHT_alt2 2 -
We can see the result is all the lines that matched this word pattern.
:::exercise
Consider the file we previously saved in results/variants.tsv
.
(Note, if you don't have this file, run the following commands: mkdir -p results; mv output.tsv results/variants.tsv
)
- Create a new file called
results/alpha.tsv
that contains only the Alpha variant samples.Hint
You can usegrep
to find a pattern in a file. You can use>
to redirect the output of a command to a new file. - How many samples are you left with?
Answer
Task 1
We can use grep
to find a pattern in our text file and use >
to save the output in a new file:
$ grep "Alpha" results/variants.tsv > results/alpha.tsv
We could investigate the output of our command using less results/alpha.tsv
.
Task 2
We can use wc
to count the lines of the newly created file:
$ wc -l results/alpha.tsv
Giving us 31 as the result.
One of the most prominent text-processing utilities is the sed
command, which is short for "stream editor".
A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).
sed
contains several sub-commands, but the main one we will use is the substitute or s
command.
The syntax is:
sed 's/pattern/replacement/options'
Where pattern
is the word we want to substitute and replacement
is the new word we want to use instead.
There are also other "options" added at the end of the command, which change the default behaviour of the text substitution.
Some of the common options are:
g
: by defaultsed
will only substitute the first match of the pattern. If we use theg
option ("global"), thensed
will substitute all matching text.i
: by defaultsed
matches the pattern in a case-sensitive manner. For example 'A' (Uppercase A) and 'a' (Lowercase A) are treated as different. If we use thei
option ("case-insensitive") thensed
will treat 'A' and 'a' as the same.
For example, let's create a file with some text inside it:
$ echo "Hello world. How are you world?" > hello.txt
(Note: the echo
command is used to print some text on the console. In this case we are sending that text to a file to use in our example.)
If we do:
$ sed 's/world/participant/' hello.txt
This is the result
Hello participant. How are you world?
We can see that the first "world" word was replaced with "participant".
This is the default behaviour of sed
: only the first pattern it finds in a line of text is replaced with the new word.
We can modify this by using the g
option after the last /
:
$ sed 's/world/participant/g' hello.txt
Hello participant. How are you participant?
:::note Regular Expressions
Finding patterns in text can be a very powerful skill to master. In our examples we have been finding a literal word and replacing it with another word. However, we can do more complex text substitutions by using special keywords that define a more general pattern. These are known as regular expressions.
For example, in regular expression syntax, the character .
stands for "any character".
So, for example, the pattern H.
would match a "H" followed by any character, and the expression:
$ sed 's/H./X/g' hello.txt
Results in:
Xllo world. Xw are you world?
Notice how both "He" (at the start of the word "Hello") and "Ho" (at the start of the word "How") are replaced with the letter "X".
Because both of them match the pattern "H followed by any character" (H.
).
To learn more see this Regular Expression Cheatsheet. :::
You may have asked yourself, if the /
character is used to separate parts of the sed
substitute command, then how would we replace the "/" character itself in a piece of text?
For example, let's add a new line of text to our file:
$ echo "Welcome to this workshop/course." >> hello.txt
Let's say we wanted to replace "workshop/course" with "tutorial" in this text. If we did:
$ sed 's/workshop/course/tutorial/' hello.txt
We would get an error:
sed: -e expression #1, char 5: unknown option to `s'
This is because we ended up with too many /
in the command, and sed
uses that to separate its different parts of the command.
In this situation we need to tell sed
to ignore that /
as being a special character but instead treat it as the literal "/" character.
To to this, we need to use \
before /
, which is called the "escape" character.
That will tell sed
to treat the /
as a normal character rather than a separator of its commands.
So:
$ sed 's/workshop\/course/tutorial/' hello.txt
↑
This / is "escaped" with \ beforehand
This looks a little strange, but the main thing to remember is that \/
will be interpreted as the character "/" rather than the separator of sed
's substitute command.
:::exercise
The file in data/envelope_protein.fa
contains the amino acid sequence of the SARS-CoV-2 envelope protein for 4 patient samples.
$ cat data/envelope_protein.fa
>patient01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Patient02/incomplete
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFY...................
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
We will cover FASTA files in the following section, for now all we need to know is that each sequence has a name in a line that starts with the character >
.
Use sed
to achive the following:
- Substitute the word
patient
withsample
.Hint
Similarly to how you can use theg
option to go "global" substitution, you can also use thei
option to do case-insensitive text substitution. - Substitute the word
/incomplete
with-missing
. - The character
.
is also a keyword used in regular expressions to mean "any character". See what happens if you run the commandsed 's/./X/g' data/envelope_protein.fa
. How would you fix this command to literally only substitute the character.
withX
?
Answer
Task 1
To replace the word patient
with sample
, we can do:
$ sed 's/patient/sample/i' data/envelope_protein.fa
>sample01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>sample02/incomplete
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFY...................
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
We have to use the option i
to tell the sed
that we want to match the word patient
in case-insensitive manner.
Task 2
For the second task, if we do:
$ sed 's//incomplete/-missing/' data/envelope_protein.fa
We will get an error.
We need to use \
to escape the /
keyword in our pattern, so:
$ sed 's/\/incomplete/-missing/' data/envelope_protein.fa
>patient01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Patient02-missing
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFY...................
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
Task 3
Finally, to replace the character .
, if we do:
$ sed 's/./X/g' data/envelope_protein.fa
XXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Everything becomes "X"!
That's because .
is a keyword used in regular expressions to mean "any character".
Because we are using the g
option (for "global substitution"), we replaced every single character with "X".
To literally replace the character ".", we need to again use the \
escape character, so:
$ sed 's/\./X/g' data/envelope_protein.fa
>patient01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Patient02/incomplete
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFYXXXXXXXXXXXXXXXXXXX
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
:::
:::highlight Key Points
- The
head
andtail
commands can be used to look at the top or bottom of a file, respectively. - The
less
command can be used to interactively investigate the content of a file. Use ↑ and ↓ to browse the file and Q to quit and return to the console. - The
cat
command can be used to combine multiple files together. Thezcat
command can be used instead if the files are compressed. - The
>
operator redirects the output of a command into a file. If the file already exists, it's content will be overwritten. - The
>>
operator also redictects the output of a command into a file, but appends it to any content that already exists. - The
grep
command can be used to find the lines in a text file that match a text pattern. - The
sed
tool can be used for advanced text manipulation. The "substitute" command can be used to text replacement:sed 's/pattern/replacement/options'
. :::