pagetitle |
---|
The Unix Shell: sed - stream editor |
:::highlight
Questions
- What is
sed
? How to usesed
? - What is the
sed
script? - What is the
s
command? How to use thes
command? - What is an escape character?
- How to use wildcards in the
sed
?
Learning Objectives
- Explain the basic usage of the
sed
. - A closer look to
sed
script. - Demonstrate the usage of the
sed
in processing text.
:::
The sed
command is short for stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).
sed
is one of the most prominent text-processing utilities on GNU/Linux. sed
performs operations on text coming from standard input or a file.
:::note
In this tutorial, we'll use the GNU version of sed
(available on Ubuntu and other Linux operating systems). The macOS has the BSD version of sed
which has different options and arguments. You can install the GNU version of sed
with Homebrew using brew install gnu-sed
.
:::
There are many instances when we want to substitute a text in a line or filter out specific lines. In such cases, we can take advantage of sed
.
sed
operates on a stream of text which it gets either from a text file or from standard input (STDIN). It means you can use the output of another command as the input of sed
-- in short, you can combine sed
with other commands.
By default, sed
outputs everything to standard output (STDOUT). It means, unless redirected, sed
will print its output onto the terminal/screen instead of saving it in a file.
:::note
sed
edits line-by-line and in a non-interactive way.
:::
The basic usage of sed
:
$ sed [OPTIONS] SCRIPT INPUT-FILES
To begin this tutorial, create a file 'input.txt' and write the following line and save.
hello world
Now we'll use the sed
to replace 'hello' with 'world':
$ sed 's/hello/world/' input.txt
# Output
world world
Just follow this tutorial we will explain all the magic which just happened.
Let's take a closer look at the sed
script command. In the above script (s/hello/world
), 's' is called the sed
command. There are many sed
commands, but we'll mainly focus on the 's' command. To see a list of all sed
's commands, visit https://www.gnu.org/software/sed/manual/html_node/sed-commands-list.html.
The basic syntax for a sed
command is:
[ADDRESS]X[OPTION]
Here, 'X' is the single letter sed
command. [ADDRESS] is optional. If [ADDRESS] is specified, the command X will be executed only on the matched lines. Now, [ADDRESS] can be a single line number, regular expression, or a range of lines. For some sed
commands we use additional [OPTION].
The 's' command, also known as substitute command, is probably the most important in sed
and has a lot of different options. The basic syntax for 's' command is:
s/regexp/replacement/[flags]
As you might have guessed, 's' is for substitution. It will tell sed
that we are going make some substitution. 's' is followed by '/' you can think of it as a delimiter.
After that, we have regexp, which is short for the regular expression. We use the regular expression to match the pattern against the content of a line.
:::note What is a regular expression?
A regular expression is a pattern formed in a standardised way that helps search in a string. Regular expressions are useful for matching common patterns of string such as email addresses, phone numbers, URLs, etc. To learn more visit: https://www.keycdn.com/support/regex-cheatsheet :::
The third one is the replacement. So, if sed
finds the match, it will replace the matched text/pattern with the replacement text.
So, the basic concept of the 's' command is: it attempts to match the pattern in a line, if the match is successful, then that portion of the pattern is replaced with replacement.
At the last of the 's' command, you can see [flags]. [flags] are optional. An 's' command can have zero or more flags. Some of the common flags are:
-
g: By default
sed
command will only substitute the first match. If the line has more than one match, all will not get replaced. It replaces only the first match. But, if we use the g flag, then thesed
command will replace all matching text. -
i:
sed
matches the regular expression in a case-sensitive manner. It means 'A' (Uppercase A) and 'a' (Lowercase A) are different. But, if we use the i flag, then thesed
will match the regular expression in a case-insensitive manner. This means now thesed
will treat 'A' and 'a' as same.
To learn more about the 's' command visit https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html.
Let's go back to our first example.
$ sed 's/hello/world' input.txt
Now, I guess it is somewhat clear to you what is happening here. The sed
is substituting 'hello' with 'world' at every line. And we are getting the input from a file called 'input.txt'.
Create a new file 'input2.txt' and write the following text in it:
# Output
Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different cases. test tEst TesT.
Now try to replace 'test' with 'hello'. We can do something like this:
$ sed 's/test/hello/' input2.txt
# Output
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the test.
This line has many occurrences of the Test with different case. hello tEst TesT.
You may have noticed that lines two and three still have 'test'. This is because we ask the sed
to only replace the first text which matches. To replace all the matches, we have to use g flag. Let's try with g flag.
$ sed 's/test/hello/g' input2.txt
# Output
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the Test with different case. hello tEst TesT.
Ah, something is still wrong with the third line. It is not replacing 'Test', 'tEst' and 'TesT'. We have to do something to tell the sed
that we want to replace all of them. We can do this by using i flag. Let's add one more flag:
$ sed 's/test/hello/gi' input2.txt
# Output
Hello, this is a hello line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.
Wonderful!
Let's say now we only want to replace all the occurrences of the 'test' at only line 3 or from line 2 to 3. Try to remember the basic syntax for the sed
command. Remember! You can add the address of the line or a range of lines that you want to edit. Here is an example:
$ sed '3s/test/hello/gi' input2.txt
# Output
Hello, this is a test line. This is a very short line.
This is test line two. In this line, we have two occurrences of the test.
This line has many occurrences of the hello with different case. hello hello hello.
See only the third line is executed by the sed
. The first two lines are as it is. At the beginning of the 's' command, we are adding the line number which we want to edit. We can also add a range of lines. Here is one more example:
$ sed '2,3s/test/hello/gi' input2.txt
As you may have got the idea, it will edit lines 2 and 3. The output will be:
# Output
Hello, this is a test line. This is a very short line.
This is hello line two. In this line, we have two occurrences of the hello.
This line has many occurrences of the hello with different case. hello hello hello.
You can combine multiple sed
commands. Let's say in the 'input2.txt' we first want to remove all the occurrences of 'test', then we want to replace 'line' with 'world'. Here it is clear that we have to do two substitutions. One for removing 'test' and one for replacing 'line'. We can do something like this:
$ sed 's/test//gi' input2.txt | sed 's/line/world/gi'
Here first we are removing 'test' with empty text, and then we are passing the output of first sed
as the input of second sed
. The second sed
is replacing 'line' with 'world'. You will get an output like this:
# Output
Hello, this is a world. This is a very short world.
This is world two. In this world, we have two occurrences of the .
This world has many occurrences of the with different case. .
An escape character is a character that invokes an alternative interpretation of the following character. Sometimes it is also used to insert unallowed characters in a string. An escape character is a backslash \
followed by a character (or characters). Some of the keywords/characters which you want to escape are as follows:
*
: Asterisk..
: Dot.[
: Left square bracket.]
: Right square bracket.?
: Question mark.$
: Dollar sign.^
: Caret/
: Forward slash\
: Backward slash
Let's try to understand this with examples. For this tutorial, create a file 'input3.txt' and put the following text in it:
This software/application is a part of this workshop.
Now, what if we want to replace 'software/application' with 'material'. If we try to do something like: sed 's/software/application/material/' input3.txt
, the sed
will throw an error. Because here we are using '/' as delimiter, sed will think it has to replace 'software' with 'application' but, why there is a third option. To solve this error, we have to escape one '/', we can do this by adding a \
in front of '/'. It will tell the sed
not to interpret '/' as a special character.
$ sed 's/software\/application/material/' input3.txt
# Output
This material is a part of this workshop.
Escape characters are also used to provide visual representations of non-printing characters and characters that usually have special meanings. The list of commonly used escape characters in the sed
is as follows:
\n
: a newline.\r
: a carriage return.\t
: a horizontal tab.
For more details visit https://www.gnu.org/software/sed/manual/sed.html#Escapes.
We have already covered wildcards on the "Basic Commands" page; here, we will discuss the use of .
, *
and ?
in the sed
. These wildcards are part of the regular expression.
-
You can use
.
as a placeholder for any character except newline (\n
) or empty text. For example, if you use.
in your regular expression likex.z
then it will match to strings likexaz
,xbz
,x1z
,xzz
, etc., but, it will not match toxz
. -
You can use
*
to match 0 or more occurrences of the previous character. For example,xy*z
will match to strings likexz
(0 occurrences of y),xyz
(1 occurrence of y),xyyz
and so on. -
?
is a bit similar to*
. The difference is it will only match for 0 or 1 occurrence of the previous character. For example,xy?z
will match to strings likexz
(0 occurrences of y),xyz
(1 occurrence of y) but not toxyyz
.
Now, let's do some coding. Create a file 'input4.txt' and copy the following sequence:
ATGCCTGATTGGGCTACGTCGTAAGCGATGGCTAGGTATCGTAAAGGGGTTTGGGAACCCCAATCACTAGCT
Let's say we want to replace anything between A
and G
with U
. We can do this using the sed
:
$ sed 's/A.G/AUG/g' input4.txt
#Output
AUGCCTGATTGGGCTAUGTCGTAUGCGAUGGCTAUGTATCGTAAUGGGGTTTGGGAACCCCAATCACTAGCT
:::exercise
Replace Me
Click on the following link to download the fasta file containing the protein sequence for one of the SARS-CoV-2, envelope_gene.fa
. You have four tasks to do on this file. They are:
- Replace the word
patient
withsample
. Make sure to replace all words matching topatient
in a case insensitive manner. - Replace all the
.
withX
. - Remove
/incomplete
from one of the sample names. - Save the out to
envelope_gene_processed.fa
.
Answer
We will walk through the solution step by step. At first, to replace the word patient
with sample
, we can do something like this:
$ sed 's/patient/sample/i' envelope_gene.fa
# Output
>sample01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>sample02/incomplete
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFY...................
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
We have to use the flag i
to tell the sed
that we want to match the word patient
in case insensitive manner.
Second we have to replace all the .
with X
. Remember the .
is a keyword (or has special meaning in the sed
). So, to have the literal meaning of .
, we have to escape the .
with \
. We can replace all the .
as follows:
$ sed 's/\./X/g' envelope_gene.fa
# Output
>patient01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Patient02/incomplete
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFYXXXXXXXXXXXXXXXXXXX
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
In line 4, you can see that all the .
is replaced by X
. Note that to replace all the .
we also have to use g
flag.
Third, we have remove \incomplete
. Again /
is a keyword in the sed
. We have to do something similar to the previous step.
$ sed 's/\/incomplete//' envelope_gene.fa
# Output
>patient01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Patient02
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFY...................
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
Now, we have to combine all three steps (this we can do by using pipe|
) and then redirect the output to a file rather than the default output.
$ sed 's/patient/sample/i' < envelope_gene.fa | sed 's/\./X/g' | sed 's/\/incomplete//' > envelope_gene_processed.fa
# You will see no output in the terminal. But you will see a new file
# `envelope_gene_processed.fa`, which will look similar to the following
>sample01
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>sample02
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNLSLVKPSFYXXXXXXXXXXXXXXXXXXX
>sample04
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
>Sample05
MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV*
To learn more about sed visit https://www.gnu.org/software/sed/manual/sed.html