sed use cases

text

In this post, I'm going to look at sed ("stream editor") which is commonly used in the IT world and every DevOps & Ops uses it many times every single day. When it comes to implementations there are two versions GNU & BSD. There are some small differences between them and better to know that before troubleshooting sessions :). I will pay atention to sed use cases.

The amount of information in manual pages is overwhelming a bit that’s why I decided to write this post which tries to cover most important use cases of sed command.

text

Before I start

Sample text

All cases will be tested against the same text file which I downloaded from here.

wget https://raw.githubusercontent.com/dwyl/english-words/master/words.txt

2
1080
&c
10-point
10th
11-point
12-point
16-point
18-point
1st

[..]

Zwinglianism
Zwinglianist
zwitter
zwitterion
zwitterionic
Zwolle
Zworykin
ZZ
zZt
ZZZ

Approaches

To process the text using sed you can use 4 approaches

  • Inline

Basically it's sed -i file where sed is processing file in-place. Therefore there is no temporaty files and everything is clean and tidy. However, it goes hand in hand with one serious danger: there is no way back. So, use it if you are 100% sure what it does. To mitigate potential issue, good approach is to do backup before proceed.

In addition, online processing differs in GNU and BSD version. For GNU is enough to add -i flag: sed -i '/pattern/' FILE to do the job whereas BSD version requires small hack : sed -i '' '/pattern/' FILE

  • pipe

Roughly speaking, this is command | sed structure. Form my perspective this is the most common. And there is no place for in-line working on single-not-backuped file. For instance, this is one and only option when you are working with command outputs.

  • redirect to temporary file

If you affraid of in-line processing you can do it step by step like below:

sed '/pattern/' FILE > TMP_FILE
mv TMP_FILE FILE

or do it as oneliner

sed '/pattern/' FILE > TMP_FILE; mv TMP_FILE FILE
  • print to standard output

This is subset of previous point.

sed '/pattern/' FILE

This is the simplest one and therefore I going to mostly use it in examples section

Use cases of sed

Finally, it's high time to head to use cases. As you probably know sed can be used in many cases. I've selected 8 most commonly used.

Delete line which contains specific pattern using sed

To delete the line and print the output to standard output use below structure

sed '/pattern/d' FILE

Example

$ head words.txt 
2
1080
&c
10-point
10th
11-point
12-point
16-point
18-point
1st

$ head words.txt | sed '/10th/d'
2
1080
&c
10-point
11-point
12-point
16-point
18-point
1st

$ head words.txt | sed '/point/d'
2
1080
&c
10th
1st

Delete line which contains specific pattern - alternative

As a alternative you can use grep or awk

grep

grep -v pattern FILE

awk

awk '!/pattern/' FILE

Example of alternative

$ head -100 words.txt | tail -10
aah
aahed
aahing
aahs
AAII
aal
Aalborg
Aalesund
aalii
aaliis

$ head -100 words.txt | tail -10 | grep -v aah
AAII
aal
Aalborg
Aalesund
aalii
aaliis

$ head -100 words.txt | tail -10 | awk '!/aah/'
AAII
aal
Aalborg
Aalesund
aalii
aaliis

Phew, first use case - done!

first use case -  done

Delete empty lines using sed

How to remove empty lines? This really common problem especially when you process output of the commands. The answer is 'jump to point no 1'. Yes, empty line is a line with specific pattern (in the manning of regular expressions) which consists of two item ^$ which are metacharacters of beginning of the line and metacharacter of ending of the line.

So, schema is the same

sed '/pattern/d' FILE

But pattern is ^$

sed '/^$/d' FILE

Example

Please have a look at well known ping output, sed removed empty line

$ ping 1.1.1.1 -c 3
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=20.7 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=21.2 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=11.3 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 11.397/17.783/21.241/4.523 ms

$ ping 1.1.1.1 -c 3 | sed '/^$/d'
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=13.5 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=12.1 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=14.4 ms
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 12.141/13.372/14.418/0.948 ms

Alternative

When it comes to empty lines we have more options: grep, awk, tr

grep

Just dot

grep .

awk

We are using NF built-in variable which contains the number of fields in the current input record. I would treat it as a hack.

awk NF

tr

This approach squeezes new line repeats

tr -s '\n' '\n'

Example of alternative

$ ping 1.1.1.1 -c 3 | grep .
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=15.7 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=35.6 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=17.8 ms
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 15.765/23.113/35.699/8.942 ms

$ ping 1.1.1.1 -c 3 | awk NF
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=12.8 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=15.9 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=13.6 ms
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 12.860/14.143/15.952/1.322 ms

$ ping 1.1.1.1 -c 3 |  tr -s '\n' '\n'
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=57 time=16.3 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=57 time=13.9 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=57 time=17.4 ms
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 13.998/15.943/17.458/1.452 ms

Replace string using sed

Another common task when you can use sed is replacing. For instance, IDs, data formats or name of instances. Replacing has many variants. The most basic one is replacing first occurrence of the pattern in the every lines

sed 's/pattern_in/pattern_out/'

To extend the replacing to all occurrences of the pattern in the line command should be adjusted a bit:

sed 's/pattern_in/pattern_out/g'

Example

$ head -200 words.txt | tail -10
Abagail
Abagtha
abay
abayah
Abailard
abaisance
abaised
abaiser
abaisse
abaissed

$ head -200 words.txt | tail -10 | sed 's/a/_/'
Ab_gail
Ab_gtha
_bay
_bayah
Ab_ilard
_baisance
_baised
_baiser
_baisse
_baissed

$ head -200 words.txt | tail -10 | sed 's/a/_/g'
Ab_g_il
Ab_gth_
_b_y
_b_y_h
Ab_il_rd
_b_is_nce
_b_ised
_b_iser
_b_isse
_b_issed

Alternative to replace using sed

You can use tr or awk, but tr is limited

tr

This tool is created to simplify single char replacing, of course you can put as a pattern_in string but all characters will be process separately. For instance tr 'ABC' 'XYZ' is equivalet to tr 'A' 'X' and tr 'B' 'Y' and tr 'C' 'Z'

tr 'pattern_in' 'pattenr_out'

awk

You can use awk with gsub function, where general schema is below. awk is extremely powerfully tool

awk '{ gsub(",","",$1); print $1 }' FILE

$1 refers to number of field in the awk's terms

Example of replacing

$ head -2000 words.txt | tail -10
accommodatingly
accommodatingness
accommodation
accommodational
accommodationist
accommodations
accommodative
accommodatively
accommodativeness
accommodator

$ head -2000 words.txt | tail -10 | tr 'acm' 'AC_'
ACCo__odAtingly
ACCo__odAtingness
ACCo__odAtion
ACCo__odAtionAl
ACCo__odAtionist
ACCo__odAtions
ACCo__odAtive
ACCo__odAtively
ACCo__odAtiveness
ACCo__odAtor

$ head -2000 words.txt | tail -10  | awk '{ gsub("c","_c_",$1); print $1 }'
a_c__c_ommodatingly
a_c__c_ommodatingness
a_c__c_ommodation
a_c__c_ommodational
a_c__c_ommodationist
a_c__c_ommodations
a_c__c_ommodative
a_c__c_ommodatively
a_c__c_ommodativeness
a_c__c_ommodator

Back reference in replacement

What is back reference? The best answer I found in manual 😀

back-references are regular expression commands which refer to a previous part of the matched regular expression. Back-references are specified with backslash and a single digit (e.g. ‘\1’).

manual

General schema is like below:

sed 's/pattern_with_(.*)/\1/'

And \1 refers to what match regex. This can be useful when you need to get some specific string surrounded by patterns

Example

$ head -200 words.txt | tail -10
Abagail
Abagtha
abay
abayah
Abailard
abaisance
abaised
abaiser
abaisse
abaissed

$ head -200 words.txt | tail -10 | sed 's/ab\(.*\)ss.*/\1/'
Abagail
Abagtha
abay
abayah
Abailard
abaisance
abaised
abaiser
ai
ai

Alternative to sed back reference

Any ideas? Of course awk is a king

awk '{ print gensub(/pattern_with(.*)/, "New pattern with back reference: \\1", "g", $1);}'

Example explains all

Example

All after mod string is a match which back-reference refers to

$ head -2000 words.txt | tail -10
accommodatingly
accommodatingness
accommodation
accommodational
accommodationist
accommodations
accommodative
accommodatively
accommodativeness
accommodator

$ head -2000 words.txt | tail -10  | awk '{ print gensub(/mod(.*)/, " New pattern with BR: \\1", "g", $1);}'
accom New pattern with BR: atingly
accom New pattern with BR: atingness
accom New pattern with BR: ation
accom New pattern with BR: ational
accom New pattern with BR: ationist
accom New pattern with BR: ations
accom New pattern with BR: ative
accom New pattern with BR: atively
accom New pattern with BR: ativeness
accom New pattern with BR: ator

Viewing a range of lines using sed

When you would like to limit text processing to the known lines you can use such structure:

sed -n '${start_line},${end_line}p' FILE

This is self-explanatory

Example

$ head -50000 words.txt | tail -10
Bravin
braving
bravish
bravissimo
bravo
bravoed
bravoes
bravoing
bravoite
bravos

$ head -50000 words.txt | tail -10 | sed -n '2,5p'
braving
bravish
bravissimo
bravo

Alternative to range viewing

Traditionally awk can help and new comers head & tail combined

awk

Another built-in variable NR helps here, it keeps a current count of the number of input records read so far from all data files. Shema is below:

awk 'NR >= ${start_line} && NR <= ${end_line}' FILE

tail & head

As you probably noticed I'm using it a lot in this post for fetching 10 lines from words.txt file just to have small static piece of lines to process. Schema is simple except calculating head values

tail +${start_line} | head -$((${end_line}-1))

Example of alterfatives

$ head -50000 words.txt | tail -10
Bravin
braving
bravish
bravissimo
bravo
bravoed
bravoes
bravoing
bravoite
bravos

$ head -50000 words.txt | tail -10 | awk 'NR >= 2 && NR <= 5'
braving
bravish
bravissimo
bravo

$ head -50000 words.txt | tail -10 | tail +2 | head -$((5-1))
braving
bravish
bravissimo
bravo

Viewing the all lines except a given range using sed

This point is similar to previous one and you are right, the syntax, is similar too.

sed '${start_line},${end_line}d' FILE

Example

$ head -60000 words.txt | tail -10
Cardozo
card-perforating
cardplayer
cardplaying
card-printing
cardroom
cards
cardshark
cardsharp
cardsharper

$ head -60000 words.txt | tail -10 | sed '2,5d'
Cardozo
cardroom
cards
cardshark
cardsharp
cardsharper

Alternative

To be honest there is only one option awk, of course using tail & head here is feasible but not elegant

awk 'NR < ${start_line} || NR > ${end_line}' FILE

NR variable is being used again but with OR operations which more accurate in this example

Example of alternative

$ head -60000 words.txt | tail -10
Cardozo
card-perforating
cardplayer
cardplaying
card-printing
cardroom
cards
cardshark
cardsharp
cardsharper

$ head -60000 words.txt | tail -10 | awk 'NR < 2 || NR > 5'
Cardozo
cardroom
cards
cardshark
cardsharp
cardsharper

dos2unix

In many cases when you work on Windows and Linux at the same time, migrating files from Windows to Linux lead to issues with end of line characters. On the Windows line is ending with \r\n  whereas on Linux there is only one character  \n. Many editors have built-in features to detect the version of the file and adjust to the OS but when you work with a raw file or when an application you are using works with a file you have to take care of it.  You can install the dos2unix command or use sed.  To tell the truth changing it is really simple because this is well known replacing with regex which detects \r character.

sed 's/\r$//' FILE

Example

$ cat windows_file 
test
windows
file

$ od -c windows_file 
0000000   t   e   s   t  \r  \n   w   i   n   d   o   w   s  \r  \n   f
0000020   i   l   e  \r  \n
0000025

$ file windows_file 
windows_file: ASCII text, with CRLF line terminators

$ sed 's/\r$//' windows_file > unix_file

$ cat unix_file 
test
windows
file

$ od -c unix_file 
0000000   t   e   s   t  \n   w   i   n   d   o   w   s  \n   f   i   l
0000020   e  \n
0000022

$ file unix_file 
unix_file: ASCII text

Add string before and after the matching pattern using sed

Sometimes you have to add a string before or after matching pattern. For instance, to add header to a section which starts with matching pattern. Syntax is below

For after

sed '/pattern/a LINE_TO_ADD' FILE

For before

sed '/pattern/i LINE_TO_ADD' FILE

Example

And time for real example

$ head -400000 words.txt  | tail -10
tchervonets
tchervonetz
tchervontzi
Tchetchentsish
Tchetnitsi
tchetvert
Tchi
tchick
tchincou
tchr

$ head -400000 words.txt  | tail -10 | sed '/tchetvert/a <------'
tchervonets
tchervonetz
tchervontzi
Tchetchentsish
Tchetnitsi
tchetvert
<------
Tchi
tchick
tchincou
tchr

$ head -400000 words.txt  | tail -10 | sed '/tchetvert/i <------'
tchervonets
tchervonetz
tchervontzi
Tchetchentsish
Tchetnitsi
<------
tchetvert
Tchi
tchick
tchincou
tchr

So, as you can see this is really simple.

Case Insensitive in sed

As you probably noticed from previous points, GNU is case-sensitive so pattern not equals Pattern . In may cases, for sure, you needed to process i.e. names with capital letter or some invalid DB insert to make text clearer and uniform. sed is, of course, ready to handle it:

sed 's/pattern/pattern/i' FILE

Example

$ head -400000 words.txt  | tail -10
tchervonets
tchervonetz
tchervontzi
Tchetchentsish
Tchetnitsi
tchetvert
Tchi
tchick
tchincou
tchr

$ head -400000 words.txt  | tail -10 | sed 's/Tch/___/i'
___ervonets
___ervonetz
___ervontzi
___etchentsish
___etnitsi
___etvert
___i
___ick
___incou
___r

Many sed

Remember you can use many sed in one commad. There are to approaches here, pipes and -e flag.

Syntax:

sed '/pattern1/' FILE | sed '/pattern2/' | sed '/pattern2/'

or

sed -e '/pattern1/' -e '/pattern2/' -e '/pattern3/' FILE

Obviously, second approach is much better cause it runs on one shell process (pipe open new process) that's why it consumes less resouress. In addition, it mitigates problem with sudo or shell variables

Example

$ head -400000 words.txt  | tail -10
tchervonets
tchervonetz
tchervontzi
Tchetchentsish
Tchetnitsi
tchetvert
Tchi
tchick
tchincou
tchr

$ head -400000 words.txt  | tail -10 | sed -e 's/Tch/___/i'
___ervonets
___ervonetz
___ervontzi
___etchentsish
___etnitsi
___etvert
___i
___ick
___incou
___r

$ head -400000 words.txt  | tail -10 | sed -e 's/Tch/___/i' -e '/ick/d'
___ervonets
___ervonetz
___ervontzi
___etchentsish
___etnitsi
___etvert
___i
___incou
___r

Bonus - all in one

Thanks for reading all these points. I really appreciate it.

blueberries, cake, fruit

Cake is for me but for you I've prepared something special sed command which used all mentioned cases.

$ sed -n '300000,300020p' words.txt | sed -e 's/plantation/PLA/i' -e '/planta/d' -e 's/Plan\(.*\)gi.*/_\1_/' -e '/animal/a ---> DOG <---'
Plantagenet
_ta_
_ta_
Plantago
plant-animal
---> DOG <---
PLA
PLAlike
PLAs
PLA's

Voila!

thanks, word, letters

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany.