There are many times when you have a file from which you want to extract specific strings based on a regex and using a capture group is a very efficient way to parse multiple strings from the same line.
I have found that sed is the easiest way to do so on the Linux command line.
Given the following input file:
This is a line of text with a year=2020 month=12 in it This line of text does not have a year or month in it This year=2021 is the current year the current month=1 This is the year=2021 the month=2
Let’s say that you want to extract only the year and the month digits from each line.
You would run the following command defining two capture groups:
sed -rn 's/.*year=([0-9]+).*month=([0-9]+).*/my year: \1, my month: \2/p' input.txt
Which will output:
2020 12 2021 1 2021 2
-rn flag tells sed to use extended regular expressions in the script and to suppress printing unless explicitly directed after we make a match.
Defines two capture groups. One to look for any number of contiguous digits after
year= and another for any number of contiguous digits after
.* explicitly tells sed that we want to ignore any number of any type of characters between the defined groups.
my year: \1, my month: \2/p
Tells sed how to format the output to include each capture group,
\1 for capture group 1 and
\2 for capture group 2.