The string you're looking for always has MOM:
before it, but you have not said if it always has "
after it. For the purpose of this answer I will assume that you are looking for strings that are permitted to contain any lower or upper case alphabetic characters, numerals, or underscores. These are known as word characters in the terminology of regular expressions. Matching such "words" of text is useful enough that most dialects of regular expressions have features to help do so. If this isn't what you want, you can modify this solution accordingly or you can use the techniques in the other answers.
I echo David Foerster's, Zanna's, and JJoao's wise warnings about parsing HTML with regex and about this not being robust. Please be careful, and consider if what you have requested is really exactly what you want to do. In your example code you assigned the path to the input file to the variable $file
, so I will assume this has been done. You've assigned the output of your command to $y
, so I will do the same.
With grep
This is similar to JJoao's method, and you can use that method with command substitution as well if the regular expression there is more suited to your needs.
y="$(grep -oPm1 'MOM:\K\w+' "$file")"
-oPm1
is just a more compact way to write -o -P -m 1
.
-o
prints only the matches, not the whole line.-P
uses PCRE, which supports\K
to drop text matched so far so it's not included in the matched text that is returned.-m 1
stops after matching the pattern one time. This way, you assign just the first match to the variable rather than multiple matches separated by newlines.
Note that you can also add -m1
to the command in JJoao's answer so it uses only matches from the first line that has any.
If the first line with a match contains multiple matches, this grep
method gives you all of them. For example, if that line is MOM:MANIKA MOM:JANE"></td><br>
then $y
will hold the value:
MANIKA
JANE
With sed
This resembles Zanna's method.
y="$(sed -rn '0,/.*MOM:(\w+).*/ s//\1/p' "$file")"
Besides being enclosed as a command substitution, the differences are that I:
- stop after the first line that contains a match
- match one or more word characters (
\w+
) instead of characters up to a"
([^"]+
) - consume zero or more arbitrary characters (
.*
) first, so thatMOM:
doesn't have to appear at the very beginning of the line - use a more compact syntax that avoids writing the pattern twice.
The technique I used for this requires GNU sed
, but that's the sed
implementation provided in Ubuntu.
If the first line with a match contains multiple matches, this sed
method gives you just the last one. From MOM:MANIKA MOM:JANE"></td><br>
you get:
JANE