RegEx with Grep


Terminal 'Above algorithm could likely stand improvement'

What they say about being humble because there’s plenty you don’t know is true of course. After a confusing lecture last week (when I was already exhausted), this week’s was slightly better as we took a walk down the lane of Regular Expressions. Which is somewhat of a coincidence, since Ruben just covered these in his Web Services Development class last week — not to be confused with Web Systems as we are wont to do.

Using grep

What we did today was in the context of using it with the grep/egrep commands however. grep searches inputted files for lines that contain at least one match for a given pattern, and prints the matching lines in the format grep (pattern/string to match) (filename).

grep is the the original command and can use basic regular expressions, whereas extended grep or egrep uses extended regular expressions and can sometimes be written as grep -E. Let’s not get into fgrep

There are of course options for grep too, like -c to suppress normal output and print a count of matching lines and -l to again suppress normal output and print the name of each input file containing lines matched. Using the -w option treats the ‘string’ as ‘words’ instead, so if the search string was ‘start’, it would only return ‘start’, and not ‘starting’ or ‘started’.

You can also pass grep to another grep command, if that’s your kind of fun… Strangely useful, when it works.

Onto RegEx

That guy Reg does get around. No, I kid. () can be used for grouping and | for union operators, with metacharacters with special meaning requiring escaping with a preceding backslash in order to be passed to grep (rather than interpreted by bash). For example:

$ egrep 'ad|i' file1

To match lines containing ‘ad’ or ‘i’, or the following or lines matching ‘ad’ followed by a space or ‘i’ followed by a space.

$ egrep '(ad|i) ' file1

This could go on all day, so I’m going to gloss over some finer details. Let’s go to some bullet points:

  • Bracket expressions allow you to match anything within the brackets and can be a range.
  • Caret (^) is an anchor that matches the empty string at the beginning of a line, and dollar ($) is an anchor that matches the empty string at the end of a line.
  • When carat (^) is used inside brackets, it means ‘does not include’, however.
  • < and > match the empty string at the beginning and end of a word (letters, digits and underscore). \b matches the empty string at the edge of a word.

And repetition operators:

  • . = matches any character
  • ? = the preceding item is optional and matched at most once
  • * = the preceding item will be matched zero or more times
  • + = the preceding item will be matched one or more times
  • {n} = the preceding item is matched exactly n times
  • {n,} = the preceding item is matched n or more times
  • {n,m} = the preceding item is matched at least n times, but not more than m times

Phew! Much more difficult in practice though. I’ll let you know how it goes with practicing this RegEx over the weekend. \b still puzzles me, as well as why certain commands seem to work fine on bash 4.1, but not bash 4.0.

Related Posts

Kirinyan
Bash Scripting Adventures at UTS
Kirinyan
Piping and Redirection in Unix