Regular Expressions

The following was nicked from Zhou Chen :: Regular Expression, which in turn was nicked from Regular Expressions by Ashley J.S Mills.

Regular Expressions

Introduction

You have probably used regular expressions before, for instance if you have specified that you want to delete ., referring to any basename followed by a dot followed by any extension, then you have used the concepts of regular expressions at least once. Regular Expressions are a particular kind of pattern matching located in the Regular Language subclass of pattern matching languages. They are considered the least complex of the pattern matching languages but are very useful.

Basics

Regular expressions consist of literal characters and meta characters, literal characters are the actual characters you want to find, meta characters are special characters, like the Kleene star, and are the core concept behind regular expressions hence we will begin this section with a brief introduction to the most common meta characters.

Single Character

A single character such as Q is a regular expression, it is the regular expression that matches every string that contains the character Q, so it would match Quick, Quiet and Quantum but not quick.

Any Character: .

The period, or full-stop as we call it in Britain, is used to signify that any character may be replaced by it in the search, it matches any character. For example, ".t.m would match atom, item and stem and probably some other words too. A fun example of using this character can be found at http://www.oneacross.com/ where it is used to help people find words for their crosswords, they also use the the character ?' as an alternative.

The Escape Character: \

\ is used to signify that we want to use a meta character as a literal character, this is necessary otherwise the character in question would be interpreted as meta-data, the character that the is being escaped is the character immediately following the escape character. For example, "\*" would match the string containing the character that has been escaped, that is, it would match the string (or any string containing).

The converse can also be true, sometimes \ is used to signify that we want to use a literal character as a meta character, for example, within a double quoted string in an implementation that requires that meta characters are escaped. You should read the documentation of the particular regular expression implementation you are using to find out which approach your implementation takes.

The Caret: ^

^, known as a caret, is used to match the beginning of a line, so "^CAPITAL" would match "CAPITAL's signify emphasised speech, anger or SHOUTING", it would not match "Your such a CAPITAL idiot!".

The Dollar Symbol: $

$ is used to match the end of a line, so "here$" would match "I like it here" but would not match "here is a potato".

The Kleene star: *

The star * is used to match zero or more occurrences of the regular expression immediately preceding the meta character. "10*" would match "1", "10", "100", "1000" and so on.

The Kleene plus: +

+ is used to match one or more occurrences of the regular expression immediately preceding the meta character. "10+" would match "10", "100", "1000" and so on but would not match "1".

Note: (regular expression)+ is the same as (regular expression)(regular expression)*.

Ranges: [ ], [cn-cm] and [^cn-cm]

[ ] is used to signify that any of the characters or expressions enclosed within them may be matched. 1[ 123]512 would match "11512", "12512" and "13512".

[cn-cm] is used to specify a range of characters (inclusively) that may be matched at this point in the regular expression. ";[b-f]oo" would match "boo", "coo", "doo", "eoo" and "foo" but not "goo".

[^cn-cm] is used to exclude a range of characters from a match, notice that the caret has been used again, when it is used immediately after an opening [ it has this special meaning, if you want to exclude the caret then you would escape it: "[^\^]. "[^1-8]00" would match "900" but not any of the other three digit hundreds such as "500".

Grouping:

 is used to treat regular expression contained within the (escaped in this case) brackets as a group, this group can then be back referenced later like \1 to refer to the first group defined. How this is implemented in various programs that use regular expressions varies, some tools do not require you to escape the brackets, some use different conventions to back reference defined groups. For instance a program may use "$1" to refer to the first bracketed group instead of "\1". There may also be limits on the number of groups that can be referenced in this way, sometimes it is a maximum of nine. In the program grep "$a$b\1" would match "aba".

- Alternatives:

| is used to delimit the OR operator, in this case the operands are the regular expressions either side of it, signifying that if either the first expression OR the second expression matches, then the whole expression will match. For example "^aba\|b$" will match the lines "aba", "abb" but not "abc". The | meta character may or may not need to be escaped depending on the program.

Repetition: \{n\}, \{,n\}, \{n,\}, \{n,m\}

\{n\} is used to specify that the regular expression immediately preceding must be matched n times exactly. "^10\{3\}$" will match the line "1000" but not "100" or "10000".

\{,n\} is used to specify that the regular expression immediately preceding may be matched up to a maximum of n times. "^10\{,3\}$" will match the lines "1", "10", "100" and "1000" but will not match "10000".

\{n,\} is used to specify that the regular expression immediately preceding must be matched at least n times. "^10\{3,\}$" will match the lines "1000", "10000", "100000" and so on but will not match "100".

Note: This is an alternative to using the Kleene star and the Kleene plus, they may not be supported in your implementation. "a\{0,}\" is the same as "a*" and "a\{1,}\" is the same as "a+".

\{n,m\} is used to specify that the regular expression immediately preceding must be matched at least n times but may not exceed m matches. "^10\{3,4\}$" will match the lines "1000" and "10000" but not "100" or "100000". The necessity to escape the characters may vary. Not all programs support all the types of repetition described.

Grep Regular Expression

'[' followed by ']' can be used to match a range of characters and some special ranges are already defined:

 matches [0-9a-zA-Z] 

 matches [a-zA-Z] 

 matches control characters 

 matches [0-9] 

 matches [a-z] 

 matches punctuation characters 

 matches [A-Z] 

 matches any white space 

$ matches the end of a line 

\ matches the beginning of a word 

\> matches the end of a word 

\b matches the empty string at the edge of a word 

\B matches the empty string provided it is not at the edge of a word.

Example: Suppose you wanted to match the h1, h2, h3... etc. elements in an HTML file. Assume the text file html.txt:

<h1 blah="cool"<Title1</h1>
<h2>Title2</h2>

<h3>Title3</h3>

<h4>Title4</h4>
<h5>Title5</h5>
<h6>Title5</h6>
<h1>Title2</h2>

One could use the following:

grep -e "^]*>[^<]*" html.txt

Which says to match, from the start of a line: "' character (so that the opening tag may contain attributes) then anything but a '<' character then "'.

Emacs Regular Expressions

Emacs has builtin regular expression support. Regular expressions may be used within searches by typing the Emacs command sequence C-M-s, this is CTRL-ALT-s on most computers. The Emacs command C-M-r is for reverse regular expressions search.

More useful however is the regular expression search and replace function. It is activated by typing C-M-%, that is CTRL-ALT-% on most computers. When this command sequence is entered, the user is asked to enter an expression to find the text to replace and then an expression to use to replace the text found. The regular expression syntax is shown below:

Regular Expressions
any single character except a newline              .   (dot)
zero or more repeats                               *
one or more repeats                                +
zero or one repeat                                 ?
any character in the set                           [ : : :]
any character not in the set                       [^ : : :]
beginning of line                                  ^
end of line                                        $
quote a special character c                        \c
alternative (\or\)                                 \_
grouping                                           \( : : :\)
nth group                                          \n
beginning of buffer                                \`
end of buffer                                      \'
word break                                         \b
not beginning or end of word                       \B
beginning of word                                  \lt;
end of word                                        \gt;
any word-syntax character                          \w
any non-word-syntax character                      \W
character with syntax c                            \sc
character with syntax not c                        \Sc

If you had a HTML file and you wanted to replace every occurrence of "<table>" with "<table border="1">" you could use:

Query replace regexp: <table> with: <table border=\"1\">

You can use back-references too:

Query replace regexp \(this\)\(.*\)\(that\) with \3\2\1

When operated on:

Switch this and that!
Switch this and then switch that!
Take this! and take that too!

Produces:

Switch that and this!
Switch that and then switch this!
Take that! and take this too!

java.util.regex, Java 1.4

java.util.regex provides classes for matching character sequences against regular expressions. The two classes of java.util.regex are Matcher and Pattern. Pattern provides the regular expression in an efficient compiled Java version. Matcher provides the methods needed to match a character sequence against a Pattern. The java.util.regex entry in the Java API 1.4 can be found at http://java.sun.com/j2se/1.4/docs/api/java/util/regex/package-summary.html.

Last modified: 07/05/2006 (most likely earlier as a site migration in 2006 reset some dates) Tags: (none)

Skip to

Regular Expressions