Tuesday, June 7, 2011

Regular Expressions - User Guide

Simple Matching

::Example::

STRING1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)


Search for

m
STRING1 match Finds the m in compatible
STRING2 no match There is no lower case m in this string. Searches are case sensitive unless you take special action.

a/4
STRING1 match Found in Mozilla/4.0 - any combination of characters can be used for the match
STRING2 match Found in same place as in STRING1

5 [
STRING1 no match The search is looking for a pattern of '5 [' and this does NOT exist in STRING1. Spaces are valid in searches.
STRING2 match Found in Mozilla/4.75 [en]

in
STRING1 match found in Windows
STRING2 match Found in Linux

le
STRING1 match found in compatible
STRING2 no match There is an l and an e in this string but they are not adjacent (or contiguous).

[ ]

Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.

-

The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].

You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c).

NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.

^

The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.

in[du]
STRING1 match finds ind in Windows
STRING2 match finds inu in Linux

x[0-9A-Z]
STRING1 no match Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.
STRING2 match Finds x2 in Linux2

[^A-M]in
STRING1 match Finds Win in Windows
STRING2 no match We have excluded the range A to M in our search so Linux is not found but linux (if it were present) would be found.

?

The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).

*

The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).

+

The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).

{n}

Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

No comments: