Tuesday, June 7, 2011

Regular Expressions - User Guide

Simple Matching


STRING1 Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
STRING2 Mozilla/4.75 [en](X11;U;Linux2.2.16-22 i586)

Search for

STRING1 match Finds the m in compatible
STRING2 no match There is no lower case m in this string. Searches are case sensitive unless you take special action.

STRING1 match Found in Mozilla/4.0 - any combination of characters can be used for the match
STRING2 match Found in same place as in STRING1

5 [
STRING1 no match The search is looking for a pattern of '5 [' and this does NOT exist in STRING1. Spaces are valid in searches.
STRING2 match Found in Mozilla/4.75 [en]

STRING1 match found in Windows
STRING2 match Found in Linux

STRING1 match found in compatible
STRING2 no match There is an l and an e in this string but they are not adjacent (or contiguous).

[ ]

Match anything inside the square brackets for ONE character position once and only once, for example, [12] means match the target to 1 and if that does not match then match the target to 2 while [0123456789] means match to any character in the range 0 to 9.


The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9].

You can define more than one range inside a list, for example, [0-9A-C] means check for 0 to 9 and A to C (but not a to c).

NOTE: To test for - inside brackets (as a literal) it must come first or last, that is, [-0-9] will test for - and 0 to 9.


The ^ (circumflex or caret) inside square brackets negates the expression (we will see an alternate use for the circumflex/caret outside square brackets later), for example, [^Ff] means anything except upper or lower case F and [^a-z] means everything except lower case a to z.

STRING1 match finds ind in Windows
STRING2 match finds inu in Linux

STRING1 no match Again the tests are case sensitive to find the xt in DigExt we would need to use [0-9a-z] or [0-9A-Zt]. We can also use this format for testing upper and lower case e.g. [Ff] will check for lower and upper case F.
STRING2 match Finds x2 in Linux2

STRING1 match Finds Win in Windows
STRING2 no match We have excluded the range A to M in our search so Linux is not found but linux (if it were present) would be found.


The ? (question mark) matches the preceding character 0 or 1 times only, for example, colou?r will find both color (0 times) and colour (1 time).


The * (asterisk or star) matches the preceding character 0 or more times, for example, tre* will find tree (2 times) and tread (1 time) and trough (0 times).


The + (plus) matches the previous character 1 or more times, for example, tre+ will find tree (2 times) and tread (1 time) but not trough (0 times).


Matches the preceding character, or character range, n times exactly, for example, to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567.

No comments: