Tuesday, June 7, 2011

Regular Expression in PHP with Examples

more info.. http://evergreenphp.blogspot.com

What is a Regular Expression?

A regular expression is a set of characters that specify a pattern. The term "regular" has nothing to do with a high-fiber diet. It comes from a term used to describe grammars and formal languages.

Regular expressions are used when you want to search for specify lines of text containing a particular pattern. Most of the UNIX utilities operate on ASCII files a line at a time. Regular expressions search for patterns on a single line, and not for patterns that start on one line and end on another.

It is simple to search for a specific word or string of characters. Almost every editor on every computer system can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You can search for a word with four or more vowels that end with an "s". Numbers, punctuation characters, you name it, a regular expression can find it. What happens once the program you are using find it is another matter. Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string with a new pattern. It all depends on the utility.

Regular expressions confuse people because they look a lot like the file matching patterns the shell uses. They even act the same way--almost. The square brackers are similar, and the asterisk acts similar to, but not identical to the asterisk in a regular expression. In particular, the Bourne shell, C shell, find, and cpio use file name matching patterns and not regular expressions.

Remember that shell meta-characters are expanded before the shell passes the arguments to the program. To prevent this expansion, the special characters in a regular expression must be quoted when passed as an option from the shell. You already know how to do this because I covered this topic in last month's tutorial.

The Structure of a Regular Expression

There are three important parts to a regular expression. Anchors are used to specify the position of the pattern in relation to a line of text. Character Sets match one or more characters in a single position. Modifiers specify how many times the previous character set is repeated. A simple example that demonstrates all three parts is the regular expression "^#*". The up arrow is an anchor that indicates the beginning of the line. The character "#" is a simple character set that matches the single character "#". The asterisk is a modifier. In a regular expression it specifies that the previous character set can appear any number of times, including zero. This is a useless regular expression, as you will see shortly.

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "regular" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.

Here is a table of the Solaris (around 1991) commands that allow you to specify regular expressions:
Utility Regular Expression Type
vi Basic
sed Basic
grep Basic
csplit Basic
dbx Basic
dbxtool Basic
more Basic
ed Basic
expr Basic
lex Basic
pg Basic
nl Basic
rdist Basic
awk Extended
nawk Extended
egrep Extended
EMACS EMACS Regular Expressions
PERL PERL Regular Expressions

The Anchor Characters: ^ and $

Most UNIX text facilities are line oriented. Searching for patterns that span several lines is not easy to do. You see, the end of line character is not included in the block of text that is searched. It is a separator. Regular expressions examine the text between the separators. If you want to search for a pattern that is at one end or the other, you use anchors. The character "^" is the starting anchor, and the character "$" is the end anchor. The regular expression "^A" will match all lines that start with a capital A. The expression "A$" will match all lines that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no longer act as anchors. That is, the "^" is only an anchor if it is the first character in a regular expression. The "$" is only an anchor if it is the last character. The expression "$1" does not have an anchor. Neither is "1^". If you need to match a "^" at the beginning of the line, or a "$" at the end of a line, you must escape the special characters with a backslash. Here is a summary:
Pattern Matches
^A "A" at the beginning of a line
A$ "A" at the end of a line
A^ "A^" anywhere on a line
$A "$A" anywhere on a line
^^ "^" at the beginning of a line
$$ "$" at the end of a line

The use of "^" and "$" as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses "!^" to specify the first argument of the previous line, and "!$" is the last argument on the previous line.

It is one of those choices that other utilities go along with to maintain consistancy. For instance, "$" can refer to the last line of a file when using ed and sed. Cat -e marks end of lines with a "$". You might see it in other programs as well.

Matching a character with a character set

The simplest character set is a character. The regular expression "the" contains three character sets: "t," "h" and "e". It will match any line with the string "the" inside it. This would also match the word "other". To prevent this, put spaces before and after the pattern: " the ". You can combine the string with an anchor. The pattern "^From: " will match the lines of a mail message that identify the sender. Use this pattern with grep to print every address in your incoming mail box:

grep '^From: ' /usr/spool/mail/$USER

Some characters have a special meaning in regular expressions. If you want to search for such a character, escape it with a backslash.

Match any character with .

The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-line character. The pattern that will match a line with a single characters is

^.$

Specifying a Range of Characters with [...]

If you want to match specific characters, you can use the square brackets to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one number is

^[0123456789]$

This is verbose. You can use the hyphen between two characters to specify a range:

^[0-9]$

You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, number, or underscore:

[A-Za-z0-9_]

Character sets can be combined by placing them next to each other. If you wanted to search for a word that

1. Started with a capital letter "T".
2. Was the first word on a line
3. The second letter was a lower case letter
4. Was exactly three letters long, and
5. The third letter was a vowel

the regular expression would be "^T[a-z][aeiou] ".

Exceptions in a character set

You can easily search for all characters except those in square brackets by putting a "^" as the first character after the "[". To match all characters except vowels use "[^aeiou]".

Like the anchors in places that can't be considered an anchor, the characters "]" and "-" do not have a special meaning if they directly follow "[". Here are some examples:
Regular Expression Matches
[] The characters "[]"
[0] The character "0"
[0-9] Any number
[^0-9] Any character other than a number
[-0-9] Any number or a "-"
[0-9-] Any number or a "-"
[^-0-9] Any character except a number or a "-"
[]0-9] Any number or a "]"
[0-9]] Any number followed by a "]"
[0-9-z] Any number,
or any character between "9" and "z".
[0-9\-a\]] Any number, or
a "-", a "a", or a "]"

Repeating character sets with *

The third part of a regular expression is the modifier. It is used to specify how may times you expect to see the previous character set. The special character "*" matches zero or more copies. That is, the regular expression "0*" matches zero or more zeros, while the expression "[0-9]*" matches zero or more numbers.

This explains why the pattern "^#*" is useless, as it matches any number of "#'s" at the beginning of the line, including zero. Therefore this will match every line, because every line starts with zero or more "#'s".

At first glance, it might seem that starting the count at zero is stupid. Not so. Looking for an unknown number of characters is very important. Suppose you wanted to look for a number at the beginning of a line, and there may or may not be spaces before the number. Just use "^ *" to match zero or more spaces at the beginning of the line. If you need to match one or more, just repeat the character set. That is, "[0-9]*" matches zero or more numbers, and "[0-9][0-9]*" matches one or more numbers.

Matching a specific number of sets with \{ and \}

You can continue the above technique if you want to specify a minimum number of character sets. You cannot specify a maximum number of sets with the "*" modifier. There is a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between "\{" and "\}". The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. A period is matched by a "\." and an asterisk is matched by a "\*".

If a backslash is placed before a "<," ">," "{," "}," "(," ")," or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of "{" would have broken old expressions. This is a horrible crime punishable by a year of hard labor writing COBOL programs. Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the unsymmetry, view it as evolution.

Having convinced you that "\{" isn't a plot to confuse you, an example is in order. The regular expression to match 4, 5, 6, 7 or 8 lower case letters is

[a-z]\{4,8\}

Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number.

No comments: