|
|
|
|
Even experienced programmers sometimes do not know everything about PHP regular expressions. And there are some things which are not really stressed in the official PHP manual, but very useful anyway. This article is intended for beginner PHP programmers. It is for people who already know what regular expressions are but did not use them really often. 1. How to make PHP regular expressions ungreedy ? PHP regular expressions are "greedy" by default. It means the quantifiers like *, +, ? would consume as many characters as possible. The quantifiers *, + and ? mean: * means repetition of 0 or more characters, same as {0, }+ means repetition of 1 or more characters, same as {1, } ? means 0 or 1 character, same as {0,1} The "greediness" of the quantifiers *, +, ? could be illustrated by the example:
<?php
which would produce: aaaaabbbbb You see the first capturing pattern (.*) has consumed 5 letters a and 4 letters b. I.e. it consumed as many characters as possible. It is because it is "greedy". To make the quantifiers *, +, ? "ungreedy", it is enough to put ? right after them. It means you have to use +? instead of + , ?? instead of ? So let us slightly modify the previous example:
<?php
We only added ? after the * but now the example would produce: aaaaabbbbb This is because the first capturing pattern (.*) is not "greedy" any more. Now it consumes as few characters as possible. You could also make ALL the quantifiers in a regular expression "ungreedy" by using the U modifier. I.e. the following example
<?php
would again produce: aaaaabbbbb Please be careful! The question mark ? changes the behavior of the quantifiers *, +, ? from "greedy" to "ungreedy" in a "greedy" regular expression. But the same question mark ? changes the behavior of the quantifiers *, +, ? from "ungreedy" to "greedy" in an "ungreedy" regular expression! Let us illustrate this idea by the example:
<?php
It would produce: aaaaabbbbb as the very first example of the article. Despite we made the whole regular expression "ungreedy" by using the U modifier, the first capturing pattern (.*?) is still "greedy". It is "greedy" beacuse we changed the behaviour of the quantifier * from "ungreedy" to "greedy" by adding a question mark ? to it. I.e. a question mark ? turns "greediness" of the quantifiers *, +, ? to opposite in "ungreedy" and "greedy" regular expressions. 2. How to denote a backslash in a regular expression. It is a very common case when you have to replace backslashes "\" in some string with common slashes "/". Of course if this is all which has to be done, using a regular expression would be an overkill. You could do it with the function str_replace():
<?php
It would produce: c:/somepath/somefile.phpPlease notice that we used '\\', not '\' in str_replace(). '\' would produce a parse error. This is because the PHP parser would consider the second single quote in the string '\' escaped by the backslash "\". This is why we had to escape "\" with another "\". Generally in single quoted strings we have only 2 types of characters
which should be escaped to denote themselves. They are the single
quote "'"
and backslash "\". Other
characters in single quoted strings are not parsed by the PHP parser. E.g.
the single quoted string '\n' would mean
2 characters: \ and letter n,
but not the line break, like it would be in a double quoted string.
Still it could be useful to know how to replace backslashes "\" in strings with common slashes "/" with a regular expression. The following code:
<?php
would produce a parser error. To make the replacement of "\" with "/" correctly, we would have to use the following code:
<?php
It would produce: c:/somepath/somefile.phpIn this example we had to use 4 backslashes "\\\\" in the regular expression to denote just 1 backslash. This is because every backslash in a C-like string must be escaped by one more backslash. So we get 2 backslashes instead of 1. But each backslash in a regular expression must be escaped by another backslash too. So we get 4 backslashes. The same result could be achieved in the following way:
<?php
Here we use backlashes inside of a character class [\\]. In character classes in regular expressions backslashes are not escaped. So we have to escape the backslash by only one more backslash because this is still a C-like string. 3. How to match a variable name in a regular expression. Sometimes it is necessary to match a variable name in a regular expression. Not the variable value, but the variable name. E.g. it could be necessary to match a string like this:
$string = "\$a";
or (the same):
$string = '$a';
Of course we could match it with a regular expression like this:
<?php
This would produce "matched". We used a single quoted string in a regular expression '/\$a/'. So we had to place only 1 slash before "$". The slash is necessary because "$" has special meaning in regular expressions (it denotes the end of a string). But sometimes it could be necessary for us to use double quoted strings in regular expressions. In this case the code would look like this:
<?php
Here we have 3 slashes before a. When the string is parsed by the parser, "\\" will become "\" and "\$" will become "$". So we will still have /\$a/ as the regular expression pattern, like in the previous case. 4. How to use a binary zero in regular expressions. Sometimes we have to process binary data files with regular expressions. It is not unheard of to meet a binary zero in such data. In a C-like string the binary zero "\x00" would be considered as the end of line character. So to use a binary zero in a regular expression, we have to write it like that: "\\x00". I.e. we have to escape with a backslash. 5. How to use recursion in regular expressions. Recursion could be used to match against a recursively repeated pattern in a string. The most common example of recursive patterns is solving nested parentheses problem. To understand recursion in PHP, let us consider a simple example. Imagine you have a pattern enclosed in correctly nested brackets somewhere inside the text. Number of nested brackets could be unlimited. You'd like to capture this bracketed pattern. You could do it like this:
<?php
This example produces: Array
(
[0] => (a(b(c)d)e)
[1] => e
)
As you see the the array element $matches[0] captures the peace text we were looking for (the peace of text enclosed in correctly nested brackets). Let's consider how it works. We made the regular expression recursive by adding
(?R) to it.
(?R) means recursive substitution of the entire regular expression.
PHP parser substitutes the entire regexp So in our particular case almost the same result could be obtained by using the regular expression:
But it works only for not more than 3 nested brackets. So if we do not know the nesting depth in advance, we have to use
instead which allows us unlimited nesting depth and simplifies the regular expression syntax. Let us check manually how the pattern
If this is how it works, it is clear why the second array element $matches[1] is equal to "e". The substring "e" is matched at the last iteration of recursion. Only the value captured at the last iteration is saved in the array. If we want to capture only $matches[0], we could do it like:
<?php
which produces: Array
(
[0] => (a(b(c)d)e)
)
Here we changed capturing brackets "( )" to not capturing "(?: )". Or we could do it even better:
<?php
which produces the same result: Array
(
[0] => (a(b(c)d)e)
)
Here we used so called once-only pattern "(?> )" (which is not capturing) instead of capturing brackets "( )". Using once-only patterns (where possible) is recommended by the PHP manual. Using them should make the regular expression faster. Once-only patterns are quite simple so I do not give information about them here. They are explained in detail in the official PHP manual here. I did not use once-only patterns at once in the examples for the sake of simplicity. If you'd like to learn more about Perl Compatible Regular Expressions (PCRE) you could do it here. You could hire the coder who wrote this article at
|
|
|
|
|
|
E-mail: support@skdevelopment.com |