Regular Expressions - Matching Rules
Everything starts from the basics. A pattern is the most fundamental element of a regular expression; it is a set of characters that describe the characteristics of a string. A pattern can be simple, consisting of ordinary strings, or it can be very complex, often using special characters to represent a range of characters, repetition, or context. For example:
^once
This pattern contains a special character ^, which indicates that the pattern only matches strings that start with once. For example, this pattern matches the string "once upon a time", but does not match "There once was a man from NewYork". Just as the ^ symbol represents the beginning, the $ symbol is used to match strings that end with the given pattern.
bucket$
This pattern matches "Who kept all of this cash in a bucket", but does not match "buckets". When the characters ^ and $ are used together, they indicate an exact match (the string is identical to the pattern). For example:
^bucket$
Only matches the string "bucket". If a pattern does not include ^ and $, it matches any string that contains the pattern. For example, the pattern:
once
matches the string
There once was a man from NewYorkWho kept all of his cash in a bucket.
The letters (o-n-c-e) in this pattern are literal characters, meaning they represent the letters themselves. Numbers are the same. For other slightly complex characters, such as punctuation and whitespace (spaces, tabs, etc.), escape sequences are used. All escape sequences start with a backslash . The escape sequence for a tab is t. So, if we want to detect whether a string starts with a tab, we can use this pattern:
^t
Similarly, n represents a "newline", and r represents a carriage return. Other special symbols can be escaped by adding a backslash in front, such as the backslash itself being represented by \, the period . by ., and so on.
In INTERNET programs, regular expressions are often used to validate user input. When a user submits a form, it is necessary to determine whether the entered phone number, address, email address, credit card number, etc., are valid. Using ordinary literal characters is insufficient.
Therefore, a more flexible way to describe the desired pattern is needed, which is the character class. To create a character class representing all vowel characters, place all vowel characters inside square brackets:
This pattern matches any vowel character, but only represents a single character. A hyphen can be used to represent a range of characters, such as:
// Matches all lowercase letters // Matches all uppercase letters // Matches all letters // Matches all digits [0-9.-] // Matches all digits, periods, and hyphens // Matches all whitespace characters
Similarly, these also represent only a single character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6", or "g7", but not "ab2", "r2d3", or "b52", use this pattern:
^$
Although represents the range of 26 letters, here it can only match strings where the first character is a lowercase letter.
It was mentioned earlier that ^ represents the beginning of a string, but it has another meaning. When used inside a set of square brackets, it means "not" or "exclude", often used to exclude a certain character. Using the previous example, we require that the first character cannot be a digit:
^[^0-9]$
This pattern matches "&5", "g7", and "-2", but does not match "12" or "66". Here are a few examples of excluding specific characters:
[^a-z] // All characters except lowercase letters [^\/^] // All characters except ()(/)(^) [^"'] // All characters except double quotes (") and single quotes (')
The special character . (period) in regular expressions is used to represent all characters except the "newline". So the pattern ^.5$ matches any two-character string that ends with the digit 5 and starts with any non-"newline" character. The pattern . can match any string, except for newline characters (n, r).
PHP's regular expressions have some built-in generic character classes, listed as follows:
| Character Class | Description |
|---|---|
| [[:alpha:]] | Any letter |
| [[:digit:]] | Any digit |
| [[:alnum:]] | Any letter or digit |
| [[:space:]] | Any whitespace character |
| [[:upper:]] | Any uppercase letter |
| [[:lower:]] | Any lowercase letter |
| [[:punct:]] | Any punctuation mark |
| [[:xdigit:]] | Any hexadecimal digit, equivalent to |
So far, you know how to match a letter or a digit, but in more cases, you may need to match a word or a set of numbers. A word consists of several letters, and a set of numbers consists of several digits. The curly braces {} following a character or character class are used to determine the number of repetitions of the preceding content.
| Character Class | Description |
|---|---|
| ^$ | All letters and underscores |
| ^[[:alpha:]]{3}$ | All 3-letter words |
| ^a$ | Letter a |
| ^a{4}$ | aaaa |
| ^a{2,4}$ | aa, aaa, or aaaa |
| ^a{1,3}$ | a, aa, or aaa |
| ^a{2,}$ | Strings containing more than two a's |
| ^a{2,} | Like: aardvark and aaab, but not apple |
| a{2,} | Like: baad and aaa, but not Nantucket |
| t{2} | Two tabs |
| .{2} | All two-character strings |
These examples describe three different uses of curly braces. A number {x} means the preceding character or character class appears exactly x times; a number followed by a comma {x,} means the preceding content appears x or more times; two numbers separated by a comma {x,y} means the preceding content appears at least x times but no more than y times. We can extend the pattern to more words or numbers:
^{1,}$ // All strings containing one or more letters, digits, or underscores ^{0,}$ // All positive integers ^-{0,1}{1,}$ // All integers ^?+.?+$ // All floating-point numbers
The last example is not easy to understand, is it? Let's look at it this way: starts with an optional negative sign (?) (^), followed by one or more digits (+), and a decimal point (.) followed by one or more digits (+), and nothing else after that ($). Below, you will know a simpler method that can be used.
The special character ? is equivalent to {0,1}; they both represent: 0 or 1 of the preceding content or the preceding content is optional. So the previous example can be simplified to:
^-?{1,}.?{1,}$
The special character * is equivalent to {0,}; they both represent 0 or more of the preceding content. Finally, the character + is equivalent to {1,}, representing 1 or more of the preceding content. So the above four examples can be written as:
^+$ // All strings containing one or more letters, digits, or underscores ^*$ // All positive integers ^-?+$ // All integers ^?+(.+)?$ // All floating-point numbers
Of course, this does not technically reduce the complexity of regular expressions, but it makes them easier to read.
YouTip