Regular Expressions - Syntax
Regular expressions are a powerful tool for matching and manipulating text. They consist of ordinary characters and special characters (called "metacharacters") used to describe the text pattern to be matched.
Regular expressions can be used to find, replace, extract, and validate specific patterns within text.
For example:
runoo+bcan match , runooob, runoooooob, etc. The+sign means the preceding character must appear at least once (1 or more times). Try it Β».runoo*bcan match runob, , runoooooob, etc. The*sign means the preceding character can appear zero times, once, or multiple times (0, 1, or more). Try it Β».colou?rcan match color or colour. The?question mark means the preceding character can appear at most once (0 or 1 time). Try it Β».
The method of constructing regular expressions is the same as creating mathematical expressionsβusing various metacharacters and operators to combine small expressions into larger ones. The components of a regular expression can be a single character, a set of characters, a range of characters, a choice between characters, or any combination of these components.
A regular expression acts as a template, matching a character pattern against the searched string.
Ordinary Characters
Ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters, including all uppercase and lowercase letters, all digits, all punctuation marks, and some other symbols.
| Character | Description | Example |
|---|---|---|
|
Matches any character listed in [...]. For example, matches all 'e', 'o', 'u', 'a' letters in the string "google taobao". |
Try it Β» |
[^ABC] |
Matches any character not listed in [...]. For example, [^aeiou] matches all characters except 'e', 'o', 'u', 'a' in the string "google taobao". |
Try it Β» |
|
represents a range, matching all uppercase letters; matches all lowercase letters; matches all digits. |
Try it Β» |
. |
Matches any single character except the newline character (n, r), equivalent to [^nr]. |
Try it Β» |
|
Matches any character (including newline). s matches any whitespace character (including newline), S matches any non-whitespace character (excluding newline). Combining them matches any character. |
Try it Β» |
w |
Matches letters, digits, and underscores, equivalent to . |
Try it Β» |
d |
Matches any single Arabic numeral (0 to 9), equivalent to . |
Try it Β» |
Test Tool
Modifier:
+
Match text:
123abc456edf789
Non-printable Characters
Non-printable characters can also be part of a regular expression. The following table lists escape sequences representing non-printable characters:
| Character | Description |
|---|---|
cx |
Matches the control character indicated by x. For example, cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z; otherwise, c is treated as a literal 'c' character. |
f |
Matches a form feed character, equivalent to x0c and cL. |
n |
Matches a newline character, equivalent to x0a and cJ. |
r |
Matches a carriage return character, equivalent to x0d and cM. |
s |
Matches any whitespace character, including space, tab, form feed, etc., equivalent to . Note that Unicode regular expressions will match full-width space characters. |
S |
Matches any non-whitespace character, equivalent to [^ fnrtv]. |
t |
Matches a tab character, equivalent to x09 and cI. |
v |
Matches a vertical tab character, equivalent to x0b and cK. |
Special Characters
Special characters are characters with special meanings, like the * in runoo*b mentioned above, which means "any number of characters." To find the literal * symbol in a string, you need to escape it by placing a backslash before it, so runo*ob matches the string runo*ob.
Many metacharacters require special treatment when attempting to match them. To match these special characters, you must first "escape" the character by placing a backslash before it. The following table lists special characters in regular expressions:
| Special Character | Description |
|---|---|
$ |
Matches the end position of the input string. If the RegExp object's Multiline property is set, $ also matches before 'n' or 'r'. To match the $ character itself, use $. |
( ) |
Mark the start and end positions of a subexpression. Subexpressions can be captured for later use. To match these characters, use ( and ). |
* |
Matches the preceding subexpression zero or more times. To match the * character itself, use *. |
+ |
Matches the preceding subexpression one or more times. To match the + character itself, use +. |
. |
Matches any single character except the newline character n. To match . itself, use .. |
[ |
Mark the start of a bracket expression. To match [, use [. |
? |
Matches the preceding subexpression zero or one time, or specifies a non-greedy quantifier. To match the ? character itself, use ?. |
|
Mark the next character as a special character, literal character, backreference, or octal escape. For example, 'n' matches the character 'n', 'n' matches a newline, '\' matches "", and '(' matches "(". |
^ |
Matches the start position of the input string. When used inside a bracket expression, it negates the character set (matches any character not in the set). To match the ^ character itself, use ^. |
{ |
Mark the start of a quantifier expression. To match {, use {. |
| |
Indicates a choice between two items. To match |, use |. |
Quantifiers
Quantifiers specify how many times a given component of a regular expression must occur to satisfy a match. There are 6 quantifiers in total: *, +, ?, {n}, {n,}, {n,m}.
| Character | Description | Example |
|---|---|---|
* |
Matches the preceding subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}. |
Try it Β» |
+ |
Matches the preceding subexpression one or more times. For example, zo+ can match "zo" and "zoo", but cannot match "z". + is equivalent to {1,}. |
Try it Β» |
? |
Matches the preceding subexpression zero or one time. For example, do(es)? can match "do" and "does", but cannot match "dog". ? is equivalent to {0,1}. |
Try it Β» |
{n} |
n is a non-negative integer, matching exactly n times. For example, o{2} cannot match the 'o' in "Bob", but can match the two 'o's in "food". |
Try it Β» |
{n,} |
n is a non-negative integer, matching at least n times. For example, o{2,} cannot match the 'o' in "Bob", but can match all 'o's in "foooood". o{1,} is equivalent to o+, o{0,} is equivalent to o*. |
Try it Β» |
{n,m} |
m and n are non-negative integers, where n <= m. Matches at least n times and at most m times. For example, o{1,3} will match the first three 'o's in "fooooood". o{0,1} is equivalent to o?. Note: There can be no spaces between the comma and the two numbers. |
Try it Β» |
The following regular expression matches a positive integer. sets the first digit to not be 0, and * represents any number of digits:
/*/
Please note that the quantifier appears after the range expression, so it applies to the entire range expression. In this case, it specifies digits from 0 to 9 (including 0 and 9).
The + quantifier is not used here because a digit is not necessarily required in the second or subsequent positions. The ? character is also not used because using ? would limit the integer to only two digits.
If you want to match two-digit numbers from 0 to 99, you can use the following expression to specify at least one digit and at most two digits:
/{1,2}/
The above expression has a drawback: it can only match numbers up to two digits and will match values like 0 and 00, which may not be expected. An improved expression to match positive integers from 1 to 99 is as follows:
/?/
or
/{0,1}/
The * and + quantifiers are greedy because they match as much text as possible. Adding a ? after them makes them non-greedy or lazy (minimal match).
For example, you might search an HTML document for content inside an h1 tag. The HTML code is as follows:
<h1>-</h1>
Greedy: The following expression matches everything from the opening less-than sign (<) to the closing greater-than sign (>) of the h1 tag:
/<.*>/
Non-greedy: If you only need to match the opening and closing h1 tags, the following non-greedy expression matches only <h1>:
/<.*?>/
You can also use the following regular expression to match h1 tags:
/<w+?>/
By placing a ? after the *, +, or ? quantifier, the expression changes from "greedy" to "non-greedy" (minimal match).
Anchors
Anchors allow you to fix a regular expression to the start or end of a line, and can also describe word boundary positions.
^ and $ refer to the start and end of the string, respectively. b describes a word boundary (before or after a word), and B represents a non-word boundary.
| Character | Description | Example |
|---|---|---|
^ |
Matches the start position of the input string. If the RegExp object's Multiline property is set, ^ also matches positions after n or r. |
Try it Β» |
$ |
Matches the end position of the input string. If the RegExp object's Multiline property is set, $ also matches positions before n or r. |
Try it Β» |
b |
Matches a word boundary, i.e., the position between a word and a space. | Try it Β» |
B |
Matches a non-word boundary, i.e., a position not at the beginning or end of a word. | Try it Β» |
Note: Quantifiers cannot be used with anchors. Since there cannot be more than one position immediately before or after a newline or word boundary, expressions like
^*are not allowed.
To match text at the beginning of a line, use the ^ character at the start of the regular expression. Be careful not to confuse this usage of ^ with its negation usage inside a bracket expression.
To match text at the end of a line, use the $ character at the end of the regular expression.
To use anchors when searching for chapter titles, the following regular expression matches a chapter title that appears at the beginning of a line and contains only two trailing digits:
/^Chapter {0,1}/
A real chapter title not only appears at the beginning of a line but is also the only text on that line. The following expression ensures that the specified match only matches chapter titles and not cross-references by limiting both the start and end of the line:
/^Chapter {0,1}$/
Word boundaries allow precise control over the match range. The following expression matches the first three characters of the word "Chapter" because these characters appear after a word boundary:
/bCha/
The position of b is crucial: at the start of the string, it looks for a match at the beginning of a word; at the end of the string, it looks for a match at the end of a word. For example, the following expression matches the string "ter" in the word "Chapter" because it appears before a word boundary:
/terb/
The following expression matches the string "apt" in "Chapter" but does not match the string "apt" in "aptitude":
/Bapt/
The string "apt" appears at a non-word boundary in the word "Chapter" but at a word boundary in the word "aptitude". For the B non-word boundary operator, it cannot match the beginning or end of a word, so the following expression does not match "Cha" in "Chapter":
/BCha/
Alternation
Use parentheses () to enclose all choices, with choices separated by |.
() denotes a capturing group. () stores the matched value from each group, and multiple matched values can be accessed via the number n (where n is a number representing the content of the nth capturing group).
Using parentheses produces a side effect: the related matched content is cached (captured). If you do not need to capture, you can use ?: at the beginning of the first option to eliminate this side effect. In this case, the parentheses are only for grouping and do not save the matched content.
Among them, ?: is one of the non-capturing metacharacters. There are two other non-capturing metacharacters: ?= and ?!: the former is a positive lookahead, matching the search string at any position where the regular expression pattern inside the parentheses begins to match; the latter is a negative lookahead, matching the search string at any position where the regular expression pattern does not begin to match.
Differences between ?=, ?<=, ?!, ?<!
exp1(?=exp2): Finds exp1 followed by exp2 (positive lookahead).
(?<=exp2)exp1: Finds exp1 preceded by exp2 (positive lookbehind).
exp1(?!exp2): Finds exp1 not followed by exp2 (negative lookahead).
(?<!exp2)exp1: Finds exp1 not preceded by exp2 (negative lookbehind).
For more information, refer to: Regular Expressions Lookahead and Lookbehind
Backreferences
Adding parentheses around a regular expression pattern or part of a pattern causes the related matches to be stored in a temporary buffer. Each captured sub-match is stored in the order it appears from left to right in the regular expression pattern. The buffer numbers start from 1 and can store up to 99 captured subexpressions. Each buffer can be accessed using n, where n is a one- or two-digit decimal number identifying a specific buffer.
You can use non-capturing metacharacters ?:, ?=, or ?! to override capturing, ignoring the saving of related matches.
One of the simplest and most useful applications of backreferences is to find two identical adjacent words in text. Take the following sentence as an example:
Is is the cost of of gasoline going up up?
The above sentence has multiple repeated words. The following regular expression uses a single subexpression to locate these repetitions:
Example
Find duplicate words:
var str ="Is is the cost of of gasoline going up up";
var patt1 =/b(+) 1b/igm;
document.write(str.match(patt1));
The captured expression + matches one or more letters. The second part of the regular expression 1 is a backreference to the first sub-match (the content captured in parentheses), requiring that it be immediately followed by the same word as the first one.
The word boundary metacharacter b ensures that only whole words are detected; otherwise, phrases like "is issued" or "this is" would not be correctly identified.
The g (global) flag at the end of the expression specifies that the expression should be applied to all matches found in the input string; the i (ignore case) flag specifies case-insensitivity; the m (multiline) flag specifies that potential matches may appear on either side of a newline.
Backreferences can also be used to break down a URI into its components. Suppose you want to break down the following URI into protocol (ftp, http, etc.), domain address, and path:
The following regular expression provides this functionality:
Example
Output all matched data:
var str ="";
var patt1 =/(w+)://([^/:]+)(:d*)?([^# ]*)/;
arr = str.match(patt1);
for(var i =0; i < arr.length; i++){
document.write(arr);
document.write("<br>");
}
str.match(patt1) returns an array containing 5 elements: index 0 corresponds to the entire matched string, indices 1-4 correspond to each parenthesized capturing group, and so on.
- First parenthesized subexpression (
w+): Captures the protocol part of the web address, matching any word before the colon and two forward slashes. Result: https - Second parenthesized subexpression (
[^/:]+): Captures the domain address part, matching one or more characters that are not:or/. Result: www..com - Third parenthesized subexpression (
:d*): Captures the port number (if present), matching zero or more digits after the colon. Result: :80 - Fourth parenthesized subexpression (
[^# ]*): Captures the path and page information, matching any sequence of characters that does not include#or a space. Result: /html/html-tutorial.html
YouTip