YouTip LogoYouTip

Csharp Regular Expressions

**Regular expressions** are search formulas used to describe and match text patterns. Simply put, you can think of them as a special wildcard language used to precisely find, validate, or replace content within strings. For example, determining whether a user's input is a valid email address or phone number, or extracting all dates from a piece of textβ€”these are typical application scenarios for regular expressions. The .NET framework includes a fully functional regular expression engine, supported through the `System.Text.RegularExpressions` namespace. A regular expression pattern consists of one or more characters, operators, and structures that together describe the text rules to be matched. If you don't understand regular expressions yet, you can first read our (#). Regular expressions are composed of the following building blocks, each responsible for different matching functions: * **Character Escapes**: Allow special characters to be treated as ordinary characters (e.g., `.` matches a literal period). * **Character Classes**: Match "a class" of characters (e.g., `d` matches any digit). * **Anchors**: Specify the "position" where a match occurs (e.g., `^` denotes the start of a line). * **Grouping Constructs**: Combine sub-patterns together for easy capture or reference. * **Quantifiers**: Control how many times an element must appear (e.g., `+` means one or more times). * **Backreference Constructs**: Reference previously captured content for further matching. * **Alternation Constructs**: Implement "OR" logic (e.g., `cat|dog`). * **Substitutions**: Reference captured content in replacement operations. * **Miscellaneous Constructs**: Inline options, comments, and other auxiliary functions. ### Character Escapes In regular expressions, the backslash character (``) has two functions: first, it gives the following **ordinary character a special meaning** (e.g., `n` represents a newline), and second, it **escapes a special character to its literal meaning** (e.g., `.` matches a literal dot instead of "any character"). A common beginner mistake: In C# strings, `` is also an escape character, so to write `d` in a regular expression, you must write `"d"` in a C# string, or use a verbatim string `@"d"` (recommended). The following table lists common escape characters: | Escape Character | Description | Pattern | Match | | --- | --- | --- | --- | | **a** | Matches the bell (alert) character u0007. | a | "u0007" in "Warning!" + 'u0007' | | **b** | In a character class, matches the backspace character u0008. (Note: Outside a character class, b represents a word boundary, see the "Anchors" section.) | {3,} | "bbbb" in "bbbb" | | **t** | Matches the tab character u0009. Often used to match tab-separated text. | (w+)t | "Namet" and "Addrt" in "NametAddrt" | | **r** | Matches the carriage return character u000D. (r is not equivalent to the newline character n. Windows line endings are typically rn.) | rn(w+) | "rnHello" in "rnHellonWorld." | | **v** | Matches the vertical tab character u000B. | {2,} | "vvv" in "vvv" | | **f** | Matches the form feed character u000C. | {2,} | "fff" in "fff" | | **n** | Matches the newline character u000A. Unix/Linux systems typically use only n for line endings. | rn(w+) | "rnHello" in "rnHellonWorld." | | **e** | Matches the escape character u001B. | e | "x001B" in "x001B" | | ** nnn** | Specifies a character using its octal representation (nnn consists of two to three digits). | w40w | "a b" and "c d" in "a bc d" | | **x nn** | Specifies a character using its hexadecimal representation (nn consists of exactly two digits). | wx20w | "a b" and "c d" in "a bc d" | | **c X c x** | Matches the ASCII control character specified by X or x, where X or x is the letter of the control character. | cC | "x0003" in "x0003" (Ctrl-C) | | **u nnnn** | Matches a Unicode character using its hexadecimal representation (nnnn represents a four-digit number). | wu0020w | "a b" and "c d" in "a bc d" | | **** | When followed by an unrecognized escape character, matches that character. | d+[+-x*]d+d+[+-x*d+ | "2+2" and "3*9" in "(2+2) * 3*9" | ### Character Classes Character classes are used to match **any single character** from "a certain class". For example, `` matches any single vowel, and `d` matches any single digit. This is one of the most commonly used basic features in regular expressions. The following table lists character classes: | Character Class | Description | Pattern | Match | | --- | --- | --- | --- | | **** | Matches any single character in character_group. By default, matching is case-sensitive. | | "m" in "mat", "m" and "n" in "moon" | | **[^character_group]** | Negation: Matches any single character not in character_group. By default, characters in character_group are case-sensitive. | [^aei] | "v" and "l" in "avail" | | **** | Character range: Matches any single character in the range from first to last. | | irds can match Birds, Cirds, Dirds | | **.** | Wildcard: Matches any single character except n. To match the literal period character (. or u002E), you must escape it with a backslash (.). | a.e | "ave" in "have", "ate" in "mate" | | **p{ name }** | Matches any single character in the Unicode general category or named block specified by _name_. | p{Lu} | "C" and "L" in "City Lights" | | **P{ name }** | Matches any single character not in the Unicode general category or named block specified by _name_. | P{Lu} | "i", "t", and "y" in "City" | | **w** | Matches any word character (letter, digit, underscore). Equivalent to (within the ASCII range). | w | "R", "o", "m", and "1" in "Room#1" | | **W** | Matches any non-word character. The opposite of w. | W | "#" in "Room#1" | | **s** | Matches any whitespace character (space, tab, newline, etc.). | ws | "D " in "ID A1.3" | | **S** | Matches any non-whitespace character. The opposite of s. | sS | " _" in "int __ctr" | | **d** | Matches any decimal digit. Equivalent to . | d | "4" in "4 = IV" | | **D** | Matches any character that is not a decimal digit. The opposite of d. | D | " ", "=", " ", "I", and "V" in "4 = IV" | ### Anchors Anchors (also called "assertions") do not match any specific character but rather match a **position** within the string. They are "zero-width" and do not consume any characters; they simply assert that the current position meets a certain condition. For example, `^d{3}` means "the first three digits at the start of the string", and `b` means "the boundary between a word character and a non-word character". The following table lists anchors: | Assertion | Description | Pattern | Match | | --- | --- | --- | --- | | **^** | The match must start at the beginning of the string or line. | ^d{3} | "567" in "567-777-" | | **$** | The match must occur at the end of the string or before the **n** at the end of the line or string. | -d{4}$ | "-2012" in "8-12-2012" | | **A** | The match must occur at the start of the string (not affected by multiline mode, always the start of the entire string). | Aw{4} | "Code" in "Code-007-" | | **Z** | The match must occur at the end of the string or before the **n** at the end of the string. | -d{3}Z | "-007" in "Bond-901-007" | | **z** | The match must occur at the very end of the string (strict end, no n allowed at the end). | -d{3}z | "-333" in "-901-333" | | **G** | The match must occur at the point where the previous match ended. Often used in continuous matching scenarios. | G(d) | "(1)", "(3)", and "(5)" in "(1)(3)(5)(9)" | | **b** | Matches a word boundary, which is the position between a word character and a space. | erb | Matches "er" in "never", but not in "verb". | | **B** | Matches a non-word boundary. | erB | Matches "er" in "verb", but not in "never". | ### Grouping Constructs Grouping constructs use parentheses `( )` to enclose part of a regular expression, forming a sub-expression. Grouping has two main purposes: * **Capture**: "Saves" the matched substring for later extraction or reference in replacements. * **Scope Limitation**: Makes quantifiers or alternation constructs apply only to the sub-expression within the group. This part can be difficult to understand. You can read **(#)** and **(#)** to help understand. The following table lists grouping constructs: | Grouping Construct | Description | Pattern | Match | | --- | --- | --- | --- | | **( subexpression )** | Captures the matched sub-expression and assigns it to a zero-based ordinal number. | (w)1 | "ee" in "deep" | | **(?subexpression)** | Captures the matched sub-expression into a named group. Named groups are more readable than numbered groups and are recommended for complex regular expressions. | (?w)k | "ee" in "deep" | | **(?subexpression)** | Defines a balancing group definition. Used for matching nested structures (like balanced parentheses), an advanced feature. | (((?'Open'()[^()]*)+((?'Close-Open'))[^()]*)+)*(?(Open)(?!))$ | "((1-3)*(3-1))" in "3+2^((1-3)*(3-1))" | | **(?: subexpression)** | Defines a non-capturing group. Used when grouping (for scope) is needed but saving the match content is not. Slightly better performance than capturing groups. | Write(?:Line)? | "WriteLine" in "Console.WriteLine()" | | **(?imnsx-imnsx:subexpression)** | Applies or disables the options specified in _subexpression_. | Ad{2}(?i:w+)b | "A12xl" and "A12XL" in "A12xl A12XL a12xl" | | **(?= subexpression)** | Zero-width positive lookahead. Matches a position followed immediately by subexpression, but does not consume characters. | w+(?=.) | "is", "ran", and "out" in "He is. The dog ran. The sun is out." | | **(?! subexpression)** | Zero-width negative lookahead. Matches a position not followed by subexpression. | b(?!un)w+b | "sure" and "used" in "unsure sure unity used" | | **(?<=subexpression)** | Zero-width positive lookbehind. Matches a position preceded immediately by subexpression. | (?<=19)d{2}b | "99", "50", and "05" in "1851 1999 1950 1905 2003" | | **(?<! subexpression)** | Zero-width negative lookbehind. Matches a position not preceded by subexpression. | (? subexpression)** | Non-backtracking (atomic group) sub-expression. Once matched successfully, backtracking is not allowed, which can improve performance in certain scenarios. | (?>A+B+) | "1ABB", "3ABB", and "5AB" in "1ABB 3ABBC 5AB 5AC" | ## Example using System; using System.Text.RegularExpressions; public class Example { public static void Main() { string input ="1851 1999 1950 1905 2003"; string pattern =@"(?<=19)d{2}b"; foreach(Match match in Regex.Matches(input, pattern)) Console.WriteLine(match.Value); } } [Run Example Β»](#) ### Quantifiers Quantifiers specify how many times the preceding element (character, character class, or group) must appear for a match to be successful. Quantifiers are **greedy** by default, meaning they match as many characters as possible. Adding `?` after a quantifier makes it **lazy (non-greedy)**, meaning it matches as few characters as possible. Beginners can first master the four basic quantifiers: `*`, `+`, `?`, and `{n}`. The following table lists quantifiers: | Quantifier | Description | Pattern | Match | | --- | --- | --- | --- | | ***** | Matches the previous element zero or more times. (Zero times is also a match.) | d*.d | ".0", "19.9", "219.9" | | **+** | Matches the previous element one or more times. (At least once.) | "be+" | "bee" in "been", "be" in "bent" | | **?** | Matches the previous element zero or one time. (The element is optional.) | "rai?n" | "ran", "rain" | | **{ n }** | Matches the previous element exactly n times. | ",d{3}" | ",043" in "1,043.6", ",876", ",543", and ",210" in "9,876,543,210" | | **{ n ,}** | Matches the previous element at least n times. | "d{2,}" | "166", "29", "1930" | | **{ n , m }** | Matches the previous element at least n times, but not more than m times. | "d{3,5}" | "19302" in "166", "17668", "193024" | | ***?** | Matches the previous element zero or more times, but as few times as possible (lazy mode). | d*?.d | ".0", "19.9", "219.9" | | **+?** | Matches the previous element one or more times, but as few times as possible (lazy mode). | "be+?" | "be" in "been", "be" in "bent" | | **??** | Matches the previous element zero or one time, but as few times as possible (lazy mode). | "rai??n" | "ran", "rain" | | **{ n }?** | Matches the previous element exactly n times. | ",d{3}?" | ",043" in "1,043.6", ",876", ",543", and ",210" in "9,876,543,210" | | **{ n ,}?** | Matches the previous element at least n times, but as few times as possible. | "d{2,}?" | "166", "29", and "1930" | | **{ n , m }?** | Matches the previous element between n and m times, but as few times as possible. | "d{3,5}?" | "193" and "024" in "166", "17668", "193024" | ### Backreference Constructs Backreferences allow you to reference **previously captured group content** within the same regular expression for further matching. For example, `(w)1` can find consecutive repeated characters like "ee" or "ll". The following table lists backreference constructs: | Backreference Construct | Description | Pattern | Match | | --- | --- | --- | --- | | ** number** | Backreference. Matches the value of a numbered sub-expression. | (w)1 | "ee" in "seek" | | **k** | Named backreference. Matches the value of a named expression. More readable than numbered references and recommended. | (?w)k | "ee" in "seek" | ### Alternation Constructs Alternation constructs use the vertical bar `|` to implement "OR" logic, allowing a regular expression to match any one of multiple candidate patterns. Similar to the `||` operator in programming languages. The following table lists alternation constructs: | Alternation Construct | Description | Pattern | Match | | --- | --- | --- | --- | | **|** | Matches any element separated by the vertical bar (|) character. | th(e|is|at) | "the" and "this" in "this is the day. " | | **(?( expression )yes | no )** | If the regular expression pattern is matched by the expression, matches _yes_; otherwise, matches the optional _no_ part. The expression is interpreted as a zero-width assertion. | (?(A)Ad{2}b|bd{3}b) | "A10" and "910" in "A10 C103 910" | | **(?( name )yes | no )** | If the named or numbered capture group has a match, matches _yes_; otherwise, matches the optional _no_. | (?")?(?(quoted).+?"|S+s) | Dogs.jpg and "Yiska playing.jpg" in "Dogs.jpg "Yiska playing.jpg"" | ### Substitutions Substitution syntax is used in the **replacement pattern string** for the `Regex.Replace()` method. It allows you to reference previously captured group content using `$` followed by a number or name, enabling flexible text restructuring. The following table lists characters used in substitutions: | Character | Description | Pattern | Replacement Pattern | Input String | Result String | | --- | --- | --- | --- | --- | --- | | **$**number | Replaces the substring matched by group _number_. | b(w+)(s)(w+)b | $3$2$1 | "one two" | "two one" | | **${**name**}** | Replaces the substring matched by the named group _name_. | b(?w+)(s)(?w+)b | ${word2} ${word1} | "one two" | "two one" | | **$$** | Replaces the "$" character. | b(d+)s?USD | $$$1 | "103 USD" | "$103" | | **$&** | Replaces a copy of the entire match. | ($*(d*(.+d+)?){1}) | **$& | "$1.30" | "**$1.30" | | **$`** | Replaces all text in the input string before the match. | B+ | $` | "AABBCC" | "AAAACC" | | **$'** | Replaces all text in the input string after the match. | B+ | $' | "AABBCC" | "AACCCC" | | **$+** | Replaces the last captured group. | B+(C+) | $+ | "AABBCCDD" | AACCDD | | **$_** | Replaces the entire input string. | B+ | $_ | "AABBCC" | "AAAABBCCCC" | ### Miscellaneous Constructs The following table lists various miscellaneous constructs: | Construct | Description | Example | | --- | --- | --- | | **(?imnsx-imnsx)** | Sets or disables options like case-insensitivity in the middle of a pattern. | bA(?i)bw+b matches "ABA" and "Able" in "ABA Able Act" | | **(?#comment)** | Inline comment. The comment terminates at the first closing parenthesis. | bA(?#Matches words starting with A)w+b | | **#** | The comment starts with an unescaped # and continues to the end of the line. | (?x)bAw+b#Matches words starting with A | The Regex class is the core class for using regular expressions in .NET, located in the `System.Text.RegularExpressions` namespace. Before using it, you need to add the following at the top of your file: using System.Text.RegularExpressions; The following table lists some commonly used methods in the Regex class: | No. | Method & Description | | --- | --- | | 1 | **public bool IsMatch( string input )** Determines whether the input string contains content that matches the regular expression pattern. Often used for form validation, such as validating phone numbers or email formats. | | 2 | **public bool IsMatch( string input, int startat )** Performs a match check starting from a specified position in the string. | | 3 | **public static bool IsMatch( string input, string pattern )** A static method that allows you to pass the pattern string directly for matching without first creating a Regex object. Suitable for one-time use scenarios. | | 4 | **public MatchCollection Matches( string input )** Searches the input string for all matches and returns a MatchCollection, which can be iterated with foreach to access each match result. | | 5 | **public string Replace( string input, string replacement )** Replaces all content in the input string that matches the regular expression pattern with the specified string. | | 6 | **public string[] Split( string input )** Splits the input string into an array of substrings based on delimiters defined by the regular expression pattern. More flexible than string.Split() and supports complex delimiter rules. | For a complete list of properties of the Regex class, please refer to the Microsoft C# documentation. The following example matches words starting with 'S': Analysis: `b` is a word boundary, `S` matches the uppercase letter S, `S*` matches zero or more non-whitespace characters. Combined, it means "a complete word starting with S". ## Example using System; using System.Text.RegularExpressions; namespace RegExApplication { class Program { private static void showMatch(string text, string expr) { Console.WriteLine("The Expression: "+ expr); MatchCollection mc = Regex.Matches(text, expr); foreach(Match m in mc) { Console.WriteLine(m); } } static void Main(string[] args) { string str ="A Thousand Splendid Suns"; Console.WriteLine("Matching words that start with 'S': "); showMatch(str, @"bSS*"); Console.ReadKey(); } } } When the above code is compiled and executed, it produces the following result: Matching words that start with 'S':The Expression: bSS*SplendidSuns The following example matches words starting with 'm' and ending with 'e': Analysis: `bm` matches words starting with m, `S*` matches any non-whitespace characters in between, and `eb` requires the word to end with e followed by a word boundary. Together, they find all words starting with m and ending with e. ## Example using System; using System.Text.RegularExpressions; namespace RegExApplication { class Program { private static void showMatch(string text, string expr) { Console.WriteLine("The Expression: "+ expr); MatchCollection mc = Regex.Matches(text, expr); foreach(Match m in mc) { Console.WriteLine(m); } } static void Main(string[] args) { string str ="make maze and manage to measure it"; Console.WriteLine("Matching words start with 'm' and ends with 'e':"); showMatch(str, @"bmS*eb"); Console.ReadKey(); } } } When the above code is compiled and executed, it produces the following result: Matching words start with 'm' and ends with 'e':The Expression: bmS*eb make maze manage measure The following example replaces extra spaces: Analysis: `s+` matches one or more consecutive whitespace characters (spaces, tabs, etc.) and replaces them all with a single space, thereby consolidating extra whitespace. ## Example using System; usin
← Csharp Exception HandlingCsharp Preprocessor Directives β†’