Regular Expressions - Introduction
A Regular Expression (often abbreviated as regex or regexp) is a powerful text processing tool that uses a special string to describe and match a series of strings that conform to a certain syntactic rule.
You can think of it as a super wildcard. Ordinary wildcards (like * representing any character) have limited functionality, while regular expressions can define extremely complex and precise text patterns, capable of everything from simple word matching to complex structured data extraction.
For example, you likely use the ? and * wildcards to find files on your hard drive. The ? wildcard matches 0 or 1 character in a filename, while the * wildcard matches zero or more characters. A pattern like data(w)?.dat will find the following files:
Example
data.dat
data1.dat
data2.dat
datax.dat
dataN.dat
Using the * character instead of the ? character increases the number of files found. data.*.dat matches all the following files:
Example
data.dat
data1.dat
data2.dat
data12.dat
datax.dat
dataXYZ.dat
Although this search method is useful, it is still limited. By understanding how the * wildcard works, we introduce the concept that regular expressions rely on, but regular expressions are more powerful and flexible.
Using regular expressions allows you to achieve powerful functionality with simple methods. Here is a simple example first:
^matches the start position of the input string.+matches multiple digits,matches a single digit, and+matches one or more.abc$matches the lettersabcand ends withabc, where$matches the end position of the input string.
When writing a user registration form, if we only allow the username to contain letters, numbers, underscores, and hyphens (-), and set a length limit for the username, we can use the following regular expression:
^{3,15}$
^indicates the start of the string to be matched.represents a character set containing lowercase letters, uppercase letters, numbers, underscores, and hyphens (-).{3,15}indicates that the preceding character set must appear at least 3 times and at most 15 times, thereby limiting the username length to between 3 and 15 characters.$indicates the end of the string to be matched.
The above regular expression can match , tutorial1, run-oob, run_oob, but does not match ru because it contains too few letters (less than 3) to match. It also does not match $ because it contains a special character.
Example
Match a string that starts with a number and ends with abc.
var str = "123abc";
var patt1 = /^+abc$/;
document.write(str.match(patt1));
The following marked text is the matched expression:
123abc
Regular Expression Metacharacters and Features
Character Matching
- Literal Characters: Literal characters are matched by their literal meaning. For example, matching the letter "a" will match the "a" character in the text.
- Metacharacters: Metacharacters have special meanings. For example,
dmatches any digit character,wmatches any alphanumeric character,.matches any character (except the newline character), etc.
Quantifiers
*: Matches the preceding pattern zero or more times.+: Matches the preceding pattern one or more times.?: Matches the preceding pattern zero or one time.{n}: Matches the preceding pattern exactly n times.{n,}: Matches the preceding pattern at least n times.{n,m}: Matches the preceding pattern at least n times and no more than m times.
Character Classes
: Matches any one character inside the brackets. For example,matches the characters "a", "b", or "c".[^ ]: Matches any one character except those inside the brackets. For example,[^abc]matches any character except "a", "b", or "c".
Boundary Matchers
^: Matches the start of the string.$: Matches the end of the string.b: Matches a word boundary.B: Matches a non-word boundary.
Grouping and Capturing
( ): Used for grouping and capturing subexpressions.(?: ): Used for grouping without capturing the subexpression.
Special Characters
: Escape character, used to match the special character itself..: Matches any character (except the newline character).|: Used to specify a choice between multiple patterns.
Definition and Purpose of Regular Expressions
Technically, a regular expression is a string composed of literal characters (such as letters a to z) and special characters (called metacharacters). This string forms a search pattern, which is used to perform search, match, replace, or split operations on text.
Main Purposes
Regular expressions have three main purposes:
- Text Search and Matching: Quickly determine if a piece of text contains a substring that conforms to a specific pattern. For example, checking if a string is a valid email address format.
- Text Replacement: Replace all parts of the text that match a specific pattern with new content. For example, changing all date formats in a document from
YYYY-MM-DDtoMM/DD/YYYY. - Text Extraction and Splitting: Precisely extract the parts we care about from a large block of text, or split the text into an array based on a specific delimiter. For example, extracting all IP addresses from a log file, or splitting a CSV string by commas.
Application Scenarios of Regular Expressions
Regular expressions are almost ubiquitous in programming and daily text processing. Here are some of the most common application scenarios:
1. Data Validation
This is one of the most classic applications of regular expressions, ensuring that user input data conforms to the expected format.
- Validating Email Addresses: Checking if the input looks like
username@domain.com. - Validating Phone Numbers: Checking if it conforms to the phone number format of a country/region (e.g., 11 digits in China).
- Validating Password Strength: Requiring passwords to contain uppercase and lowercase letters, numbers, and special characters.
- Validating Date Formats: Ensuring the date is in a valid format like
2023-12-25or12/25/2023. - Validating ID Numbers: Matching ID numbers that follow specific encoding rules.
2. Text Search and Filtering
Quickly locating information within large amounts of text.
- Log Analysis: Searching for all
ERRORorWARNlevel records in server logs. - Code Search: In an IDE or editor, using regular expressions to search for all function definitions (e.g.,
function xxx(...)) or specific variable names. - Document Content Search: Finding all phone numbers or URLs in a long document.
3. Text Replacement and Cleaning
Batch modifying text content to standardize it.
- Data Formatting: Formatting a phone number from
12345678901to123-4567-8901. - Data Cleaning: Removing extra whitespace characters (like multiple consecutive spaces or tabs) from text.
- Sensitive Information Masking: Replacing ID numbers in text with
***, e.g.,110101199001011234->110101********1234. - Code Refactoring: Batch renaming variables or function names.
4. Text Extraction and Parsing
Extracting structured data from unstructured text.
- Web Scraping: Extracting all links (
href="...") or image addresses (src="...") from HTML code. - Parsing Configuration Files: Reading configuration files in
key = valueformat. - Extracting Specific Data: Extracting all amounts (like
οΏ₯100.50or$99.99) from a piece of text.
5. String Splitting
Splitting strings using complex rules, not just a single character.
- Splitting a sentence using one or more spaces, commas, or semicolons.
- Parsing CSV or TSV data based on different delimiters (like
,,;,t).
The following flowchart summarizes the core workflow of regular expressions in data processing:
Development History
The ancestors of regular expressions can be traced back to early research on how the human nervous system works. Two neurophysiologists, Warren McCulloch and Walter Pitts, developed a mathematical way to describe these neural networks.
In 1951, a mathematician named Stephen Kleene, building on the early work of McCulloch and Pitts, published a paper titled "Representation of Events in Nerve Nets and Finite Automata," introducing the concept of regular expressions. Regular expressions are expressions used to describe what he called the algebra of regular sets, hence the term regular expression.
Subsequently, it was found that this work could be applied to some early research using Ken Thompson's computational search algorithms. Ken Thompson is a major inventor of Unix. The first practical application of regular expressions was the grep editor in Unix.
The general development history is as follows:
- 1951: Stephen Kleene, one of the founders of computational theory and an American computer scientist, first proposed the concept of regular languages and used formal methods to describe this language. This laid the theoretical foundation for the development of regular expressions.
- 1960s: Ken Thompson, one of the co-founders of the Unix operating system, developed the first program to practically apply regular expressions, which was part of the
grepcommand in Unix. This marked the practical application of regular expressions. - 1970s: Ken Thompson and Rob Pike developed the first regular expression engine, which was widely used in Unix systems, playing a key role in the popularization of regular expressions.
- 1986: Philip Hazel developed the PCRE (Perl Compatible Regular Expressions) library, a regular expression library that allows the use of Perl-style regular expressions in different programming languages.
- 1997: IEEE released the POSIX.2 standard, which included standard specifications for regular expressions, making the behavior of regular expressions more consistent across different Unix systems.
- After 2000: Regular expressions became increasingly popular in computer programming and text processing. Programming languages and tools supporting regular expressions became richer and more powerful, such as Perl, Python, Java, JavaScript, etc.
- Currently: Regular expressions remain an important tool for text processing and data extraction, with widespread applications in fields like data science, text analysis, web scraping, string search and replacement, etc.
Application Fields
Currently, regular expressions have been widely applied in many software programs, including *nix (Linux, Unix, etc.), HP and other operating systems, development environments like PHP, C#, Java, and many application software. You can see the shadow of regular expressions everywhere.
C# Regular Expressions
In our C# tutorial, the chapter C# Regular Expressions specifically introduces knowledge about C# regular expressions.
Java Regular Expressions
In our Java tutorial, the chapter Java Regular Expressions specifically introduces knowledge about Java regular expressions.
JavaScript Regular Expressions
In our JavaScript tutorial, the chapter JavaScript RegExp Object specifically introduces knowledge about JavaScript regular expressions. We also provide a complete JavaScript RegExp Object Reference Manual.
Python Regular Expressions
In our Python basic tutorial, the chapter Python Regular Expressions specifically introduces knowledge about Python regular expressions.
Ruby Regular Expressions
In our Ruby tutorial, the chapter Ruby Regular Expressions specifically introduces knowledge about Ruby regular expressions.
Command or Environment . ^ $ ( ) { } ? + | ( )
vi β β β β β
Visual C++ β β β β β
awk β β β β awk supports this syntax, but you need to add the --posix or --re-interval parameter on the command line. See the interval expression in man awk. β β β β
sed β β β β β β
delphi β β β β β β β β β
python β β β β β β β β β β
java β β β β β β β β β β
javascript β β β β β β β β β
php β β β β β
perl β β β β β β β β β
C# β β β β β β β β
Below is a comparison of regular expression support in mainstream programming languages:
| Programming Language | Support Method | Common Class/Module | Simple Example (Match Numbers) |
|---|---|---|---|
| JavaScript | Native language support | RegExp object, or using literals /.../ |
/d+/ or new RegExp("d+") |
| Python | Standard library re module |
re module |
re.compile(r"d+") |
| Java | Standard library java.util.regex package |
Pattern and Matcher classes |
Pattern.compile("d+") |
| PHP | Built-in PCRE functions | preg_ series functions (e.g., preg_match) |
preg_match("/d+/", $text) |
| C# | System.Text.RegularExpressions namespace |
Regex class |
new Regex(@"d+") |
| Go | Standard library regexp package |
regexp package |
regexp.MustCompile(`d+`) |
| Ruby | Native language support, core class | Regexp class |
/d+/ |
YouTip