Java Regular Expressions
Regular expressions define patterns for strings.
Regular expressions can be used to search, edit, or process text.
Regular expressions are not limited to a single language, but there are subtle differences in each language.
Java provides the `java.util.regex` package, which contains the `Pattern` and `Matcher` classes for handling regular expression matching operations.
### Regular Expression Examples
A string itself is a simple regular expression. For example, the regular expression **Hello World** matches the string "Hello World".
`.` (dot) is also a regular expression, which matches any single character, such as "a" or "1".
The following table lists some regular expression examples and their descriptions:
| Regular Expression | Description |
| --- | --- |
| this is text | Matches the string "this is text" |
| thiss+iss+text | Note the **s+** in the string. The **s+** after the word "this" matches multiple spaces, then matches the string "is", then **s+** matches multiple spaces again, followed by the string "text". Can match this example: this is text |
| ^d+(.d+)? | ^ defines the start, d+ matches one or more digits, ? makes the parentheses optional, . matches ".". Can match examples: "5", "1.5", and "2.21". |
> For more regular expression content, refer to: (#)
* * *
## java.util.regex Package
The `java.util.regex` package is part of the Java standard library for supporting regular expression operations.
The `java.util.regex` package mainly includes the following three classes:
* **(#):**
A Pattern object is a compiled representation of a regular expression. The Pattern class has no public constructors. To create a Pattern object, you must first call its public static compile method, which returns a Pattern object. This method takes a regular expression as its first parameter.
* **(#):**
A Matcher object is an engine that interprets and performs matching operations on an input string. Like the Pattern class, Matcher also has no public constructors. You need to call the matcher method of a Pattern object to obtain a Matcher object.
* **PatternSyntaxException:**
PatternSyntaxException is an unchecked exception class that indicates a syntax error in a regular expression pattern.
The following example uses the regular expression **.*tutorial.*** to find if the string contains the substring **tutorial**:
## Example
```java
import java.util.regex.*;
class RegexExample1{
public static void main(String[]args){
String content = "I am noob " + "from example.com.";
String pattern = ".*tutorial.*";
boolean isMatch = Pattern.matches(pattern, content);
System.out.println("Whether the string contains 'tutorial' substring? " + isMatch);
}
}
The output of the example is:
Whether the string contains 'tutorial' substring? true
* * *
## Capturing Groups
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing characters inside a set of parentheses.
For example, the regular expression `(dog)` creates a single group containing "d", "o", and "g".
Capturing groups are numbered by counting their opening parentheses from left to right. For example, in the expression `((A)(B(C)))`, there are four such groups:
* `((A)(B(C)))`
* `(A)`
* `(B(C))`
* `(C)`
You can find out how many groups are in an expression by calling the `groupCount` method on a matcher object. The `groupCount` method returns an `int` value indicating the number of capturing groups the matcher object currently has.
There is also a special group (group(0)), which always represents the entire expression. This group is not included in the return value of `groupCount`.
## Example
The following example illustrates how to find numeric strings from a given string:
## RegexMatches.java File Code:
```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches{
public static void main(String[]args){
String line = "This order was placed for QT3000! OK?";
String pattern = "(D*)(d+)(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
if(m.find()){
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
}else{
System.out.println("NO MATCH");
}
}
}
The compiled and running result of the above example is as follows:
Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT
Found value: 3000
Found value: ! OK?
* * *
## Regular Expression Syntax
In other languages, `` means: **I want to insert a literal (plain) backslash into the regular expression, please do not give it any special meaning.**
In Java, `` means: **I want to insert a regular expression backslash, so the following character has special meaning.**
Therefore, in other languages (like Perl), a single backslash `` is sufficient for escaping, while in Java regular expressions, two backslashes are needed to be parsed as the escaping effect in other languages. It can also be simply understood that in Java regular expressions, two `` represent one `` in other languages. This is why the regular expression for a single digit is `d`, and for a plain backslash is ``.
```java
System.out.print(""); // Outputs
System.out.print(""); // Outputs
| Character | Description |
| --- | --- |
| `` | Marks the next character as a special character, text, a backreference, or an octal escape. For example, `n` matches the character `n`. `n` matches a newline character. The sequence `` matches ``, and `(` matches `(`. |
| `^` | Matches the position at the start of the input string. If the **Multiline** property of the **RegExp** object is set, `^` also matches the position after `n` or `r`. |
| `$` | Matches the position at the end of the input string. If the **Multiline** property of the **RegExp** object is set, `$` also matches the position before `n` or `r`. |
| `*` | Matches the preceding character or subexpression zero or more times. For example, `zo*` matches "z" and "zoo". `*` is equivalent to `{0,}`. |
| `+` | Matches the preceding character or subexpression one or more times. For example, `zo+` matches "zo" and "zoo", but not "z". `+` is equivalent to `{1,}`. |
| `?` | Matches the preceding character or subexpression zero or one time. For example, `do(es)?` matches "do" or "does" in "do". `?` is equivalent to `{0,1}`. |
| `{n}` | `n` is a nonnegative integer. Matches exactly `n` times. For example, `o{2}` does not match the "o" in "Bob", but matches the two "o"s in "food". |
| `{n,}` | `n` is a nonnegative integer. Matches at least `n` times. For example, `o{2,}` does not match the "o" in "Bob", but matches all the o's in "foooood". `o{1,}` is equivalent to `o+`. `o{0,}` is equivalent to `o*`. |
| `{n,m}` | `m` and `n` are nonnegative integers, where `n According to the Java Language Specification, backslashes in Java source code strings are interpreted as Unicode escapes or other character escapes. Therefore, two backslashes must be used in string literals to protect the regular expression from being interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal `b` matches a single backspace character, while `b` matches a word boundary. The string literal `(hello)` is illegal and will cause a compile-time error; to match the string `(hello)`, you must use the string literal `(hello)`.
* * *
## Methods of the Matcher Class
## Index Methods
Index methods provide useful index values, precisely indicating where in the input string a match was found:
| **No.** | **Method and Description** |
| --- | --- |
| 1 | **public int start()** Returns the start index of the previous match. |
| 2 | **public int start(int group)** Returns the start index of the subsequence captured by the given group during the previous match operation. |
| 3 | **public int end()** Returns the offset after the last character matched. |
| 4 | **public int end(int group)** Returns the offset after the last character of the subsequence captured by the given group during the previous match operation. |
## Study Methods
Study methods are used to examine the input string and return a boolean value indicating whether the pattern is found:
| **No.** | **Method and Description** |
| --- | --- |
| 1 | **public boolean lookingAt()** Attempts to match the input sequence, starting at the beginning of the region, against the pattern. |
| 2 | **public boolean find()** Attempts to find the next subsequence of the input sequence that matches the pattern. |
| 3 | **public boolean find(int start)** Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. |
| 4 | **public boolean matches()** Attempts to match the entire region against the pattern. |
## Replacement Methods
Replacement methods are methods for replacing text in the input string:
| **No.** | **Method and Description** |
| --- | --- |
| 1 | **public Matcher appendReplacement(StringBuffer sb, String replacement)** Implements a non-terminal append-and-replace step. |
| 2 | **public StringBuffer appendTail(StringBuffer sb)** Implements a terminal append-and-replace step. |
| 3 | **public String replaceAll(String replacement)** Replaces every subsequence of the input sequence that matches the pattern with the given replacement string. |
| 4 | **public String replaceFirst(String replacement)** Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string. |
| 5 | **public static String quoteReplacement(String s)** Returns a literal replacement string for the specified string. This method produces a string that will work as a literal replacement `s` in the `appendReplacement` method of the `Matcher` class. |
## start and end Methods
Here is an example that counts the number of times the word "cat" appears in the input string:
## RegexMatches.java File Code:
```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches{
private static final String REGEX = "bcatb";
private static final String INPUT = "cat cat cat cattie cat";
public static void main(String[]args){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
int count = 0;
while(m.find()){
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
}
}
The compiled and running result of the above example is as follows:
Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22
It can be seen that this example uses word boundaries to ensure that the letters "c", "a", "t" are not just a substring of a longer word. It also provides some useful information about where in the input string the matches occurred.
The `start` method returns the start index of the subsequence captured by the given group during the previous match operation, and the `end` method returns the index of the last matched character plus one.
## matches and lookingAt Methods
Both the `matches` and `lookingAt` methods attempt to match an input sequence against a pattern. Their difference is that `matches` requires the entire sequence to match, while `lookingAt` does not.
Although the `lookingAt` method does not require the entire sentence to match, it does require matching to start from the first character.
These two methods are often used at the beginning of an input string.
We use the following example to explain this functionality:
## RegexMatches.java File Code:
```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches{
private static final String REGEX = "foo";
private static final String INPUT = "fooooooooooooooooo";
private static final String INPUT2 = "ooooofoooooooooooo";
private static Pattern pattern;
private static Matcher matcher;
private static Matcher matcher2;
public static void main(String[]args){
pattern = Pattern.compile(REGEX);
matcher = pattern.matcher(INPUT);
matcher2 = pattern.matcher(INPUT2);
System.out.println("Current REGEX is: "+REGEX);
System.out.println("Current INPUT is: "+INPUT);
System.out.println("Current INPUT2 is: "+INPUT2);
System.out.println("lookingAt(): "+matcher.lookingAt());
System.out.println("matches(): "+matcher.matches());
System.out.println("lookingAt(): "+matcher2.lookingAt());
}
}
The compiled and running result of the above example is as follows:
Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
Current INPUT2 is: ooooofoooooooooooo
lookingAt(): true
matches(): false
lookingAt(): false
## replaceFirst and replaceAll Methods
The `replaceFirst` and `replaceAll` methods are used to replace text that matches a regular expression. The difference is that `replaceFirst` replaces the first match, while `replaceAll` replaces all matches.
The following example explains this functionality:
## RegexMatches.java File Code:
```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches{
private static String REGEX = "dog";
private static String INPUT = "The dog says meow. " + "All dogs say meow.";
private static String REPLACE = "cat";
public static void main(String[]args){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
INPUT = m.replaceAll(REPLACE);
System.out.println(INPUT);
}
}
The compiled and running result of the above example is as follows:
The cat says meow. All cats say meow.
## appendReplacement and appendTail Methods
The `Matcher` class also provides `appendReplacement` and `appendTail` methods for text replacement:
Look at the following example to explain this functionality:
## RegexMatches.java File Code:
```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexMatches{
private static String REGEX = "a*b";
private static String INPUT = "aabfooaabfooabfoobkkk";
private static String REPLACE = "-";
public static void main(String[]args){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(INPUT);
StringBuffer sb = new StringBuffer();
while(m.find()){
m.appendReplacement(sb,REPLACE);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
The compiled and running result of the above example is as follows:
-foo-foo-foo-kkk
## Methods of the PatternSyntaxException Class
`PatternSyntaxException` is an unchecked exception class that indicates a syntax error in a regular expression pattern.
The `PatternSyntaxException` class provides the following methods to help us see what went wrong.
| **No.** | **Method and Description** |
| --- | --- |
| 1 | **public String getDescription()** Retrieves the error description. |
| 2 | **public int getIndex()** Retrieves the error index. |
| 3 | **public String getPattern()** Retrieves the erroneous regular expression pattern. |
| 4 | **public String getMessage()** Returns a multi-line string containing the description of the syntax error and its index, the erroneous regular expression pattern, and a visual indication of the error index within the pattern. |
YouTip