Unicode Regex Expressions Cheat Sheet
Regex cheat sheets don't address Unicode; this specifically summarizes the most useful parts. The Notes section links to the actual characters represented by the 'Property' named alias.
Unicode Regex Syntax
- \p{xx}
- a character with the Unicode property alias, see below
- \P{xx}
- a character without Unicode property alias, see below
\xas "hex"- Hexadecimal Escape. Used to match a specific character by its hex code. Usually followed by two digits (\xHH) or braces in some engines (\x{HHHH}).
\x41 matches the letter A. \Xas "eXtended"- Unicode Grapheme Cluster. Matches a "user-perceived character," which includes a base character plus any combining marks (like accents).
Why
\Xis differentIn the Unicode world, some "characters" are actually multiple code points combined. For example, the emoji 👨👩👧 is one "human-perceived character" but is made of several individual code points.
.(the dot) might only match the first part of that emoji.\Xwill match the entire sequence as one unit.
-
\p{L}
\pL
MS .NET Regex Cheat Sheet
For detailed information and examples, see http://aka.ms/regex
Test at http://regexlib.com/RETester.aspx
Or test using 💻 Launch Netspresso Lite (scroll to bottom, highlights substitutions!)
Single characters
|
Use |
To match any character |
|
[set] |
In that set |
|
[^set] |
Not in that set |
|
[a-z] |
In the a-z range |
|
[a-z] |
Not in the a-z range |
|
. |
Any except \n (new line) |
|
[a-z] |
Escaped special character |
Control characters
|
Use |
To match |
Unicode |
|
\t |
Horizontal tab |
\u0009 |
|
\v |
Vertical tab |
\u000B |
|
\b |
Backspace |
\u0008 |
|
\e |
Escape |
\u001B |
|
\r |
Carriage return |
\u000D |
|
\f |
Form feed |
\u000C |
|
\n |
New line |
\u000A |
|
\a |
Bell (alarm) |
\u0007 |
|
\c char |
ASCII control character |
|
Non-ASCII codes
|
Use |
To match character = with |
|
\octal |
2-3 digit octal character code |
|
\x hex |
2-digit hex character code |
|
\u hex |
4-digit hex character code |
Character classes
|
Use |
To match character |
|
\p{category} |
In that Unicode category or block |
|
\P{category} |
Not in that Unicode category or block |
|
\w |
Word character |
|
\W |
Non-word character |
|
\d |
Decimal digit |
|
\D |
Not a decimal digit |
|
\s |
White-space character |
|
\S |
Non-white-space char |
Quantifiers
|
Greedy |
Lazy |
Matches |
|
* |
*? |
0 or more times |
|
+ |
+? |
1 or more times |
|
? |
?? |
0 or 1 time |
|
{n} |
{n}?= |
Exactly n times |
|
{n,} |
{n,}? |
At least n times |
|
{n,m} |
{n,m}? |
From n to m times |
Anchors
|
Use |
To specify position |
|
^ |
At start of string or line |
|
\A |
At start of string |
|
\z |
At end of string |
|
\Z |
At end (or before \n at end) of string |
|
$ |
At end (or before \n at end) of string or line |
|
\G |
Where previous match ended |
|
\b |
On word boundary |
|
\B |
Not on word boundary |
Groups
|
Use |
To define |
|
(exp)= |
Indexed group |
|
(?<name>exp) |
Named group |
|
(?<name1-name2>exp) |
Balancing group |
|
(?:exp)= |
Non-capturing group |
|
(?=exp)= |
Zero-width positive look-ahead |
|
(?!exp)= |
Zero-width negative look-ahead |
|
(?<=exp)= |
Zero-width positive look-behind |
|
(?<!exp)= |
Zero-width negative look-behind |
|
(?>exp)= |
Non-backtracking (greedy) |
Inline options
|
Option |
Effect on match |
|
i |
Case-insensitive |
|
m |
Multiline mode |
|
n |
Explicit (named) |
|
s |
Single-line mode |
|
x |
Ignore white space |
Inline options .NET special instruction
|
Use |
To |
|
(?imnsx-imnsx) |
Set or disable the specified options |
|
(?imnsx-imnsx:exp) |
Set or disable the specified options within the expression |
Back References
|
Use |
To match |
|
\n |
Indexed group |
|
\k<name> |
Named group |
Alternation
|
Use |
To match |
|
a |b |
Either a or b |
|
(?(exp) yes | no) |
yes if exp is matched |
|
(?(name) yes | no) |
yes if name is matched |
Substitution
|
Use |
To substitute |
|
$n |
Substring matched by group number n |
|
${name} |
Substring matched by group name |
|
$$ |
Literal $ character |
|
$& |
Copy of whole match |
|
$` |
Text before the match |
|
$' |
Text after the match |
|
$+ |
Last captured group |
|
$_ |
Entire input string |
Comments
|
Use |
To |
|
(?# comment) |
Add inline comment |
|
# |
Add x-mode comment |
Supported Unicode Categories
|
Category |
Description |
|
Lu |
Letter, uppercase |
|
LI |
Letter, lowercase |
|
Lt |
Letter, title case |
|
Lm |
Letter, modifier |
|
Lo |
Letter, other |
|
L |
Letter, all |
|
Mn |
Mark, non-spacing combining |
|
Mc |
Mark, spacing combining |
|
Me |
Mark, enclosing combining |
|
M |
Mark, all diacritic |
|
Nd |
Number, decimal digit |
|
Nl |
Number, letter-like |
|
No |
Number, other |
|
N |
Number, all |
|
Pc |
Punctuation, connector |
|
Pd |
Punctuation, dash |
|
Ps |
Punctuation, opening mark |
|
Pe = |
Punctuation, closing mark |
|
Pi |
Punctuation, initial quote mark |
|
Pf |
Punctuation, final quote mark |
|
Po |
Punctuation, other |
|
P |
Punctuation, all |
|
Sm |
Symbol, math |
|
Sc |
Symbol, currency |
|
Sk |
Symbol, modifier |
|
So |
Symbol, other |
|
S |
Symbol, all |
|
Zs |
Separator, space |
|
Zl |
Separator, line |
|
Zp |
Separator, paragraph |
|
Z |
Separator, all |
|
Cc |
Control code |
|
Cf |
Format control character |
|
Cs |
Surrogate code point |
|
Co |
Private-use character |
|
Cn |
Unassigned |
|
C |
Control characters, all |
For named character set blocks (e.= g., Cyrillic), search for "supported named blocks" in the MSDN Library.
No comments:
Post a Comment