This appendix is a reference for regular expressions.
QED (short for Quick Editor) was originally written for the Berkeley Time-Sharing System, which ran on the Scientific Data Systems SDS 940. A rewrite of the original QED editor by Ken Thompson for MIT’s Compatible Time-Sharing System yielded one of the earliest (if not the first) practical implementation of regular expressions in computing. Table A-1, taken from pages 3 and 4 of a 1970 Bell Labs memo, outlines the regex features in QED. It amazes me that most of this syntax has remained in use to this day, over 40 years later.
Table A-1. QED regular expressions
There are 14 metacharacters used in regular expressions, each with
special meaning, as described in Table A-2. If you want
to use one of these characters as a literal, you must precede it with a
backslash to escape it. For example, you would escape the
dollar sign like this \$
, or a
backslash like this \\
.
Table A-2. Metacharacters in regular expressions
Metacharacter | Name | Code Point | Purpose |
---|---|---|---|
. | U+002E | Match any character | |
\ | Backslash | U+005C | |
| | U+007C | Alternation (or) | |
^ | Circumflex | U+005E | |
$ | Dollar Sign | U+0024 | |
? | Question Mark | U+003F | |
* | Asterisk | U+002A | |
+ | Plus Sign | U+002B | |
[ | Left Square Bracket | U+005B | |
] | Right Square Bracket | U+005D | |
{ | Left Curly Brace | U+007B | Open quantifier or block |
} | Right Curly Brace | 007D | Close quantifier or block |
( | Left Parenthesis | U+0028 | |
) | Right Parenthesis | U+0029 | Close group |
Table A-3 lists character shorthands used in regular expressions.
Table A-4 is a list of character shorthands for whitespace.
Whitespace characters in Unicode are listed in Table A-5.
Table A-5. Whitespace characters in Unicode
Abbreviation or Nickname | Name | Unicode Code Point | Regex |
---|---|---|---|
HT | Horizontal tab | U+0009 | \u0009 or \t |
LF | Line feed | U+000A | \u000A or \n |
VT | Vertical tab | U+000B | \u000B or \v |
FF | Form feed | U+000C | \u000C or \f |
CR | Carriage return | U+000D | \u000d or \r |
SP | Space | U+0020 | \u0020 or \s[a] |
NEL | Next line | U+0085 | \u0085 |
NBSP | No-break space | U+00A0 | \u00A0 |
— | Ogham space mark | U+1680 | \u1680 |
MVS | Mongolian vowel separator | U+180E | \u180E |
BOM | Byte order mark | U+FEFF | \ufeff |
NQSP | En quad | U+2000 | \u2000 |
MQSP, Mutton Quad | Em quad | U+2001 | \u2001 |
ENSP, Nut | En space | U+2002 | \u2002 |
EMSP, Mutton | Em space | U+2003 | \u2003 |
3MSP, Thick space | Three-per-em space | U+2004 | \u2004 |
4MSP, Mid space | Four-per-em space | U+2005 | \u2005 |
6/MSP | Six-per-em space | U+2006 | \u2006 |
FSP | Figure space | U+2007 | \u2007 |
PSP | Punctuation space | U+2008 | \u2008 |
THSP | Thin space | U+2009 | \u2009 |
HSP | Hair space | U+200A | \u200A |
ZWSP | Zero width space | U+200B | \u200B |
LSEP | Line separator | U+2028 | \u2028 |
PSEP | Paragraph separator | U+2029 | \u2029 |
NNBSP | Narrow no-break space | U+202F | \u202F |
MMSP | Medium mathematical space | U+205F | \u205f |
IDSP | Ideographic space | U+3000 | \u3000 |
[a] Also matches other whitespace. |
Table A-6 shows a way to match control characters in regular expressions.
Table A-6. Matching control characters
Control Character | Unicode Value | Abbreviation | Name |
---|---|---|---|
c@[a] | U+0000 | NUL | Null |
\cA | U+0001 | SOH | Start of heading |
\cB | U+0002 | STX | Start of text |
\cC | U+0003 | ETX | End of text |
\cD | U+0004 | EOT | End of transmission |
\cE | U+0005 | ENQ | Enquiry |
\cF | U+0006 | ACK | Acknowledge |
\cG | U+0007 | BEL | Bell |
\cH | U+0008 | BS | Backspace |
\cI | U+0009 | HT | Character tabulation or horizontal tab |
\cJ | U+000A | LF | Line feed (newline, end of line) |
\cK | U+000B | VT | Line tabulation or vertical tab |
\cL | U+000C | FF | Form feed |
\cM | U+000D | CR | Carriage return |
\cN | U+000E | SO | Shift out |
\cO | U+000F | SI | Shift in |
\cP | U+0010 | DLE | Data link escape |
\cQ | U+0011 | DC1 | Device control one |
\cR | U+0012 | DC2 | Device control two |
\cS | U+0013 | DC3 | Device control three |
\cT | U+0014 | DC4 | Device control four |
\cU | U+0015 | NAK | Negative acknowledge |
\cV | U+0016 | SYN | Synchronous idle |
\cW | U+0017 | ETB | End of Transmission block |
\cX | U+0018 | CAN | Cancel |
\cY | U+0019 | EM | End of medium |
\cZ | U+001A | SUB | Substitute |
\c[ | U+001B | ESC | Escape |
\c\ | U+001C | FS | Information separator four |
\c] | U+001D | GS | Information separator three |
\c^ | U+001E | RS | Information separator two |
\c_ | U+001F | US | Information separator one |
[a] Can use upper- or lowercase. For example, |
Table A-7 lists character property names for use
with
\p{
property}
or
\P{
property}
.
Table A-7. Character properties[2]
Property | Description |
---|---|
C | Other |
Cc | Control |
Cf | Format |
Cn | Unassigned |
Co | |
Cs | Surrogate |
L | Letter |
Ll | Lowercase letter |
Lm | Modifier letter |
Lo | Other letter |
Lt | Title case letter |
Lu | Uppercase letter |
L& | Ll, Lu, or Lt |
M | Mark |
Mc | Spacing mark |
Me | Enclosing mark |
Mn | Non-spacing mark |
N | Number |
Nd | Decimal number |
Nl | Letter number |
No | Other number |
P | Punctuation |
Pc | Connector punctuation |
Pd | Dash punctuation |
Pe | Close punctuation |
Pf | Final punctuation |
Pi | Initial punctuation |
Po | Other punctuation |
Ps | Open punctuation |
S | Symbol |
Sc | Currency symbol |
Sk | Modifier symbol |
Sm | Mathematical symbol |
So | Other symbol |
Z | Separator |
Zl | Line separator |
Zp | Paragraph separator |
Zs | Space separator |
[2] See pcresyntax(3) at http://www.pcre.org/pcre.txt. |
Table A-8 shows the language script names for use
with
/p{
property}
or
/P{
property}
.
Table A-8. Script names[3]
Arabic (Arab) | Glagolitic (Glag) | Lepcha (Lepc) | Samaritan (Samr) |
Armenian (Armn) | Limbu (Limb) | Saurashtra (Saur) | |
Avestan (Avst) | Greek (Grek) | Linear B (Linb) | Shavian (Shaw) |
Balinese (Bali) | Gujarati (Gujr) | Lisu (Lisu) | Sinhala (Sinh) |
Bamum (Bamu) | Gurmukhi (Guru) | Lycian (Lyci) | Sundanese (Sund) |
Bengali (Beng) | Han (Hani) | Lydian (Lydi) | Syloti Nagri (Sylo) |
Bopomofo (Bopo) | Hangul (Hang) | Malayalam (Mlym) | Syriac (Syrc) |
Braille (Brai) | Hanunoo (Hano) | Meetei Mayek (Mtei) | Tagalog (Tglg) |
Buginese (Bugi) | Hebrew (Hebr) | Mongolian (Mong) | Tagbanwa (Tagb) |
Buhid (Buhd) | Hiragana (Hira) | Myanmar (Mymr) | Tai Le (Tale) |
Canadian Aboriginal (Cans) | Hrkt: Katakana or Hiragana) | New Tai Lue (Talu) | Tai Tham (Lana) |
Carian (Cari) | Imperial Aramaic (Armi) | Nko (Nkoo) | Tai Viet (Tavt) |
Cham (None) | Inherited (Zinh/Qaai) | Ogham (Ogam) | Tamil (Taml) |
Cherokee (Cher) | Inscriptional Pahlavi (Phli) | Ol Chiki (Olck) | Telugu (Telu) |
Common (Zyyy) | Inscriptional Parthian (Prti) | Old Italic (Ital) | Thaana (Thaa) |
Coptic (Copt/Qaac) | Javanese (Java) | Old Persian (Xpeo) | Thai (None) |
Cuneiform (Xsux) | Kaithi (Kthi) | Old South Arabian (Sarb) | Tibetan (Tibt) |
Cypriot (Cprt) | Kannada (Knda) | Old Turkic (Orkh) | Tifinagh (Tfng) |
Cyrillic (Cyrl) | Katakana (Kana) | Oriya (Orya) | Ugaritic (Ugar) |
Deseret (Dsrt) | Kayah Li (Kali) | Osmanya (Osma) | Unknown (Zzzz) |
Devanagari (Deva) | Kharoshthi (Khar) | Phags Pa (Phag) | Vai (Vaii) |
Egyptian Hieroglyphs (Egyp) | Khmer (Khmr) | Phoenician (Phnx) | Yi (Yiii) |
Ethiopic (Ethi) | Lao (Laoo) | Rejang (Rjng) |
|
Georgian (Geor) | Latin (Latn) | Runic (Runr) |
|
[3] See pcresyntax(3) at http://www.pcre.org/pcre.txt or http://ruby.runpaint.org/regexps#properties. |
Table A-9 shows a list of POSIX character classes.
Tables A-10 and A-11 list options and modifiers.
Table A-10. Options in regular expressions
Option | Description | Supported by |
---|---|---|
| Unix lines | Java |
| Case insensitive | PCRE, Perl, Java |
| Allow duplicate names | PCRE[a] |
| Multiline | PCRE, Perl, Java |
| Single line (dotall) | PCRE, Perl, Java |
| Unicode case | Java |
| Default match lazy | PCRE |
| Ignore whitespace, comments | PCRE, Perl, Java |
| Unset or turn off options | PCRE |
[a] See “Named Subpatterns” in http://www.pcre.org/pcre.txt.http://www.pcre.org/pcre.txt. |
Table A-12 is an ASCII code chart with regex cross-references.
Table A-12. ASCII code chart
Binary | Oct | Dec | Hex | Char | Kybd | Regex | Name |
---|---|---|---|---|---|---|---|
00000000 | 0 | 0 | 0 | NUL | ^@ | \c@ | Null character |
00000001 | 1 | 1 | 1 | SOH | ^A | \cA | Start of header |
00000010 | 2 | 2 | 2 | STX | ^B | \cB | Start of text |
00000011 | 3 | 3 | 3 | ETX | ^C | \cC | End of text |
00000100 | 4 | 4 | 4 | EOT | ^D | \cD | End of transmission |
00000101 | 5 | 5 | 5 | ENQ | ^E | \cE | Enquiry |
00000110 | 6 | 6 | 6 | ACK | ^F | \cF | Acknowledgment |
00000111 | 7 | 7 | 7 | BEL | ^G | \a, \cG | Bell |
00001000 | 10 | 8 | 8 | BS | ^H | [\b], \cH | Backspace |
00001001 | 11 | 9 | 9 | HT | ^I | \t, \cI | Horizontal tab |
00001010 | 12 | 10 | 0A | LF | ^J | \n, \cJ | Line feed |
00001011 | 13 | 11 | 0B | VT | ^K | \v, \cK | Vertical tab |
00001100 | 14 | 12 | 0C | FF | ^L | \f, \cL | Form feed |
00001101 | 15 | 13 | 0D | CR | ^M | \r, \cM | Carriage return |
00001110 | 16 | 14 | 0E | SO | ^N | \cN | Shift out |
00001111 | 17 | 15 | 0F | SI | ^O | \cO | Shift in |
00010000 | 20 | 16 | 10 | DLE | ^P | \cP | Data link escape |
00010001 | 21 | 17 | 11 | DC1 | ^Q | \cQ | Device control 1 (XON) |
00010010 | 22 | 18 | 12 | DC2 | ^R | \cR | Device control 2 |
00010011 | 23 | 19 | 13 | DC3 | ^S | \cS | Device control 3 (XOFF) |
00010100 | 24 | 20 | 14 | DC4 | ^T | \cT | Device control 4 |
00010101 | 25 | 21 | 15 | NAK | ^U | \cU | Negative acknowledgement |
00010110 | 26 | 22 | 16 | SYN | ^V | \cV | Synchronous idle |
00010111 | 27 | 23 | 17 | ETB | ^W | \cW | End of transmission block |
00011000 | 30 | 24 | 18 | CAN | ^X | \cX | Cancel |
00011001 | 31 | 25 | 19 | EM | ^Y | \cY | End of medium |
00011010 | 32 | 26 | 1A | SUB | ^Z | \cZ | Substitute |
00011011 | 33 | 27 | 1B | ESC | ^[ | \e, \c[ | Escape |
00011100 | 34 | 28 | 1C | FS | ^| | \c| | File separator |
00011101 | 35 | 29 | 1D | GS | ^] | \c] | Group separator |
00011110 | 36 | 30 | 1E | RS | ^^ | \c^ | Record separator |
00011111 | 37 | 31 | 1F | US | ^_ | \c_ | Unit Separator |
00100000 | 40 | 32 | 20 | SP | SP | \s, [ ] | Space |
00100001 | 41 | 33 | 21 | ! | ! | ! | Exclamation mark |
00100010 | 42 | 34 | 22 | " | " | " | Quotation mark |
00100011 | 43 | 35 | 23 | # | # | # | Number sign |
00100100 | 44 | 36 | 24 | $ | $ | \$ | Dollar sign |
00100101 | 45 | 37 | 25 | % | % | % | Percent sign |
00100110 | 46 | 38 | 26 | & | & | & | Ampersand |
00100111 | 47 | 39 | 27 | ' | ' | ' | Apostrophe |
00101000 | 50 | 40 | 28 | ( | ( | (, \( | Left parenthesis |
00101001 | 51 | 41 | 29 | ) | ) | ), \) | Right parenthesis |
00101010 | 52 | 42 | 2A | * | * | * | Asterisk |
00101011 | 53 | 43 | 2B | + | + | + | Plus sign |
00101100 | 54 | 44 | 2C | " | " | " | Comma |
00101101 | 55 | 45 | 2D | - | - | - | Hyphen-minus |
00101110 | 56 | 46 | 2E | . | . | \., [.] | Full stop |
00101111 | 57 | 47 | 2F | / | / | / | Solidus |
00110000 | 60 | 48 | 30 | 0 | 0 | \d, [0] | Digit zero |
00110001 | 61 | 49 | 31 | 1 | 1 | \d, [1] | Digit one |
00110010 | 62 | 50 | 32 | 2 | 2 | \d, [2] | Digit two |
00110011 | 63 | 51 | 33 | 3 | 3 | \d, [3] | Digit three |
00110100 | 64 | 52 | 34 | 4 | 4 | \d, [4] | Digit four |
00110101 | 65 | 53 | 35 | 5 | 5 | \d, [5] | Digit five |
00110110 | 66 | 54 | 36 | 6 | 6 | \d, [6] | Digit six |
00110111 | 67 | 55 | 37 | 7 | 7 | \d, [7] | Digit seven |
00111000 | 70 | 56 | 38 | 8 | 8 | \d, [8] | Digit eight |
00111001 | 71 | 57 | 39 | 9 | 9 | \d, [9] | Digit nine |
00111010 | 72 | 58 | 3A | : | : | : | Colon |
00111011 | 73 | 59 | 3B | ; | ; | ; | Semicolon |
00111100 | 74 | 60 | 3C | < | < | < | Less-than sign |
00111101 | 75 | 61 | 3D | = | = | = | Equals sign |
00111110 | 76 | 62 | 3E | > | > | > | Greater-than sign |
00111111 | 77 | 63 | 3F | ? | ? | ? | Question mark |
01000000 | 100 | 64 | 40 | @ | @ | @ | Commercial at |
01000001 | 101 | 65 | 41 | A | A | \w, [A] | Latin capital letter A |
01000010 | 102 | 66 | 42 | B | B | \w, [B] | Latin capital letter B |
01000011 | 103 | 67 | 43 | C | C | \w, [C] | Latin capital letter C |
01000100 | 104 | 68 | 44 | D | D | \w, [D] | Latin capital letter D |
01000101 | 105 | 69 | 45 | E | E | \w, [E] | Latin capital letter E |
01000110 | 106 | 70 | 46 | F | F | \w, [F] | Latin capital letter F |
01000111 | 107 | 71 | 47 | G | G | \w, [G] | Latin capital letter G |
01001000 | 110 | 72 | 48 | H | H | \w, [H] | Latin capital letter H |
01001001 | 111 | 73 | 49 | I | I | \w, [I] | Latin capital letter I |
01001010 | 112 | 74 | 4A | J | J | \w, [J] | Latin capital letter J |
01001011 | 113 | 75 | 4B | K | K | \w, [K] | Latin capital letter K |
01001100 | 114 | 76 | 4C | L | L | \w, [L] | Latin capital letter L |
01001101 | 115 | 77 | 4D | M | M | \w, [M] | Latin capital letter M |
01001110 | 116 | 78 | 4E | N | N | \w, [N] | Latin capital letter N |
01001111 | 117 | 79 | 4F | O | O | \w, [O] | Latin capital letter O |
01010000 | 120 | 80 | 50 | P | P | \w, [P] | Latin capital letter P |
01010001 | 121 | 81 | 51 | Q | Q | \w, [Q] | Latin capital letter Q |
01010010 | 122 | 82 | 52 | R | R | \w, [R] | Latin capital letter R |
01010011 | 123 | 83 | 53 | S | S | \w, [S] | Latin capital letter S |
01010100 | 124 | 84 | 54 | T | T | \w, [T] | Latin capital letter T |
01010101 | 125 | 85 | 55 | U | U | \w, [U] | Latin capital letter U |
01010110 | 126 | 86 | 56 | V | V | \w, [V] | Latin capital letter V |
01010111 | 127 | 87 | 57 | W | W | \w, [W] | Latin capital letter W |
01011000 | 130 | 88 | 58 | X | X | \w, [X] | Latin capital letter X |
01011001 | 131 | 89 | 59 | Y | Y | \w, [Y] | Latin capital letter Y |
01011010 | 132 | 90 | 5A | Z | Z | \w, [Z] | Latin capital letter Z |
01011011 | 133 | 91 | 5B | [ | [ | \[ | Left square bracket |
01011100 | 134 | 92 | 5C | \ | \ | \ | Reverse solidus |
01011101 | 135 | 93 | 5D | ] | ] | \] | Right square bracket |
01011110 | 136 | 94 | 5E | ^ | ^ | ^, [^] | Circumflex accent |
01011111 | 137 | 95 | 5F | _ | _ | _, [_] | Low line |
00100000 | 140 | 96 | 60 | ` | ` | \` | Grave accent |
01100001 | 141 | 97 | 61 | a | a | \w, [a] | Latin small letter A |
01100010 | 142 | 98 | 62 | b | b | \w, [b] | Latin small letter B |
01100011 | 143 | 99 | 63 | c | c | \w, [c] | Latin small letter C |
01100100 | 144 | 100 | 64 | d | d | \w, [d] | Latin small letter D |
01100101 | 145 | 101 | 65 | e | e | \w, [e] | Latin small letter E |
01100110 | 146 | 102 | 66 | f | f | \w, [f] | Latin small letter F |
01100111 | 147 | 103 | 67 | g | g | \w, [g] | Latin small letter G |
01101000 | 150 | 104 | 68 | h | h | \w, [h] | Latin small letter H |
01101001 | 151 | 105 | 69 | i | i | \w, [i] | Latin small letter I |
01101010 | 152 | 106 | 6A | j | j | \w, [j] | Latin small letter J |
01101011 | 153 | 107 | 6B | k | k | \w, [k] | Latin small letter K |
01101100 | 154 | 108 | 6C | l | l | \w, [l] | Latin small letter L |
01101101 | 155 | 109 | 6D | m | m | \w, [m] | Latin small letter M |
01101110 | 156 | 110 | 6E | n | n | \w, [n] | Latin small letter N |
01101111 | 157 | 111 | 6F | o | o | \w, [o] | Latin small letter O |
01110000 | 160 | 112 | 70 | p | p | \w, [p] | Latin small letter P |
01110001 | 161 | 113 | 71 | q | q | \w, [q] | Latin small letter Q |
01110010 | 162 | 114 | 72 | r | r | \w, [r] | Latin small letter R |
01110011 | 163 | 115 | 73 | s | s | \w, [s] | Latin small letter S |
01110100 | 164 | 116 | 74 | t | t | \w, [t] | Latin small letter T |
01110101 | 165 | 117 | 75 | u | u | \w, [u] | Latin small letter U |
01110110 | 166 | 118 | 76 | v | v | \w, [v] | Latin small letter V |
01110111 | 167 | 119 | 77 | w | w | \w, [w] | Latin small letter W |
01111000 | 170 | 120 | 78 | x | x | \w, [x] | Latin small letter X |
01111001 | 171 | 121 | 79 | y | y | \w, [y] | Latin small letter Y |
01111010 | 172 | 122 | 7A | z | z | \w, [z] | Latin small letter Z |
01111011 | 173 | 123 | 7B | { | { | { | Left curly brace |
01111100 | 174 | 124 | 7C | | | | | | | Vertical line (Bar) |
01111101 | 175 | 125 | 7D | } | } | } | Right curly brace |
01111110 | 176 | 126 | 7E | ~ | ~ | \~ | Tilde |
01111111 | 177 | 127 | 7F | DEL | ^? | \c? |
You can find Ken Thompson and Dennis Ritchie’s QED memo-cum manual at http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf.