Appendix A. Regular Expression Reference

This appendix is a reference for regular expressions.

Regular Expressions in QED

QED (short for Quick Editor) was originally written for the Berkeley Time-Sharing System, which ran on the Scientific Data Systems SDS 940. A rewrite of the original QED editor by Ken Thompson for MIT’s Compatible Time-Sharing System yielded one of the earliest (if not the first) practical implementation of regular expressions in computing. Table A-1, taken from pages 3 and 4 of a 1970 Bell Labs memo, outlines the regex features in QED. It amazes me that most of this syntax has remained in use to this day, over 40 years later.

Table A-1. QED regular expressions

Feature	Description
literal	“a) An ordinary character [literal] is a regular expression which matches that character.”
^	“b) ^ is a regular expression which matches the null character at the beginning of a line.”
$	“c) $ is a regular expression which matches the null character before the character <nl> [newline] (usually at the end of a line).”
.	“d) . is a regular expression which matches any character except <nl> [newline].”
[<string>]	“e) “[<string>]” is a regular expression which matches any of the characters in the <string> and no others.”
[^<string>]	“f) “[^<string>] is a regular expression which matches any character but <nl> [newline] and the characters of the <string>.”
*	“g) A regular expression followed by “*” is a regular expression which matches any number (including zero) of adjacent occurrences of the text matched by the regular expression.”
	“h) Two adjacent regular expressions form a regular expression which matches adjacent occurrences of the text matched by the regular expressions.”
\|	“i) Two regular expressions separated by “\|” form a regular expression which matches the text matched by either of the regular expressions.”
( )	“j) A regular expression in parentheses is a regular expression which matches the same text as the original regular expression. Parentheses are used to alter the order of evaluation implied by g), h), and i): a(b\|c)d will match abd or acd, while ab\|cd matches ab or cd.”
{ }	“k) If “<regexp>” is a regular expression, “{<regexp>}x” is a regular expression, where x is any character. This regular expression matches the same things as <regexp>; it has certain side effects as explained under the Substitute command.” [The Substitute command was formed (.,.)S/<regexp>/<string>/ (see page 13 of the memo), similar to the way it is still used in programs like sed and Perl.]
\E	“l) If <rexname> is the name of a regular expression named by the E command (below), then “\E<rexname>” is a regular expression which matches the same things as the regular expression specified in the E command. More discussion is presented under the E command.” [The \E command allowed you to name a regular expression and repeat its use by name.]
	“m) The null regular expression standing alone is equivalent to the last regular expression encountered. Initially the null regular expression is undefined; it also becomes undefined after an erroneous regular expression and after use of the E command.”
	“n) Nothing else is a regular expression.”
	“o) No regular expression will match text spread across more than one line.”

Metacharacters

There are 14 metacharacters used in regular expressions, each with special meaning, as described in Table A-2. If you want to use one of these characters as a literal, you must precede it with a backslash to escape it. For example, you would escape the dollar sign like this \$, or a backslash like this \\.

Table A-2. Metacharacters in regular expressions

Metacharacter	Name	Code Point	Purpose
.	Full Stop	U+002E	Match any character
\	Backslash	U+005C	Escape a character
\|	Vertical Bar	U+007C	Alternation (or)
^	Circumflex	U+005E	Beginning of a line anchor
$	Dollar Sign	U+0024	End of a line anchor
?	Question Mark	U+003F	Zero or one quantifier
*	Asterisk	U+002A	Zero or more quantifier
+	Plus Sign	U+002B	One or more quantifier
[	Left Square Bracket	U+005B	Open character class
]	Right Square Bracket	U+005D	Close character class
{	Left Curly Brace	U+007B	Open quantifier or block
}	Right Curly Brace	007D	Close quantifier or block
(	Left Parenthesis	U+0028	Open group
)	Right Parenthesis	U+0029	Close group

Character Shorthands

Table A-3 lists character shorthands used in regular expressions.

Table A-3. Character shorthands

Character Shorthand	Description
\a	Alert
\b	Word boundary
[\b]	Backspace character
\B	Non-word boundary
\cx	Control character
\d	Digit character
\D	Non-digit character
\dxxx	Decimal value for a character
\f	Form feed character
\r	Carriage return
\n	Newline character
\oxxx	Octal value for a character
\s	Space character
\S	Non-space character
\t	Horizontal tab character
\v	Vertical tab character
\w	Word character
\W	Non-word character
\0	Null character
\xxx	Hexadecimal value for a character
\uxxxx	Unicode value for a character

Whitespace

Table A-4 is a list of character shorthands for whitespace.

Table A-4. Whitespace characters

Character Shorthand	Description
\f	Form feed
\h	Horizontal whitespace
\H	Not horizontal whitespace
\n	Newline
\r	Carriage return
\t	Horizontal tab
\v	Vertical whitespace
\V	Not vertical whitespace

Unicode Whitespace Characters

Whitespace characters in Unicode are listed in Table A-5.

Table A-5. Whitespace characters in Unicode

Abbreviation or Nickname	Name	Unicode Code Point	Regex
HT	Horizontal tab	U+0009	\u0009 or \t
LF	Line feed	U+000A	\u000A or \n
VT	Vertical tab	U+000B	\u000B or \v
FF	Form feed	U+000C	\u000C or \f
CR	Carriage return	U+000D	\u000d or \r
SP	Space	U+0020	\u0020 or \s^[a]
NEL	Next line	U+0085	\u0085
NBSP	No-break space	U+00A0	\u00A0
—	Ogham space mark	U+1680	\u1680
MVS	Mongolian vowel separator	U+180E	\u180E
BOM	Byte order mark	U+FEFF	\ufeff
NQSP	En quad	U+2000	\u2000
MQSP, Mutton Quad	Em quad	U+2001	\u2001
ENSP, Nut	En space	U+2002	\u2002
EMSP, Mutton	Em space	U+2003	\u2003
3MSP, Thick space	Three-per-em space	U+2004	\u2004
4MSP, Mid space	Four-per-em space	U+2005	\u2005
6/MSP	Six-per-em space	U+2006	\u2006
FSP	Figure space	U+2007	\u2007
PSP	Punctuation space	U+2008	\u2008
THSP	Thin space	U+2009	\u2009
HSP	Hair space	U+200A	\u200A
ZWSP	Zero width space	U+200B	\u200B
LSEP	Line separator	U+2028	\u2028
PSEP	Paragraph separator	U+2029	\u2029
NNBSP	Narrow no-break space	U+202F	\u202F
MMSP	Medium mathematical space	U+205F	\u205f
IDSP	Ideographic space	U+3000	\u3000
^[a]Also matches other whitespace.

Control Characters

Table A-6 shows a way to match control characters in regular expressions.

Table A-6. Matching control characters

Control Character	Unicode Value	Abbreviation	Name
c@^[a]	U+0000	NUL	Null
\cA	U+0001	SOH	Start of heading
\cB	U+0002	STX	Start of text
\cC	U+0003	ETX	End of text
\cD	U+0004	EOT	End of transmission
\cE	U+0005	ENQ	Enquiry
\cF	U+0006	ACK	Acknowledge
\cG	U+0007	BEL	Bell
\cH	U+0008	BS	Backspace
\cI	U+0009	HT	Character tabulation or horizontal tab
\cJ	U+000A	LF	Line feed (newline, end of line)
\cK	U+000B	VT	Line tabulation or vertical tab
\cL	U+000C	FF	Form feed
\cM	U+000D	CR	Carriage return
\cN	U+000E	SO	Shift out
\cO	U+000F	SI	Shift in
\cP	U+0010	DLE	Data link escape
\cQ	U+0011	DC1	Device control one
\cR	U+0012	DC2	Device control two
\cS	U+0013	DC3	Device control three
\cT	U+0014	DC4	Device control four
\cU	U+0015	NAK	Negative acknowledge
\cV	U+0016	SYN	Synchronous idle
\cW	U+0017	ETB	End of Transmission block
\cX	U+0018	CAN	Cancel
\cY	U+0019	EM	End of medium
\cZ	U+001A	SUB	Substitute
\c[	U+001B	ESC	Escape
\c\	U+001C	FS	Information separator four
\c]	U+001D	GS	Information separator three
\c^	U+001E	RS	Information separator two
\c_	U+001F	US	Information separator one
^[a]Can use upper- or lowercase. For example, `\cA` or `\ca` are equivalent; however, Java implementations require uppercase.`\cA` or `\ca` are equivalent; however, Java implementations require uppercase.

Character Properties

Table A-7 lists character property names for use with \p{property} or \P{property}.

Table A-7. Character properties^[2]

Property	Description
C	Other
Cc	Control
Cf	Format
Cn	Unassigned
Co	Private use
Cs	Surrogate
L	Letter
Ll	Lowercase letter
Lm	Modifier letter
Lo	Other letter
Lt	Title case letter
Lu	Uppercase letter
L&	Ll, Lu, or Lt
M	Mark
Mc	Spacing mark
Me	Enclosing mark
Mn	Non-spacing mark
N	Number
Nd	Decimal number
Nl	Letter number
No	Other number
P	Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol
Z	Separator
Zl	Line separator
Zp	Paragraph separator
Zs	Space separator
^[2]See pcresyntax(3) at http://www.pcre.org/pcre.txt.

Script Names for Character Properties

Table A-8 shows the language script names for use with /p{property} or /P{property}.

Table A-8. Script names^[3]

Arabic (Arab)	Glagolitic (Glag)	Lepcha (Lepc)	Samaritan (Samr)
Armenian (Armn)	Gothic (Goth)	Limbu (Limb)	Saurashtra (Saur)
Avestan (Avst)	Greek (Grek)	Linear B (Linb)	Shavian (Shaw)
Balinese (Bali)	Gujarati (Gujr)	Lisu (Lisu)	Sinhala (Sinh)
Bamum (Bamu)	Gurmukhi (Guru)	Lycian (Lyci)	Sundanese (Sund)
Bengali (Beng)	Han (Hani)	Lydian (Lydi)	Syloti Nagri (Sylo)
Bopomofo (Bopo)	Hangul (Hang)	Malayalam (Mlym)	Syriac (Syrc)
Braille (Brai)	Hanunoo (Hano)	Meetei Mayek (Mtei)	Tagalog (Tglg)
Buginese (Bugi)	Hebrew (Hebr)	Mongolian (Mong)	Tagbanwa (Tagb)
Buhid (Buhd)	Hiragana (Hira)	Myanmar (Mymr)	Tai Le (Tale)
Canadian Aboriginal (Cans)	Hrkt: Katakana or Hiragana)	New Tai Lue (Talu)	Tai Tham (Lana)
Carian (Cari)	Imperial Aramaic (Armi)	Nko (Nkoo)	Tai Viet (Tavt)
Cham (None)	Inherited (Zinh/Qaai)	Ogham (Ogam)	Tamil (Taml)
Cherokee (Cher)	Inscriptional Pahlavi (Phli)	Ol Chiki (Olck)	Telugu (Telu)
Common (Zyyy)	Inscriptional Parthian (Prti)	Old Italic (Ital)	Thaana (Thaa)
Coptic (Copt/Qaac)	Javanese (Java)	Old Persian (Xpeo)	Thai (None)
Cuneiform (Xsux)	Kaithi (Kthi)	Old South Arabian (Sarb)	Tibetan (Tibt)
Cypriot (Cprt)	Kannada (Knda)	Old Turkic (Orkh)	Tifinagh (Tfng)
Cyrillic (Cyrl)	Katakana (Kana)	Oriya (Orya)	Ugaritic (Ugar)
Deseret (Dsrt)	Kayah Li (Kali)	Osmanya (Osma)	Unknown (Zzzz)
Devanagari (Deva)	Kharoshthi (Khar)	Phags Pa (Phag)	Vai (Vaii)
Egyptian Hieroglyphs (Egyp)	Khmer (Khmr)	Phoenician (Phnx)	Yi (Yiii)
Ethiopic (Ethi)	Lao (Laoo)	Rejang (Rjng)
Georgian (Geor)	Latin (Latn)	Runic (Runr)
^[3]See pcresyntax(3) at http://www.pcre.org/pcre.txt or http://ruby.runpaint.org/regexps#properties.

POSIX Character Classes

Table A-9 shows a list of POSIX character classes.

Table A-9. POSIX character classes

Character Class	Description
[[:alnum:]]	Alphanumeric characters (letters and digits)
[[:alpha:]]	Alphabetic characters (letters)
[[:ascii:]]	ASCII characters (all 128)
[[:blank:]]	Blank characters
[[:ctrl:]]	Control characters
[[:digit:]]	Digits
[[:graph:]]	Graphic characters
[[:lower:]]	Lowercase letters
[[:print:]]	Printable characters
[[:punct:]]	Punctuation characters
[[:space:]]	Whitespace characters
[[:upper:]]	Uppercase letters
[[:word:]]	Word characters
[[:xdigit:]]	Hexadecimal digits

Options/Modifiers

Tables A-10 and A-11 list options and modifiers.

Table A-10. Options in regular expressions

Option	Description	Supported by
`(?d)`	Unix lines	Java
`(?i)`	Case insensitive	PCRE, Perl, Java
`(?J)`	Allow duplicate names	PCRE^[a]
`(?m)`	Multiline	PCRE, Perl, Java
`(?s)`	Single line (dotall)	PCRE, Perl, Java
`(?u)`	Unicode case	Java
`(?U)`	Default match lazy	PCRE
`(?x)`	Ignore whitespace, comments	PCRE, Perl, Java
`(?-…)`	Unset or turn off options	PCRE
^[a]See “Named Subpatterns” in http://www.pcre.org/pcre.txt.http://www.pcre.org/pcre.txt.

Table A-11. Perl modifiers (flags)^[4]

Modifier	Description
a	Match `\d`, `\s`, `\w` and POSIX in ASCII range only
c	Keep current position after match fails
d	Use default, native rules of the platform
g	Global matching
i	Case-insensitive matching
l	Use current locale’s rules
m	Multiline strings
p	Preserve the matched string
s	Treat strings as a single line
u	Use Unicode rules when matching
x	Ignore whitespace and comments
^[4]See http://perldoc.perl.org/perlre.html#Modifiers.

ASCII Code Chart with Regex

Table A-12 is an ASCII code chart with regex cross-references.

Table A-12. ASCII code chart

Binary	Oct	Dec	Hex	Char	Kybd	Regex	Name
00000000	0	0	0	NUL	^@	\c@	Null character
00000001	1	1	1	SOH	^A	\cA	Start of header
00000010	2	2	2	STX	^B	\cB	Start of text
00000011	3	3	3	ETX	^C	\cC	End of text
00000100	4	4	4	EOT	^D	\cD	End of transmission
00000101	5	5	5	ENQ	^E	\cE	Enquiry
00000110	6	6	6	ACK	^F	\cF	Acknowledgment
00000111	7	7	7	BEL	^G	\a, \cG	Bell
00001000	10	8	8	BS	^H	[\b], \cH	Backspace
00001001	11	9	9	HT	^I	\t, \cI	Horizontal tab
00001010	12	10	0A	LF	^J	\n, \cJ	Line feed
00001011	13	11	0B	VT	^K	\v, \cK	Vertical tab
00001100	14	12	0C	FF	^L	\f, \cL	Form feed
00001101	15	13	0D	CR	^M	\r, \cM	Carriage return
00001110	16	14	0E	SO	^N	\cN	Shift out
00001111	17	15	0F	SI	^O	\cO	Shift in
00010000	20	16	10	DLE	^P	\cP	Data link escape
00010001	21	17	11	DC1	^Q	\cQ	Device control 1 (XON)
00010010	22	18	12	DC2	^R	\cR	Device control 2
00010011	23	19	13	DC3	^S	\cS	Device control 3 (XOFF)
00010100	24	20	14	DC4	^T	\cT	Device control 4
00010101	25	21	15	NAK	^U	\cU	Negative acknowledgement
00010110	26	22	16	SYN	^V	\cV	Synchronous idle
00010111	27	23	17	ETB	^W	\cW	End of transmission block
00011000	30	24	18	CAN	^X	\cX	Cancel
00011001	31	25	19	EM	^Y	\cY	End of medium
00011010	32	26	1A	SUB	^Z	\cZ	Substitute
00011011	33	27	1B	ESC	^[	\e, \c[	Escape
00011100	34	28	1C	FS	^\|	\c\|	File separator
00011101	35	29	1D	GS	^]	\c]	Group separator
00011110	36	30	1E	RS	^^	\c^	Record separator
00011111	37	31	1F	US	^_	\c_	Unit Separator
00100000	40	32	20	SP	SP	\s, [ ]	Space
00100001	41	33	21	!	!	!	Exclamation mark
00100010	42	34	22	"	"	"	Quotation mark
00100011	43	35	23	#	#	#	Number sign
00100100	44	36	24	$	$	\$	Dollar sign
00100101	45	37	25	%	%	%	Percent sign
00100110	46	38	26	&	&	&	Ampersand
00100111	47	39	27	'	'	'	Apostrophe
00101000	50	40	28	(	(	(, \(	Left parenthesis
00101001	51	41	29	)	)	), \)	Right parenthesis
00101010	52	42	2A	*	*	*	Asterisk
00101011	53	43	2B	+	+	+	Plus sign
00101100	54	44	2C	"	"	"	Comma
00101101	55	45	2D	-	-	-	Hyphen-minus
00101110	56	46	2E	.	.	\., [.]	Full stop
00101111	57	47	2F	/	/	/	Solidus
00110000	60	48	30	0	0	\d, [0]	Digit zero
00110001	61	49	31	1	1	\d, [1]	Digit one
00110010	62	50	32	2	2	\d, [2]	Digit two
00110011	63	51	33	3	3	\d, [3]	Digit three
00110100	64	52	34	4	4	\d, [4]	Digit four
00110101	65	53	35	5	5	\d, [5]	Digit five
00110110	66	54	36	6	6	\d, [6]	Digit six
00110111	67	55	37	7	7	\d, [7]	Digit seven
00111000	70	56	38	8	8	\d, [8]	Digit eight
00111001	71	57	39	9	9	\d, [9]	Digit nine
00111010	72	58	3A	:	:	:	Colon
00111011	73	59	3B	;	;	;	Semicolon
00111100	74	60	3C	<	<	<	Less-than sign
00111101	75	61	3D	=	=	=	Equals sign
00111110	76	62	3E	>	>	>	Greater-than sign
00111111	77	63	3F	?	?	?	Question mark
01000000	100	64	40	@	@	@	Commercial at
01000001	101	65	41	A	A	\w, [A]	Latin capital letter A
01000010	102	66	42	B	B	\w, [B]	Latin capital letter B
01000011	103	67	43	C	C	\w, [C]	Latin capital letter C
01000100	104	68	44	D	D	\w, [D]	Latin capital letter D
01000101	105	69	45	E	E	\w, [E]	Latin capital letter E
01000110	106	70	46	F	F	\w, [F]	Latin capital letter F
01000111	107	71	47	G	G	\w, [G]	Latin capital letter G
01001000	110	72	48	H	H	\w, [H]	Latin capital letter H
01001001	111	73	49	I	I	\w, [I]	Latin capital letter I
01001010	112	74	4A	J	J	\w, [J]	Latin capital letter J
01001011	113	75	4B	K	K	\w, [K]	Latin capital letter K
01001100	114	76	4C	L	L	\w, [L]	Latin capital letter L
01001101	115	77	4D	M	M	\w, [M]	Latin capital letter M
01001110	116	78	4E	N	N	\w, [N]	Latin capital letter N
01001111	117	79	4F	O	O	\w, [O]	Latin capital letter O
01010000	120	80	50	P	P	\w, [P]	Latin capital letter P
01010001	121	81	51	Q	Q	\w, [Q]	Latin capital letter Q
01010010	122	82	52	R	R	\w, [R]	Latin capital letter R
01010011	123	83	53	S	S	\w, [S]	Latin capital letter S
01010100	124	84	54	T	T	\w, [T]	Latin capital letter T
01010101	125	85	55	U	U	\w, [U]	Latin capital letter U
01010110	126	86	56	V	V	\w, [V]	Latin capital letter V
01010111	127	87	57	W	W	\w, [W]	Latin capital letter W
01011000	130	88	58	X	X	\w, [X]	Latin capital letter X
01011001	131	89	59	Y	Y	\w, [Y]	Latin capital letter Y
01011010	132	90	5A	Z	Z	\w, [Z]	Latin capital letter Z
01011011	133	91	5B	[	[	\[	Left square bracket
01011100	134	92	5C	\	\	\	Reverse solidus
01011101	135	93	5D	]	]	\]	Right square bracket
01011110	136	94	5E	^	^	^, [^]	Circumflex accent
01011111	137	95	5F	_	_	_, [_]	Low line
00100000	140	96	60	`	`	\`	Grave accent
01100001	141	97	61	a	a	\w, [a]	Latin small letter A
01100010	142	98	62	b	b	\w, [b]	Latin small letter B
01100011	143	99	63	c	c	\w, [c]	Latin small letter C
01100100	144	100	64	d	d	\w, [d]	Latin small letter D
01100101	145	101	65	e	e	\w, [e]	Latin small letter E
01100110	146	102	66	f	f	\w, [f]	Latin small letter F
01100111	147	103	67	g	g	\w, [g]	Latin small letter G
01101000	150	104	68	h	h	\w, [h]	Latin small letter H
01101001	151	105	69	i	i	\w, [i]	Latin small letter I
01101010	152	106	6A	j	j	\w, [j]	Latin small letter J
01101011	153	107	6B	k	k	\w, [k]	Latin small letter K
01101100	154	108	6C	l	l	\w, [l]	Latin small letter L
01101101	155	109	6D	m	m	\w, [m]	Latin small letter M
01101110	156	110	6E	n	n	\w, [n]	Latin small letter N
01101111	157	111	6F	o	o	\w, [o]	Latin small letter O
01110000	160	112	70	p	p	\w, [p]	Latin small letter P
01110001	161	113	71	q	q	\w, [q]	Latin small letter Q
01110010	162	114	72	r	r	\w, [r]	Latin small letter R
01110011	163	115	73	s	s	\w, [s]	Latin small letter S
01110100	164	116	74	t	t	\w, [t]	Latin small letter T
01110101	165	117	75	u	u	\w, [u]	Latin small letter U
01110110	166	118	76	v	v	\w, [v]	Latin small letter V
01110111	167	119	77	w	w	\w, [w]	Latin small letter W
01111000	170	120	78	x	x	\w, [x]	Latin small letter X
01111001	171	121	79	y	y	\w, [y]	Latin small letter Y
01111010	172	122	7A	z	z	\w, [z]	Latin small letter Z
01111011	173	123	7B	{	{	{	Left curly brace
01111100	174	124	7C	\|	\|	\|	Vertical line (Bar)
01111101	175	125	7D	}	}	}	Right curly brace
01111110	176	126	7E	~	~	\~	Tilde
01111111	177	127	7F	DEL	^?	\c?	Delete

Technical Notes

You can find Ken Thompson and Dennis Ritchie’s QED memo-cum manual at http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf.