Appendix A. Regular Expression Reference

This appendix is a reference for regular expressions.

QED (short for Quick Editor) was originally written for the Berkeley Time-Sharing System, which ran on the Scientific Data Systems SDS 940. A rewrite of the original QED editor by Ken Thompson for MIT’s Compatible Time-Sharing System yielded one of the earliest (if not the first) practical implementation of regular expressions in computing. Table A-1, taken from pages 3 and 4 of a 1970 Bell Labs memo, outlines the regex features in QED. It amazes me that most of this syntax has remained in use to this day, over 40 years later.

Table A-1. QED regular expressions

FeatureDescription

literal

“a) An ordinary character [literal] is a regular expression which matches that character.”

^

“b) ^ is a regular expression which matches the null character at the beginning of a line.”

$

“c) $ is a regular expression which matches the null character before the character <nl> [newline] (usually at the end of a line).”

.

“d) . is a regular expression which matches any character except <nl> [newline].”

[<string>]

“e) “[<string>]” is a regular expression which matches any of the characters in the <string> and no others.”

[^<string>]

“f) “[^<string>] is a regular expression which matches any character but <nl> [newline] and the characters of the <string>.”

*

“g) A regular expression followed by “*” is a regular expression which matches any number (including zero) of adjacent occurrences of the text matched by the regular expression.”

“h) Two adjacent regular expressions form a regular expression which matches adjacent occurrences of the text matched by the regular expressions.”

|

“i) Two regular expressions separated by “|” form a regular expression which matches the text matched by either of the regular expressions.”

( )

“j) A regular expression in parentheses is a regular expression which matches the same text as the original regular expression. Parentheses are used to alter the order of evaluation implied by g), h), and i): a(b|c)d will match abd or acd, while ab|cd matches ab or cd.”

{ }

“k) If “<regexp>” is a regular expression, “{<regexp>}x” is a regular expression, where x is any character. This regular expression matches the same things as <regexp>; it has certain side effects as explained under the Substitute command.” [The Substitute command was formed (.,.)S/<regexp>/<string>/ (see page 13 of the memo), similar to the way it is still used in programs like sed and Perl.]

\E

“l) If <rexname> is the name of a regular expression named by the E command (below), then “\E<rexname>” is a regular expression which matches the same things as the regular expression specified in the E command. More discussion is presented under the E command.” [The \E command allowed you to name a regular expression and repeat its use by name.]

“m) The null regular expression standing alone is equivalent to the last regular expression encountered. Initially the null regular expression is undefined; it also becomes undefined after an erroneous regular expression and after use of the E command.”

“n) Nothing else is a regular expression.”

“o) No regular expression will match text spread across more than one line.”

There are 14 metacharacters used in regular expressions, each with special meaning, as described in Table A-2. If you want to use one of these characters as a literal, you must precede it with a backslash to escape it. For example, you would escape the dollar sign like this \$, or a backslash like this \\.

Table A-3 lists character shorthands used in regular expressions.

Table A-4 is a list of character shorthands for whitespace.

Whitespace characters in Unicode are listed in Table A-5.

Table A-6 shows a way to match control characters in regular expressions.

Table A-7 lists character property names for use with \p{property} or \P{property}.

Table A-7. Character properties[2]

PropertyDescription

C

Other

Cc

Control

Cf

Format

Cn

Unassigned

Co

Private use

Cs

Surrogate

L

Letter

Ll

Lowercase letter

Lm

Modifier letter

Lo

Other letter

Lt

Title case letter

Lu

Uppercase letter

L&

Ll, Lu, or Lt

M

Mark

Mc

Spacing mark

Me

Enclosing mark

Mn

Non-spacing mark

N

Number

Nd

Decimal number

Nl

Letter number

No

Other number

P

Punctuation

Pc

Connector punctuation

Pd

Dash punctuation

Pe

Close punctuation

Pf

Final punctuation

Pi

Initial punctuation

Po

Other punctuation

Ps

Open punctuation

S

Symbol

Sc

Currency symbol

Sk

Modifier symbol

Sm

Mathematical symbol

So

Other symbol

Z

Separator

Zl

Line separator

Zp

Paragraph separator

Zs

Space separator

[2] See pcresyntax(3) at http://www.pcre.org/pcre.txt.

Table A-8 shows the language script names for use with /p{property} or /P{property}.

Table A-8. Script names[3]

Arabic (Arab)Glagolitic (Glag)Lepcha (Lepc)Samaritan (Samr)

Armenian (Armn)

Gothic (Goth)

Limbu (Limb)

Saurashtra (Saur)

Avestan (Avst)

Greek (Grek)

Linear B (Linb)

Shavian (Shaw)

Balinese (Bali)

Gujarati (Gujr)

Lisu (Lisu)

Sinhala (Sinh)

Bamum (Bamu)

Gurmukhi (Guru)

Lycian (Lyci)

Sundanese (Sund)

Bengali (Beng)

Han (Hani)

Lydian (Lydi)

Syloti Nagri (Sylo)

Bopomofo (Bopo)

Hangul (Hang)

Malayalam (Mlym)

Syriac (Syrc)

Braille (Brai)

Hanunoo (Hano)

Meetei Mayek (Mtei)

Tagalog (Tglg)

Buginese (Bugi)

Hebrew (Hebr)

Mongolian (Mong)

Tagbanwa (Tagb)

Buhid (Buhd)

Hiragana (Hira)

Myanmar (Mymr)

Tai Le (Tale)

Canadian Aboriginal (Cans)

Hrkt: Katakana or Hiragana)

New Tai Lue (Talu)

Tai Tham (Lana)

Carian (Cari)

Imperial Aramaic (Armi)

Nko (Nkoo)

Tai Viet (Tavt)

Cham (None)

Inherited (Zinh/Qaai)

Ogham (Ogam)

Tamil (Taml)

Cherokee (Cher)

Inscriptional Pahlavi (Phli)

Ol Chiki (Olck)

Telugu (Telu)

Common (Zyyy)

Inscriptional Parthian (Prti)

Old Italic (Ital)

Thaana (Thaa)

Coptic (Copt/Qaac)

Javanese (Java)

Old Persian (Xpeo)

Thai (None)

Cuneiform (Xsux)

Kaithi (Kthi)

Old South Arabian (Sarb)

Tibetan (Tibt)

Cypriot (Cprt)

Kannada (Knda)

Old Turkic (Orkh)

Tifinagh (Tfng)

Cyrillic (Cyrl)

Katakana (Kana)

Oriya (Orya)

Ugaritic (Ugar)

Deseret (Dsrt)

Kayah Li (Kali)

Osmanya (Osma)

Unknown (Zzzz)

Devanagari (Deva)

Kharoshthi (Khar)

Phags Pa (Phag)

Vai (Vaii)

Egyptian Hieroglyphs (Egyp)

Khmer (Khmr)

Phoenician (Phnx)

Yi (Yiii)

Ethiopic (Ethi)

Lao (Laoo)

Rejang (Rjng)

 

Georgian (Geor)

Latin (Latn)

Runic (Runr)

 

Table A-9 shows a list of POSIX character classes.

Tables A-10 and A-11 list options and modifiers.

Table A-11. Perl modifiers (flags)[4]

ModifierDescription

a

Match \d, \s, \w and POSIX in ASCII range only

c

Keep current position after match fails

d

Use default, native rules of the platform

g

Global matching

i

Case-insensitive matching

l

Use current locale’s rules

m

Multiline strings

p

Preserve the matched string

s

Treat strings as a single line

u

Use Unicode rules when matching

x

Ignore whitespace and comments

Table A-12 is an ASCII code chart with regex cross-references.

Table A-12. ASCII code chart

BinaryOctDecHexCharKybdRegexName

00000000

0

0

0

NUL

^@

\c@

Null character

00000001

1

1

1

SOH

^A

\cA

Start of header

00000010

2

2

2

STX

^B

\cB

Start of text

00000011

3

3

3

ETX

^C

\cC

End of text

00000100

4

4

4

EOT

^D

\cD

End of transmission

00000101

5

5

5

ENQ

^E

\cE

Enquiry

00000110

6

6

6

ACK

^F

\cF

Acknowledgment

00000111

7

7

7

BEL

^G

\a, \cG

Bell

00001000

10

8

8

BS

^H

[\b], \cH

Backspace

00001001

11

9

9

HT

^I

\t, \cI

Horizontal tab

00001010

12

10

0A

LF

^J

\n, \cJ

Line feed

00001011

13

11

0B

VT

^K

\v, \cK

Vertical tab

00001100

14

12

0C

FF

^L

\f, \cL

Form feed

00001101

15

13

0D

CR

^M

\r, \cM

Carriage return

00001110

16

14

0E

SO

^N

\cN

Shift out

00001111

17

15

0F

SI

^O

\cO

Shift in

00010000

20

16

10

DLE

^P

\cP

Data link escape

00010001

21

17

11

DC1

^Q

\cQ

Device control 1 (XON)

00010010

22

18

12

DC2

^R

\cR

Device control 2

00010011

23

19

13

DC3

^S

\cS

Device control 3 (XOFF)

00010100

24

20

14

DC4

^T

\cT

Device control 4

00010101

25

21

15

NAK

^U

\cU

Negative acknowledgement

00010110

26

22

16

SYN

^V

\cV

Synchronous idle

00010111

27

23

17

ETB

^W

\cW

End of transmission block

00011000

30

24

18

CAN

^X

\cX

Cancel

00011001

31

25

19

EM

^Y

\cY

End of medium

00011010

32

26

1A

SUB

^Z

\cZ

Substitute

00011011

33

27

1B

ESC

^[

\e, \c[

Escape

00011100

34

28

1C

FS

^|

\c|

File separator

00011101

35

29

1D

GS

^]

\c]

Group separator

00011110

36

30

1E

RS

^^

\c^

Record separator

00011111

37

31

1F

US

^_

\c_

Unit Separator

00100000

40

32

20

SP

SP

\s, [ ]

Space

00100001

41

33

21

!

!

!

Exclamation mark

00100010

42

34

22

"

"

"

Quotation mark

00100011

43

35

23

#

#

#

Number sign

00100100

44

36

24

$

$

\$

Dollar sign

00100101

45

37

25

%

%

%

Percent sign

00100110

46

38

26

&

&

&

Ampersand

00100111

47

39

27

'

'

'

Apostrophe

00101000

50

40

28

(

(

(, \(

Left parenthesis

00101001

51

41

29

)

)

), \)

Right parenthesis

00101010

52

42

2A

*

*

*

Asterisk

00101011

53

43

2B

+

+

+

Plus sign

00101100

54

44

2C

"

"

"

Comma

00101101

55

45

2D

-

-

-

Hyphen-minus

00101110

56

46

2E

.

.

\., [.]

Full stop

00101111

57

47

2F

/

/

/

Solidus

00110000

60

48

30

0

0

\d, [0]

Digit zero

00110001

61

49

31

1

1

\d, [1]

Digit one

00110010

62

50

32

2

2

\d, [2]

Digit two

00110011

63

51

33

3

3

\d, [3]

Digit three

00110100

64

52

34

4

4

\d, [4]

Digit four

00110101

65

53

35

5

5

\d, [5]

Digit five

00110110

66

54

36

6

6

\d, [6]

Digit six

00110111

67

55

37

7

7

\d, [7]

Digit seven

00111000

70

56

38

8

8

\d, [8]

Digit eight

00111001

71

57

39

9

9

\d, [9]

Digit nine

00111010

72

58

3A

:

:

:

Colon

00111011

73

59

3B

;

;

;

Semicolon

00111100

74

60

3C

<

<

<

Less-than sign

00111101

75

61

3D

=

=

=

Equals sign

00111110

76

62

3E

>

>

>

Greater-than sign

00111111

77

63

3F

?

?

?

Question mark

01000000

100

64

40

@

@

@

Commercial at

01000001

101

65

41

A

A

\w, [A]

Latin capital letter A

01000010

102

66

42

B

B

\w, [B]

Latin capital letter B

01000011

103

67

43

C

C

\w, [C]

Latin capital letter C

01000100

104

68

44

D

D

\w, [D]

Latin capital letter D

01000101

105

69

45

E

E

\w, [E]

Latin capital letter E

01000110

106

70

46

F

F

\w, [F]

Latin capital letter F

01000111

107

71

47

G

G

\w, [G]

Latin capital letter G

01001000

110

72

48

H

H

\w, [H]

Latin capital letter H

01001001

111

73

49

I

I

\w, [I]

Latin capital letter I

01001010

112

74

4A

J

J

\w, [J]

Latin capital letter J

01001011

113

75

4B

K

K

\w, [K]

Latin capital letter K

01001100

114

76

4C

L

L

\w, [L]

Latin capital letter L

01001101

115

77

4D

M

M

\w, [M]

Latin capital letter M

01001110

116

78

4E

N

N

\w, [N]

Latin capital letter N

01001111

117

79

4F

O

O

\w, [O]

Latin capital letter O

01010000

120

80

50

P

P

\w, [P]

Latin capital letter P

01010001

121

81

51

Q

Q

\w, [Q]

Latin capital letter Q

01010010

122

82

52

R

R

\w, [R]

Latin capital letter R

01010011

123

83

53

S

S

\w, [S]

Latin capital letter S

01010100

124

84

54

T

T

\w, [T]

Latin capital letter T

01010101

125

85

55

U

U

\w, [U]

Latin capital letter U

01010110

126

86

56

V

V

\w, [V]

Latin capital letter V

01010111

127

87

57

W

W

\w, [W]

Latin capital letter W

01011000

130

88

58

X

X

\w, [X]

Latin capital letter X

01011001

131

89

59

Y

Y

\w, [Y]

Latin capital letter Y

01011010

132

90

5A

Z

Z

\w, [Z]

Latin capital letter Z

01011011

133

91

5B

[

[

\[

Left square bracket

01011100

134

92

5C

\

\

\

Reverse solidus

01011101

135

93

5D

]

]

\]

Right square bracket

01011110

136

94

5E

^

^

^, [^]

Circumflex accent

01011111

137

95

5F

_

_

_, [_]

Low line

00100000

140

96

60

`

`

\`

Grave accent

01100001

141

97

61

a

a

\w, [a]

Latin small letter A

01100010

142

98

62

b

b

\w, [b]

Latin small letter B

01100011

143

99

63

c

c

\w, [c]

Latin small letter C

01100100

144

100

64

d

d

\w, [d]

Latin small letter D

01100101

145

101

65

e

e

\w, [e]

Latin small letter E

01100110

146

102

66

f

f

\w, [f]

Latin small letter F

01100111

147

103

67

g

g

\w, [g]

Latin small letter G

01101000

150

104

68

h

h

\w, [h]

Latin small letter H

01101001

151

105

69

i

i

\w, [i]

Latin small letter I

01101010

152

106

6A

j

j

\w, [j]

Latin small letter J

01101011

153

107

6B

k

k

\w, [k]

Latin small letter K

01101100

154

108

6C

l

l

\w, [l]

Latin small letter L

01101101

155

109

6D

m

m

\w, [m]

Latin small letter M

01101110

156

110

6E

n

n

\w, [n]

Latin small letter N

01101111

157

111

6F

o

o

\w, [o]

Latin small letter O

01110000

160

112

70

p

p

\w, [p]

Latin small letter P

01110001

161

113

71

q

q

\w, [q]

Latin small letter Q

01110010

162

114

72

r

r

\w, [r]

Latin small letter R

01110011

163

115

73

s

s

\w, [s]

Latin small letter S

01110100

164

116

74

t

t

\w, [t]

Latin small letter T

01110101

165

117

75

u

u

\w, [u]

Latin small letter U

01110110

166

118

76

v

v

\w, [v]

Latin small letter V

01110111

167

119

77

w

w

\w, [w]

Latin small letter W

01111000

170

120

78

x

x

\w, [x]

Latin small letter X

01111001

171

121

79

y

y

\w, [y]

Latin small letter Y

01111010

172

122

7A

z

z

\w, [z]

Latin small letter Z

01111011

173

123

7B

{

{

{

Left curly brace

01111100

174

124

7C

|

|

|

Vertical line (Bar)

01111101

175

125

7D

}

}

}

Right curly brace

01111110

176

126

7E

~

~

\~

Tilde

01111111

177

127

7F

DEL

^?

\c?

Delete

You can find Ken Thompson and Dennis Ritchie’s QED memo-cum manual at http://cm.bell-labs.com/cm/cs/who/dmr/qedman.pdf.