Character classes are a way of combining characters with common traits into a single classification, such as characters that represent numbers, letters, vowels, or hexadecimal characters. Once we get into the details, we will see how useful character classes are. So, in this section, we're going to take a look at introducing the basics of character classes. We'll expound on that by introducing character class ranges, character class negations, and then we will write a full regular expression to handle matching dates.
So, our first introduction to character classes begins with vowels. Vowels are the letters A, E, I, O, U. Almost every word has a vowel in it. Let's see if we can write a character class that matches a vowel:
So, here we have word "dog" and, to begin a character class, we use square braces. Inside the square braces we have our vowels, aeiou, and we're going to match that to a Bool. Now, a character class means that a letter must match one of the letters that exist in those square brackets.
For this case, there is a vowel in the word dog, so this results in True. Let's try another word, for example, why:
This turns out to be False. There are no vowels in the word why. As long as one character inside the character class matches, the entire character class matches.
Now, we will create a character class for numbers. So, in the string "123", we could ask whether that string contains any numbers? This will require us to create a character class consisting of the digits 0 to 9:
That resulted in True. Let's do the same thing with dog:
We see that it gives False. We could also create a character class for letters of the alphabet where we would type abcd, all the way to the letter z, but that would be very error-prone. If you missed a letter, that letter would never be matched in the character class. For this reason, character classes support ranges. Inside of the character class, use a hyphen between two characters and you will get all of the characters, from the first character to the second. As you can see in our examples, character classes are not case-sensitive. So, let's go through a few examples. We will take one of our previous examples:
We can see that adding a range of 0-9 worked, and we got True. We could also do the same for the word dog and check whether it contains any lowercase letters:
Of course, it does. So, this results in True. Let's do the same thing again, but we are checking for uppercase letters now in the word dog:
We see that it does not match, hence we got False.
In order to find words with mixed cases in a character class, you simply repeat two ranges back-to-back, once for uppercase and again for lowercase. Let's check for the word DoG:
It returned True. Now, one more feature of character classes needs to be expressed, and that is the negative character class, which matches to any character not found in the character class. To make a negative character class, start the first character in the character class with ^. So, let's check if the word why contains any constant sounds:
As you can see, in the character class, the very first character is ^ followed by the vowels, aeiou. So, this means no vowels; in other words, anything that's not a vowel. And, it is True as there are characters that are not vowels in the word why.
What about any symbols that aren't letters of the alphabet? There is a singer in America by the name of Kesha, and Kesha spells her name Ke$ha. What we would like to ask is, does Kesha's name have any non-letter characters? We will be using the ^ character:
We can see that she does have a non-letter character, hence it returned True.
Remember in our last section, where we talked about how to express a date and we talked about expressing a date that only works for the 1900s or the 2000s. So, let's say that we have a date, 1969-07-20, and we would like to express that if that truly is a date. To verify that, let's have a regular expression to match a date, and we will use lots of character classes to make sure that we put numbers into our expressions correctly:
So, we begin with 19|20, followed by two numbers and then a dash; and then two numbers and then a dash; and then two numbers. We can see that the result is True. So, let's play around with this expression. We will copy the same expression, and we're going to check for a different date, say, 1901-40-99:
We see that it is also True, so this doesn't work perfectly for all dates. Let's see if we can modify this scanning so that we resolve this to be False. Now, a month can only be the numbers 01 all the way up to 12, for January to December. So, we are going to modify the month portion in our regular expression:
So, we will take 0 followed by a character class of 1-9, or a 1 followed by the character class of 012. So, here we have a character class that matches just the digits 01 all the way up to 12. We got a False for 40, which doesn't qualify as a valid month. Now, this still doesn't work because we have 99 on the end, although that still results in a False. Now, in order to get that last date, we need to make sure that we only allow for days of the month that go from 01 all the way up to 31. So, let's elaborate on that:
We have our 0 followed by 1-9, or we can have a 1 or a 2 that goes from 0-9, followed by another |, and a 3, which can have a 0 or a 1 after it. This is a very lengthy regular expression to match a day of the week. We see that it also resulted in False.
Now, we should test this under a variety of circumstances, but we are going to test this out again with our original date of 1969-07-20:
This one is still True. So, at least we know that this regular expression works for these two circumstances, and we would need further testing in order to verify this one. Regular expressions can be very hairy and they can also be very hard to test, but once you figure out the language, you can do lots with them. So, in our next section, we are going to use regular expressions in the context of a CSV file.