Chapter 10. String Functions

A string is a value that can contain alphabetical characters, digits, and other characters (e.g., the ampersand, the dollar sign). Although a string can contain numbers, they are not considered numeric values. It’s a matter of context and perspective. For instance, postal codes in the United States are all digits, but you shouldn’t store them as integers because the postal code for 02138 would become 2138. You should use a string to store the postal code.

To make the handling of strings easier, MySQL provides many built-in functions. You can format text for nicer results, make better expressions in a WHERE clause, or otherwise extract and manipulate data from a string or column. Therefore, in this chapter, we’ll go through several string functions, grouping them by similar features, and provide examples of how they might be used.

Formatting Strings

Several string functions can format or reconstitute text for a better display. They allow you to store data in columns in a raw form or in separate components and then create the display you want when you retrieve the data.

For instance, in the humans table, we are able to store each member’s title, first name, and last name in separate columns because we can put them together when needed. Breaking apart the names allows us to sort easily based on last name or first name. You’ll see how this is done in the next subsection.

Concatenating Strings

The CONCAT() function is very useful for pasting together the contents of different columns, or adding some other text to the results retrieved from a column. This is probably the most used string function—we’ve already used it in several examples in previous chapters. Within the parentheses of the function, in a comma-separated list, you give the strings, columns, and other elements that you want to merge together into one string.

Let’s look at an example of how it might be used within a SELECT statement. Suppose we want to get a list of a few members and birds that they’ve seen. We could enter an SQL statement like this:

SELECT CONCAT(formal_title, '. ', name_first, SPACE(1), name_last) AS Birder,
CONCAT(common_name, ' - ', birds.scientific_name) AS Bird,
time_seen AS 'When Spotted'
FROM birdwatchers.bird_sightings
JOIN birdwatchers.humans USING(human_id)
JOIN rookery.birds USING(bird_id)
GROUP BY human_id DESC
LIMIT 4;

+----------------------+----------------------------------+--------------------+
| Birder               | Bird                             | When Spotted       |
+----------------------+----------------------------------+--------------------+
| Ms. Marie Dyer       | Red-billed Curassow - Crax blu...| 2013-10-02 07:39:44|
| Ms. Anahit Vanetsyan | Bar-tailed Godwit - Limosa lap...| 2013-10-01 05:40:00|
| Ms. Katerina Smirnova| Eurasian Curlew - Numenius arq...| 2013-10-01 07:06:46|
| Ms. Elena Bokova     | Eskimo Curlew - Numenius borea...| 2013-10-01 05:09:27|
+----------------------+----------------------------------+--------------------+

The first field displayed by this SQL statement is not a single column from the table, but a CONCAT() function that merges the bird-watcher’s title, first name, and last name. We added a period in quotes after the title, as we’ve decided to store the titles without a period. We used quote marks to add spaces where needed. For the second field, we concatenated the common name of each bird species with the scientific name, and put spaces and a hyphen between them.

Without CONCAT(), we might be tempted to combine text in one column that really should be separated. For instance, we might put the common and scientific names of bird species in one column. Keeping values in separate columns makes a database more efficient and flexible. String functions like CONCAT() alleviate the need to do otherwise.

A less common concatenating function is CONCAT_WS(). It puts together columns with a separator between each. The first argument is the element you want to use as a separator (e.g., a space) and the rest of the arguments are the values to be separated. This can be useful when making data available for other programs.

For instance, suppose we have embroidered patches made with the name of the Rookery site on them and we want to mail one to each premium member. To do this, we use an advertising and marketing agency that will handle the mailing. The agency needs the names and addresses of members, and would like that data in a text file, with the values of each field separated by vertical bars. To do this, we’ll run mysql on the command line, passing a single statement to it:

mysql -p --skip-column-names -e \
"SELECT CONCAT_WS('|', formal_title, name_first, name_last,
street_address, city, state_province, postal_code, country_id)
FROM birdwatchers.humans WHERE membership_type = 'premium'
AND membership_expiration > CURDATE();" > rookery_patch_mailinglist.txt

This example uses mysql with several options. The --skip-column-names option tells MySQL not to display the column headings—we want just the data separated by bars. The -e option says that what follows within quotes is to be executed. We then put the SQL statement within double quotes. The first argument to CONCAT_WS() is the vertical bar that the company wants as a separator. The remaining arguments are the columns to be strung together. After the closing double quotes, we use > to redirect the results to a text file that we’ll email to the agency. There is a potential problem with the SQL statement we used. If a column has a NULL value, nothing will be exported and no bar will be put in the file to indicate an empty field. Here’s an example of how the text file would look:

Ms|Rusty|Osborne|ch
Ms|Elena|Bokova|ru

We have only four fields for these members, although we told MySQL to export eight fields. If these two records were in the midst of thousands of records, they would cause errors that might not be obvious when imported. Although it’s more cumbersome, we should wrap each column name in an IFNULL() function. Then we can give a value to display if the column is NULL, such as the word unknown or a blank space. Here’s the same example again, but with the IFNULL() function:

mysql -p --skip-column-names -e \
"SELECT CONCAT_WS('|', IFNULL(formal_title, ' '), IFNULL(name_first, ' '),
IFNULL(name_last, ' '), IFNULL(street_address, ' '),
IFNULL(city, ' '), IFNULL(state_province, ' '),
IFNULL(postal_code, ' '), IFNULL(country_id, ' '))
FROM birdwatchers.humans WHERE membership_type = 'premium'
AND membership_expiration > CURDATE();" > rookery_patch_mailinglist.txt

It looks daunting and excessive, but it’s simple to MySQL. The new contents of the text file follow:

Ms|Rusty|Osborne| | | | |ch
Ms|Elena|Bokova| | | | |ru

That’s a manageable data file. When the results are like this, the marketing company can import all of the records without errors and then contact us to try to get the missing information. They can add it to their system without having to reimport the text file.

Setting Case and Quotes

Occasionally, you might want to convert the text from a column to either all lowercase letters or all uppercase letters. For these situations, there are LOWER() and UPPER(), which can also be spelled LCASE() and UCASE(), respectively. In the example that follows, the output of the first column is converted to lowercase and the second to uppercase:

SELECT LCASE(common_name) AS Species,
UCASE(bird_families.scientific_name) AS Family
FROM birds
JOIN bird_families USING(family_id)
WHERE common_name LIKE '%Wren%'
ORDER BY Species
LIMIT 5;

+-------------------------+---------------+
| Species                 | Family        |
+-------------------------+---------------+
| apolinar's wren         | TROGLODYTIDAE |
| band-backed wren        | TROGLODYTIDAE |
| banded wren             | TROGLODYTIDAE |
| bar-winged wood-wren    | TROGLODYTIDAE |
| bar-winged wren-babbler | TIMALIIDAE    |
+-------------------------+---------------+

The QUOTE() function takes a string and returns it enclosed in single quotes. But it does a good deal more: it makes it input-safe by marking certain characters that could cause trouble in SQL statements or other programming languages. These characters are single quotes, backslashes, null (zero) bytes, and Ctrl-Z characters. The QUOTE() function precedes each of these with a backslash so that they won’t be interpreted in some way or (in the case of a single quote) cause SQL to prematurely terminate the string.

In the following example, we’re selecting a list of bird species named for a Prince or Princess:

SELECT QUOTE(common_name)
FROM birds
WHERE common_name LIKE "%Prince%"
ORDER BY common_name;

+----------------------------------+
| QUOTE(common_name)               |
+----------------------------------+
| 'Prince Henry\'s Laughingthrush' |
| 'Prince Ruspoli\'s Turaco'       |
| 'Princess Parrot'                |
+----------------------------------+

Notice in the results that because of the QUOTE() function, the strings returned are enclosed in single quotes, and any single quotes within the strings are escaped with a backslash. This can prevent errors if the value is passed to another program.

Trimming and Padding Strings

One of the problems with allowing the public to enter data into a website is that they’re not always careful. They do things like adding spaces before and after the text. There are a few functions for trimming any leading or trailing spaces from the values of a column. The LTRIM() function eliminates any leading spaces to the left. For columns with spaces on the right, RTRIM() will remove them. A more versatile trimming function, though, is TRIM(). With it, you can trim both left and right spaces.

These trim functions can be useful for cleaning data with the UPDATE statement. Let’s look at an example of their use. In these SQL statements, we’ll use LTRIM() and RTRIM() to eliminate both leading and trailing spaces:

UPDATE humans
SET name_first = LTRIM(name_first),
name_last = LTRIM(name_last);

UPDATE humans
SET name_first = RTRIM(name_first),
name_last = RTRIM(name_last);

In this example, we trimmed the leading spaces with the first UPDATE and the trailing spaces with the second one. Notice that we set the value of the columns to the same values, but with the strings trimmed. We can combine these functions into one SQL statement like so:

UPDATE humans
SET name_first = LTRIM( RTRIM(name_last) ),
name_last = LTRIM( RTRIM(name_last) );

You can always combine functions like this for a more dynamic result. In this case, though, the TRIM() function is a better alternative. Here’s the same SQL statement using it:

UPDATE humans
SET name_first = TRIM(name_first),
name_last = TRIM(name_last);

The TRIM() function also offers more options. You can specify something other than spaces to remove. For instance, suppose we receive a small table with bird sightings from another bird-watcher club, as we did in Row Subqueries. However, in this table, the scientific names of bird species are within double quotes. If we wanted to insert that data into our bird_sightings table, we could use the same SQL query as we did before, with the addition of the TRIM() function. Here is the relevant excerpt, the last lines on which we join their table to our birds table:

…
JOIN rookery.birds
ON(scientific_name = TRIM(BOTH '"' FROM science_name) ) );

It may be difficult to see, but we’re enclosing the character that we want trimmed—a double quote—within single quotes. The keyword BOTH isn’t actually necessary because it’s the default—that’s why we didn’t specify it in the previous example. If you don’t want to remove the string given from one end or the other, you can specify LEADING or TRAILING, thus making TRIM() work like LTRIM() or RTRIM(). The default string to trim is a space, as we have seen.

When displaying data in web forms and other such settings, it’s sometimes useful to pad the data displayed with dots or some other filler. This can be necessary when dealing with VARCHAR columns where the width varies. Padding the results of a column selected can help the user to see the column limits. There are two functions that may be used for padding: LPAD() and RPAD(). There is also SPACE(), which pads the string with spaces:

SELECT CONCAT(RPAD(common_name, 20, '.' ),
RPAD(Families.scientific_name, 15, '.'),
Orders.scientific_name) AS Birds
FROM birds
JOIN bird_families AS Families USING(family_id)
JOIN bird_orders AS Orders
WHERE common_name != ''
AND Orders.scientific_name = 'Ciconiiformes'
ORDER BY common_name LIMIT 3;

+--------------------------------------------------+
| Birds                                            |
+--------------------------------------------------+
| Abbott's Babbler....Pellorneidae...Ciconiiformes |
| Abbott's Booby......Sulidae........Ciconiiformes |
| Abbott's Starling...Sturnidae......Ciconiiformes |
+--------------------------------------------------+

Notice how all the bird families and orders are aligned vertically. This is because we padded each value out to its maximum width using RPAD(). The first argument was the column to read, the second was the total size of the resulting string we want, and the third was a period so that periods apear for columns that have less text. This happens to work because MySQL uses a fixed-width font. We could uses spaces instead of dots for a similar effect. For web display, we might use   as padding element for non-breaking spaces.

Extracting Text

There are a few functions for extracting a piece of text from a string. You indicate the point from which to start selecting text and how much text you want. There are four such functions: LEFT(), MID(), RIGHT(), and SUBSTRING(). The SUBSTRING_INDEX() function is also related. We’ll look at each one here.

Let’s look at the LEFT(), MID(), and RIGHT() functions first. Suppose our marketing agency acquires a table called prospects containing a list of people who are known to be bird-watchers. Each person’s title and first and last name is stored in a column called prospect_name, with email addresses in another column. The prospect_name column is a fixed character length data type, CHAR(54). The marketing agency tells us that the title is contained in the first four characters, the first name in the next 25, and the last name in the remaining 25. For the titles, they’re using only Mr. and Ms. with a space after each—hence the first four characters—but we will extract just the first two characters for our tables. Let’s see how that column looks by executing a simple SELECT to retrieve four names:

SELECT prospect_name
FROM prospects LIMIT 4;

+--------------------------------------------------------+
| prospect_name                                          |
+--------------------------------------------------------+
| Ms. Caryn-Amy                Rose                      |
| Mr. Colin                    Charles                   |
| Mr. Kenneth                  Dyer                      |
| Ms. Sveta                    Smirnova                  |
+--------------------------------------------------------+

As you can see, the data is a fixed width for each element. Normally, with a CHAR column, MySQL would not store the trailing spaces. Whoever created this table enforced the rigid format (4, 25, and 25 characters) by executing SET sql_mode = 'PAD_CHAR_TO_FULL_LENGTH'; before inserting data into the column.

With an INSERT INTO…SELECT statement and a few functions, we can extract and separate the data we need and put these prospects in a new table we created that we call membership_prospects. Let’s execute the SELECT first to test our organization of the functions before we insert the data:

SELECT LEFT(prospect_name, 2) AS title,
MID(prospect_name, 5, 25) AS first_name,
RIGHT(prospect_name, 25) AS last_name
FROM prospects LIMIT 4;

+-------+---------------------------+---------------------------+
| title | first_name                | last_name                 |
+-------+---------------------------+---------------------------+
| Ms    | Caryn-Amy                 | Rose                      |
| Mr    | Kenneth                   | Dyer                      |
| Mr    | Colin                     | Charles                   |
| Ms    | Sveta                     | Smirnova                  |
+-------+---------------------------+---------------------------+

In the example’s LEFT() function, the starting point for extracting data is the first character. The number we gave as an argument (i.e., 2), is the number of characters we want to extract starting from the first. The RIGHT() function is similar, but it starts from the last character on the right, counting left. The MID() function is a little different. With it, you can specify the starting point (i.e., the fifth character in our example) and how many characters you want (i.e., 25 characters).

The SUBSTRING() function is synonymous with MID() and their syntax is the same. By default, if the number of characters to capture isn’t specified, it’s assumed that all the remaining ones are to be extracted. This makes these functions work like the LEFT() function. If the second argument to SUBSTRING() or MID() is a negative number, the function will start from the end of the string, making it like the RIGHT() function.

Because the SUBSTRING() function is so versatile, we can use it to accomplish all the text extraction in the previous example. The equivalent SELECT would look like this:

SELECT SUBSTRING(prospect_name, 1, 2) AS title,
SUBSTRING(prospect_name FROM 5 FOR 25) AS first_name,
SUBSTRING(prospect_name, -25) AS last_name
FROM prospects LIMIT 3;

This example shows three ways to use SUBSTRING():

SUBSTRING(prospect_name, 1, 2) AS title: This has the same syntax we have used for other functions in this section: three arguments to specify the column with the text, the starting point for extracting text, and the number of characters to extract.
SUBSTRING(prospect_name FROM 5 FOR 25) AS first_name: This shows a different, wordier syntax. The starting point here is 5 and the number of characters to extract is 25.
SUBSTRING(prospect_name, -25) AS last_name: This specifies a starting point of −25 characters. Because it doesn’t specify how many to extract, MySQL takes the remaining characters from that starting point.

You can use whatever style you prefer.

The SUBSTRING_INDEX() is similar to the previous functions, but looks for elements that separate data within a string. For example, suppose the prospect_name column was constructed differently. Suppose that instead of having fixed width for the title and names, the text had vertical bars between them. This would be odd for data in a column, but it is possible. Here’s how we could separate the same column containing the vertical bar character as the separator (the first and third third lines using SUBSTRING_INDEX() are fairly understandable, but the second one is more complex):

SELECT SUBSTRING_INDEX(prospect_name, '|', 1) AS title,
SUBSTRING_INDEX( SUBSTRING_INDEX(prospect_name, '|', 2), '|', -1) AS first_name,
SUBSTRING_INDEX(prospect_name, '|', -1) AS last_name
FROM prospects WHERE prospect_id = 7;

The second argument to SUBSTRING_INDEX() tells MySQL how to break the string into the pieces of text we want. In our example, we use '|' to specify the vertical bar. The number in the third argument tells how many elements to take. So in the first line here we’re saying to get the first element. In the third line, because it has a negative sign in front of the number, we’re saying to count from the end and get one element there. In the second line, we’re using SUBSTRING_INDEX() twice, one call embedded inside the other. The inner call extracts the first two elements. Using those results, we then use an outer call to extract its first element starting from the end.

Using SUBSTRING() is much nicer, but you need to know the starting point and how many characters to take. In our vertical bar example, we’d need to know exactly where the vertical bars are in each name. To do that, you will need to use other functions to search strings. Those are covered in the next section.

Searching Strings and Using Lengths

MySQL and MariaDB do not have comprehensive functions for searching string based on patterns. Yes, there’s the REGEXP operator that permits some pattern matching. But this isn’t as robust and isn’t fine tuned as easily as the capabilities offered by programming languages like PHP and Perl. But there are a few functions that assist in searching strings. We’ll look at some of them in this section.

Locating Text Within a String

MySQL and MariaDB have a few built-in functions that can find characters within a string. These functions return the location where the search parameter was found.

The LOCATE() function returns the numeric starting point just left of the first occurrence of a given substring in a given string. It does not search beyond this point. Let’s look at an example. Suppose we want a list of Avocet birds—they’re a type of shore birds that is part of the Recurvirostridae family. We could enter something like this:

SELECT common_name AS 'Avocet'
FROM birds
JOIN bird_families USING(family_id)
WHERE bird_families.scientific_name = 'Recurvirostridae'
AND birds.common_name LIKE '%Avocet%';

+-------------------+
| Avocet            |
+-------------------+
| Pied Avocet       |
| Red-necked Avocet |
| Andean Avocet     |
| American Avocet   |
+-------------------+

Now suppose we want to eliminate the word Avocet from the names returned. There are a few ways we might do that: one way is to use the LOCATE() function to find the word Avocet, and extract all text before it with the SUBSTRING() function:

SELECT
SUBSTRING(common_name, 1, LOCATE(' Avocet', common_name) ) AS 'Avocet'
FROM birds
JOIN bird_families USING(family_id)
WHERE bird_families.scientific_name = 'Recurvirostridae'
AND birds.common_name LIKE '%Avocet%';

+-------------+
| Avocet      |
+-------------+
| Pied        |
| Red-necked  |
| Andean      |
| American    |
+-------------+

That’s a cumbersome example, but it shows you how you can use LOCATE() in conjunction with other functions to get what you want from a string. Let’s look at another example.

Earlier in this chapter, in Trimming and Padding Strings, we had some examples involving merging data from another bird-watcher group. That included using the TRIM() function to remove quotes from around the scientific names of birds spotted by people in that group. Let’s use that column again, but assume that it doesn’t have quotes. Instead, the bird species is given with its bird family in this format: bird species - bird family. For this, we can use the LOCATE() function to locate the hyphen and then the SUBSTRING() to get the family name for the JOIN clause in that earlier example. Here’s just the excerpt from the JOIN clause:

…
JOIN rookery.birds
ON(scientific_name = SUBSTRING(science_name, LOCATE(' - ', science_name) + 3 ) );

Let’s pull this apart to understand it better. First, let’s focus on the inner function, the LOCATE(). The search parameter it’s given is a hyphen surrounded by spaces. The science_name column is the string to search. This function will return the position in the string where the search parameter is found. We’re adding 3 to that because the search parameter is three characters long—in other words, LOCATE() gives us the point before the separator and we want to get the substring after the end of the separator. So the results of LOCATE() + 3 is given as the starting point for the SUBSTRING() function. Because we’re not specifying how many characters we want, MySQL will extract the remaining characters. That will give us the scientific name of the bird in the table we’re joining to birds.

The POSITION() function works like LOCATE(), except that it takes the keyword IN instead of a comma between the substring you’re searching for and the containing string:

POSITION(' - ' IN science_name)

In addition, LOCATE() accepts an optional argument to indicate the starting point for the search, which is not available in POSITION().

Another function for searching a string is FIND_IN_SET(). If you have a string that contains several pieces of data separated by commas, this function tells you which element in that set of data contains the search pattern you give it. To understand this better, suppose that we want to get a list of members from Russia, but ordered by the date when the members joined. We would enter this:

SELECT human_id,
CONCAT(name_first, SPACE(1), name_last) AS Name,
join_date
FROM humans
WHERE country_id = 'ru'
ORDER BY join_date;

+----------+-------------------+------------+
| human_id | Name              | join_date  |
+----------+-------------------+------------+
|       19 | Elena Bokova      | 2011-05-21 |
|       27 | Anahit Vanetsyan  | 2011-10-01 |
|       26 | Katerina Smirnova | 2012-02-01 |
+----------+-------------------+------------+

Now suppose that we want to know the position of the member Anahit Vanetsyan in the list of Russian members. We can see easily from the results just shown that she is the third member from Russia to join. That’s because there are very few results here. Imagine if the results contained hundreds of names. We could use FIND_IN_SET() with a subquery to determine this:

SELECT FIND_IN_SET('Anahit Vanetsyan', Names) AS Position
  FROM
    (SELECT GROUP_CONCAT(Name ORDER BY join_date) AS Names
     FROM
       ( SELECT CONCAT(name_first, SPACE(1), name_last) AS Name,
         join_date
         FROM humans
         WHERE country_id = 'ru')
       AS derived_1 )
    AS derived_2;

+----------+
| Position |
+----------+
|        2 |
+----------+

This is a pretty complex SQL statement. The innermost SELECT is essentially the query we saw earlier, but returning just the full name and join date for each Russian person. These results are fed to GROUP_CONCAT, which produces a single huge string containing all the names. The outermost SELECT finds the name we want and returns its position.

Note

When you put a SELECT statement inside parentheses and derive a table from it that you will use with an outer statement, you must give that derived table a name using AS. For naming simplicity, we’ve named the derived tables in this chapter derived_1 and derived_2. Almost any unique name is fine.

The statement can be useful if we associate it with a user profile page on the Rookery website. We might want to use it to show members where they rank in different lists, such as most sightings of birds or most sightings of birds in a particular category.

FIND_IN_SET() returns 0 if the string is not found in the set or if the string list is empty. It returns NULL if the value of either argument is NULL.

String Lengths

There will be times you want to know how long a string is. There are a few functions that return the character length of a string. This can be useful when adjusting formatting or making other decisions related to a string, and they are commonly used with functions like LOCATE() and SUBSTRING().

The CHAR_LENGTH() or CHARACTER_LENGTH() function returns the number of characters in a string. This could be helpful when different rows have different-length strings in a particular column.

For instance, suppose we want to display on the Rookery website a list of the birds most recently sighted by members, as recorded in the bird_sightings table. We’ll include the common and scientific name and other information about the bird species. Suppose that we want to also include the comments that the member entered when they recorded the sighting. Because this column can contain a lot of text, we want to know how many characters it contains when displaying it. If there’s too much (i.e., more than 100 characters), we’ll limit the text and include a link on the web page to view all of the text. To check the length, we could construct an SQL statement like this that would be part of a program:

SELECT IF(CHAR_LENGTH(comments) > 100), 'long', 'short')
FROM bird_sightings
WHERE sighting_id = 2;

Here we’re using CHAR_LENGTH() to count the number of characters in the comments column for the row selected. We’re using the IF() function to determine whether the character length of the comments is greater than 100 characters. If it is, the function will return the word long. If not, it will return short. If this SQL statement was used in an API script, the value in the WHERE clause for the sighting_id could be dynamically replaced for each bird sighting.

CHAR_LENGTH() understands the character set in current use, as we touched on in Creating a Database. Characters that take up multiple bytes—usually present in Asian languages—are still considered one character. In contrast, the LENGTH() function returns the number of bytes in a given string. Note that there are eight bits to a byte and that Western languages normally use one byte for each letter. If you want to count the number of bits, use the BIT_LENGTH() function.

As an example, suppose we notice that the comments column of the bird_sightings table contains some odd binary characters. They have been entered into the column through the mobile application we provide to members. To narrow the list of rows that have these odd characters so that we can remove them, we can execute the following SQL statement:

SELECT sighting_id
FROM bird_sightings
WHERE CHARACTER_LENGTH(comments) != LENGTH(comments);

This will give us the sighting_id for the rows in which the number of characters does not equal the number of bytes in the comments column.

Comparing and Searching Strings

The previous subsection used the output of CHAR_LENGTH() as input to an IF() statement so that we had a choice of what to return. In this subsection, we’ll look at some functions that compare strings, which can also be handy when used with a logical function such as IF() or in a WHERE clause.

Let’s consider a situation where we might use one of these functions—specifically, the STRCMP() function. The name of the function, in the manner much loved by computer programmers, is a compressed version of “string compare.”

Email addresses are critical for communicating with members so we decide to require new members to enter their email address twice during the registration process to ensure accuracy. However, in case the connection is lost in the process or the joining member does not correct a problem with their email address, we want to keep both addresses until they do. So we’ll add a row to the humans table to store whatever information they give us, and then store both email addresses in another table to compare them. For that comparison, we could use the STRCMP() function in an SQL statement.

This scenario is the kind of situation that you would automate with an API program, a program you would create to interface with MySQL or MariaDB. It would store the SQL statements needed for processing the information the new member enters from the website. To start the process related to checking the email, we might create a table that will store the member’s identification number and the two email addresses. We could do that like so:

CREATE TABLE possible_duplicate_email
(human_id INT,
email_address1 VARCHAR(255),
email_address2 VARCHAR(255),
entry_date datetime );

Now when new members register, after their information has been stored in the humans table, our web interface can store conditionally the two email addresses provided in the possible_duplicate_email table. It might look like this:

INSERT IGNORE INTO possible_duplicate_email
(human_id, email_address_1, email_address_2, entry_date)
VALUES(LAST_INSERT_ID(), 'bobyfischer@mymail.com', 'bobbyfischer@mymail.com')
WHERE ABS( STRCMP('bobbyrobin@mymail.com', 'bobyrobin@mymail.com') ) = 1 ;

For the email addresses, I’ve displayed the plain text. But in a more realistic example, this SQL statement might be embedded in a PHP script and would refer to variables (e.g., $email_1 and $email_2) where the email addresses are here.

Using the STRCMP() in the WHERE clause, if the email addresses match, STRCMP() returns 0. If the addresses don’t match, it will return 1 or -1. It returns -1 if the first value is alphabetically before the second. To allow for that possibility, we put it inside of ABS(), which changes the value to the absolute value—it makes negative values positive. So, if the two email addresses don’t match, the statement will insert the addresses into the possible_duplicate_email table for an administrator to review. Incidentally, that would normally return an error message, but IGNORE flag tells MySQL to ignore errors.

Another comparison function is MATCH() AGAINST(), which searches for a string and returns matching rows from the table. It even ranks the rows by relevance, but that is beyond the scope of this chapter. Among the complications of MATCH() AGAINST(), it works only on columns that have been indexed with a special FULLTEXT index. To test this function, we’ll first add a FULLTEXT index to the bird_sightings table, basing it on the comments column because that’s a TEXT column:

CREATE FULLTEXT INDEX comment_index
ON bird_sightings (comments);

Now you can use MATCH() AGAINST(). It is commonly found in WHERE clauses as a condition to find columns containing a given string. Text in the given string, which is delimited by spaces or quotes, is parsed into separate words. Small words (i.e., three characters or fewer) are generally ignored. Here is an example:

SELECT CONCAT(name_first, SPACE(1), name_last) AS Name,
common_name AS Bird,
SUBSTRING(comments, 1, 25) AS Comments
FROM birdwatchers.bird_sightings
JOIN birdwatchers.humans USING(human_id)
JOIN rookery.birds USING(bird_id)
WHERE MATCH (comments) AGAINST ('beautiful');

+-------------------+-----------------+---------------------------+
| Name              | Bird            | Comments                  |
+-------------------+-----------------+---------------------------+
| Elena Bokova      | Eskimo Curlew   | It was a major effort get |
| Katerina Smirnova | Eurasian Curlew | Such a beautiful bird. I  |
+-------------------+-----------------+---------------------------+

In the WHERE clause, we’re able now to match the comments column against the string beautiful. The comments column from the birdwatchers.bird_sightings is combined in the results with three other columns: common_name from rookery.birds and name_first and name_last from birdwatchers.humans.

We’re using the SUBSTRING function to limit the amount of text displayed. This cuts off the text abruptly. You could use the CONCAT() function to append ellipses to indicate there is more text. You might also use the IF() function to determine whether there is more text before appending ellipses. There are other functions you can use for locating the beautiful within the column so that you can display only the text around it. We’ll cover that kind of function later in this chapter.

Replacing and Inserting into Strings

If you want to insert or replace certain text from a column (but not all of its contents), you could use the INSERT() function. Don’t confuse this with the INSERT statement. The syntax of this function consists of the string or column into which you want to insert text, followed by the position in which to insert text. You may specify also how much text to delete from that point, if you want. Finally, you give the text to insert. Let’s look at some examples of this function.

We’ll start with a simple example. Suppose that on a page of the Rookery site, we are thinking of adding some text to the common names of bird species with the word Least in their name. We want to explain that it means Smallest, so that uninformed birders don’t think it means these birds are the least important. To test this, we enter this SQL statement:

SELECT INSERT(common_name, 6, 0, ' (i.e., Smallest)')
AS 'Smallest Birds'
FROM birds
WHERE common_name LIKE 'Least %' LIMIT 1;

+------------------------------+
| Smallest Birds               |
+------------------------------+
| Least (i.e., Smallest) Grebe |
+------------------------------+

The first argument is the column containing the string we’re manipulating. The second argument is the starting point for inserting text. Based on the WHERE clause, we’re looking for common names that start with Least. That’s 5 characters. We add 1 to that because the starting point for INSERT is 1. The third argument specifies how many characters after the starting point should be replaced. In this case, we’re just inserting text, not replacing any.

The SQL statement uses INSERT() to change the results set, not the data in the table. So we could use the INSERT() function to display the common names like this to new members for the first month who have identified themselves as new to bird-watching. We would have to construct a more complex SQL statement to check who is new, but this example shows you how to insert text within a string. Let’s look now at an example in which we will replace data using INSERT().

Suppose we discover that parts of some of the common bird species names are abbreviated in the birds table (e.g., Great is abbreviated as Gt.). We prefer not to have any abbreviations for the common names. Before changing the data, we’ll execute a SELECT statement to test our use of the INSERT() function:

SELECT common_name AS Original,
INSERT(common_name, LOCATE('Gt.', common_name), 3, 'Great') AS Adjusted
FROM birds
WHERE common_name REGEXP 'Gt.' LIMIT 1;

+------------------+--------------------+
| Original         | Adjusted           |
+------------------+--------------------+
| Gt. Reed-Warbler | Great Reed-Warbler |
+------------------+--------------------+

We’ve already reviewed the arguments of the INSERT() function in the previous example. The extra twist here is in the second argument, which contains the LOCATE(). We’re using that function to determine the position in the string where text is to be replaced. In the previous example, we assumed that the common name would start with the string we wanted to modify. In this case, we’re not assuming the position of the string within the column. Instead, we’re letting MySQL find it for us.

Another difference in this example is the third element: we’re telling the function to replace three characters (i.e., the length of Gt.) from the starting point with the text given for the fourth argument (i.e., Great). Although the text we’re adding is more than three characters, it’s fine because when we update the table later, we’re updating a column with plenty of space to hold the results.

If LOCATE() does not find the string we give it, it returns 0. A value of 0 for the position in the INSERT() function negates it and returns the value of common_name unchanged. So with this usage of INSERT(), because of the inclusion of LOCATE() for the starting location, the WHERE clause is unnecessary—except to see that it works where we want it to.

Now that we’ve verified that our combination of functions works correctly, we can update the data by entering the following SQL statement:

UPDATE birds
SET common_name = INSERT(common_name, LOCATE('Gt.', common_name), 3, 'Great')
WHERE common_name REGEXP 'Gt.';

There is an alternative to using INSERT() for replacing text in a string. In the previous example, we had to use the LOCATE() function to determine the location of the text where we wanted to insert text and we had to tell it how many characters to replace. A simpler function for replacing text is REPLACE(). We could use this function to replace all occurrences of Gt. with Great in the common_name column. Let’s test that with a SELECT statement like so:

SELECT common_name AS Original,
REPLACE(common_name, 'Gt.', 'Great') AS Replaced
FROM birds
WHERE common_name REGEXP 'Gt.' LIMIT 1;

+------------------+--------------------+
| Original         | Replaced           |
+------------------+--------------------+
| Gt. Reed-Warbler | Great Reed-Warbler |
+------------------+--------------------+

This works much better. We can use the REPLACE() with the arguments we have here and enter the following UPDATE to change the data in the table:

UPDATE birds
SET common_name = REPLACE(common_name, 'Gt.', 'Great');

Query OK, 8 rows affected (0.23 sec)
Rows matched: 28891  Changed: 8  Warnings: 0

Notice that we didn’t include the WHERE clause, but the results message says that only eight rows were changed. This is because there were only eight rows that contained Gt. in the common_name column. Updating data in a table with that many rows is intimidating and dangerous without a WHERE clause. That’s why it’s good to use them and to test the parameters with a SELECT statement first.

Converting String Types

There may be times when you will have to work with tables created by people who might not have made the best choices for column data types. Sometimes you can alter the tables, but sometimes you may not be allowed to do so. For manipulating data from such tables or for importing data from them, you can use the CAST() or CONVERT() functions to change the data type of columns. The effect just takes place within your SQL statement, not the database itself. Let’s look at some examples of how and why you might use these two functions, which are basically synonymous except for a minor syntax difference.

Suppose we’re given a table containing images of birds in a particular area, showing female, male, and juvenile color patterns. One of the columns contains numbers for ordering birds based loosely on the type of bird and the date when usually seen in the area. This column isn’t a numeric data type like INT, but is CHAR. When we sort the data based on this column, MySQL will sort the rows lexically, not numerically. Here’s an example of how that might look:

SELECT sorting_id, bird_name, bird_image
FROM bird_images
ORDER BY sorting_id
LIMIT 5;

+------------+-----------------+----------------------------+
| sorting_id | bird_name       | bird_image                 |
+------------+-----------------+----------------------------+
| 11         | Arctic Loon     | artic_loon_male.jpg        |
| 111        | Wilson's Plover | wilson_plover_male.jpg     |
| 112        | Wilson's Plover | wilson_plover_female.jpg   |
| 113        | Wilson's Plover | wilson_plover_juvenile.jpg |
| 12         | Pacific Loon    | pacific_loon_male.jpg      |
+------------+-----------------+----------------------------+

Notice that the rows with a sorting_id starting with 11n are listed before one with the value of 12. That’s because MySQL is reading the data as characters and not numbers. The two Loons should be together, before the Plovers are listed.

We can use the CAST() function to cast the values taken from sorting_id into the INT data type:

SELECT sorting_id, bird_name, bird_image
FROM bird_images ORDER BY CAST(sorting_id AS INT) LIMIT 5;

+------------+-----------------+----------------------------+
| sorting_id | bird_name       | bird_image                 |
+------------+-----------------+----------------------------+
| 11         | Arctic Loon     | artic_loon_male.jpg        |
| 12         | Pacific Loon    | pacific_loon_male.jpg      |
| 111        | Wilson's Plover | wilson_plover_male.jpg     |
| 112        | Wilson's Plover | wilson_plover_female.jpg   |
| 113        | Wilson's Plover | wilson_plover_juvenile.jpg |
+------------+-----------------+----------------------------+

That worked correctly. Let’s suppose now that we don’t want to use sorting_id, but instead the gender_age column. This is an ENUM column specifying that the image file is for a male, female, or a juvenile. The color patterns of most birds deviate based on these factors. Let’s see how the results will look if we sort based on this column:

SELECT bird_name, gender_age, bird_image
FROM bird_images
WHERE bird_name LIKE '%Plover%'
ORDER BY gender_age
LIMIT 5;

+-----------------+------------+----------------------------+
| bird_name       | gender_age | bird_image                 |
+-----------------+------------+----------------------------+
| Wilson's Plover | male       | wilson_plover_male.jpg     |
| Snowy Plover    | male       | snowy_plover_male.jpg      |
| Wilson's Plover | female     | wilson_plover_female.jpg   |
| Snowy Plover    | female     | snowy_plover_female.jpg    |
| Wilson's Plover | juvenile   | wilson_plover_juvenile.jpg |
+-----------------+------------+----------------------------+

Notice that the rows are grouped together based on the gender_age column, but those values are not in alphabetical order (i.e., female rows should be before male rows). This is because of how the enumerated values are listed in the gender_age column:

SHOW COLUMNS FROM bird_images LIKE 'gender_age' \G

*************************** 1. row ***************************
  Field: gender_age
   Type: enum('male','female','juvenile')
   Null: YES
    Key:
Default: NULL
  Extra:

To MySQL, the value of male for the gender_age column is stored as 1, and female as 2. This controls the order of the display, even though the values are rendered as text. If we use though the CAST() or the CONVERT() function in the ORDER BY clause, MySQL will sort the results based on their rendered values and not their column values. Here’s how that would look:

SELECT bird_name, gender_age, bird_image
FROM bird_images
WHERE bird_name LIKE '%Plover%'
ORDER BY CONVERT(gender_age, CHAR)
LIMIT 5;

+-----------------+------------+----------------------------+
| bird_name       | gender_age | bird_image                 |
+-----------------+------------+----------------------------+
| Wilson's Plover | female     | wilson_plover_female.jpg   |
| Snowy Plover    | female     | snowy_plover_female.jpg    |
| Wilson's Plover | juvenile   | wilson_plover_juvenile.jpg |
| Snowy Plover    | juvenile   | snowy_plover_juvenile.jpg  |
| Wilson's Plover | male       | wilson_plover_male.jpg     |
+-----------------+------------+----------------------------+

Notice that for the CONVERT() function, a comma is used to separate the string given from the data type instead of the AS keyword. The data type given as the second argument can be BINARY, CHAR, DATE, DATETIME, SIGNED [INTEGER], TIME, or UNSIGNED [INTEGER]. BINARY converts a string to a binary string. You can add also CHARACTER SET to use a different character set from the default for the value given. To convert the character set of a given string to another, you have to use the USING option, like so:

SELECT bird_name, gender_age, bird_image
FROM bird_images
WHERE bird_name LIKE '%Plover%'
ORDER BY CONVERT(gender_age USING utf8)
LIMIT 5;

Compressing Strings

Some column data types allow large amounts of data. For instance, the BLOB column can store plenty. To reduce the size of tables that use this column data type, you can compress the data it contains when inserting the data. The COMPRESS() function compresses a string and the UNCOMPRESS() function decompresses a compressed string. If you want to use them, MySQL has to have been compiled with a compression library (i.e., zlib). If it wasn’t, a NULL value will be returned when using COMPRESS(). Let’s look at some examples of their use.

The humans table has a column for birding_background which is a BLOB. Members can write as much as they like about themselves, which could result in pages of information on their experiences and education as bird-watchers. This could potentially slow down queries and updates if many members do this. So we decide to use COMPRESS() to compress the member’s background when inserting it into the humans table. Here’s how that might look:

INSERT INTO humans
(formal_title, name_first, name_last, join_date, birding_background)
VALUES('Ms', 'Melissa', 'Lee', CURDATE(), COMPRESS("lengthy background..."));

This SQL statement inserts a new member’s information into the humans table—it has more columns than shown here, but we’re trying to keep this example simple. The statement uses the COMPRESS() function to compress the background information given (which isn’t much for this simple example). You would normally get such data from an API variable using something like PHP to store text entered by the user through a web page. So instead of the text shown here, you would use a variable (e.g., $birding_background).

To see how the data looks in the compressed form, we could do this:

SELECT birding_background AS Background
FROM humans
WHERE name_first = 'Melissa' AND name_last = 'Lee' \G

*************************** 1. row ***************************
Background:    x#####/ɨTHJL##/######## Z######

Notice that the results are not normal text. The mysql client substitutes a hash sign (#) for binary values. In order to see the text contained in this compressed format, we would use UNCOMPRESS(). It returns NULL if the string is not compressed or if MySQL wasn’t compiled with zlib:

SELECT UNCOMPRESS(birding_background) AS Background
FROM humans
WHERE name_first = 'Melissa' AND name_last = 'Lee' \G

*************************** 1. row ***************************
Background: lengthy background...

For small amounts of text like this, compression takes more space than the plain text. But for large amounts of text, it will save plenty of space. So use it sparingly and where appropriate.

Summary

There are more string functions available in MySQL and MariaDB. A few of the functions mentioned here have aliases or close alternatives. There are also functions for converting between ASCII, binary, hexadecimal, and octal strings. And there are also string functions related to text encryption and decryption that were not mentioned. However, I think this chapter has given you a good collection of common string functions that will assist you in building more powerful SQL statements and formatting results to be more attractive.

Exercises

String functions are very necessary to developing databases in MySQL and MariaDB. You need to know them well. To become an expert, you need to practice using them, so be sure to complete all of the following exercises.

One of the most commonly used string functions is CONCAT(). Construct a SELECT statement to query the humans table. Use the CONCAT() function to merge together values from the name_first column with the name_last column. Use the SPACE() function to put a space between them in the results. Give that field an alias of Full Name—and remember to put quotes around this alias, as it contains a space. Limit the results to four people. Execute it to be sure it has no errors.
Add a WHERE clause to that SELECT statement. For the condition of the WHERE clause, copy the CONCAT() you just assembled. List rows where the name is in a set of the following names: Lexi Hollar, Michael Zabalaoui, and Rusty Johnson.
After you successfully execute the SELECT with that WHERE clause, add an ORDER BY clause to sort the data based on the concatenated name. Do it without using CONCAT().
Construct a SELECT statement that selects, from the birds table, the common_name and the scientific_name. Use a string function to change the scientific_name to all lowercase letters. Use the CONCAT() function to put them into one field, with a space after the common name, followed by the scientific name in parentheses—for example, African Desert Warbler (sylvia deserti). Don’t use the SPACE() function. Instead, put the spaces and parentheses within single quote marks within the CONCAT(). Give the resulting field an alias of Bird Species. Limit the results to 10 rows.
After you’ve successfully executed that SQL statement, modify that statement to join in the bird_families and the bird_orders tables. The JOIN statement was covered extensively in Unifying Results. Then add the scientific_name columns from both of these tables to the fields returned.
Execute this modified statement to make sure your joins are correct. When they are, move the scientific_name columns for the two additional tables into the CONCAT(). Using the RPAD() function, put dots after the bird species name, before the bird family and the bird order names. The results for a field will look like this:
```
Speckled Warbler (pyrrholaemus sagittatus)...Acanthizidae...Passeriformes
```
This will probably require you to use CONCAT() twice. Use a WHERE clause to list only Warblers. Limit the results to 10 rows.
Construct another SELECT statement to list all of the common names of bird species from the birds table, where the common name contains the word Shrike. When you execute that statement you should see some names with a hyphen after the word Shrike. Add the REPLACE() function to the SELECT statement to replace those hyphens with a space in the results, and then execute the SQL statement again.
Some of the names of the birds in the results from the SELECT statement in the previous exercise have more than one hyphen (e.g., Yellow-browed Shrike-Vireo). Redo that SQL statement to replace only the hyphens after the word Shrike (e.g., to look like this: Yellow-browed Shrike Vireo). In order to do this, use LOCATE() with REPLACE(). You will need to use LOCATE() twice: one within another.
True Shrikes are of the Laniidae family. Construct another SELECT to select the common bird names with the word Shrike, but belonging to Laniidae. This will require a join to the bird_families table. Use one of the substring functions like SUBSTRING() to extract the words before Shrike. To do this, you will need to use LOCATE() or a similar function. Then use CONCAT() to display that value extracted after Shrike with a comma and space in between. The results for each field should look like this: Shrike, Rufous-tailed. Give the field an alias of Shrikes.
The humans table contains entries in which the member used either all lowercase letters or all uppercase letters to enter their first and last names (e.g., andy oram and MICHAEL STONE). Use UPDATE to change the names to title case (i.e., the first letter capital and the rest lowercase). First experiment with SELECT to make sure you have the functions organized properly. Use the UCASE() and LCASE() functions to set the cases. You will need to use SUBSTRING() or a similar function a few times, and CONCAT() a couple of times.