Chapter 3. Installing Cassandra

For those among us who like instant gratification, let’s start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as you walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and explain Cassandra in its larger context.

Installing the Apache Distribution

While there are a number of options available for installing Cassandra on various operating systems, let’s start your journey by downloading the Apache distribution from http://cassandra.apache.org so you can get a good look at what’s inside. We’ll explore other installation options in “Other Cassandra Distributions”.

Click the link on the Cassandra home page to download a version as a gzipped archive. Typically, multiple versions of Cassandra are provided. The latest version is the current recommended version for use in production. There are other supported releases that are still viable for production usage and receive bug fixes. The project goal is to limit the number of supported releases, but reasonable accommodations are made. For example, the 2.2 and 2.1 releases were considered to be officially maintained through the release of 4.0. For all releases, the prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download for Cassandra 4.0 is around 40 MB.

Extracting the Download

You can unpack the compressed file using any regular ZIP utility. On Unix-based systems such as Linux or macOS, gzip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware.

Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.

What’s in There?

Once you decompress the tarball, you’ll see that the Cassandra binary distribution includes several files and directories.

The files include the NEWS.txt file, which includes the release notes describing features included in the current and prior releases, and the CHANGES.txt file, which is similar but focuses on bug fixes. You’ll want to make sure to review these files whenever you are upgrading to a new version so you know what changes to expect. The LICENSE.txt and NOTICE.txt files contain the Apache 2.0 license used by Cassandra, and copyright notices for Cassandra and included software, respectively.

Let’s take a moment now to look around in the different directories and see what’s there:

bin: This directory contains the executables to run Cassandra as well as clients, including the query language shell (cqlsh). It also has scripts to run the nodetool, which is a utility for inspecting a cluster to determine whether it is properly configured, and to perform a variety of maintenance operations. We look at nodetool in depth later. The directory also contains several utilities for performing operations on SSTables, the files in which Cassandra stores its data on disk. We’ll discuss these utilities in Chapter 12.
conf: This directory contains the files for configuring your Cassandra instance. The configuration files you may use most frequently include the cassandra.yaml file, which is the primary configuration for running Cassandra, and the logback.xml file, which lets you change the logging settings to suit your needs. Additional files can be used to configure Java Virtual Machine (JVM) settings, the network topology, metrics reporting, archival and restore commands, and triggers. You’ll learn how to use these configuration files in Chapter 10.
doc: Traditionally, documentation has been one of the weaker areas of the project, but a concerted effort for the 4.0 release, including sponsorship from the Google Season of Docs project, yielded significant progress to the documentation included in the Cassandra distribution as well as the documentation on the Cassandra website. The documentation includes a getting started guide, an architectural overview, and instructions for configuring and operating Cassandra.
javadoc: This directory contains a documentation website generated using Java’s JavaDoc tool. Note that JavaDoc reflects only the comments that are stored directly in the Java code, and as such does not represent comprehensive documentation. It’s helpful if you want to see how the code is laid out. Moreover, Cassandra is a wonderful project, but the code contains relatively few comments, so you might find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc, open the javadoc/index.html file in a browser.
lib: This directory contains all of the external libraries that Cassandra needs to run. For example, it uses two different JSON serialization libraries, the Google collections project, and several Apache Commons libraries.
pylib: This directory contains Python libraries that are used by cqlsh.
tools: This directory contains tools that are used to maintain your Cassandra nodes. You’ll learn about these tools in Chapter 12.

Additional Directories

If you’ve already run Cassandra using the default configuration, you will notice two additional directories under the main Cassandra directory: data and log. We’ll discuss the contents of these directories momentarily.

Building from Source

Cassandra uses Apache Ant for its build scripting language and Maven for dependency management.

Downloading Ant

You can download Ant from http://ant.apache.org. You don’t need to download Maven separately just to build Cassandra.

Building from source requires a complete Java 8 JDK (or later version), not just the Java Runtime Environment (JRE). If you see a message about how Ant is missing tools.jar, either you don’t have the full JDK or you’re pointing to the wrong path in your environment variables. Maven downloads files from the internet, so if your connection is invalid or Maven cannot determine the proxy, the build will fail.

Downloading Development Builds

If you want to download the latest Cassandra builds or view test results, you can find these in Jenkins, which the Cassandra project uses as its continuous integration tool. See this Jenkins page for the latest builds and test coverage information.

If you are interested in having a look at the Cassandra source, you can get the trunk version of the Cassandra source using this command:

$ git clone https://github.com/apache/cassandra.git

Because Maven takes care of all the dependencies, it’s easy to build Cassandra once you have the source. Just make sure you’re in the root directory of your source download and execute the ant program, which will look for a file called build.xml in the current directory and execute the default build target. Ant and Maven take care of the rest. To execute the Ant program and start compiling the source, just type:

$ ant

That’s it. Maven will retrieve all of the necessary dependencies, and Ant will build the hundreds of source files and execute the tests. If all went well, you should see a BUILD SUCCESSFUL message. If all did not go well, make sure that your path settings are all correct, that you have the most recent versions of the required programs, and that you downloaded a stable Cassandra build. You can check the Jenkins report to make sure that the source you downloaded actually can compile.

More Build Output

If you want to see detailed information on what is happening during the build, you can pass Ant the -v option to cause it to output verbose details regarding each operation it performs.

Additional Build Targets

To compile the server, you can simply execute ant, as shown previously. This command executes the default target, jar. This target will perform a complete build, including unit tests, and output a file into the build directory called apache-cassandra-x.x.x.jar.

If you want to see a list of all of the targets supported by the build file, simply pass Ant the -p option to get a description of each target. Here are a few others you might be interested in:

test: Users will probably find this the most helpful, as it executes the battery of unit tests. You can also check out the unit test sources themselves for some useful examples of how to interact with Cassandra.
stress-build: This target builds the Cassandra stress tool, which you will learn to use in Chapter 13.
clean: This target removes locally created artifacts such as generated source files and classes and unit test results. The related target realclean performs a clean and additionally removes the Cassandra distribution JAR files and JAR files downloaded by Maven.

Running Cassandra

The Cassandra developers have done a terrific job of making it very easy for new users to start using Cassandra immediately, as you can start a single node without making any changes to the default configuration. We’ll note some of the available configuration options in Chapter 10.

Required Java Version

Cassandra versions from 3.0 onward require a Java 8 JVM or later, preferably the latest stable version. It has been tested on both the OpenJDK and Oracle’s JDK. Cassandra 4.0 has been compiled and tested against both Java 8 and Java 11. You can check your installed Java version by opening a command prompt and executing java -version. If you need a JDK, you can get one at this Java SE Downloads page or the jdk.java.net page.

Setting the Environment

Once you have the binary (or the source downloaded and compiled), you’re ready to start the database server.

Setting the JAVA_HOME environment variable is recommended. To do this on a Windows system, click the Start button and then right-click Computer. Click Advanced System Settings, and then click the Environment Variables… button. Click New… to create a new system variable. In the Variable Name field, type JAVA_HOME. In the Variable Value field, type the path to your Java installation. This is probably something like C:\Program Files\Java\jre1.8.0_25 or /usr/java/jre1.8.0_.

Once you’ve started the server for the first time, Cassandra will add directories to your system to store its datafiles. The default configuration creates directories under the CASSANDRA_HOME directory:

data

Datafile Locations

The datafile locations are configurable in the cassandra.yaml file, located in the conf directory. The properties are called data_file_directories, commit_log_directory, hints_directory, and saved_caches_directory. We’ll discuss the recommended configuration of these directories in Chapter 10.

Many users on Unix-based systems prefer to use the /var/lib directory for data storage. If you are changing this configuration, you will need to edit the conf/cassandra.yaml file and create the referenced directories for Cassandra to store its data, making sure to configure write permissions for the user that will be running Cassandra:

$ sudo mkdir -p /var/lib/cassandra
$ sudo chown -R username /var/lib/cassandra

Instead of username, substitute your own username, of course.

Starting the Server

To start the Cassandra server on any OS, open a command prompt or terminal window, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the command cassandra -f to start your server.

Starting Cassandra in the Foreground

Using the -f switch tells Cassandra to stay in the foreground instead of running as a background process, so that all of the server logs will print to standard out (stdout in Unix systems) and you can see them in your terminal window, which is useful for testing. In either case, the logs will append to the system.log file.

In a clean installation, you should see quite a few log statements as the server gets running. The exact syntax of logging statements will vary depending on the release you’re using, but there are a few highlights you can look for. If you search for cassandra.yaml, you’ll quickly run into the following:

INFO  [main] 2019-08-25 17:42:11,712 YamlConfigurationLoader.java:89 -
  Configuration location:
  file:/Users/jeffreycarpenter/cassandra/conf/cassandra.yaml
INFO  [main] 2019-08-25 17:42:11,855 Config.java:598 - Node configuration:[
  allocate_tokens_for_keyspace=null;
  ...

These log statements indicate the location of the cassandra.yaml file containing the configured settings. The Node configuration statement lists out the settings read from the config file.

Now search for JVM and you’ll find something like this:

INFO  [main] 2019-08-25 17:42:12,308 CassandraDaemon.java:487 -
  JVM vendor/version: OpenJDK 64-Bit Server VM/12.0.1
INFO  [main] 2019-08-25 17:42:12,309 CassandraDaemon.java:488 -
  Heap size: 3.900GiB/3.900GiB

These log statements provide information describing the JVM being used, including memory settings.

Next, search for the versions in use—Cassandra version, CQL version, Native protocol supported versions:

INFO  [main] 2019-08-25 17:42:17,847 StorageService.java:610 -
  Cassandra version: 4.0-alpha3
INFO  [main] 2019-08-25 17:42:17,848 StorageService.java:611 -
  CQL version: 3.4.5
INFO  [main] 2019-08-25 17:42:17,848 StorageService.java:612 -
  Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)

You can also find statements where Cassandra is initializing internal data structures, such as caches:

INFO [main] 2015-12-08 06:02:43,633 CacheService.java:115 -
  Initializing key cache with capacity of 24 MBs.
INFO [main] 2015-12-08 06:02:43,679 CacheService.java:137 -
  Initializing row cache with capacity of 0 MBs
INFO [main] 2015-12-08 06:02:43,686 CacheService.java:166 -
  Initializing counter cache with capacity of 12 MBs

If you search for terms like JMX, gossip, and listening, you’ll find statements like the following:

WARN  [main] 2019-08-25 17:42:12,363 StartupChecks.java:168 -
  JMX is not enabled to receive remote connections.
  Please see cassandra-env.sh for more info.
INFO  [main] 2019-08-25 17:42:18,354 StorageService.java:814 -
  Starting up server gossip
INFO  [main] 2019-08-25 17:42:18,070 InboundConnectionInitiator.java:130 -
  Listening on address: (127.0.0.1:7000), nic: lo0, encryption: enabled (openssl)

These log statements indicate the server is beginning to initiate communications with other servers in the cluster and expose publicly available interfaces. By default, the management interface via the Java Management Extensions (JMX) is disabled for remote access. We’ll explore the management interface in Chapter 11.

Finally, search for state jump and you’ll see the following:

INFO  [main] 2019-08-25 17:42:18,581 StorageService.java:1507 -
  JOINING: Finish joining ring
INFO  [main] 2019-08-25 17:42:18,591 StorageService.java:2508 -
  Node 127.0.0.1:7000 state jump to NORMAL

Congratulations! Now your Cassandra server should be up and running with a new single-node cluster called “Test Cluster,” ready to interact with other nodes and clients. If you continue to monitor the output, you’ll begin to see periodic output such as memtable flushing and compaction, which you’ll learn about soon.

Starting Over

The committers work hard to ensure that data is readable from one minor dot release to the next and from one major version to the next. The commit log, however, needs to be completely cleared out from version to version (even minor versions).

If you have any previous versions of Cassandra installed, you may want to clear out the data directories for now, just to get up and running. If you’ve messed up your Cassandra installation and want to get started cleanly again, you can delete the data folders.

Stopping Cassandra

Now that you’ve successfully started a Cassandra server, you may be wondering how to stop it. You may have noticed the stop-server command in the bin directory. Let’s try running that command. Here’s what you’ll see on Unix systems:

$ ./stop-server
please read the stop-server script before use

So you see that the server has not been stopped, but instead you are directed to read the script. Taking a look inside with your favorite code editor, you’ll learn that the way to stop Cassandra is to kill the JVM process that is running Cassandra. The file suggests a couple of different techniques by which you can identify the JVM process and kill it.

The first technique is to start Cassandra using the -p option, which provides Cassandra with the name of a file to which it should write the process identifier (PID) upon starting up. This is arguably the most straightforward approach to making sure you kill the right process.

However, because you did not start Cassandra with the -p option, you’ll need to find the process yourself and kill it. The script suggests using pgrep to locate processes for the current user containing the term “cassandra”:

user=`whoami`
pgrep -u $user -f cassandra | xargs kill −9

Stopping Cassandra on Windows

On Windows installations, you can find the JVM process and kill it using the Task Manager.

Other Cassandra Distributions

The preceding instructions showed you how to install the Apache distribution of Cassandra. In addition to the Apache distribution, there are a couple of other ways to get Cassandra:

DataStax Enterprise Edition

We’ll take a deeper look at several options for deploying Cassandra in production environments, including Kubernetes and cloud computing environments, in Chapter 10.

Selecting the right distribution will depend on your deployment environment; your needs for scale, stability, and support; and your development and maintenance budgets. Having both open source and commercial deployment options provides the flexibility to make the right choice for your organization.

Running the CQL Shell

Now that you have a Cassandra installation up and running, let’s give it a quick try to make sure everything is set up properly. You’ll use the CQL shell (cqlsh) to connect to your server and have a look around.

Deprecation of the CLI

If you’ve used Cassandra in releases prior to 3.0, you may also be familiar with the command-line client interface known as cassandra-cli. The CLI was removed in the 3.0 release because it depends on the legacy Thrift API, which was deprecated in 3.0 and removed entirely in 4.0.

To run the shell, create a new terminal window, change to the Cassandra home directory, and type the following command (you should see output similar to that shown here):

$ bin/cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]
Use HELP for help.

Because you did not specify a node to which you wanted to connect, the shell helpfully checks for a node running on the local host, and finds the node you started earlier. The shell also indicates that you’re connected to a Cassandra server cluster called “Test Cluster.” That’s because this cluster of one node at localhost is set up for you by default.

Renaming the Default Cluster

In a production environment, be sure to change the cluster name to something more suitable to your application.

To connect to a specific node, specify the hostname and port on the command line. For example, the following will connect to your local node:

$ bin/cqlsh localhost 9042

The port number can be omitted if the node uses the default value (9042). Another alternative for configuring the cqlsh connection is to set the environment variables $CQLSH_HOST and $CQLSH_PORT. This approach is useful if you will be frequently connecting to a specific node on another host. The environment variables will be overriden if you specify the host and port on the command line.

Connection Errors

Have you run into an error like this while trying to connect to a server?

Exception connecting to localhost/9042. Reason:
  Connection refused.

If so, make sure that a Cassandra instance is started at that host and port, and that you can ping the host you’re trying to reach. There may be firewall rules preventing you from connecting.

To see a complete list of the command-line options supported by cqlsh, type the command cqlsh -help.

Basic cqlsh Commands

Let’s take a quick tour of cqlsh to learn what kinds of commands you can send to the server. You’ll see how to use the basic environment commands and how to do a round trip of inserting and retrieving some data.

Case in cqlsh

The cqlsh commands are all case insensitive. For the examples in this book, we adopt the convention of uppercase to be consistent with the way the shell describes its own commands in help topics and output.

cqlsh Help

To get help for cqlsh, type HELP or ? to see the list of available commands:

cqlsh> help

Documented shell commands:
===========================
CAPTURE  CLS          COPY  DESCRIBE  EXPAND  LOGIN   SERIAL  SOURCE   UNICODE
CLEAR    CONSISTENCY  DESC  EXIT      HELP    PAGING  SHOW    TRACING

CQL help topics:
================
AGGREGATES               CREATE_KEYSPACE           DROP_TRIGGER      TEXT
ALTER_KEYSPACE           CREATE_MATERIALIZED_VIEW  DROP_TYPE         TIME
ALTER_MATERIALIZED_VIEW  CREATE_ROLE               DROP_USER         TIMESTAMP
ALTER_TABLE              CREATE_TABLE              FUNCTIONS         TRUNCATE
ALTER_TYPE               CREATE_TRIGGER            GRANT             TYPES
ALTER_USER               CREATE_TYPE               INSERT            UPDATE
APPLY                    CREATE_USER               INSERT_JSON       USE
ASCII                    DATE                      INT               UUID
BATCH                    DELETE                    JSON
BEGIN                    DROP_AGGREGATE            KEYWORDS
BLOB                     DROP_COLUMNFAMILY         LIST_PERMISSIONS
BOOLEAN                  DROP_FUNCTION             LIST_ROLES
COUNTER                  DROP_INDEX                LIST_USERS
CREATE_AGGREGATE         DROP_KEYSPACE             PERMISSIONS
CREATE_COLUMNFAMILY      DROP_MATERIALIZED_VIEW    REVOKE
CREATE_FUNCTION          DROP_ROLE                 SELECT
CREATE_INDEX             DROP_TABLE                SELECT_JSON

cqlsh Help Topics

You’ll notice that the help topics listed differ slightly from the actual command syntax. The CREATE_TABLE help topic describes how to use the syntax > CREATE TABLE …, for example.

To get additional documentation about a particular command, type HELP <command>. Many cqlsh commands may be used with no parameters, in which case they print out the current setting. Examples include CONSISTENCY, EXPAND, and PAGING.

Describing the Environment in cqlsh

Now that you have connected to your Cassandra instance Test Cluster, to learn more about the cluster you’re working in, type:

cqlsh> DESCRIBE CLUSTER;
Cluster: Test Cluster
Partitioner: Murmur3Partitioner

To see which keyspaces are available in the cluster, issue this command:

cqlsh> DESCRIBE KEYSPACES;
system_traces  system_auth  system_distributed     system_views
system_schema  system       system_virtual_schema

Initially this list will consist of several system keyspaces. Once you have created your own keyspaces, they will be shown as well. The system keyspaces are managed internally by Cassandra, and aren’t for you to put data into. In this way, these keyspaces are similar to the master and temp databases in Microsoft SQL Server. Cassandra uses these keyspaces to store the schema, tracing, and security information. We’ll learn more about these keyspaces in Chapter 6.

You can use the following command to learn the client, server, and protocol versions in use:

cqlsh> SHOW VERSION;
[cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]

You may have noticed that this version info is printed out when cqlsh starts. There are a variety of other commands with which you can experiment. For now, let’s add some data to the database and get it back out again.

Creating a Keyspace and Table in cqlsh

A Cassandra keyspace is sort of like a relational database. It defines one or more tables. When you start cqlsh without specifying a keyspace, the prompt will look like this: cqlsh>, with no keyspace specified.

Now you’ll create your own keyspace so you have something to write data to. In creating your keyspace, there are some required options. To walk through these options, you could use the command HELP CREATE_KEYSPACE, but instead you can use the helpful command-completion features of cqlsh. Type the following and then press the Tab key:

cqlsh> CREATE KEYSPACE my_keyspace WITH

When you press the Tab key, cqlsh begins completing the syntax of your command:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '

This is informing you that in order to specify a keyspace, you also need to specify a replication strategy. Tab again to see what options you have:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '
NetworkTopologyStrategy    OldNetworkTopologyStrategy SimpleStrategy

Now cqlsh is giving you three strategies to choose from. You’ll learn more about these strategies in Chapter 6. For now, choose the SimpleStrategy by typing the name, and indicate you’re done with a closing quote and Tab again:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor':

The next option you’re presented with is a replication factor. For the simple strategy, this indicates how many nodes the data in this keyspace will be written to. For a production deployment, you’ll want copies of your data stored on multiple nodes, but because you’re just running a single node at the moment, you’ll ask for a single copy. Specify a value of “1” and a space and Tab again:

cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor': 1};

You see that cqlsh has now added a closing bracket, indicating you’ve completed all of the required options. Complete the command with a semicolon and return, and your keyspace will be created.

Keyspace Creation Options

For a production keyspace, you would probably never want to use a value of 1 for the replication factor. There are additional options on creating a keyspace depending on the replication strategy that is chosen. The command completion feature will walk through the different options.

Have a look at your keyspace using the DESCRIBE KEYSPACE command:

cqlsh> DESCRIBE KEYSPACE my_keyspace
CREATE KEYSPACE my_keyspace WITH replication = {'class':
  'SimpleStrategy', 'replication_factor': '1'} AND
  durable_writes = true;

We see that the table has been created with the SimpleStrategy, a replication_factor of one, and durable writes. Notice that your keyspace is described in much the same syntax that we used to create it, with one additional option that we did not specify: durable_writes = true. Don’t worry about this option now; we’ll return to it in Chapter 6.

After you have created your own keyspace, you can switch to it in the shell by typing:

cqlsh> USE my_keyspace;
cqlsh:my_keyspace>

Notice that the prompt has changed to indicate that we’re using the keyspace.

Now that you have a keyspace, you can create a table in your keyspace. To do this in cqlsh, use the following command:

cqlsh:my_keyspace> CREATE TABLE user ( first_name text ,
  last_name text, title text, PRIMARY KEY (last_name, first_name)) ;

This creates a new table called “user” in your current keyspace with three columns to store first and last names and a title, all of type text. The text and varchar types are synonymous and are used to store strings. You’ve specified a primary key for this table consisting of the first_name and last_name and taken the defaults for other table options. You’ll learn more about primary keys and the significance of your choice of primary key in Chapter 4, but for now let’s think of that combination of names as identifying unique rows in your table. The title column is the only one in your table that is not part of the primary key.

Using Keyspace Names in cqlsh

You could have also created this table without switching to your keyspace by using the syntax CREATE TABLE my_keyspace.user.

You can use cqlsh to get a description of a the table you just created using the DESCRIBE TABLE command:

cqlsh:my_keyspace> DESCRIBE TABLE user;
CREATE TABLE my_keyspace.user (
    first_name text,
    last_name text,
    title text,
    PRIMARY KEY (last_name, first_name)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.
      SizeTieredCompactionStrategy', 'max_threshold': '32',
      'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '16', 'class':
      'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99p';

You’ll notice that cqlsh prints a nicely formatted version of the CREATE TABLE command that you just typed in but also includes default values for all of the available table options that you did not specify. We’ll describe these settings later. For now, you have enough to get started.

Writing and Reading Data in cqlsh

Now that you have a keyspace and a table, you can write some data to the database and read it back out again. It’s OK at this point not to know quite what’s going on. You’ll come to understand Cassandra’s data model in depth later. For now, you have a keyspace (database), which has a table, which holds columns, the atomic unit of data storage.

To write rows, you use the INSERT command:

cqlsh:my_keyspace> INSERT INTO user (first_name, last_name, title)
  VALUES ('Bill', 'Nguyen', 'Mr.');

Here you have created a new row with two columns for the key Bill, to store a set of related values. The column names are first_name and last_name.

Now that you have written some data, you can read it back using the SELECT command:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill' AND
  last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |   Mr.

(1 rows)

In this command, you requested to return rows matching the primary key including all columns. For this query, you specified both of the columns referenced by the primary key. What happens when you only specify one of the values? Let’s find out:

cqlsh:my_keyspace> SELECT * FROM user where last_name = 'Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |   Mr.

(1 rows)
cqlsh:my_keyspace> SELECT * FROM user where first_name = 'Bill';
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance.
If you want to execute this query despite the
performance unpredictability, use ALLOW FILTERING"

This behavior might not seem intuitive at first, but it has to do with the composition of the primary key you used for this table. This is your first clue that there might be something a bit different about accessing data in Cassandra as compared to what you might be used to in SQL. We’ll explain the significance of your primary key selection and the ALLOW FILTERING option in Chapter 4 and other chapters.

Counting Data and Full Table Scans

Many new Cassandra users, especially those who are coming from a relational background, will be inclined to use the SELECT COUNT command as a way to ensure data has been written. For example, you could use the following command to verify your write to the user table:

cqlsh:my_keyspace> SELECT COUNT (*) FROM user;
 count
-------
     1

(1 rows)

Warnings :
Aggregation query used without partition key

Note that when you execute this command, cqlsh gives you the correct count of rows, but also gives you a warning. This is because you’ve asked Cassandra to perform a full table scan. In a multi-node cluster with potentially large amounts of data, this COUNT could be a very expensive operation. Throughout the rest of the book, you’ll encounter various ways in which Cassandra tries to warn you or constrain your ability to perform operations that will perform poorly at scale in a distributed architecture.

You can delete a column using the DELETE command. Here you will delete the title column from the row inserted previously:

    cqlsh:my_keyspace> DELETE title FROM USER WHERE
      first_name='Bill' AND last_name='Nguyen';

You can perform this delete because the title column is not part of the primary key. To make sure that the value has been removed, you can query again:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill'
  AND last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------
    Nguyen |       Bill |  null

(1 rows)

Now you’ll clean up after yourself by deleting the entire row. It’s the same command, but you don’t specify a column name:

cqlsh:my_keyspace> DELETE FROM USER WHERE first_name='Bill'
  AND last_name='Nguyen';

To make sure that it’s removed, you can query again:

cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill'
  AND last_name='Nguyen';

 last_name | first_name | title
-----------+------------+-------

(0 rows)

If you really want to clean things up, you can remove all data from the table using the TRUNCATE command, or even delete the table schema using the DROP TABLE command:

cqlsh:my_keyspace> TRUNCATE user;
cqlsh:my_keyspace> DROP TABLE user;

cqlsh Command History

Now that you’ve been using cqlsh for a while, you may have noticed that you can navigate through commands you’ve executed previously with the up and down arrow keys. This history is stored in a file called cqlsh_history, which is located in a hidden directory called .cassandra within your home directory. This acts like your bash shell history, listing the commands in a plain-text file in the order Cassandra executed them. Nice!

Running Cassandra in Docker

Over the past few years, containers have become a very popular alternative to full machine virtualization for deployment of applications and supporting infrastructure such as databases.

Given the high popularity of Docker and its image format, the Apache project has begun supporting official Docker images of Cassandra.

If you have a Docker environment installed on your machine, it’s extremely simple to start a Cassandra node. After making sure you’ve stopped any Cassandra node started previously, start a new node in Docker using the following two commands:

$ docker pull cassandra
$ docker run --name my-cassandra cassandra

The first command pulls the Docker image marked with the tag latest from the Docker Hub https://hub.docker.com/_/cassandra/:

Using default tag: latest
latest: Pulling from library/cassandra
9fc222b64b0a: Pull complete
33b9abeacd73: Pull complete
d28230b01bc3: Pull complete
6e755ec31928: Pull complete
b881e4d8c78e: Pull complete
d8b058ab9240: Pull complete
3ddfff7126ed: Pull complete
94de8e3674c4: Pull complete
61d4f90c97c4: Pull complete
a3d009e31ea4: Pull complete
Digest: sha256:0f188d784235e1bedf191361096e6eeab330f9579eac7d2e68e14a5c29f75ad6
Status: Downloaded newer image for cassandra:latest
docker.io/library/cassandra:latest

The second command starts an instance of Cassandra with default options. Note that you could have used the -d option to start the container in the background without printing out the logs.

You used the --name option to specify a name for the container, which allows you to reference the container by name when using other Docker commands. For example, you can stop the container by using the command:

$ docker stop my-cassandra

If you don’t provide a name for the container, the Docker runtime will assign a randomly selected name such as breezy_ensign. Docker also creates a unique identifier for each container, which is returned from the initial run command. Either the name or ID may be used to reference a specific container in Docker commands.

If you’d like to start an instance of cqlsh, the simplest way is to use the copy inside the instance by executing a command on the instance:

$ docker exec -it my-cassandra cqlsh

This will give you a cqlsh prompt, with which you could execute the same commands you’ve practiced in this chapter, or any other commands you’d like.

Up to this point, you’ve only created a single Cassandra node in Docker, which is not accessible from outside Docker’s internal network. In order to access this node from outside Docker for CQL queries, you’ll want to make sure the standard CQL port is exposed when the node is created:

$ docker start cassandra -p 9042:9042

There are several other configuration options available for running Cassandra in Docker, documented on the Docker Hub page referenced earlier. One exercise you may find interesting is to launch multiple nodes in Docker to create a small cluster.

Summary

Now you should have a Cassandra installation up and running. You’ve worked with the cqlsh client to insert and retrieve some data, and you’re ready to take a step back and get the big picture on Cassandra before really diving into the details.