For those among us who like instant gratification, let’s start by installing Cassandra. Because Cassandra introduces a lot of new vocabulary, there might be some unfamiliar terms as you walk through this. That’s OK; the idea here is to get set up quickly in a simple configuration to make sure everything is running properly. This will serve as an orientation. Then, we’ll take a step back and explain Cassandra in its larger context.
While there are a number of options available for installing Cassandra on various operating systems, let’s start your journey by downloading the Apache distribution from http://cassandra.apache.org so you can get a good look at what’s inside. We’ll explore other installation options in “Other Cassandra Distributions”.
Click the link on the Cassandra home page to download a version as a gzipped archive. Typically, multiple versions of Cassandra are provided. The latest version is the current recommended version for use in production. There are other supported releases that are still viable for production usage and receive bug fixes. The project goal is to limit the number of supported releases, but reasonable accommodations are made. For example, the 2.2 and 2.1 releases were considered to be officially maintained through the release of 4.0. For all releases, the prebuilt binary is named apache-cassandra-x.x.x-bin.tar.gz, where x.x.x represents the version number. The download for Cassandra 4.0 is around 40 MB.
You can unpack the compressed file using any regular ZIP utility. On Unix-based systems such as Linux or macOS, gzip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware.
Open your extracting program. You might have to extract the ZIP file and the TAR file in separate steps. Once you have a folder on your filesystem called apache-cassandra-x.x.x, you’re ready to run Cassandra.
Once you decompress the tarball, you’ll see that the Cassandra binary distribution includes several files and directories.
The files include the NEWS.txt file, which includes the release notes describing features included in the current and prior releases, and the CHANGES.txt file, which is similar but focuses on bug fixes. You’ll want to make sure to review these files whenever you are upgrading to a new version so you know what changes to expect. The LICENSE.txt and NOTICE.txt files contain the Apache 2.0 license used by Cassandra, and copyright notices for Cassandra and included software, respectively.
Let’s take a moment now to look around in the different directories and see what’s there:
This directory contains the executables to run Cassandra as well as clients, including the query language shell (cqlsh
). It also has scripts to run the nodetool
, which is a utility for inspecting a cluster to determine whether it is properly configured, and to perform a variety of maintenance operations. We look at nodetool
in depth later. The directory also contains several utilities for performing operations on SSTables, the files in which Cassandra stores its data on disk. We’ll discuss these utilities in Chapter 12.
This directory contains the files for configuring your Cassandra instance. The configuration files you may use most frequently include the cassandra.yaml file, which is the primary configuration for running Cassandra, and the logback.xml file, which lets you change the logging settings to suit your needs. Additional files can be used to configure Java Virtual Machine (JVM) settings, the network topology, metrics reporting, archival and restore commands, and triggers. You’ll learn how to use these configuration files in Chapter 10.
Traditionally, documentation has been one of the weaker areas of the project, but a concerted effort for the 4.0 release, including sponsorship from the Google Season of Docs project, yielded significant progress to the documentation included in the Cassandra distribution as well as the documentation on the Cassandra website. The documentation includes a getting started guide, an architectural overview, and instructions for configuring and operating Cassandra.
This directory contains a documentation website generated using Java’s JavaDoc tool. Note that JavaDoc reflects only the comments that are stored directly in the Java code, and as such does not represent comprehensive documentation. It’s helpful if you want to see how the code is laid out. Moreover, Cassandra is a wonderful project, but the code contains relatively few comments, so you might find the JavaDoc’s usefulness limited. It may be more fruitful to simply read the class files directly if you’re familiar with Java. Nonetheless, to read the JavaDoc, open the javadoc/index.html file in a browser.
This directory contains all of the external libraries that Cassandra needs to run. For example, it uses two different JSON serialization libraries, the Google collections project, and several Apache Commons libraries.
This directory contains Python libraries that are used by cqlsh
.
This directory contains tools that are used to maintain your Cassandra nodes. You’ll learn about these tools in Chapter 12.
Cassandra uses Apache Ant for its build scripting language and Maven for dependency management.
You can download Ant from http://ant.apache.org. You don’t need to download Maven separately just to build Cassandra.
Building from source requires a complete Java 8 JDK (or later version), not just the Java Runtime Environment (JRE). If you see a message about how Ant is missing tools.jar, either you don’t have the full JDK or you’re pointing to the wrong path in your environment variables. Maven downloads files from the internet, so if your connection is invalid or Maven cannot determine the proxy, the build will fail.
If you want to download the latest Cassandra builds or view test results, you can find these in Jenkins, which the Cassandra project uses as its continuous integration tool. See this Jenkins page for the latest builds and test coverage information.
If you are interested in having a look at the Cassandra source, you can get the trunk version of the Cassandra source using this command:
$ git clone https://github.com/apache/cassandra.git
Because Maven takes care of all the dependencies, it’s easy to build Cassandra once you have the source. Just make sure you’re in the root directory of your source download and execute the ant
program, which will look for a file called build.xml in the current directory and execute the default build target. Ant and Maven take care of the rest. To execute the Ant program and start compiling the source, just type:
$ ant
That’s it. Maven will retrieve all of the necessary dependencies, and Ant will build the hundreds of source files and execute the tests. If all went well, you should see a BUILD SUCCESSFUL
message. If all did not go well, make sure that your path settings are all correct, that you have the most recent versions of the required programs, and that you downloaded a stable Cassandra build. You can check the Jenkins report to make sure that the source you downloaded actually can compile.
If you want to see detailed information on what is happening during the build, you can pass Ant the -v
option to cause it to output verbose details regarding each operation it performs.
To compile the server, you can simply execute ant
, as shown previously. This command executes the default target, jar. This target will perform a complete build, including unit tests, and output a file into the build directory called apache-cassandra-x.x.x.jar.
If you want to see a list of all of the targets supported by the build file, simply pass Ant the -p
option to get a description of each target. Here are a few others you might be interested in:
Users will probably find this the most helpful, as it executes the battery of unit tests. You can also check out the unit test sources themselves for some useful examples of how to interact with Cassandra.
This target builds the Cassandra stress tool, which you will learn to use in Chapter 13.
This target removes locally created artifacts such as generated source files and classes and unit test results. The related target realclean performs a clean and additionally removes the Cassandra distribution JAR files and JAR files downloaded by Maven.
The Cassandra developers have done a terrific job of making it very easy for new users to start using Cassandra immediately, as you can start a single node without making any changes to the default configuration. We’ll note some of the available configuration options in Chapter 10.
Cassandra versions from 3.0 onward require a Java 8 JVM or later, preferably the latest stable version. It has been tested on both the OpenJDK and Oracle’s JDK. Cassandra 4.0 has been compiled and tested against both Java 8 and Java 11. You can check your installed Java version by opening a command prompt and executing java -version
. If you need a JDK, you can get one at this Java SE Downloads page or the jdk.java.net page.
Once you have the binary (or the source downloaded and compiled), you’re ready to start the database server.
Setting the JAVA_HOME
environment variable is recommended. To do this on a Windows system, click the Start button and then right-click Computer. Click Advanced System Settings, and then click the Environment Variables… button. Click New… to create a new system variable. In the Variable Name field, type JAVA_HOME
. In the Variable Value field, type the path to your Java installation. This is probably something like C:\Program Files\Java\jre1.8.0_25 or /usr/java/jre1.8.0_.
Once you’ve started the server for the first time, Cassandra will add directories to your system to store its datafiles. The default configuration creates directories under the CASSANDRA_HOME directory:
This directory is where Cassandra stores its data. By default, there are sub-directories under the data directory, corresponding to the various datafiles Cassandra uses: commitlog, data, hints, and saved_caches. We’ll explore the significance of each of these datafiles in Chapter 6. If you’ve been trying different versions of the database and aren’t worried about losing data, you can delete these directories and restart the server as a last resort.
This directory is where Cassandra stores its logs in a file called system.log. If you encounter any difficulties, consult the log to see what might have happened.
The datafile locations are configurable in the cassandra.yaml file, located in the conf directory. The properties are called data_file_directories
, commit_log_directory
, hints_directory
, and saved_caches_directory
. We’ll discuss the recommended configuration of these directories in Chapter 10.
Many users on Unix-based systems prefer to use the /var/lib directory for data storage. If you are changing this configuration, you will need to edit the conf/cassandra.yaml file and create the referenced directories for Cassandra to store its data, making sure to configure write permissions for the user that will be running Cassandra:
$ sudo mkdir -p /var/lib/cassandra
$ sudo chown -R username
/var/lib/cassandra
Instead of username
, substitute your own username, of course.
To start the Cassandra server on any OS, open a command prompt or terminal window, navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the command cassandra -f
to start your server.
Using the -f
switch tells Cassandra to stay in the foreground instead of running as a background process, so that all of the server logs will print to standard out (stdout
in Unix systems) and you can see them in your terminal window, which is useful for testing. In either case, the logs will append to the system.log file.
In a clean installation, you should see quite a few log statements as the server gets running. The exact syntax of logging statements will vary depending on the release you’re using, but there are a few highlights you can look for. If you search for cassandra.yaml
, you’ll quickly run into the following:
INFO [main] 2019-08-25 17:42:11,712 YamlConfigurationLoader.java:89 - Configuration location: file:/Users/jeffreycarpenter/cassandra/conf/cassandra.yaml INFO [main] 2019-08-25 17:42:11,855 Config.java:598 - Node configuration:[ allocate_tokens_for_keyspace=null; ...
These log statements indicate the location of the cassandra.yaml file containing the configured settings. The Node configuration
statement lists out the settings read from the config file.
Now search for JVM
and you’ll find something like this:
INFO [main] 2019-08-25 17:42:12,308 CassandraDaemon.java:487 -
JVM
vendor/version: OpenJDK 64-Bit Server VM/12.0.1
INFO [main] 2019-08-25 17:42:12,309 CassandraDaemon.java:488 -
Heap size: 3.900GiB/3.900GiB
These log statements provide information describing the JVM being used, including memory settings.
Next, search for the versions in use—Cassandra version
, CQL version
, Native protocol supported versions
:
INFO [main] 2019-08-25 17:42:17,847 StorageService.java:610 -Cassandra version
: 4.0-alpha3 INFO [main] 2019-08-25 17:42:17,848 StorageService.java:611 -CQL version
: 3.4.5 INFO [main] 2019-08-25 17:42:17,848 StorageService.java:612 -Native protocol supported versions
: 3/v3, 4/v4, 5/v5-beta (default: 4/v4)
You can also find statements where Cassandra is initializing internal data structures, such as caches:
INFO [main] 2015-12-08 06:02:43,633 CacheService.java:115 - Initializingkey cache
with capacity of 24 MBs. INFO [main] 2015-12-08 06:02:43,679 CacheService.java:137 - Initializingrow cache
with capacity of 0 MBs INFO [main] 2015-12-08 06:02:43,686 CacheService.java:166 - Initializingcounter cache
with capacity of 12 MBs
If you search for terms like JMX
, gossip
, and listening
, you’ll find statements like the following:
WARN [main] 2019-08-25 17:42:12,363 StartupChecks.java:168 -JMX
is not enabled to receive remote connections. Please see cassandra-env.sh for more info. INFO [main] 2019-08-25 17:42:18,354 StorageService.java:814 - Starting up servergossip
INFO [main] 2019-08-25 17:42:18,070 InboundConnectionInitiator.java:130 -Listening
on address: (127.0.0.1:7000), nic: lo0, encryption: enabled (openssl)
These log statements indicate the server is beginning to initiate communications with other servers in the cluster and expose publicly available interfaces. By default, the management interface via the Java Management Extensions (JMX) is disabled for remote access. We’ll explore the management interface in Chapter 11.
Finally, search for state jump
and you’ll see the following:
INFO [main] 2019-08-25 17:42:18,581 StorageService.java:1507 -
JOINING: Finish joining ring
INFO [main] 2019-08-25 17:42:18,591 StorageService.java:2508 -
Node 127.0.0.1:7000 state jump
to NORMAL
Congratulations! Now your Cassandra server should be up and running with a new single-node cluster called “Test Cluster,” ready to interact with other nodes and clients. If you continue to monitor the output, you’ll begin to see periodic output such as memtable flushing and compaction, which you’ll learn about soon.
The committers work hard to ensure that data is readable from one minor dot release to the next and from one major version to the next. The commit log, however, needs to be completely cleared out from version to version (even minor versions).
If you have any previous versions of Cassandra installed, you may want to clear out the data directories for now, just to get up and running. If you’ve messed up your Cassandra installation and want to get started cleanly again, you can delete the data folders.
Now that you’ve successfully started a Cassandra server, you may be wondering how to stop it. You may have noticed the stop-server
command in the bin directory. Let’s try running that command. Here’s what you’ll see on Unix systems:
$ ./stop-server please read the stop-server script before use
So you see that the server has not been stopped, but instead you are directed to read the script. Taking a look inside with your favorite code editor, you’ll learn that the way to stop Cassandra is to kill the JVM process that is running Cassandra. The file suggests a couple of different techniques by which you can identify the JVM process and kill it.
The first technique is to start Cassandra using the -p
option, which provides Cassandra with the name of a file to which it should write the process identifier (PID) upon starting up. This is arguably the most straightforward approach to making sure you kill the right process.
However, because you did not start Cassandra with the -p
option, you’ll need to find the process yourself and kill it. The script suggests using pgrep
to locate processes for the current user containing the term “cassandra”:
user=`whoami` pgrep -u $user -f cassandra | xargs kill −9
The preceding instructions showed you how to install the Apache distribution of Cassandra. In addition to the Apache distribution, there are a couple of other ways to get Cassandra:
DataStax provides a fully supported version certified for production use. The product line provides an integrated database platform with support for complementary data technologies such as Apache Solr for search, Apache Spark for analytics, Apache TinkerPop for graph, as well as advanced security and other enterprise features. We’ll explore some of these integrations in Chapter 15.
A frequent model for deployment of Cassandra is to package one of the preceding distributions in a virtual machine image. For example, multiple such images are available in the Amazon Web Services (AWS) Marketplace.
It has become increasingly popular to run Cassandra in Docker containers, especially in development environments. We’ll provide some simple instructions for running the Apache distribution in Docker in “Running Cassandra in Docker”.
There are a few providers of Cassandra as a managed service, where the provider provides hosting and management of Cassandra clusters. These include Instaclustr and Aiven. DataStax provides an Apache Cassandra as a service called Astra.
We’ll take a deeper look at several options for deploying Cassandra in production environments, including Kubernetes and cloud computing environments, in Chapter 10.
Selecting the right distribution will depend on your deployment environment; your needs for scale, stability, and support; and your development and maintenance budgets. Having both open source and commercial deployment options provides the flexibility to make the right choice for your organization.
Now that you have a Cassandra installation up and running, let’s give it a quick try to make sure everything is set up properly. You’ll use the CQL shell (cqlsh
) to connect to your server and have a look around.
If you’ve used Cassandra in releases prior to 3.0, you may also be familiar with the command-line client interface known as cassandra-cli
. The CLI was removed in the 3.0 release because it depends on the legacy Thrift API, which was deprecated in 3.0 and removed entirely in 4.0.
To run the shell, create a new terminal window, change to the Cassandra home directory, and type the following command (you should see output similar to that shown here):
$ bin/cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]
Use HELP for help.
Because you did not specify a node to which you wanted to connect, the shell helpfully checks for a node running on the local host, and finds the node you started earlier. The shell also indicates that you’re connected to a Cassandra server cluster called “Test Cluster.” That’s because this cluster of one node at localhost
is set up for you by default.
In a production environment, be sure to change the cluster name to something more suitable to your application.
To connect to a specific node, specify the hostname and port on the command line. For example, the following will connect to your local node:
$ bin/cqlsh localhost 9042
The port number can be omitted if the node uses the default value (9042). Another alternative for configuring the cqlsh
connection is to set the environment variables $CQLSH_HOST
and $CQLSH_PORT
. This approach is useful if you will be frequently connecting to a specific node on another host. The environment variables will be overriden if you specify the host and port on the command line.
Have you run into an error like this while trying to connect to a server?
Exception connecting to localhost/9042. Reason: Connection refused.
If so, make sure that a Cassandra instance is started at that host and port, and that you can ping the host you’re trying to reach. There may be firewall rules preventing you from connecting.
To see a complete list of the command-line options supported by cqlsh
, type the command cqlsh -help
.
Let’s take a quick tour of cqlsh
to learn what kinds of commands you can send to the server. You’ll see how to use the basic environment commands and how to do a round trip of inserting and retrieving some data.
The cqlsh
commands are all case insensitive. For the examples in this book, we adopt the convention of uppercase to be consistent with the way the shell describes its own commands in help topics and output.
To get help for cqlsh
, type HELP
or ?
to see the list of available commands:
cqlsh> help Documented shell commands: =========================== CAPTURE CLS COPY DESCRIBE EXPAND LOGIN SERIAL SOURCE UNICODE CLEAR CONSISTENCY DESC EXIT HELP PAGING SHOW TRACING CQL help topics: ================ AGGREGATES CREATE_KEYSPACE DROP_TRIGGER TEXT ALTER_KEYSPACE CREATE_MATERIALIZED_VIEW DROP_TYPE TIME ALTER_MATERIALIZED_VIEW CREATE_ROLE DROP_USER TIMESTAMP ALTER_TABLE CREATE_TABLE FUNCTIONS TRUNCATE ALTER_TYPE CREATE_TRIGGER GRANT TYPES ALTER_USER CREATE_TYPE INSERT UPDATE APPLY CREATE_USER INSERT_JSON USE ASCII DATE INT UUID BATCH DELETE JSON BEGIN DROP_AGGREGATE KEYWORDS BLOB DROP_COLUMNFAMILY LIST_PERMISSIONS BOOLEAN DROP_FUNCTION LIST_ROLES COUNTER DROP_INDEX LIST_USERS CREATE_AGGREGATE DROP_KEYSPACE PERMISSIONS CREATE_COLUMNFAMILY DROP_MATERIALIZED_VIEW REVOKE CREATE_FUNCTION DROP_ROLE SELECT CREATE_INDEX DROP_TABLE SELECT_JSON
You’ll notice that the help topics listed differ slightly from the actual command syntax. The CREATE_TABLE
help topic describes how to use the syntax > CREATE TABLE …
, for example.
To get additional documentation about a particular command, type HELP <command>
. Many cqlsh
commands may be used with no parameters, in which case they print out the current setting. Examples include CONSISTENCY
, EXPAND
, and PAGING
.
Now that you have connected to your Cassandra instance Test Cluster, to learn more about the cluster you’re working in, type:
cqlsh> DESCRIBE CLUSTER; Cluster: Test Cluster Partitioner: Murmur3Partitioner
To see which keyspaces are available in the cluster, issue this command:
cqlsh> DESCRIBE KEYSPACES; system_traces system_auth system_distributed system_views system_schema system system_virtual_schema
Initially this list will consist of several system
keyspaces. Once you have created your own keyspaces, they will be shown as well. The system
keyspaces are managed internally by Cassandra, and aren’t for you to put data into. In this way, these keyspaces are similar to the master
and temp
databases in Microsoft SQL Server. Cassandra uses these keyspaces to store the schema, tracing, and security information. We’ll learn more about these keyspaces in Chapter 6.
You can use the following command to learn the client, server, and protocol versions in use:
cqlsh> SHOW VERSION; [cqlsh 5.0.1 | Cassandra 4.0-alpha3 | CQL spec 3.4.5 | Native protocol v4]
You may have noticed that this version info is printed out when cqlsh
starts. There are a variety of other commands with which you can experiment. For now, let’s add some data to the database and get it back out again.
A Cassandra keyspace is sort of like a relational database. It defines one or more tables. When you start cqlsh
without specifying a keyspace, the prompt will look like this: cqlsh>
, with no keyspace specified.
Now you’ll create your own keyspace so you have something to write data to. In creating your keyspace, there are some required options. To walk through these options, you could use the command HELP CREATE_KEYSPACE
, but instead you can use the helpful command-completion features of cqlsh
. Type the following and then press the Tab key:
cqlsh> CREATE KEYSPACE my_keyspace WITH
When you press the Tab key, cqlsh
begins completing the syntax of your command:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': '
This is informing you that in order to specify a keyspace, you also need to specify a replication strategy. Tab again to see what options you have:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': ' NetworkTopologyStrategy OldNetworkTopologyStrategy SimpleStrategy
Now cqlsh
is giving you three strategies to choose from. You’ll learn more about these strategies in Chapter 6. For now, choose the SimpleStrategy
by typing the name, and indicate you’re done with a closing quote and Tab again:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
The next option you’re presented with is a replication factor. For the simple strategy, this indicates how many nodes the data in this keyspace will be written to. For a production deployment, you’ll want copies of your data stored on multiple nodes, but because you’re just running a single node at the moment, you’ll ask for a single copy. Specify a value of “1” and a space and Tab again:
cqlsh> CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
You see that cqlsh
has now added a closing bracket, indicating you’ve completed all of the required options. Complete the command with a semicolon and return, and your keyspace will be created.
For a production keyspace, you would probably never want to use a value of 1 for the replication factor. There are additional options on creating a keyspace depending on the replication strategy that is chosen. The command completion feature will walk through the different options.
Have a look at your keyspace using the DESCRIBE KEYSPACE
command:
cqlsh> DESCRIBE KEYSPACE my_keyspace CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
We see that the table has been created with the SimpleStrategy
, a replication_factor
of one, and durable writes. Notice that your keyspace is described in much the same syntax that we used to create it, with one additional option that we did not specify: durable_writes = true
. Don’t worry about this option now; we’ll return to it in Chapter 6.
After you have created your own keyspace, you can switch to it in the shell by typing:
cqlsh> USE my_keyspace; cqlsh:my_keyspace>
Notice that the prompt has changed to indicate that we’re using the keyspace.
Now that you have a keyspace, you can create a table in your keyspace. To do this in cqlsh
, use the following command:
cqlsh:my_keyspace> CREATE TABLE user ( first_name text , last_name text, title text, PRIMARY KEY (last_name, first_name)) ;
This creates a new table called “user” in your current keyspace with three columns to store first and last names and a title, all of type text
. The text
and varchar
types are synonymous and are used to store strings. You’ve specified a primary key for this table consisting of the first_name
and last_name
and taken the defaults for other table options. You’ll learn more about primary keys and the significance of your choice of primary key in Chapter 4, but for now let’s think of that combination of names as identifying unique rows in your table. The title
column is the only one in your table that is not part of the primary key.
You could have also created this table without switching to your keyspace by using the syntax CREATE TABLE my_keyspace.user
.
You can use cqlsh
to get a description of a the table you just created using the DESCRIBE TABLE
command:
cqlsh:my_keyspace> DESCRIBE TABLE user; CREATE TABLE my_keyspace.user ( first_name text, last_name text, title text, PRIMARY KEY (last_name, first_name) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction. SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99p';
You’ll notice that cqlsh
prints a nicely formatted version of the CREATE TABLE
command that you just typed in but also includes default values for all of the available table options that you did not specify. We’ll describe these settings later. For now, you have enough to get started.
Now that you have a keyspace and a table, you can write some data to the database and read it back out again. It’s OK at this point not to know quite what’s going on. You’ll come to understand Cassandra’s data model in depth later. For now, you have a keyspace (database), which has a table, which holds columns, the atomic unit of data storage.
To write rows, you use the INSERT
command:
cqlsh:my_keyspace> INSERT INTO user (first_name, last_name, title) VALUES ('Bill', 'Nguyen', 'Mr.');
Here you have created a new row with two columns for the key Bill
, to store a set of related values. The column names are first_name
and last_name
.
Now that you have written some data, you can read it back using the SELECT
command:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill' AND last_name='Nguyen'; last_name | first_name | title -----------+------------+------- Nguyen | Bill | Mr. (1 rows)
In this command, you requested to return rows matching the primary key including all columns. For this query, you specified both of the columns referenced by the primary key. What happens when you only specify one of the values? Let’s find out:
cqlsh:my_keyspace> SELECT * FROM user where last_name = 'Nguyen'; last_name | first_name | title -----------+------------+------- Nguyen | Bill | Mr. (1 rows) cqlsh:my_keyspace> SELECT * FROM user where first_name = 'Bill'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
This behavior might not seem intuitive at first, but it has to do with the composition of the primary key you used for this table. This is your first clue that there might be something a bit different about accessing data in Cassandra as compared to what you might be used to in SQL. We’ll explain the significance of your primary key selection and the ALLOW FILTERING
option in Chapter 4 and other chapters.
Many new Cassandra users, especially those who are coming from a relational background, will be inclined to use the SELECT COUNT
command as a way to ensure data has been written. For example, you could use the following command to verify your write to the user
table:
cqlsh:my_keyspace> SELECT COUNT (*) FROM user; count ------- 1 (1 rows) Warnings : Aggregation query used without partition key
Note that when you execute this command, cqlsh
gives you the correct count of rows, but also gives you a warning. This is because you’ve asked Cassandra to perform a full table scan. In a multi-node cluster with potentially large amounts of data, this COUNT
could be a very expensive operation. Throughout the rest of the book, you’ll encounter various ways in which Cassandra tries to warn you or constrain your ability to perform operations that will perform poorly at scale in a distributed architecture.
You can delete a column using the DELETE
command. Here you will delete the title
column from the row inserted previously:
cqlsh:my_keyspace> DELETE title FROM USER WHERE first_name='Bill' AND last_name='Nguyen';
You can perform this delete because the title
column is not part of the primary key. To make sure that the value has been removed, you can query again:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill' AND last_name='Nguyen'; last_name | first_name | title -----------+------------+------- Nguyen | Bill | null (1 rows)
Now you’ll clean up after yourself by deleting the entire row. It’s the same command, but you don’t specify a column name:
cqlsh:my_keyspace> DELETE FROM USER WHERE first_name='Bill' AND last_name='Nguyen';
To make sure that it’s removed, you can query again:
cqlsh:my_keyspace> SELECT * FROM user WHERE first_name='Bill' AND last_name='Nguyen'; last_name | first_name | title -----------+------------+------- (0 rows)
If you really want to clean things up, you can remove all data from the table using the TRUNCATE
command, or even delete the table schema using the DROP TABLE
command:
cqlsh:my_keyspace> TRUNCATE user; cqlsh:my_keyspace> DROP TABLE user;
Now that you’ve been using cqlsh
for a while, you may have noticed that you can navigate through commands you’ve executed previously with the up and down arrow keys. This history is stored in a file called cqlsh_history, which is located in a hidden directory called .cassandra within your home directory. This acts like your bash shell history, listing the commands in a plain-text file in the order Cassandra executed them. Nice!
Over the past few years, containers have become a very popular alternative to full machine virtualization for deployment of applications and supporting infrastructure such as databases.
Given the high popularity of Docker and its image format, the Apache project has begun supporting official Docker images of Cassandra.
If you have a Docker environment installed on your machine, it’s extremely simple to start a Cassandra node. After making sure you’ve stopped any Cassandra node started previously, start a new node in Docker using the following two commands:
$ docker pull cassandra $ docker run --name my-cassandra cassandra
The first command pulls the Docker image marked with the tag latest from the Docker Hub https://hub.docker.com/_/cassandra/:
Using default tag: latest latest: Pulling from library/cassandra 9fc222b64b0a: Pull complete 33b9abeacd73: Pull complete d28230b01bc3: Pull complete 6e755ec31928: Pull complete b881e4d8c78e: Pull complete d8b058ab9240: Pull complete 3ddfff7126ed: Pull complete 94de8e3674c4: Pull complete 61d4f90c97c4: Pull complete a3d009e31ea4: Pull complete Digest: sha256:0f188d784235e1bedf191361096e6eeab330f9579eac7d2e68e14a5c29f75ad6 Status: Downloaded newer image for cassandra:latest docker.io/library/cassandra:latest
The second command starts an instance of Cassandra with default options. Note that you could have used the -d
option to start the container in the background without printing out the logs.
You used the --name
option to specify a name for the container, which allows you to reference the container by name when using other Docker commands. For example, you can stop the container by using the command:
$ docker stop my-cassandra
If you don’t provide a name for the container, the Docker runtime will assign a randomly selected name such as breezy_ensign
. Docker also creates a unique identifier for each container, which is returned from the initial run
command. Either the name or ID may be used to reference a specific container in Docker commands.
If you’d like to start an instance of cqlsh
, the simplest way is to use the copy inside the instance by executing a command on the instance:
$ docker exec -it my-cassandra cqlsh
This will give you a cqlsh
prompt, with which you could execute the same commands you’ve practiced in this chapter, or any other commands you’d like.
Up to this point, you’ve only created a single Cassandra node in Docker, which is not accessible from outside Docker’s internal network. In order to access this node from outside Docker for CQL queries, you’ll want to make sure the standard CQL port is exposed when the node is created:
$ docker start cassandra -p 9042:9042
There are several other configuration options available for running Cassandra in Docker, documented on the Docker Hub page referenced earlier. One exercise you may find interesting is to launch multiple nodes in Docker to create a small cluster.
Now you should have a Cassandra installation up and running. You’ve worked with the cqlsh
client to insert and retrieve some data, and you’re ready to take a step back and get the big picture on Cassandra before really diving into the details.