Apache HBase Primer 2016

Bytes.toBytes("val1"));

put.add(Bytes.toBytes("cf2"), Bytes.toBytes("col2"),

Bytes.toBytes("val2"));

table.put(put);

Now let’s discuss the checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) method. For example, let’s put a new value called val2 in row1 , column family cf1 , column col1 using a Put instance if the row1 , column family cf1 , column col1 has value val1 :

Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "table1");

Put put = new Put(Bytes.toBytes("row1"));

put.add(Bytes.toBytes("cf1"), Bytes.toBytes("col1"),

Bytes.toBytes("val2"));

boolean bool= table.checkAndPut(Bytes.toBytes("row1"), Bytes.toBytes("cf1"), Bytes.toBytes("col1"), Bytes.toBytes("val1"), put); System.out.println(“New value put: ”+bool);

If the value in the “ value” arg is null, the check is for the lack (non-existence) of a column, implying that you put the new value if a value does not already exist. © Deepak Vohra 2016 ₁₃₃

D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_18 CHAPTER 18 ■ USING THE HTABLE CLASS

Summary

In this chapter, I discussed the salient methods in the HTable class. In the next chapter, I will discuss using the HBase shell.

All names in the Apache HBase shell should be quoted, such as the table name, row key, and column name. Constants don’t need to be quoted. Successful HBase commands return an exit code of 0. But a non-0 exit status does not necessarily indicate failure and could indicate other issues, such as loss of connectivity.

The HBase shell is started with the following command if the HBase bin directory is in the PATH environment variable:

hbase shell

HBase shell commands may also be run from a .txt script file. For example, if the commands are stored in the file shell_commands.txt , use the following command to run the commands in the script:

hbase shell shell_commands.txt

Creating a Table

A table is created using the create command. The syntax of the create command is as follows:

create '/path/tablename', {NAME =>'cfname'}, {NAME =>'cfname'} As args to the command the first arg is the table name, followed by a dictionary

of specifications per column family, followed optionally by a dictionary of a table configuration.

The following command creates a table called t1 with column families of f1 , f2 , and f3 : create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}

The short form of the preceding command is as follows:

create 't1', 'f1', 'f2', 'f3'

D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_19 CHAPTER 19 ■ USING THE HBASE SHELL

The following command creates a table called t1 with column families of f1 , f2 , and f3 . The maximum number of versions for each column family is also specified. create 't1', {NAME => 'f1', VERSIONS => 1}, {NAME => 'f2', VERSIONS => 3}, {NAME => 'f3', VERSIONS => 5}

The following command includes whether to use block cache option BLOCKCACHE set to true for the f1 column family:

create 't1', {NAME => 'f1', VERSIONS => 3, BLOCKCACHE => true} Altering a Table

The alter command is used to alter the column family schema. Args provide the table name and the new column family specification. The following command alters column families f1 and f2 . The number of versions in f1 is set to 2 and the number of versions in f2 is set to 3.

alter 't1', {NAME => 'f1', VERSIONS => 2}, {NAME => 'f2', VERSIONS => 3} The following command deletes the column families f1 and f2 from table t1 : alter 't1', {NAME => 'f1', METHOD => 'delete'}, {NAME => 'f2', METHOD =>

'delete'}

Table scope attributes such as those in Table 19-1 may also be set. Table 19-1. Table Scope Attributes

MAX_FILESIZE Maximum size of the store file after which the region split occurs

DEFERRED_LOG_FLUSH Indicates if the deferred log flush option is enabled MEMSTORE_FLUSHSIZE Maximum size of the MemStore after which the

MemStore is flushed to disk

The following command sets the maximum file size to 256MB: alter 't1', {METHOD => 'table_att', MAX_FILESIZE => '268435456'}

CHAPTER 19 ■ USING THE HBASE SHELL

Adding Table Data

The put command is used to add data . The syntax of the put command is as follows: put '/path/tablename', 'rowkey', 'cfname:colname', 'value', 'timestamp' The timestamp is optional. The following command puts a value at coordinates

{t1,r1,c1} , table t1 and row r1 and column c1 with timestamp ts1 : put 't1', 'r1', 'c1', 'value', 'ts1'

The column may be specified using a column family name and column qualifier. For example, the following command puts value val into column cf1:c1 in row with row key r1 in table t1 with timestamp ts1 :

put 't1', 'r1', 'cf1:c1', 'val', 'ts1'

Describing a Table

The syntax of the describe command is as follows:

describe '/path/table'

To describe a table t1 , run the following command:

describe 't1'

Finding If a Table Exists

To find if table t1 exists, run the exists command .

exists 't1'

Listing Tables

The list command lists all the tables.

list

CHAPTER 19 ■ USING THE HBASE SHELL

Scanning a Table

The scan command is used to scan a table. The syntax of the scan command is as follows: scan '/path/tablename'

By default, the complete table is scanned. The following command scans the complete table t1 :

scan 't1'

The output of the scan command has the following format: ROW COLUMN+CELL

row1 column=cf1:c1, timestamp=12345, value=val1

row1 column=cf1:c2, timestamp=34567, value=val2

row1 column=cf1:c3, timestamp=678910, value=val3

The following command scans table t1 . The COLUMNS option specifies that only columns c2 and c5 are scanned. The LIMIT option limits only five rows to be scanned and the STARTROW option sets the start row from which the table is to be scanned. The STOPTROW sets the last row after which table is not scanned. An HBase table is sorted lexicographically and the STARTROW and STOPTROW may not even be in the table. The table is scanned lexicographically from the first row after the STARTROW if the STARTROW does not exist. Similarly, if the STOPTROW does not exist, the table is scanned up to the last row before the STOPTROW .

scan 't1', {COLUMNS => ['c2', 'c5'], LIMIT => 5, STARTROW => 'row1234', STOPTROW => 'row78910'}

The following command scans table t1 starting with row key c and stopping with row key n :

scan 't1', { STARTROW => 'c', STOPROW => 'n'}

The following command scans a table based on a time range: scan 't1', {TIMERANGE => [123456, 124567]}

A filter such as the ColumnPagination filter may be set using the FILTER option. The ColumnPagination filter takes two options, limit and offset , both of type int . scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter. new(5, 1)}

CHAPTER 19 ■ USING THE HBASE SHELL

Block caching, which is enabled by default, may be disabled using the CACHE_BLOCKS option.

scan 't1', {COLUMNS => ['c2', 'c3'], CACHE_BLOCKS => false} Enabling and Disabling a Table

The enable command enables a table and the disable command disables a table. enable 't1'

disable 't1'

Dropping a Table

The drop command drops a table. Before a table is dropped, the table must be disabled. disable 't1'

drop 't1'

Counting the Number of Rows in a Table

To count the number of rows in table t1, run the following command. By default, the row count is shown every 1,000 rows.

count 't1'

The row count interval may be set to a non-default value as follows: count 't1', 10000

Getting Table Data

The get command is used to get table row data. The syntax of the get command is as follows:

get '/path/tablename', 'rowkey'

A row of data may be accessed or a cell data may be accessed. A row with row key r1 from table t1 is accessed as follows:

get 't1', 'r1'

CHAPTER 19 ■ USING THE HBASE SHELL

The output has the following format:

COLUMN CELL

cf1:c1 timestamp=123, value=val1

cf1:c2 timestamp=234, value=val2

cf1:c3 timestamp=123, value=val3

Cell data for a single column c1 is accessed as follows:

get 't1', 'r1', {COLUMN => 'c1'}

The COLUMN specification may include the column family and the column qualifier. get 't1', 'r1', {COLUMN => 'cf1:c1'}

Or just the column family may be specified.

get 't1', 'r1', {COLUMN => 'cf1'}

Cell data for a single column like c1 with a particular timestamp is accessed as follows:

get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} Cell data for multiple columns c1 , c3, and c5 from column family cf1 are accessed

as follows:

get 't1', 'r1', {COLUMN => ['cf1:c1', 'cf1:c3', 'cf1:c5']} A get is essentially a scan limited to one row.

Truncating a Table

The truncate command disables, drops, and recreates a table. truncate 't1'

Deleting Table Data

The delete command may be used to delete a row, a column family, or specific cell in a row in a table. The syntax for the delete command is as follows: delete 'tablename', 'rowkey', 'columnfamily:columnqualifier'

CHAPTER 19 ■ USING THE HBASE SHELL

The following command deletes row r1 from table t1 :

delete 't1', 'r1'

The following command deletes column family cf1 from row r1 from table t1 : delete 't1', 'r1', 'cf1'

The following command deletes column cell c1 in column family cf1 in row r1 in table t1 :

delete 't1', 'r1', 'cf1:c1'

The deleteall command deletes all rows in a given row. For example, the following command deletes row r1 in table t1 :

deleteall 't1' 'r1'

Optionally, a column and a timestamp may be specified. The following command deletes column c1 in column family cf1 from row key r1 in table t1 with timestamp ts1 : deleteall 't1', 'r1', 'cf1:c1' ' ts1'

Summary

In this chapter, I discussed the salient HBase shell commands. In the next chapter, I will discuss bulk loading data into HBase.

The org.apache.hadoop.hbase.mapreduce.ImportTsv utility and the completebulkload tool are used to bulk load data into HBase. The procedure to upload is as follows: 1. Put the data file, which is a TSV file, to be uploaded into HDFS 2. Run the ImportTsv utility to generate multiple HFiles from

the TSV file

3. Run the completebulkload tool to bulk load the HFiles into HBase

Let’s discuss an example. You can use the following sample data file input.tsv in HDFS:

r1 c1 c2 c3

r2 c1 c2 c3

r3 c1 c2 c3

r4 c1 c2 c3

r5 c1 c2 c3

r6 c1 c2 c3

r7 c1 c2 c3

r8 c1 c2 c3

r9 c1 c2 c3

r10 c1 c2 c3

Run the following importtsv command to generate HFiles from input file input.tsv : HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/ bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv. columns=HBASE_ROW_KEY,cf1:c1,cf1:c2,cf1:c3 -Dimporttsv.bulk.output=hdfs:// output t1 hdfs://input.tsv

D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_20 CHAPTER 20 ■ BULK LOADING DATA

The command without the classpath settings is as follows: hadoop jar hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_ KEY,cf1:c1,cf1:c2,cf1:c3 -Dimporttsv.bulk.output=hdfs://output t1 hdfs:// input.tsv

T h e -Dimporttsv.columns option specifies the column names of the TSV data. Comma-separated column names are to be provided with each column name as either a column family or a column family: column qualifier . A column name must be included for each column data in the input TSV file. The HBASE_ROW_KEY column is for specifying the row key and only one must be specified. The -Dimporttsv.bulk.output=/path/for/ output specifies HFiles are to be generated at the specified output path, which should either be a relative path on the HDFS or the absolute HDFS path, as in the example as hdfs://output . The target table specified is t1; if a target table is not specified, a table is created with the default column family descriptors. The input file on the HDFS is specified as hdfs://input.tsv .

The -Dimporttsv.bulk.output option must be specified to generate HFiles for bulk upload because without it the input data is uploaded directly into an HBase table. The other options that may be specified are in Table 20-1 .

Table 20-1. T h e importtsv Command Options

-Dimporttsv.skip.bad.lines Whether to skip invalid data line. Default is false.

-Dimporttsv.separator Input data separator.

Default is tab.

-Dimporttsv.timestamp The timestamp to use.

Default is currentTimeAsLong .

-Dimporttsv.mapper.class User-defined mapper to use instead of the default, which is

org.apache.hadoop.hbase.mapreduce.

When importtsv is run, HFiles get generated.

Next, use the completebulkload utility to bulk upload the HFiles into an HBase table. The completebulkload may be run in one of two modes : as an explicit class name or via the driver. Using the class name, the syntax is as follows:

bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs:// output> <tablename>

CHAPTER 20 ■ BULK LOADING DATA

With the driver, the syntax is as follows:

hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/ hbase-site.xml] <hdfs://output> <tablename>

To load the HFile generated in hdfs://output from importtsv, run the following command:

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://output t1 Bulk load bypasses the write path completely and does not use the WAL and does

not cause a split storm. A custom MR job may also be used for bulk uploading. Summary

In this chapter, I discussed bulk loading data into HBase. This chapter concludes the book. The book is a primer on the core concepts of Apache HBase, the HBase data model, schema design and architecture, the HBase API, and administration. For usage of HBase, including installation, configuration, and creating an HBase table, please refer another book from Apress: Practical Hadoop Ecosystem .

^{Master , 70 regions , 72}

Apache HBase RegionServer , 69, 70

architecture , 11 write-ahead log , 72

characteristics , 9 ZooKeeper , 71

NameNode and datanode , 11 Column family

WALs , 10 block size , 106

Apache HBase storage architecture , 12 bloom f lter , 106

Atomicity, consistency, isolation, cardinality , 105

durability , 63 compression , 106

Audit logging systems , 46 IN_MEMORY characteristic , 107

Auto-sharding , 27, 78, 111 MAX_LENGHT and

MAX_VERSIONS , 107

^{number of , 106 storage characteristics , 105}

Block cache , 82 Column qualif ers

Block format , 37 column family , 56

Block header , 20 HBase data model , 54

Block indexes , 82 HBase stores , 53

Bloom f lters , 19, 118 KeyValue , 55

Bulk loading data Row-1 , 54

column qualif er , 146 versions , 55

-Dimporttsv.bulk.output option , 146 Compactions , 21, 32, 72 -Dimporttsv.columns option , 146 conf guration properties , 92–96 HBASE_ROW_KEY column , 146 data locality , 91 importtsv Command Options , 146 delete markers , 90 modes , 146 disadvantages , 90

upload procedure , 145 encryption , 91

expired rows , 90

^{function and purpose , 89 major , 88}

checkAndPut() method , 133 minor , 87

Classical transactional applications , 46 policy , 88–89

Cluster, HBase region count , 91

clients , 73 region splitting , 90

components , 69 versions , 90

data-oriented operations , 71 write throughput , 91

HDFS , 73 Content management systems , 46

D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3 ^{hbase\:meta Table , 83 HColumnFamily , 12}

Data block encoding , 20, 31 Hadoop distributed f le system Data block structure , 20 (HDFS) , 3, 9, 70

Data-centric model , 45 checksums , 40

Data f les , 9, 36 conf guration settings , 13

Data layout , 45 directory structure , 23

Data locality , 24, 38, 42 HFile

Datanode failure , 10 data f les , 19

Data sharding , 4 format , 30

DFSClient , 11, 39 and HLog f les , 14, 62

dif encoding , 21 sections , 16

DML operation , 81 HFileSystem class , 33

Durability enums , 14 HFile v2 , 19

HMaster , 11, 51

^{HTable class , 133–134}

FirstKeyOnlyFilter f lter , 118

Flexible data model , 4

Fully-distributed mode , 3 KeyOnlyFilter f lter , 118

KeyValue class , 21

^{Key value format , 18–19}

Get class

Get(byte[] row) constructor , 130

HBase conf guration , 129 Load balancing , 33, 42, 79

methods , 130 Local reads , 39

org.apache.hadoop.hbase.client.Get Logical storage

class , 129 data f le , 64

result class methods , 132 timestamp , 63

setTimeStamp(long timestamp) unit of storage , 64

method , 130

_{Major compaction , 21, 87, 88, 90, 91}

Hadoop HBase , 5 Managed splitting , 112

HBase Manual splitting , 79, 112

architecture , 26, 51 MapFile , 15

auto-sharding , 3 MemStore , 11, 42–43

blocks size , 17 Message-centered systems , 46

data model , 54 META locations , 84

ecosystem , 25 .META. table , 34

Hadoop and HDFS , 3 Minor compaction , 87, 88

indexes data , 51 Monitoring system , 46

Java client API , 35

NoSQL , 3

RDBMS , 5–7

HBaseAdmin class org.apache.hadoop.hbase.client.HTable Connection.getAdmin() , 127 class , 129

functions methods , 124–127 Overloaded addColumn

hbase\:meta location , 85 method , 60, 61

^{bloom f lters , 118}

Pluggable compactions , 32, 97 locality , 119

Port 50010 , 13 optimal read performance , 117

Port 50020 , 13 scan time , 118

Pref x length , 20 sequential keys , 118

Pre-splitting , 79, 113 table key design , 117

Preventing hotspots , 80 Row versioning

Put class constructors , 60 Put class constructors , 60

row and column keys , 60

^{timestamp , 59}

Random access , 36

RDBMS , 5

Read path , 30 Scalability , 4–5, 33

Real-time analytics , 46 Shell command, HBase

Reference f les/links , 37 add table data , 139

Region failover alter table , 138

conf guration properties , 103 count, number of rows , 141

data locality , 103 deleting table data , 142–143

data recovery , 102 describe command , 139

failure detection , 102 drop command , 141

HBase resilience , 99 enable and disable command , 141

phases , 100–101 exists command , 139

regions reassignment , 103 getting table data , 141–142

ZooKeeper , 99 listing tables , 139

Regions scan command , 140

assignment , 76 table creation , 137

auto-sharding , 78 table scope attributes , 138

cluster , 75 truncate command , 142

compactions , 76 Snapshots , 32–33

distributed datastore , 77 Strongly consistent , 4

failover , 77

locality , 77

partitioning , 77

RegionServers , 75 Table Format , 25

startKey and endKey , 75 Timestamp , 59, 63

RegionServers , 29, 75, 81, 84

collocation , 51

crash , 99

Region splitting , 5, 33, 78 Versions sorting , 61

conf guration properties , 113–114

managed splitting , 112

middle row key , 111

RegionServer , 115–116 Write-ahead logs (WALs) , 9, 38, 72 sequence , 115 Write path , 27–28

Relational databases , 3

Replica placement policy , 25, 77

Row data , 82

Row keys , 57, 82, 123 ZooKeeper , 71, 99