Bytes.toBytes("val1"));
put.add(Bytes.toBytes("cf2"), Bytes.toBytes("col2"),
Bytes.toBytes("val2"));
table.put(put);
Now let’s discuss the checkAndPut(byte[] row, byte[] family, byte[] qualifier, byte[] value, Put put) method. For example, let’s put a new value called val2 in row1 , column family cf1 , column col1 using a Put instance if the row1 , column family cf1 , column col1 has value val1 :
Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "table1");
Put put = new Put(Bytes.toBytes("row1"));
put.add(Bytes.toBytes("cf1"), Bytes.toBytes("col1"),
Bytes.toBytes("val2"));
boolean bool= table.checkAndPut(Bytes.toBytes("row1"), Bytes.toBytes("cf1"), Bytes.toBytes("col1"), Bytes.toBytes("val1"), put); System.out.println(“New value put: ”+bool);
If the value in the “ value” arg is null, the check is for the lack (non-existence) of a column, implying that you put the new value if a value does not already exist. © Deepak Vohra 2016 133
D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_18 CHAPTER 18 ■ USING THE HTABLE CLASS
Summary
In this chapter, I discussed the salient methods in the HTable class. In the next chapter, I will discuss using the HBase shell.
All names in the Apache HBase shell should be quoted, such as the table name, row key, and column name. Constants don’t need to be quoted. Successful HBase commands return an exit code of 0. But a non-0 exit status does not necessarily indicate failure and could indicate other issues, such as loss of connectivity.
The HBase shell is started with the following command if the HBase bin directory is in the PATH environment variable:
hbase shell
HBase shell commands may also be run from a .txt script file. For example, if the commands are stored in the file shell_commands.txt , use the following command to run the commands in the script:
hbase shell shell_commands.txt
Creating a Table
A table is created using the create command. The syntax of the create command is as follows:
create '/path/tablename', {NAME =>'cfname'}, {NAME =>'cfname'} As args to the command the first arg is the table name, followed by a dictionary
of specifications per column family, followed optionally by a dictionary of a table configuration.
The following command creates a table called t1 with column families of f1 , f2 , and f3 : create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
The short form of the preceding command is as follows:
create 't1', 'f1', 'f2', 'f3'
© Deepak Vohra 2016 137
D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_19 CHAPTER 19 ■ USING THE HBASE SHELL
The following command creates a table called t1 with column families of f1 , f2 , and f3 . The maximum number of versions for each column family is also specified. create 't1', {NAME => 'f1', VERSIONS => 1}, {NAME => 'f2', VERSIONS => 3}, {NAME => 'f3', VERSIONS => 5}
The following command includes whether to use block cache option BLOCKCACHE set to true for the f1 column family:
create 't1', {NAME => 'f1', VERSIONS => 3, BLOCKCACHE => true} Altering a Table
The alter command is used to alter the column family schema. Args provide the table name and the new column family specification. The following command alters column families f1 and f2 . The number of versions in f1 is set to 2 and the number of versions in f2 is set to 3.
alter 't1', {NAME => 'f1', VERSIONS => 2}, {NAME => 'f2', VERSIONS => 3} The following command deletes the column families f1 and f2 from table t1 : alter 't1', {NAME => 'f1', METHOD => 'delete'}, {NAME => 'f2', METHOD =>
'delete'}
Table scope attributes such as those in Table 19-1 may also be set. Table 19-1. Table Scope Attributes
MAX_FILESIZE Maximum size of the store file after which the region split occurs
DEFERRED_LOG_FLUSH Indicates if the deferred log flush option is enabled MEMSTORE_FLUSHSIZE Maximum size of the MemStore after which the
MemStore is flushed to disk
The following command sets the maximum file size to 256MB: alter 't1', {METHOD => 'table_att', MAX_FILESIZE => '268435456'}
CHAPTER 19 ■ USING THE HBASE SHELL
Adding Table Data
The put command is used to add data . The syntax of the put command is as follows: put '/path/tablename', 'rowkey', 'cfname:colname', 'value', 'timestamp' The timestamp is optional. The following command puts a value at coordinates
{t1,r1,c1} , table t1 and row r1 and column c1 with timestamp ts1 : put 't1', 'r1', 'c1', 'value', 'ts1'
The column may be specified using a column family name and column qualifier. For example, the following command puts value val into column cf1:c1 in row with row key r1 in table t1 with timestamp ts1 :
put 't1', 'r1', 'cf1:c1', 'val', 'ts1'
Describing a Table
The syntax of the describe command is as follows:
describe '/path/table'
To describe a table t1 , run the following command:
describe 't1'
Finding If a Table Exists
To find if table t1 exists, run the exists command .
exists 't1'
Listing Tables
The list command lists all the tables.
list
CHAPTER 19 ■ USING THE HBASE SHELL
Scanning a Table
The scan command is used to scan a table. The syntax of the scan command is as follows: scan '/path/tablename'
By default, the complete table is scanned. The following command scans the complete table t1 :
scan 't1'
The output of the scan command has the following format: ROW COLUMN+CELL
row1 column=cf1:c1, timestamp=12345, value=val1
row1 column=cf1:c2, timestamp=34567, value=val2
row1 column=cf1:c3, timestamp=678910, value=val3
The following command scans table t1 . The COLUMNS option specifies that only columns c2 and c5 are scanned. The LIMIT option limits only five rows to be scanned and the STARTROW option sets the start row from which the table is to be scanned. The STOPTROW sets the last row after which table is not scanned. An HBase table is sorted lexicographically and the STARTROW and STOPTROW may not even be in the table. The table is scanned lexicographically from the first row after the STARTROW if the STARTROW does not exist. Similarly, if the STOPTROW does not exist, the table is scanned up to the last row before the STOPTROW .
scan 't1', {COLUMNS => ['c2', 'c5'], LIMIT => 5, STARTROW => 'row1234', STOPTROW => 'row78910'}
The following command scans table t1 starting with row key c and stopping with row key n :
scan 't1', { STARTROW => 'c', STOPROW => 'n'}
The following command scans a table based on a time range: scan 't1', {TIMERANGE => [123456, 124567]}
A filter such as the ColumnPagination filter may be set using the FILTER option. The ColumnPagination filter takes two options, limit and offset , both of type int . scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter. new(5, 1)}
CHAPTER 19 ■ USING THE HBASE SHELL
Block caching, which is enabled by default, may be disabled using the CACHE_BLOCKS option.
scan 't1', {COLUMNS => ['c2', 'c3'], CACHE_BLOCKS => false} Enabling and Disabling a Table
The enable command enables a table and the disable command disables a table. enable 't1'
disable 't1'
Dropping a Table
The drop command drops a table. Before a table is dropped, the table must be disabled. disable 't1'
drop 't1'
Counting the Number of Rows in a Table
To count the number of rows in table t1, run the following command. By default, the row count is shown every 1,000 rows.
count 't1'
The row count interval may be set to a non-default value as follows: count 't1', 10000
Getting Table Data
The get command is used to get table row data. The syntax of the get command is as follows:
get '/path/tablename', 'rowkey'
A row of data may be accessed or a cell data may be accessed. A row with row key r1 from table t1 is accessed as follows:
get 't1', 'r1'
CHAPTER 19 ■ USING THE HBASE SHELL
The output has the following format:
COLUMN CELL
cf1:c1 timestamp=123, value=val1
cf1:c2 timestamp=234, value=val2
cf1:c3 timestamp=123, value=val3
Cell data for a single column c1 is accessed as follows:
get 't1', 'r1', {COLUMN => 'c1'}
The COLUMN specification may include the column family and the column qualifier. get 't1', 'r1', {COLUMN => 'cf1:c1'}
Or just the column family may be specified.
get 't1', 'r1', {COLUMN => 'cf1'}
Cell data for a single column like c1 with a particular timestamp is accessed as follows:
get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} Cell data for multiple columns c1 , c3, and c5 from column family cf1 are accessed
as follows:
get 't1', 'r1', {COLUMN => ['cf1:c1', 'cf1:c3', 'cf1:c5']} A get is essentially a scan limited to one row.
Truncating a Table
The truncate command disables, drops, and recreates a table. truncate 't1'
Deleting Table Data
The delete command may be used to delete a row, a column family, or specific cell in a row in a table. The syntax for the delete command is as follows: delete 'tablename', 'rowkey', 'columnfamily:columnqualifier'
CHAPTER 19 ■ USING THE HBASE SHELL
The following command deletes row r1 from table t1 :
delete 't1', 'r1'
The following command deletes column family cf1 from row r1 from table t1 : delete 't1', 'r1', 'cf1'
The following command deletes column cell c1 in column family cf1 in row r1 in table t1 :
delete 't1', 'r1', 'cf1:c1'
The deleteall command deletes all rows in a given row. For example, the following command deletes row r1 in table t1 :
deleteall 't1' 'r1'
Optionally, a column and a timestamp may be specified. The following command deletes column c1 in column family cf1 from row key r1 in table t1 with timestamp ts1 : deleteall 't1', 'r1', 'cf1:c1' ' ts1'
Summary
In this chapter, I discussed the salient HBase shell commands. In the next chapter, I will discuss bulk loading data into HBase.
The org.apache.hadoop.hbase.mapreduce.ImportTsv utility and the completebulkload tool are used to bulk load data into HBase. The procedure to upload is as follows: 1. Put the data file, which is a TSV file, to be uploaded into HDFS 2. Run the ImportTsv utility to generate multiple HFiles from
the TSV file
3. Run the completebulkload tool to bulk load the HFiles into HBase
Let’s discuss an example. You can use the following sample data file input.tsv in HDFS:
r1 c1 c2 c3
r2 c1 c2 c3
r3 c1 c2 c3
r4 c1 c2 c3
r5 c1 c2 c3
r6 c1 c2 c3
r7 c1 c2 c3
r8 c1 c2 c3
r9 c1 c2 c3
r10 c1 c2 c3
Run the following importtsv command to generate HFiles from input file input.tsv : HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/ bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv. columns=HBASE_ROW_KEY,cf1:c1,cf1:c2,cf1:c3 -Dimporttsv.bulk.output=hdfs:// output t1 hdfs://input.tsv
© Deepak Vohra 2016 145
D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3_20 CHAPTER 20 ■ BULK LOADING DATA
The command without the classpath settings is as follows: hadoop jar hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_ KEY,cf1:c1,cf1:c2,cf1:c3 -Dimporttsv.bulk.output=hdfs://output t1 hdfs:// input.tsv
T h e -Dimporttsv.columns option specifies the column names of the TSV data. Comma-separated column names are to be provided with each column name as either a column family or a column family: column qualifier . A column name must be included for each column data in the input TSV file. The HBASE_ROW_KEY column is for specifying the row key and only one must be specified. The -Dimporttsv.bulk.output=/path/for/ output specifies HFiles are to be generated at the specified output path, which should either be a relative path on the HDFS or the absolute HDFS path, as in the example as hdfs://output . The target table specified is t1; if a target table is not specified, a table is created with the default column family descriptors. The input file on the HDFS is specified as hdfs://input.tsv .
The -Dimporttsv.bulk.output option must be specified to generate HFiles for bulk upload because without it the input data is uploaded directly into an HBase table. The other options that may be specified are in Table 20-1 .
Table 20-1. T h e importtsv Command Options
-Dimporttsv.skip.bad.lines Whether to skip invalid data line. Default is false.
-Dimporttsv.separator Input data separator.
Default is tab.
-Dimporttsv.timestamp The timestamp to use.
Default is currentTimeAsLong .
-Dimporttsv.mapper.class User-defined mapper to use instead of the default, which is
org.apache.hadoop.hbase.mapreduce.
When importtsv is run, HFiles get generated.
Next, use the completebulkload utility to bulk upload the HFiles into an HBase table. The completebulkload may be run in one of two modes : as an explicit class name or via the driver. Using the class name, the syntax is as follows:
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs:// output> <tablename>
CHAPTER 20 ■ BULK LOADING DATA
With the driver, the syntax is as follows:
hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/ hbase-site.xml] <hdfs://output> <tablename>
To load the HFile generated in hdfs://output from importtsv, run the following command:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://output t1 Bulk load bypasses the write path completely and does not use the WAL and does
not cause a split storm. A custom MR job may also be used for bulk uploading. Summary
In this chapter, I discussed bulk loading data into HBase. This chapter concludes the book. The book is a primer on the core concepts of Apache HBase, the HBase data model, schema design and architecture, the HBase API, and administration. For usage of HBase, including installation, configuration, and creating an HBase table, please refer another book from Apress: Practical Hadoop Ecosystem .
Master , 70 regions , 72
Apache HBase RegionServer , 69, 70
architecture , 11 write-ahead log , 72
characteristics , 9 ZooKeeper , 71
NameNode and datanode , 11 Column family
WALs , 10 block size , 106
Apache HBase storage architecture , 12 bloom f lter , 106
Atomicity, consistency, isolation, cardinality , 105
durability , 63 compression , 106
Audit logging systems , 46 IN_MEMORY characteristic , 107
Auto-sharding , 27, 78, 111 MAX_LENGHT and
MAX_VERSIONS , 107
number of , 106 storage characteristics , 105
Block cache , 82 Column qualif ers
Block format , 37 column family , 56
Block header , 20 HBase data model , 54
Block indexes , 82 HBase stores , 53
Bloom f lters , 19, 118 KeyValue , 55
Bulk loading data Row-1 , 54
column qualif er , 146 versions , 55
-Dimporttsv.bulk.output option , 146 Compactions , 21, 32, 72 -Dimporttsv.columns option , 146 conf guration properties , 92–96 HBASE_ROW_KEY column , 146 data locality , 91 importtsv Command Options , 146 delete markers , 90 modes , 146 disadvantages , 90
upload procedure , 145 encryption , 91
expired rows , 90
function and purpose , 89 major , 88
checkAndPut() method , 133 minor , 87
Classical transactional applications , 46 policy , 88–89
Cluster, HBase region count , 91
clients , 73 region splitting , 90
components , 69 versions , 90
data-oriented operations , 71 write throughput , 91
HDFS , 73 Content management systems , 46
© Deepak Vohra 2016 149
D. Vohra, Apache HBase Primer , DOI 10.1007/978-1-4842-2424-3 hbase\:meta Table , 83 HColumnFamily , 12
Data block encoding , 20, 31 Hadoop distributed f le system Data block structure , 20 (HDFS) , 3, 9, 70
Data-centric model , 45 checksums , 40
Data f les , 9, 36 conf guration settings , 13
Data layout , 45 directory structure , 23
Data locality , 24, 38, 42 HFile
Datanode failure , 10 data f les , 19
Data sharding , 4 format , 30
DFSClient , 11, 39 and HLog f les , 14, 62
dif encoding , 21 sections , 16
DML operation , 81 HFileSystem class , 33
Durability enums , 14 HFile v2 , 19
HMaster , 11, 51
HTable class , 133–134
FirstKeyOnlyFilter f lter , 118
Flexible data model , 4
Fully-distributed mode , 3 KeyOnlyFilter f lter , 118
KeyValue class , 21
Key value format , 18–19
Get class
Get(byte[] row) constructor , 130
HBase conf guration , 129 Load balancing , 33, 42, 79
methods , 130 Local reads , 39
org.apache.hadoop.hbase.client.Get Logical storage
class , 129 data f le , 64
result class methods , 132 timestamp , 63
setTimeStamp(long timestamp) unit of storage , 64
method , 130
Major compaction , 21, 87, 88, 90, 91
Hadoop HBase , 5 Managed splitting , 112
HBase Manual splitting , 79, 112
architecture , 26, 51 MapFile , 15
auto-sharding , 3 MemStore , 11, 42–43
blocks size , 17 Message-centered systems , 46
data model , 54 META locations , 84
ecosystem , 25 .META. table , 34
Hadoop and HDFS , 3 Minor compaction , 87, 88
indexes data , 51 Monitoring system , 46
Java client API , 35
NoSQL , 3
RDBMS , 5–7
HBaseAdmin class org.apache.hadoop.hbase.client.HTable Connection.getAdmin() , 127 class , 129
functions methods , 124–127 Overloaded addColumn
hbase\:meta location , 85 method , 60, 61
bloom f lters , 118
Pluggable compactions , 32, 97 locality , 119
Port 50010 , 13 optimal read performance , 117
Port 50020 , 13 scan time , 118
Pref x length , 20 sequential keys , 118
Pre-splitting , 79, 113 table key design , 117
Preventing hotspots , 80 Row versioning
Put class constructors , 60 Put class constructors , 60
row and column keys , 60
timestamp , 59
Random access , 36
RDBMS , 5
Read path , 30 Scalability , 4–5, 33
Real-time analytics , 46 Shell command, HBase
Reference f les/links , 37 add table data , 139
Region failover alter table , 138
conf guration properties , 103 count, number of rows , 141
data locality , 103 deleting table data , 142–143
data recovery , 102 describe command , 139
failure detection , 102 drop command , 141
HBase resilience , 99 enable and disable command , 141
phases , 100–101 exists command , 139
regions reassignment , 103 getting table data , 141–142
ZooKeeper , 99 listing tables , 139
Regions scan command , 140
assignment , 76 table creation , 137
auto-sharding , 78 table scope attributes , 138
cluster , 75 truncate command , 142
compactions , 76 Snapshots , 32–33
distributed datastore , 77 Strongly consistent , 4
failover , 77
locality , 77
partitioning , 77
RegionServers , 75 Table Format , 25
startKey and endKey , 75 Timestamp , 59, 63
RegionServers , 29, 75, 81, 84
collocation , 51
crash , 99
Region splitting , 5, 33, 78 Versions sorting , 61
conf guration properties , 113–114
managed splitting , 112
middle row key , 111
RegionServer , 115–116 Write-ahead logs (WALs) , 9, 38, 72 sequence , 115 Write path , 27–28
Relational databases , 3
Replica placement policy , 25, 77
Row data , 82
Row keys , 57, 82, 123 ZooKeeper , 71, 99