Chaining multiple values into a hash

One of the bigger problems when using a DBM file with the storage mechanism of DB_HASH is that the keys against which the data is stored must be unique. For example, if we stored two different values with the key of ``Wiltshire,'' say for Stonehenge and Avebury, generally the last value inserted into the hash would get stored in the database. This is a bit problematic, to say the least.

In a good database design, the primary key of any data structure generally should be unique in order to speed up searches. But quick and dirty databases, badly designed ones, or databases with a suboptimal data quality may not be able to enforce this uniqueness. Similarly, using referential hashtables to provide nonprimary key searching of the database also triggers this problem.

A Perl solution to this problem is to push the multiple values onto an array that is stored within the hash element. This technique works fine while the program is running, because the array references are still valid, but when the database is written out and reloaded, the data is invalid.

Therefore, to solve this problem, we need to look at using the different Berkeley DB storage management method of DB_BTREE , which orders its keys prior to insertion. With this mechanism, it is possible to have duplicate keys, because the underlying DBM file is in the form of an array rather than a hashtable. Fortunately, you still reference the DBM file via a Perl hashtable, so DB_BTREE is not any harder to use. The main downside to DB_BTREE storage is a penalty in performance, since a B-Tree is generally slightly slower than a hashtable for data retrieval.

The following short program creates a Berkeley DB using the DB_BTREE storage mechanism and also specifies a flag to indicate that duplicate keys are allowed. A number of rows are inserted with duplicate keys, and finally the database is dumped to show that the keys have been stored:

#!/usr/bin/perl -w
  #
  # ch02/DBM/dupkey1: Creates a Berkeley DB with the DB_BTREE mechanism and 
  #                   allows for duplicate keys. We then insert some test 
  #                   object data with duplicate keys and dump the final
  #                   database.
  
  use DB_File;
  use Fcntl ':flock';
  use Megalith;
  
  ### Set Berkeley DB BTree mode to handle duplicate keys
  $DB_BTREE->{'flags'} = R_DUP;
  
  ### Remove any existing database files
  unlink 'dupkey2.dat';
  
  ### Open the database up
  my %database;
  my $db = tie %database, 'DB_File', "dupkey2.dat", 
                 O_CREAT | O_RDWR, 0666, $DB_BTREE
      or die "Can't initialize database: $!\n";
  
  ### Exclusively lock the database to ensure no one accesses it
  my $fd = $db->fd(  );
  open DATAFILE, "+<&=$fd"
      or die "Can't safely open file: $!\n";
  print "Acquiring exclusive lock...";
  flock( DATAFILE, LOCK_EX )
      or die "Unable to acquire lock: $!. Aborting";
  print "Acquired lock. Ready to update database!\n\n";
  
  ### Create, pack and insert some rows with duplicate keys
  $database{'Wiltshire'} = 
    new Megalith( 'Avebury',
                  'Wiltshire',
                  'SU 103 700',
                  'Stone Circle and Henge',
                  'Largest stone circle in Britain' )->pack(  );
  
  $database{'Wiltshire'} =
    new Megalith( 'Stonehenge',
                  'Wiltshire',
                  'SU 123 400',
                  'Stone Circle and Henge',
                  'The most popularly known stone circle in the world' )->pack(  );
  
  $database{'Wiltshire'} =
    new Megalith( 'The Sanctuary',
                  'Wiltshire',
                  'SU 118 680',
                  'Stone Circle ( destroyed )',
                  'No description available' )->pack(  );
  
  ### Dump the database
  foreach my $key ( keys %database ) {
  
      ### Unpack the record into a new megalith object
      my $megalith = new Megalith( $database{$key} );
      
      ### And display the record
      $megalith->dump(  );
  }
  
  ### Close the database
  undef $db;
  untie %database;
  
  ### Close the filehandle to release the lock
  close DATAFILE;
  
  exit;

The output you get from running this program is not exactly what we’d hoped for:

Acquiring exclusive lock...Acquired lock. Ready to update database!
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available
  
  The Sanctuary ( Stone Circle ( destroyed ) )
  ============================================
  Location:      Wiltshire
  Map Reference: SU 118 680
  Description:   No description available

It seems that we’ve managed to successfully store three copies of the same record instead of three different records!

Fortunately, this isn’t actually the case. We have correctly stored the three different records with the same key in the DBM file. The problem lies in the way we’ve tried to read these records back out of the DBM file. A basic dereference using the hash key obviously doesn’t work, since Perl stores only a single value for each key, as we already know.

To get around this limitation, we can use the seq( ) method declared within the DB_File module, which is used to traverse chained records stored within a single hash element. Figure 2.2 illustrates the principle of chained record traversal within a hash element.

Figure 2-2. Chained record traversal

The corrected record dumping chunk is rewritten to use seq() in this way:

### Dump the database
my ($key, $value, $status) = ('', '', 0);
for ( $status = $db->seq( $key, $value, R_FIRST ) ;
      $status == 0 ;
      $status = $db->seq( $key, $value, R_NEXT ) ) {

    ### Unpack the record into a new megalith object
    my $megalith = new Megalith( $value );

    ### And display the record
    $megalith->dump();
}

Running this corrected version produces the output we expected, i.e., records for three different megalithic sites.

The seq() method is quite simple to use and understand, and it works well when used in conjunction with a for loop, as shown above. The method takes three arguments: the hash key, the hash value, and a flag signifying which element within the chain should be returned. The first two arguments are actually populated with the hash key and the correct hash value, respectively, when seq() is called. Exactly which hash value is returned depends on the value of the third argument:

R_FIRST returns the first record within the chain of records.
R_LAST returns the last record stored within the chain of records.
R_NEXT returns the next record within the chain of records. This is used for forward sequential traversals of the chain.
R_PREV returns the previous record within the chain of records. This is used for backward sequential traversals of the chain.
R_CURSOR returns a record in which a partial match for the key has been located. This allows a certain element of “fuzzy matching” of keys. This feature is not necessarily accurate and may return the closest match to the desired key rather than an exact match. For example, if you searched for all sites within Wiltshire and asked for a partial match against ``wilt'', but no exact records matched, you may be returned the entries for the ``Western Isles,'' as these are the closest to the search term.

In the database dumping example shown above, we are simply starting at the beginning of the record chain and traversing through it in a forward direction. We could have performed a backward search by writing:

for ( $status = $db->seq( $key, $value, R_LAST ) ;
      $status == 0 ;
      $status = $db->seq( $key, $value, R_PREV ) ) {
    ...
}

A quicker and easier utility method for querying duplicate values also exists: get_dup() . This method returns either the number of records with the given key or an array or hash containing the appropriate records. For example, given that we have three records in our database with the key of Wiltshire, we could verify that fact by writing:

### Displays the number of records inserted against 
### the "Wiltshire" key
my $numRecords = $db->get_dup( 'Wiltshire' );
print "Number of Wiltshire records: $numRecords\n";