Databases are designed to maintain data for businesses, governments, and individuals. User requirements continue to increase and database capacity and efficiency must improve constantly to meet such requirements.
Objective 17.1 This chapter discusses applications of B trees, bloomjoins, and the CouchDB structure to resolve search-related database issues.
The complexity of a sorting and searching algorithm depends on the number of comparisons performed. For example, for a binary search of 1 million records, a specific record can be located with at most 20 comparisons. The large database is usually kept in the disk drives and the time to read a record on a disk drive (magnetic tapes) far exceeds the time needed to compare keys once the record is available. The time to read an element from a hard disk involves seek time and rotation time. Seek time is usually 0 to 20 ms, and sometimes more. The rotation time for a Seagate ST350032NS 7200 rpm drive is around 8.33 ms. For simplicity, assume that seeking data from a hard disk requires 10 ms. The time to process 20 disk reads to locate a single record among 1 million (at 10 ms per read) is 0.2 second. A little time saving results because individual records are grouped into disk blocks that hold approximately 16 Kb. About 100 records per block could be saved if each record holds 160 bytes.
If the index of the database is improved then the time consumed during searching can also be improved. Search range in the above example can be improved by creating an auxiliary index containing the first record in each disk block. This is known as a sparse index. The auxiliary index decreases search time but it occupies around 1% of the total database. It also identifies the block where the search is to be done, eliminating the cost of searching an enormous database. The auxiliary index can hold up to 10,000 entries, so it would take at most 14 comparisons. The index could be searched in about 8 disk blocks and the desired record will be accessed in around 9 disk reads. The trick to optimize the search is to create a new auxiliary index of an auxiliary index. This will reduce search further and it can be accommodated on one disk. So instead of reading all 14 disk blocks to find a desired record, we need only 3 block reads. Auxiliary indices can reduce search time to one-fifth of the processing time with no indices. If the main database is used frequently, the auxiliary disk will reside at the disk cache to avoid repetitive disk reads.
17.1.3 Insertion deletion in database
Index compilation becomes easy what a database is fixed. However, if a database changes over time, index management becomes complex. Operations like deleting a record have little impact because only the record in the index will be marked as deleted. If deletions become frequent, search operations become less efficient. Insertion of data in a sorted record is expensive because it requires shifts of all previous records.
We suggest storing all records sparsely in a block to create free space between the records. The free space can handle insertions and deletions efficiently.
If inserted data will not fit on a block after use of the above technique, free space on an out-of-sequence disk may be used.
17.2 B and B+ Trees for Database Creation and Block Search
Most of the self-balanced search trees (like AVL and red-black trees) assume that all data is stored in the main memory. This is a constraint when large data operations must be performed.
Disk access requires more time than accessing main memory.
B trees can reduce the frequency of disk access. Search, insert, delete, max, and min require O(h) accesses where h is tree height. The high of a B tree can be reduced by applying the largest key possible to its nodes. Generally, B tree node size is equivalent to block size. B tree height is low in comparison to other BSTs like AVL and red-black [12].
17.2.1 Applications of B trees in databases and file systems
•B trees keep the keys in sorted order so that the sequential traversal becomes easy.
•To minimize the number of disk reads, B trees use hierarchical indexing.
•B trees use partially full blocks to speed insertions and deletions.
•B trees utilize elegant recursive algorithms to keep their indices balanced.
B trees utilize only half their interior nodes t minimize time requirements.
B trees can also handle any number of insertions and deletions.
A binary search of a sorted table (N records) requires roughly (log2 N) comparisons.
B trees can serve as file systems to quickly access arbitrary blocks without need to access the main database.
Apache CouchDB is a new product intended to enhance database management. It uses a B tree data structure to index its documents and views. It also acts as a B tree manager with an HTTP interface.
CouchDB uses a B+ tree, a variation of the B tree that trades a bit of (disk) space for speed,
CouchDB uses a B+ structure to store huge amounts of data. B trees typical have single-digit heights even when storing millions of entries. The advantage of a B tree is the ability to store leaves on a slow medium such as a hard drive. CouchDB utilizes that feature. An operating system is likely to cache the upper tree nodes so only the final leaf nodes are stored on a hard disk. B tree access time is fewer than 10 ms, even for extremely large amounts of data. CouchDB, B tree appends data only to the database file that keeps the B tree on disk and append-only leaf nodes. B Tree provides a robust database file. Inclusion of B tree features in CouchDB helps avoid data corruption by not overwriting existing data on hard disks.
CouchDB restarts after problems like power failures by backward reading of database files. If the first 2k (footer with checksum) is corrupted, CouchDB relaces it with the second footer. If that footer is corrupted, the first 2k is copied. When both footers are replaced successfully to the disk, the write operation is successful. Documents indexed in a B tree are given names (DocID) and sequence identifications. Each database update generates a new sequence number that will be used later to find changes in the database. Indices are updated by saving or deleting documents.
Index updates occur after files are appended or modified.
A bloomjoin is a algorithm used for combining database attributes in distributed environments. The sequence of steps is shown below. Consider an Internet scenario involving Site 1 and Site 2; both have data sets designated T1 and T2 respectively.
•Site 1 computes a bloom filter F(T1)-based table of T1 records by hashing h1(T1) and sends the table to Site 2.
•Site 2 computes a bloom filter F(T2) and filters out all records that do not belong to F(T1). Assume T′ is the required record.
•Site 2 then sends T′ to Site 1 where the join is computed.
This algorithm is not limited to only 2 sites. However, the algorithm does not specify any method to minimize the network cost. Assigning optimal configurations to records such as table structures, limiting numbers of records in tables, and joining data can reduce message size [190].
Project 17.1 — Question paper generator system. Create a smart exam paper generation system that will allow an administrator to input questions and answers. The system should restrict the ability to determine weights and complexities of questions to the administrator. The system will be capable of incorporating modifications to the exam after it is converted to pdf format and emailed to colleges. [Hint: Use a bloomjoin.]
Project 17.2 — Online tutorial using Couch DB. Use CouchDB to implement membership and revenue aspects of an online tutorial system. A user will register on a website, then pay a fee, after which the user is treated as a member and can view and download educational content. Include a procedure that will allow a user to refer new members and and collect a 30% rebate if a potential member joins. Your design should include four referral levels with decreasing rebate levels.
Project 17.3 — College governance system. Create a chart of accounts for various departments (finance, computer science, administration) with access to all accounts available to the top official. The first task is to choose any data structure you want. The requirements are listed below:
•Separate sections for all departments (you can choose any number of departments.
•Online access and processing of data restricted to certain officials.
•Administrator accounts for top officials.
•Separate sections for communications to and from various entities: A2S = Administraton to Students; A2F = Administration to Faculty; S2A = Students to Administration; A2P = Administration to Parents.
Project 17.4 — Farming assistance application. Design a Web project to increase profitability of farmers through direct communications between farmers and between farmers and supplies. Dealers should have the capability to advertise to farmers. Farmers should be able to rate dealer products and services. All parties will access the system via login identifications and passwords. [Hint: Database should utilize red-black trees.]