Hash Tables

Hash tables are a convenient mechanism to store and retrieve objects identified by unique keys.

Why use hash tables?

The main advantage of hash tables is the efficiency of the basic operations: store (put) and retrieve (item, remove).
The idea behind hash tables is to try to emulate the data structure that provides the ultimate in efficiency: the array. On an array a, for some integer i whose value lies within the bounds ofa, the basic operations are

a.put (x, i)
x := a.item (i)
x := a @ i

The first causes the value of a at index i to be x; the second (and the third, which is simply a syntactical variant) access the value at index i and assign it to x. With the usual computer architectures, these operations are very fast: because arrays items are stored contiguously in memory, a computer will need just one addition (base address plus index) and one memory access to perform a put or item.
Not only are the operation times small; they are constant (or more precisely bounded by a constant). This is a great advantage over structures such as lists or trees which you must traverse at least in part to retrieve an item, so that access and modification times grow with the number of items. With an array, disregarding the influence of other factors such as memory paging, the time for a put or item is for all practical purposes the same whether the array has five items or five hundred thousand. These properties make arrays excellent data structures for keeping objects. Unfortunately, they are only applicable if the objects satisfy three requirements:

A1. For each object there must be an associated integer, which for the purpose of this discussion we may call the object’s index (since it will serve as index for the object in the array.)
A2. No two objects may have the same index.
A3. If we want to avoid wasting huge amount of storage, all the indices must lie in a contiguous or almost contiguous range.

Hash tables may be viewed as a rehabilitation mechanism for objects that do not naturally possess these three properties. If we are unable to find a natural index, we can sometimes devise an artificial one. To do so we must be able to find a key. Each key must uniquely identify the corresponding object; this is the same as property A2, making keys similar to indices. But keys are not necessarily integers (violating property A1), although it must be possible to associate an integer with each key. The mechanism that maps keys to integers is called the hashing function.
Thanks to the hashing mechanism we will indeed be able to store suitable objects into arrays, approaching the optimal efficiency of this data structure. The efficiency will not be quite as good, however, for two reasons:

We must pay the price of computing the hash function whenever we store or retrieve an object.
Different keys may hash into the same integer value, requiring extra processing to find an acceptable index.

With good implementations, however, it is possible to use hash tables with a performance that is not much worse than that of arrays and, most importantly, may be treated as if the time for a put, an item or a remove were constant. This will mean that you can consider operations such as

h.put (x, k)
h := a.item (k)

where h is a hash-table and k is a key (for example a string) as conceptually equivalent to the array operations mentioned above.
The quality of a hashed implementation will depend both on the data structure that will store the objects, and on the choice of hashing function. Class HASH_TABLE attempts to address the first concern; for the second, client developers will be responsible for choosing the proper hashing function, although Base provides a few predefined functions, in particular for class STRING.

When hash tables are appropriate

You may keep objects in a hash table if for each one of these objects you can find a key that uniquely identifies it. The objects and their keys may be of many possible kinds:

H1. In a simple example, the objects are integers; each integer serves as its own key. (More precisely we will use its absolute value, since it is convenient to have non-negative keys only.) This case is of more than theoretical interest, since it makes hash tables appropriate for storing a set of integers with widely scattered values, for which simple array storage would be a waste of space (see requirement A3 above).
H2. Frequently, the objects will be composite, that is to say, instances of a developer-defined class, and one of the attributes of that class, of type STRING, can serve as the key. For example if you were writing an Eiffel compiler you would probably need to keep a data structure that includes information about classes of the system. Each class is represented by an object with several fields describing the properties of the class; one of these fields, the class name, corresponding to an attribute of type STRING, will serve as key.
H3. Instead of being the full object (as in case H1) or one of the object’s fields (as in case H2), the key may have to be computed through a function of the generating class, which will take into account several attributes of the class (that is to say, for each object, several fields).

What this practically means is that in all cases you will need, in the generating class of the objects to be stored, a query (attribute or function) that gives the key. The type of the key is highly variable but must in all cases be a descendant of HASHABLE. This is true of both INTEGER (case H1) and STRING (case H2). The requirements for being a HASHABLE are not harsh: all you need is a function hash_code that returns a non-negative integer.

Using hash tables

Class HASH_TABLE takes two generic parameters:

class HASH_TABLE [G, H -> HASHABLE] ...

G represents the type of the objects to be stored in the hash table, H the type of their keys.
When viewed as an implementation of containers, HASH_TABLE, in a strict sense, represents bags rather than sets: unlike the other classes in this chapter, it allows an object to have two or more distinct occurrences in a single container. But this is only true if we consider a hash table as a repository of objects of type G. In reality each item of the table is identified by a pair of values, one from G and one from H. Because the keys must uniquely identify objects, the hash table viewed as a container of such pairs is indeed a set. The creation procedure make takes an integer argument, as in

create my_table.make (n)

The value of n indicates how many items the hash table is expected to have to accommodate. This number of items is not a hardwired size, just information passed to the class. In particular:

The actual size of the underlying array representation will be higher than n since efficient operation of hash table algorithms require the presence of enough breathing space - unoccupied positions.
If the number of items in the table grows beyond the initial allocation, the table will automatically be resized.

It is useful, however, to use a reasonable n upon creation: not too large to avoid wasting space, but not too small to avoid frequent applications of resizing, an expensive operation.