Hash tables are a convenient mechanism to store and retrieve objects identified by unique keys.
a.put
(x,
i)
x :=
a.item
(i)
x :=
a
@
i
The first causes the value of a at index i to be
x; the second (and the third, which is
simply a syntactical variant) access the value at index i
and assign it to x. With the usual
computer architectures, these operations are very fast: because arrays items are stored
contiguously in memory, a computer will need just one addition (base address plus
index) and one memory access to perform a
put or
item.
Not only are the operation times small; they are constant (or more precisely
bounded by a constant). This is a great advantage over structures such as lists or trees
which you must traverse at least in part to retrieve an item, so that access and
modification times grow with the number of items. With an array, disregarding the
influence of other factors such as memory paging, the time for a put or item is for all
practical purposes the same whether the array has five items or five hundred thousand.
These properties make arrays excellent data structures for keeping objects.
Unfortunately, they are only applicable if the objects satisfy three requirements:
Hash tables may be viewed as a rehabilitation mechanism for objects that do not
naturally possess these three properties. If we are unable to find a natural index, we can
sometimes devise an artificial one. To do so we must be able to find a key. Each key
must uniquely identify the corresponding object; this is the same as property A2,
making keys similar to indices. But keys are not necessarily integers (violating property
A1), although it must be possible to associate an integer with each key. The mechanism
that maps keys to integers is called the hashing function.
Thanks to the hashing mechanism we will indeed be able to store suitable objects
into arrays, approaching the optimal efficiency of this data structure. The efficiency will
not be quite as good, however, for two reasons:
With good implementations, however, it is possible to use hash tables with a performance that is not much worse than that of arrays and, most importantly, may be treated as if the time for a put, an item or a remove were constant. This will mean that you can consider operations such as
where h is a hash-table and k
is a key (for example a string) as conceptually equivalent to the array operations mentioned above.
The quality of a hashed implementation will depend both on the data structure that
will store the objects, and on the choice of hashing function. Class
HASH_TABLE attempts to
address the first concern; for the second, client developers will be responsible for choosing
the proper hashing function, although Base provides a few predefined functions, in particular
for class STRING.
You may keep objects in a hash table if for each one of these objects you can find a key that uniquely identifies it. The objects and their keys may be of many possible kinds:
What this practically means is that in all cases you will need, in the generating class of the objects to be stored, a query (attribute or function) that gives the key. The type of the key is highly variable but must in all cases be a descendant of HASHABLE. This is true of both INTEGER (case H1) and STRING (case H2). The requirements for being a HASHABLE are not harsh: all you need is a function hash_code that returns a non-negative integer.
Class HASH_TABLE takes two generic parameters:
class HASH_TABLE [G, H -> HASHABLE] ...
G represents the type of the objects to be stored in the hash table, H the type of
their keys.
When viewed as an implementation of containers,
HASH_TABLE, in a strict sense, represents
bags rather than sets: unlike the other classes in this chapter, it allows an object to have two or
more distinct occurrences in a single container. But this is only true if we consider a hash table
as a repository of objects of type G. In reality each item of the table is identified by a pair of
values, one from G and one from H. Because the keys must uniquely identify objects, the hash
table viewed as a container of such pairs is indeed a set.
The creation procedure
make
takes an integer argument, as in
create my_table.make (n)
The value of n indicates how many items the hash table is expected to have to accommodate. This number of items is not a hardwired size, just information passed to the class. In particular:
It is useful, however, to use a reasonable n upon creation: not too large to avoid wasting space, but not too small to avoid frequent applications of resizing, an expensive operation.