From 6a364ced497e407ab3ffb2554d4ef2c78f801832 Mon Sep 17 00:00:00 2001
From: Karsten Blees <karsten.blees@gmail.com>
Date: Thu, 14 Nov 2013 20:17:54 +0100
Subject: add a hashtable implementation that supports O(1) removal

The existing hashtable implementation (in hash.[ch]) uses open addressing
(i.e. resolve hash collisions by distributing entries across the table).
Thus, removal is difficult to implement with less than O(n) complexity.
Resolving collisions of entries with identical hashes (e.g. via chaining)
is left to the client code.

Add a hashtable implementation that supports O(1) removal and is slightly
easier to use due to builtin entry chaining.

Supports all basic operations init, free, get, add, remove and iteration.

Also includes ready-to-use hash functions based on the public domain FNV-1
algorithm (http://www.isthe.com/chongo/tech/comp/fnv).

The per-entry data structure (hashmap_entry) is piggybacked in front of
the client's data structure to save memory. See test-hashmap.c for usage
examples.

The hashtable is resized by a factor of four when 80% full. With these
settings, average memory consumption is about 2/3 of hash.[ch], and
insertion is about twice as fast due to less frequent resizing.

Lookups are also slightly faster, because entries are strictly confined to
their bucket (i.e. no data of other buckets needs to be traversed).

Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/api-hashmap.txt | 235 ++++++++++++++++++++++++++++++++
 1 file changed, 235 insertions(+)
 create mode 100644 Documentation/technical/api-hashmap.txt

(limited to 'Documentation/technical')

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
new file mode 100644
index 0000000000..b2280f1b0d
--- /dev/null
+++ b/Documentation/technical/api-hashmap.txt
@@ -0,0 +1,235 @@
+hashmap API
+===========
+
+The hashmap API is a generic implementation of hash-based key-value mappings.
+
+Data Structures
+---------------
+
+`struct hashmap`::
+
+	The hash table structure.
++
+The `size` member keeps track of the total number of entries. The `cmpfn`
+member is a function used to compare two entries for equality. The `table` and
+`tablesize` members store the hash table and its size, respectively.
+
+`struct hashmap_entry`::
+
+	An opaque structure representing an entry in the hash table, which must
+	be used as first member of user data structures. Ideally it should be
+	followed by an int-sized member to prevent unused memory on 64-bit
+	systems due to alignment.
++
+The `hash` member is the entry's hash code and the `next` member points to the
+next entry in case of collisions (i.e. if multiple entries map to the same
+bucket).
+
+`struct hashmap_iter`::
+
+	An iterator structure, to be used with hashmap_iter_* functions.
+
+Types
+-----
+
+`int (*hashmap_cmp_fn)(const void *entry, const void *entry_or_key, const void *keydata)`::
+
+	User-supplied function to test two hashmap entries for equality. Shall
+	return 0 if the entries are equal.
++
+This function is always called with non-NULL `entry` / `entry_or_key`
+parameters that have the same hash code. When looking up an entry, the `key`
+and `keydata` parameters to hashmap_get and hashmap_remove are always passed
+as second and third argument, respectively. Otherwise, `keydata` is NULL.
+
+Functions
+---------
+
+`unsigned int strhash(const char *buf)`::
+`unsigned int strihash(const char *buf)`::
+`unsigned int memhash(const void *buf, size_t len)`::
+`unsigned int memihash(const void *buf, size_t len)`::
+
+	Ready-to-use hash functions for strings, using the FNV-1 algorithm (see
+	http://www.isthe.com/chongo/tech/comp/fnv).
++
+`strhash` and `strihash` take 0-terminated strings, while `memhash` and
+`memihash` operate on arbitrary-length memory.
++
+`strihash` and `memihash` are case insensitive versions.
+
+`void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function, size_t initial_size)`::
+
+	Initializes a hashmap structure.
++
+`map` is the hashmap to initialize.
++
+The `equals_function` can be specified to compare two entries for equality.
+If NULL, entries are considered equal if their hash codes are equal.
++
+If the total number of entries is known in advance, the `initial_size`
+parameter may be used to preallocate a sufficiently large table and thus
+prevent expensive resizing. If 0, the table is dynamically resized.
+
+`void hashmap_free(struct hashmap *map, int free_entries)`::
+
+	Frees a hashmap structure and allocated memory.
++
+`map` is the hashmap to free.
++
+If `free_entries` is true, each hashmap_entry in the map is freed as well
+(using stdlib's free()).
+
+`void hashmap_entry_init(void *entry, int hash)`::
+
+	Initializes a hashmap_entry structure.
++
+`entry` points to the entry to initialize.
++
+`hash` is the hash code of the entry.
+
+`void *hashmap_get(const struct hashmap *map, const void *key, const void *keydata)`::
+
+	Returns the hashmap entry for the specified key, or NULL if not found.
++
+`map` is the hashmap structure.
++
+`key` is a hashmap_entry structure (or user data structure that starts with
+hashmap_entry) that has at least been initialized with the proper hash code
+(via `hashmap_entry_init`).
++
+If an entry with matching hash code is found, `key` and `keydata` are passed
+to `hashmap_cmp_fn` to decide whether the entry matches the key.
+
+`void *hashmap_get_next(const struct hashmap *map, const void *entry)`::
+
+	Returns the next equal hashmap entry, or NULL if not found. This can be
+	used to iterate over duplicate entries (see `hashmap_add`).
++
+`map` is the hashmap structure.
++
+`entry` is the hashmap_entry to start the search from, obtained via a previous
+call to `hashmap_get` or `hashmap_get_next`.
+
+`void hashmap_add(struct hashmap *map, void *entry)`::
+
+	Adds a hashmap entry. This allows to add duplicate entries (i.e.
+	separate values with the same key according to hashmap_cmp_fn).
++
+`map` is the hashmap structure.
++
+`entry` is the entry to add.
+
+`void *hashmap_put(struct hashmap *map, void *entry)`::
+
+	Adds or replaces a hashmap entry. If the hashmap contains duplicate
+	entries equal to the specified entry, only one of them will be replaced.
++
+`map` is the hashmap structure.
++
+`entry` is the entry to add or replace.
++
+Returns the replaced entry, or NULL if not found (i.e. the entry was added).
+
+`void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)`::
+
+	Removes a hashmap entry matching the specified key. If the hashmap
+	contains duplicate entries equal to the specified key, only one of
+	them will be removed.
++
+`map` is the hashmap structure.
++
+`key` is a hashmap_entry structure (or user data structure that starts with
+hashmap_entry) that has at least been initialized with the proper hash code
+(via `hashmap_entry_init`).
++
+If an entry with matching hash code is found, `key` and `keydata` are
+passed to `hashmap_cmp_fn` to decide whether the entry matches the key.
++
+Returns the removed entry, or NULL if not found.
+
+`void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter)`::
+`void *hashmap_iter_next(struct hashmap_iter *iter)`::
+`void *hashmap_iter_first(struct hashmap *map, struct hashmap_iter *iter)`::
+
+	Used to iterate over all entries of a hashmap.
++
+`hashmap_iter_init` initializes a `hashmap_iter` structure.
++
+`hashmap_iter_next` returns the next hashmap_entry, or NULL if there are no
+more entries.
++
+`hashmap_iter_first` is a combination of both (i.e. initializes the iterator
+and returns the first entry, if any).
+
+Usage example
+-------------
+
+Here's a simple usage example that maps long keys to double values.
+[source,c]
+------------
+struct hashmap map;
+
+struct long2double {
+	struct hashmap_entry ent; /* must be the first member! */
+	long key;
+	double value;
+};
+
+static int long2double_cmp(const struct long2double *e1, const struct long2double *e2, const void *unused)
+{
+	return !(e1->key == e2->key);
+}
+
+void long2double_init(void)
+{
+	hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
+}
+
+void long2double_free(void)
+{
+	hashmap_free(&map, 1);
+}
+
+static struct long2double *find_entry(long key)
+{
+	struct long2double k;
+	hashmap_entry_init(&k, memhash(&key, sizeof(long)));
+	k.key = key;
+	return hashmap_get(&map, &k, NULL);
+}
+
+double get_value(long key)
+{
+	struct long2double *e = find_entry(key);
+	return e ? e->value : 0;
+}
+
+void set_value(long key, double value)
+{
+	struct long2double *e = find_entry(key);
+	if (!e) {
+		e = malloc(sizeof(struct long2double));
+		hashmap_entry_init(e, memhash(&key, sizeof(long)));
+		e->key = key;
+		hashmap_add(&map, e);
+	}
+	e->value = value;
+}
+------------
+
+Using variable-sized keys
+-------------------------
+
+The `hashmap_entry_get` and `hashmap_entry_remove` functions expect an ordinary
+`hashmap_entry` structure as key to find the correct entry. If the key data is
+variable-sized (e.g. a FLEX_ARRAY string) or quite large, it is undesirable
+to create a full-fledged entry structure on the heap and copy all the key data
+into the structure.
+
+In this case, the `keydata` parameter can be used to pass
+variable-sized key data directly to the comparison function, and the `key`
+parameter can be a stripped-down, fixed size entry structure allocated on the
+stack.
+
+See test-hashmap.c for an example using arbitrary-length strings as keys.
-- 
cgit v1.3


From efc684245b81ae0fb8f0afbd06dc1c3101c4e5a0 Mon Sep 17 00:00:00 2001
From: Karsten Blees <karsten.blees@gmail.com>
Date: Thu, 14 Nov 2013 20:23:12 +0100
Subject: remove old hash.[ch] implementation

Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/api-hash.txt |  52 -----------------
 Makefile                             |   2 -
 cache.h                              |   1 -
 hash.c                               | 110 -----------------------------------
 hash.h                               |  50 ----------------
 test-hashmap.c                       |  84 --------------------------
 6 files changed, 299 deletions(-)
 delete mode 100644 Documentation/technical/api-hash.txt
 delete mode 100644 hash.c
 delete mode 100644 hash.h

(limited to 'Documentation/technical')

diff --git a/Documentation/technical/api-hash.txt b/Documentation/technical/api-hash.txt
deleted file mode 100644
index e5061e0677..0000000000
--- a/Documentation/technical/api-hash.txt
+++ /dev/null
@@ -1,52 +0,0 @@
-hash API
-========
-
-The hash API is a collection of simple hash table functions. Users are expected
-to implement their own hashing.
-
-Data Structures
----------------
-
-`struct hash_table`::
-
-	The hash table structure. The `array` member points to the hash table
-	entries. The `size` member counts the total number of valid and invalid
-	entries in the table. The `nr` member keeps track of the number of
-	valid entries.
-
-`struct hash_table_entry`::
-
-	An opaque structure representing an entry in the hash table. The `hash`
-	member is the entry's hash key and the `ptr` member is the entry's
-	value.
-
-Functions
----------
-
-`init_hash`::
-
-	Initialize the hash table.
-
-`free_hash`::
-
-	Release memory associated with the hash table.
-
-`insert_hash`::
-
-	Insert a pointer into the hash table. If an entry with that hash
-	already exists, a pointer to the existing entry's value is returned.
-	Otherwise NULL is returned.  This allows callers to implement
-	chaining, etc.
-
-`lookup_hash`::
-
-	Lookup an entry in the hash table. If an entry with that hash exists
-	the entry's value is returned. Otherwise NULL is returned.
-
-`for_each_hash`::
-
-	Call a function for each entry in the hash table. The function is
-	expected to take the entry's value as its only argument and return an
-	int. If the function returns a negative int the loop is aborted
-	immediately.  Otherwise, the return value is accumulated and the sum
-	returned upon completion of the loop.
diff --git a/Makefile b/Makefile
index d8d3d6705b..f495dd4c13 100644
--- a/Makefile
+++ b/Makefile
@@ -677,7 +677,6 @@ LIB_H += git-compat-util.h
 LIB_H += gpg-interface.h
 LIB_H += graph.h
 LIB_H += grep.h
-LIB_H += hash.h
 LIB_H += hashmap.h
 LIB_H += help.h
 LIB_H += http.h
@@ -809,7 +808,6 @@ LIB_OBJS += gettext.o
 LIB_OBJS += gpg-interface.o
 LIB_OBJS += graph.o
 LIB_OBJS += grep.o
-LIB_OBJS += hash.o
 LIB_OBJS += hashmap.o
 LIB_OBJS += help.o
 LIB_OBJS += hex.o
diff --git a/cache.h b/cache.h
index 1f11e24cd0..407145c364 100644
--- a/cache.h
+++ b/cache.h
@@ -3,7 +3,6 @@
 
 #include "git-compat-util.h"
 #include "strbuf.h"
-#include "hash.h"
 #include "hashmap.h"
 #include "advice.h"
 #include "gettext.h"
diff --git a/hash.c b/hash.c
deleted file mode 100644
index 749ecfe484..0000000000
--- a/hash.c
+++ /dev/null
@@ -1,110 +0,0 @@
-/*
- * Some generic hashing helpers.
- */
-#include "cache.h"
-#include "hash.h"
-
-/*
- * Look up a hash entry in the hash table. Return the pointer to
- * the existing entry, or the empty slot if none existed. The caller
- * can then look at the (*ptr) to see whether it existed or not.
- */
-static struct hash_table_entry *lookup_hash_entry(unsigned int hash, const struct hash_table *table)
-{
-	unsigned int size = table->size, nr = hash % size;
-	struct hash_table_entry *array = table->array;
-
-	while (array[nr].ptr) {
-		if (array[nr].hash == hash)
-			break;
-		nr++;
-		if (nr >= size)
-			nr = 0;
-	}
-	return array + nr;
-}
-
-
-/*
- * Insert a new hash entry pointer into the table.
- *
- * If that hash entry already existed, return the pointer to
- * the existing entry (and the caller can create a list of the
- * pointers or do anything else). If it didn't exist, return
- * NULL (and the caller knows the pointer has been inserted).
- */
-static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
-{
-	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
-
-	if (!entry->ptr) {
-		entry->ptr = ptr;
-		entry->hash = hash;
-		table->nr++;
-		return NULL;
-	}
-	return &entry->ptr;
-}
-
-static void grow_hash_table(struct hash_table *table)
-{
-	unsigned int i;
-	unsigned int old_size = table->size, new_size;
-	struct hash_table_entry *old_array = table->array, *new_array;
-
-	new_size = alloc_nr(old_size);
-	new_array = xcalloc(sizeof(struct hash_table_entry), new_size);
-	table->size = new_size;
-	table->array = new_array;
-	table->nr = 0;
-	for (i = 0; i < old_size; i++) {
-		unsigned int hash = old_array[i].hash;
-		void *ptr = old_array[i].ptr;
-		if (ptr)
-			insert_hash_entry(hash, ptr, table);
-	}
-	free(old_array);
-}
-
-void *lookup_hash(unsigned int hash, const struct hash_table *table)
-{
-	if (!table->array)
-		return NULL;
-	return lookup_hash_entry(hash, table)->ptr;
-}
-
-void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
-{
-	unsigned int nr = table->nr;
-	if (nr >= table->size/2)
-		grow_hash_table(table);
-	return insert_hash_entry(hash, ptr, table);
-}
-
-int for_each_hash(const struct hash_table *table, int (*fn)(void *, void *), void *data)
-{
-	int sum = 0;
-	unsigned int i;
-	unsigned int size = table->size;
-	struct hash_table_entry *array = table->array;
-
-	for (i = 0; i < size; i++) {
-		void *ptr = array->ptr;
-		array++;
-		if (ptr) {
-			int val = fn(ptr, data);
-			if (val < 0)
-				return val;
-			sum += val;
-		}
-	}
-	return sum;
-}
-
-void free_hash(struct hash_table *table)
-{
-	free(table->array);
-	table->array = NULL;
-	table->size = 0;
-	table->nr = 0;
-}
diff --git a/hash.h b/hash.h
deleted file mode 100644
index 1d43ac0ba0..0000000000
--- a/hash.h
+++ /dev/null
@@ -1,50 +0,0 @@
-#ifndef HASH_H
-#define HASH_H
-
-/*
- * These are some simple generic hash table helper functions.
- * Not necessarily suitable for all users, but good for things
- * where you want to just keep track of a list of things, and
- * have a good hash to use on them.
- *
- * It keeps the hash table at roughly 50-75% free, so the memory
- * cost of the hash table itself is roughly
- *
- *	3 * 2*sizeof(void *) * nr_of_objects
- *
- * bytes.
- *
- * FIXME: on 64-bit architectures, we waste memory. It would be
- * good to have just 32-bit pointers, requiring a special allocator
- * for hashed entries or something.
- */
-struct hash_table_entry {
-	unsigned int hash;
-	void *ptr;
-};
-
-struct hash_table {
-	unsigned int size, nr;
-	struct hash_table_entry *array;
-};
-
-extern void *lookup_hash(unsigned int hash, const struct hash_table *table);
-extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
-extern int for_each_hash(const struct hash_table *table, int (*fn)(void *, void *), void *data);
-extern void free_hash(struct hash_table *table);
-
-static inline void init_hash(struct hash_table *table)
-{
-	table->size = 0;
-	table->nr = 0;
-	table->array = NULL;
-}
-
-static inline void preallocate_hash(struct hash_table *table, unsigned int elts)
-{
-	assert(table->size == 0 && table->nr == 0 && table->array == NULL);
-	table->size = elts * 2;
-	table->array = xcalloc(sizeof(struct hash_table_entry), table->size);
-}
-
-#endif
diff --git a/test-hashmap.c b/test-hashmap.c
index 581d2964e4..7e86f886d8 100644
--- a/test-hashmap.c
+++ b/test-hashmap.c
@@ -126,85 +126,6 @@ static void perf_hashmap(unsigned int method, unsigned int rounds)
 	}
 }
 
-struct hash_entry
-{
-	struct hash_entry *next;
-	char key[FLEX_ARRAY];
-};
-
-/*
- * Test performance of hash.[ch]
- * Usage: time echo "perfhash method rounds" | test-hashmap
- */
-static void perf_hash(unsigned int method, unsigned int rounds)
-{
-	struct hash_table map;
-	char buf[16];
-	struct hash_entry **entries, **res, *entry;
-	unsigned int *hashes;
-	unsigned int i, j;
-
-	entries = malloc(TEST_SIZE * sizeof(struct hash_entry *));
-	hashes = malloc(TEST_SIZE * sizeof(int));
-	for (i = 0; i < TEST_SIZE; i++) {
-		snprintf(buf, sizeof(buf), "%i", i);
-		entries[i] = malloc(sizeof(struct hash_entry) + strlen(buf) + 1);
-		strcpy(entries[i]->key, buf);
-		hashes[i] = hash(method, i, entries[i]->key);
-	}
-
-	if (method & TEST_ADD) {
-		/* test adding to the map */
-		for (j = 0; j < rounds; j++) {
-			init_hash(&map);
-
-			/* add entries */
-			for (i = 0; i < TEST_SIZE; i++) {
-				res = (struct hash_entry **) insert_hash(
-						hashes[i], entries[i], &map);
-				if (res) {
-					entries[i]->next = *res;
-					*res = entries[i];
-				} else {
-					entries[i]->next = NULL;
-				}
-			}
-
-			free_hash(&map);
-		}
-	} else {
-		/* test map lookups */
-		init_hash(&map);
-
-		/* fill the map (sparsely if specified) */
-		j = (method & TEST_SPARSE) ? TEST_SIZE / 10 : TEST_SIZE;
-		for (i = 0; i < j; i++) {
-			res = (struct hash_entry **) insert_hash(hashes[i],
-					entries[i], &map);
-			if (res) {
-				entries[i]->next = *res;
-				*res = entries[i];
-			} else {
-				entries[i]->next = NULL;
-			}
-		}
-
-		for (j = 0; j < rounds; j++) {
-			for (i = 0; i < TEST_SIZE; i++) {
-				entry = lookup_hash(hashes[i], &map);
-				while (entry) {
-					if (!strcmp(entries[i]->key, entry->key))
-						break;
-					entry = entry->next;
-				}
-			}
-		}
-
-		free_hash(&map);
-
-	}
-}
-
 #define DELIM " \t\r\n"
 
 /*
@@ -218,7 +139,6 @@ static void perf_hash(unsigned int method, unsigned int rounds)
  * size -> tablesize numentries
  *
  * perfhashmap method rounds -> test hashmap.[ch] performance
- * perfhash method rounds -> test hash.[ch] performance
  */
 int main(int argc, char *argv[])
 {
@@ -324,10 +244,6 @@ int main(int argc, char *argv[])
 
 			perf_hashmap(atoi(p1), atoi(p2));
 
-		} else if (!strcmp("perfhash", cmd) && l1 && l2) {
-
-			perf_hash(atoi(p1), atoi(p2));
-
 		} else {
 
 			printf("Unknown command %s\n", cmd);
-- 
cgit v1.3


From b6aad994737458177ddf68939719f90e7909f656 Mon Sep 17 00:00:00 2001
From: Karsten Blees <karsten.blees@gmail.com>
Date: Wed, 18 Dec 2013 14:41:27 +0100
Subject: hashmap.h: use 'unsigned int' for hash-codes everywhere

Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/api-hashmap.txt | 2 +-
 hashmap.h                               | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

(limited to 'Documentation/technical')

diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt
index b2280f1b0d..42ca2347ed 100644
--- a/Documentation/technical/api-hashmap.txt
+++ b/Documentation/technical/api-hashmap.txt
@@ -80,7 +80,7 @@ prevent expensive resizing. If 0, the table is dynamically resized.
 If `free_entries` is true, each hashmap_entry in the map is freed as well
 (using stdlib's free()).
 
-`void hashmap_entry_init(void *entry, int hash)`::
+`void hashmap_entry_init(void *entry, unsigned int hash)`::
 
 	Initializes a hashmap_entry structure.
 +
diff --git a/hashmap.h b/hashmap.h
index f5b3b61073..a816ad47b1 100644
--- a/hashmap.h
+++ b/hashmap.h
@@ -43,7 +43,7 @@ extern void hashmap_free(struct hashmap *map, int free_entries);
 
 /* hashmap_entry functions */
 
-static inline void hashmap_entry_init(void *entry, int hash)
+static inline void hashmap_entry_init(void *entry, unsigned int hash)
 {
 	struct hashmap_entry *e = entry;
 	e->hash = hash;
-- 
cgit v1.3