ocaml - Data structure for occurrence counting in long tail distribution -
i have big list of elements (tens of millions). trying count number of occurrence of several subset of these elements. occurrence distribution long-tailed.
the data structure looks (in ocaml-ish flavor):
type element_key type element_aggr_key type raw_data = element_key list type element_stat = { occurrence : (element_key, int) hashtbl.t; } type stat = { element_stat_hashtable : (element_aggr_key, element_stat) hashtbl.t; }
element_stat use hashtable key each elements , value integer. however, inefficient because when many elements have single occurrence, occurrence hashtable resized many times. cannot avoid resizing occurrence hashtable setting big initial size because there many element_stat instances (the size of hashtable in stat big).
i know if there more efficient (memory-wise and/or insertion-wise) data structure use-case. found lot of existing data structure trie, radix tree, judy array. have trouble understanding differences , whether fit problem.
what have here table mapping element_aggr_key
tables in turn map element_key
int
. practical purposes, equivalent single table maps element_aggr_key * element_key
int
, do:
type stat = (element_aggr_key * element_key, int) hashtbl.t
then have single hash table, , can give huge initial size.
Comments
Post a Comment