ocaml - Data structure for occurrence counting in long tail distribution -

- March 15, 2010

i have big list of elements (tens of millions). trying count number of occurrence of several subset of these elements. occurrence distribution long-tailed.

the data structure looks (in ocaml-ish flavor):

type element_key type element_aggr_key  type raw_data = element_key list  type element_stat = {      occurrence : (element_key, int) hashtbl.t; }  type stat = {     element_stat_hashtable : (element_aggr_key, element_stat) hashtbl.t; }

element_stat use hashtable key each elements , value integer. however, inefficient because when many elements have single occurrence, occurrence hashtable resized many times. cannot avoid resizing occurrence hashtable setting big initial size because there many element_stat instances (the size of hashtable in stat big).

i know if there more efficient (memory-wise and/or insertion-wise) data structure use-case. found lot of existing data structure trie, radix tree, judy array. have trouble understanding differences , whether fit problem.

what have here table mapping element_aggr_key tables in turn map element_key int. practical purposes, equivalent single table maps element_aggr_key * element_key int, do:

type stat = (element_aggr_key * element_key, int) hashtbl.t

then have single hash table, , can give huge initial size.

Search This Blog

KBPS

ocaml - Data structure for occurrence counting in long tail distribution -

Comments

Post a Comment

Popular posts from this blog

node.js - StackOverflow API not returning JSON -

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -