node.js - Fast repeated row counting in vast data - what format? -
my node.js app needs index several gigabytes of timestamped csv data, in such way can row count combination of values, either each minute in day (1440 queries) or each hour in couple of months (also 1440). let's in half second.
the column values not read, row counts per interval given permutation. reducing time whole minutes ok. there rather few possible values per column, between 2 , 10, , depend on other columns. it's fine preprocessing , store counts in whatever format suitable single task - but format be?
storing actual values bad idea, millions of rows , little variation.
it might feasible generate short code each combination , match regex, since these codes have duplicated each minute, i'm not sure it's approach.
or can use embedded database sqlite, nedb or tingodb, not entirely convinced since don't have native enum-like types , might or might not made kind of counting. maybe work fine?
this must common problem idiomatic solution, haven't figured out might called. knowing call , how think helpful!
will answer own findings now, i'm still interested know more theory problem.
nedb not solution here saved values normal json behind hood, repeating key names each row , adding unique ids. wasted lots of space , surely have been slow, if because of disk i/o.
sqlite might better @ compressing , indexing data, have yet try it. update results if do.
instead went other approach mentioned: assign unique letter each column value come across , short string representing permutation. each minute, add these strings keys iff occur, number of occurrences values. can later use our dictionary create regex matches set of combinations, , run on small index quickly.
this easy enough implement, of course have been trickier if had had more possible column values 70 found.
Comments
Post a Comment