python - what is the efficient way of handling pycassa multiget for 1 million rowkeys -


i complete newbie in cassandra.

right have managed code working problem scenario on relatively small set of data.

however when try multiget on 1 million rowkeys fails message "retried 6 times. last failure timeout: timed out" .

e.g: colfam.multiget([rowkey1,...........,rowkey_million])

basically column family trying query has 1 million records 28 columns each.

here running 2-node cassandra cluster on single ubuntu virtual-box system config as

ram: 3gb processor: 1cpu

so how manage handle multiget on many rowkeys efficiently , bulk insert of same cassandra column family??

thanks in advance :) :)

i responded on pycassa mailing list (please try not post in multiple places), i'll copy answer else sees this:

multiget expensive operation cassandra. each row in multiget can require couple of disk seeks cassandra. pycassa automatically splits query smaller chunks, still expensive.

if you're trying read whole column family, use get_range() instead.

if you're trying read subset of rows in column family (based on attribute) , need frequently, need use different data model.

since you're new this, spend time learning data modeling in cassandra: http://wiki.apache.org/cassandra/datamodel. (note: of these examples use cql3, pycassa not support. if want work cql3 instead, use new datastax python driver: https://github.com/datastax/python-driver)


Comments

Popular posts from this blog

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -

SQL: Divide the sum of values in one table with the count of rows in another -