python - what is the efficient way of handling pycassa multiget for 1 million rowkeys -

- June 15, 2015

i complete newbie in cassandra.

right have managed code working problem scenario on relatively small set of data.

however when try multiget on 1 million rowkeys fails message "retried 6 times. last failure timeout: timed out" .

e.g: colfam.multiget([rowkey1,...........,rowkey_million])

basically column family trying query has 1 million records 28 columns each.

here running 2-node cassandra cluster on single ubuntu virtual-box system config as

ram: 3gb processor: 1cpu

so how manage handle multiget on many rowkeys efficiently , bulk insert of same cassandra column family??

thanks in advance :) :)

i responded on pycassa mailing list (please try not post in multiple places), i'll copy answer else sees this:

multiget expensive operation cassandra. each row in multiget can require couple of disk seeks cassandra. pycassa automatically splits query smaller chunks, still expensive.

if you're trying read whole column family, use get_range() instead.

if you're trying read subset of rows in column family (based on attribute) , need frequently, need use different data model.

since you're new this, spend time learning data modeling in cassandra: http://wiki.apache.org/cassandra/datamodel. (note: of these examples use cql3, pycassa not support. if want work cql3 instead, use new datastax python driver: https://github.com/datastax/python-driver)

Search This Blog

KBPS

python - what is the efficient way of handling pycassa multiget for 1 million rowkeys -

Comments

Post a Comment

Popular posts from this blog

node.js - StackOverflow API not returning JSON -

python - Subclassed QStyledItemDelegate ignores Stylesheet -

java - HttpClient 3.1 Connection pooling vs HttpClient 4.3.2 -