CQL3 offers a few new ways of dealing with partitions and clustering of wide rows. This blog entry is about how to make the best use of those new features in cassandra using cql3
Primary key syntax:
given a column family:
by defining a
primary key ( (a1, a2, ...), b1, b2, ... )
it's implied that:
a1, a2, ... are fields used to craft a row key in order to:
- determine how the data is partitioned
- determine what is phisically stored in a single row
- referred as row key or partition key
b1, b2, ... are column family fields used to cluster a row key in order to:
- create logical sets inside a single row
- allow more flexible search schemes such as range queries
- referred as column key or cluster key
All the remaining fields are effectively multiplexed for every possible combination of values defined in the partition colum keys. Here below i present an example about howcomposite keys (partition + clustering keys) work.
Example / Use case:
You need to create a list of followers (a-la twitter) and for each follower you would like to report a number of facts, such as when it first started following, the number of private messages exchanged, the number of favourited posts, etc.
Craft your own keys:
You could craft a key such as <followed_id>_<follower_id>, where followed_id and follower_id are some sort of account id's and then define properties for this relation as fields of this column family:
This works great, up to the moment that someone comes in with the request "Can I please get also the count of how many followers are related to a given account?". The main issue here is that since the followed-follower relation is embedded in the key, I cannot directly range search for it. The speed of hashed keys comes at the cost of limited search for instance by proding a partial key.
I could stop here and enable a ByteOrderedPartitioner for my primary partition key. However this is just a last resort which comes with a number of serious drawbacks (quoting datastax):
Can we do something better?
Use a compound row key:
The difficulties about counting the amount of followers are very much related to the way we have crafted our row key. We could actually explicitely pass the to two account ids, respectively follower_id and followed_id instead:
This will store separate rows for each followed_follower pair.
The syntax primary key ( (followed_id, followed_id) ) will create a compound row key ( in the documentation is also referred as the partition key) using the fields followed_id, followed_id. Note the double parentheses: Since these followed_id, followed_id parameters are actually used to craft a row key, this key is still subject to the same limitation of row keys, for instance can only be search with direct access. Since each entry is partitioned by the compound followed_id , follower_id, those entries would actually be three physical rows in the column family.
Use a row keys and column keys:
In other words, all the followers of a given followed account id will be stored in a single internal row, determined by the followed_id key. The result is that the colum key will be used to fold a single row to a number of "internal rows". And because these internal rows are actually set of columns, they can be now sorted and queried using for instance range queries, just as stated above.
The difference with the previous case is the definition of the compound primary key. The way CQL3 works, it will map the first component (followed_id) of the primary key to the internal row key and the second component (follower_id) to the internal cell name. And the remaining CQL3 columns (created, messages_count, likes_count) will be mapped to the cell value. That is how CQL3 allows access to wide rows: by transposing one internal wide rows into multiple CQL3 rows, one per cell of the wide row.
Although it looks very similar to the previous example, it is actually stored in a different way. In this form range queries for instance on the column key are allowed. However when it the data is actually produced by the client driver it looks as if the column keys are just ordinary keys and multiple row are produced. however in the last example those rows are actually folded subset of the same row, hence the name clustering key.