Upcoming talks and demos:

Jupyter Con - New York 23-25 Aug

View Natalino Busa's profile on LinkedIn

Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning.

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Thursday, October 3, 2013

Wide column folding in Cassandra using CQL3

CQL3 offers a few new ways of dealing with partitions and clustering of wide rows. This blog entry is about how to make the best use of those new features in cassandra using cql3

Primary key syntax:

given a column family:

by defining a
primary key ( (a1, a2, ...), b1, b2, ... )

it's implied that:

a1, a2, ... are fields used to craft a row key in order to:
  • determine how the data is partitioned
  • determine what is phisically stored in a single row
  • referred as row key or partition key

b1, b2, ... are column family fields used to cluster a row key in order to:
  • create logical sets inside a single row
  • allow more flexible search schemes such as range queries
  • referred as column key or cluster key
All the remaining fields are effectively multiplexed for every possible combination of values defined in the partition colum keys. Here below i present an example about howcomposite keys (partition + clustering keys) work.

Example / Use case:

You need to create a list of followers (a-la twitter) and for each follower you would like to report a number of facts, such as when it first started following, the number of private messages exchanged, the number of favourited posts, etc.

Craft your own keys:

You could craft a key such as <followed_id>_<follower_id>, where followed_id and follower_id are some sort of account id's and then define properties for this relation as fields of this column family:

This works great, up to the moment that someone comes in with the request "Can I please get also the count of how many followers are related to a given account?". The main issue here is that since the followed-follower relation is embedded in the key, I cannot directly range search for it. The speed of hashed keys comes at the cost of limited search for instance by proding a partial key.

I could stop here and enable a ByteOrderedPartitioner for my primary partition key. However this is just a last resort which comes with a number of serious drawbacks (quoting datastax):

Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons:
  • Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then these writes are not distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data.
  • More administrative overhead to load balance the cluster: An ordered partitioner requires administrators to manually calculate token ranges based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
  • Uneven load balancing for multiple column families: If your application has multiple column families, chances are that those column families have different row keys and different distributions of data. An ordered partitioner that is balanced for one column family may cause hot spots and uneven distribution for another column family in the same cluster.

Can we do something better?

Use a compound row key:

The difficulties about counting the amount of followers are very much related to the way we have crafted our row key. We could actually explicitely pass the to two account ids, respectively follower_id and followed_id instead:

This will store separate rows for each followed_follower pair.

The syntax primary key ( (followed_id, followed_id) ) will create a compound row key ( in the documentation is also referred as the partition key) using the fields followed_id, followed_id. Note the double parentheses: Since these followed_id, followed_id parameters are actually used to craft a row key, this key is still subject to the same limitation of row keys, for instance can only be search with direct access. Since each entry is partitioned by the compound  followed_id , follower_id, those entries would actually be three physical rows in the column family.

Use a row keys and column keys: 

Instead of declaring follower_id as part of the compound partition key, you can declar as a colum key:

 In other words, all the followers of a given followed account id will be stored in a single internal row, determined by the followed_id key. The result is that the colum key will be used to fold a single row to a number of "internal rows". And because these internal rows are actually set of columns, they can be now sorted and queried using for instance range queries, just as stated above.

The difference with the previous case is the definition of the compound primary key. The way CQL3 works, it will map the first component (followed_id) of the primary key to the internal row key and the second component (follower_id) to the internal cell name. And the remaining CQL3 columns (created, messages_count, likes_count) will be mapped to the cell value. That is how CQL3 allows access to wide rows: by transposing one internal wide rows into multiple CQL3 rows, one per cell of the wide row.

Although it looks very similar to the previous example, it is actually stored in a different way. In this form range queries for instance on the column key are allowed. However when it the data is actually produced by the client driver it looks as if the column keys are just ordinary keys and multiple row are produced. however in the last example those rows are actually folded subset of the same row, hence the name clustering key.