Cassandra is a distributed storage system for managing structured data that is designed to scale to a very large size across many commodity servers, with no single point of failure. Hence Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service.
Cassandra has achieved several goals – scalability, high performance, high availability and applicability. In many ways Cassandra resembles a database and shares many design and implementation strategies with databases. However, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format.
- Every row is identified by a unique String key with no limit on its size.
- An instance of Cassandra has one table which is made up of one or more column families as defined by the user
- There is no limitation to the number of column families but it is expected that there would be a few of these.
- Each column family can contain one of two structures: supercolumns or columns. There is no limit on the number of these
- Columns are constructs that have a name, a value and a user-defined timestamp associated with them.
- Supercolumns are a construct that have a name, and an infinite number of columns associated with them.
- Data is distributed across the nodes in the cluster using Consistent Hashing based and on an Order Preserving Hash function.
- Cluster membership is maintained via Gossip style membership algorithm.
- High availability is achieved using replication and we actively replicate data across data centers.
- System exhibits incremental scalability properties which can be achieved as easily as dropping nodes and having them automatically bootstrapped with data.
Use cases of Cassandra
Persistent caching: The short explanation is that key-value stores are designed to provide much better I/O rates than RDBMs at the expense of flexibility. If what you need to do is write data and fetch it by a specific key such as a url (as opposed to performing arbitrary queries by various values with many restrictions), you can handle the same number of operations with much less hardware.