This tutorial gives you an overview and talks about the fundamentals of Apache Cassandra.
Why NOSQL and What is NOSQL?
The RDBMS has been the de-facto standard for managing data since it first appeared from IBM in the mid-1980s. The RDBMS really exploded in the 1990s with Oracle, Sybase, Microsoft SQL Server, and other similar databases appearing in the data centers of nearly every enterprise – databases you likely use today.
With the first wave of Web applications, open source RDBMS’s such as MySQL and Postgres emerged and became a standard at many companies that desired alternatives to expensive proprietary databases sold by vendors such as Oracle.
However, it wasn’t long before things began to change, and the application and data center requirements of key Internet players like Amazon, Facebook, and Google began to outgrow the RDBMS. The need for more flexible data models that supported agile development methodologies and the requirements to consume large amounts of fast-incoming data from millions of Web and mobile users around the globe – while maintaining extreme amounts of performance and uptime – necessitated the introduction of a new data management platform.
Today, with every company utilizing modern Web and mobile applications, the data problems originally encountered by the Internet giants have become common issues for every company, including yours. This means that you and your team of database administrators must realize that it is no longer a question of if you will be deploying and managing NoSQL database systems, but when, and how much of your company’s data will eventually be stored on NoSQL platforms.
Types of NOSQL Databases
There are different types of NoSQL databases, with the primary difference characterized by their underlying data model and method for storing data. The main categories of NoSQL databases are:
Wide Row Store– Also known as wide-column stores, these databases store data in rows and users are able to perform some query operations via column-based access. A wide-row store offers very high performance and a highly scalable architecture. Examples include: Cassandra, HBase, and Google BigTable.
Key/Value– These NoSQL databases are some of the least complex as all of the data within consists of an indexed key and a value. Examples include Amazon DynamoDB, Riak, and Oracle NoSQL database.
Document– Expands on the basic idea of key-value stores where “documents” are more complex, in that they contain data and each document is assigned a unique key, which is used to retrieve the document. These are designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Examples include MongoDB and CouchDB.
Graph– Designed for data whose relationships are well represented as a graph structure and has elements that are interconnected; with an undetermined number of relationships between them. Examples include: Neo4J and TitanDB.
What is Cassandra
Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category.
Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database’s storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn’t available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs.
Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application’s standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs.
The result is that Cassandra deployments have an almost limitless capacity to store and process data; when additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share.
The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application’s ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely.
A master-follower architecture improves this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn’t completely provide high availability.
Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, two nodes must fail simultaneously for there to be any application-visible interruption in Cassandra’s availability.
Click for more Information: