Working with Azure Managed Instance for Cassandra

Use open-source tools to build big data systems that bridge on premises and cloud.

Working with Azure Managed Instance for Cassandra
Anze Furlan PSGTProductions / Getty Images

Building cloud-native applications at scale requires choosing your stack carefully. One popular tool is Apache’s Cassandra project, a NoSQL database designed to scale rapidly without affecting application performance. It’s an ideal platform for working with big data, with built-in map-reduce tools based on Hadoop, as well as its own query language. Originally developed at Facebook, it’s since been used at CERN, Netflix, and Uber.

Azure initially offered Cassandra support through DataStax’s offerings in the Azure Marketplace before adding Cassandra API support to its own distributed Cosmos DB, as well as providing guidance for users who wanted to build and deploy their own Cassandra systems on Azure VMs. It’s now developing its own Cassandra implementation, with a public preview of a set of managed instances of Cassandra, designed to work alongside Cosmos DB.

Apache Cassandra on Azure

Cassandra is a distributed database, with each node connected to each other via the gossip protocol. Nodes run on multiple machines, organized as a data center and deployed as rings of nodes. All nodes are peers, so if any one node is lost, the system can keep operating while a replacement starts. Rings can peer with other rings, too, allowing you to have on-premises systems work with cloud-hosted systems, or one region with others for global resilience. Nodes can be added or removed from a ring as necessary, offering linear scaling. To double performance or capacity, all you need to do is double the number of nodes.

Microsoft’s Azure Managed Instance for Apache Cassandra is perhaps best thought of as a way of extending on-premises data into Cosmos DB. There’s been demand for on-premises Cosmos DB since shortly after launch, but its deep integration with the Azure platform makes it hard for Microsoft to separate it. By offering integration between its Azure implementation and Cosmos DB, it’s now possible to set up an Azure-hosted Cassandra ring and peer it with on premises and with Cosmos DB. You can now replicate data between on premises and the cloud, taking advantage of Cosmos DB’s capabilities to run global-scale distributed applications while working with local Cassandra instances to handle regulated data operations in your own data center.

There are other advantages to using Managed Instances, as you can hand over much of the day-to-day operations of a Cassandra ring to Azure. It will automatically deliver upgrades and updates, handling patching so your database always runs the most secure version of the software. With less management overhead, you can concentrate on building applications rather than maintaining your stack.

Getting started with Managed Instances

There’s not much difference between setting up and running Azure’s Apache and any of its other managed open source databases. Start by logging in to the Azure Portal, then search for Managed Instance for Apache Cassandra to create a cluster.

You’ll need to follow most of the steps for adding an Azure service to a subscription, from adding it to a resource group and choosing a location. At the same time, choose a name and pick a host VM type. In the current preview, you’re limited to DS14_v2 servers, attached to four P30 disks. These are quite powerful Xeon-based systems, with 16 vCPUs, 112GB of memory, and a 224GB SSD. There’s support for as many as 64 data disks and 8 network cards, with 12,000 Mbps of bandwidth. Expect to pay at least $2.11 an hour per server, depending on where you are provisioning the service. P30 disks offer 1TB of storage per disk and cost at least $122.88 a month (with additional charges for mounts).

Running Casandra in Azure won’t be cheap, but then it’s not for small applications. You’re going to be shifting a lot of data around your application even if you’re only using it as a gateway to Cosmos DB.

The next step links your instance to either a new or existing Azure virtual network. Any VNet needs to have internet access, as it needs to link to several different Azure services. These include support for virtual machine scaling, managing encryption keys and certificates, as well as integrating with Azure’s security and authentication services. If you’re connecting to an existing VNet, you must add appropriate permissions from the Azure CLI, otherwise your deployment will fail.

You’re now ready to create your cluster. Once it’s deployed, your next step is to create a management virtual machine with support for the Cassandra libraries. This will allow you to use the Cassandra query tools to manage your database, using the admin password you set up when you created the cluster. You can now start to work with Cassandra.

Building hybrid clusters in hybrid clouds

If you’re thinking of using Cassandra in Azure as a bridge to Cosmos DB, you need to configure your Azure resources as a hybrid cluster. As before, create and deploy a Cassandra cluster in Azure, setting its name and connecting it to an Azure VNet. You will need to configure Cassandra for node-to-node encryption, so if your on-premises install isn’t using it, enable it. Export your encryption certificates and use the Azure CLI to install them in your Azure-hosted cluster. These will enable your two sites to communicate over encrypted gossip connections.

The VNet will need to connect to your local network, either over dedicated Express Route connections or using a site-to-site VPN. What you use will depend on how much data you intend to ship to Azure, although experimental clusters are likely to use a VPN to avoid the cost of setting up a dedicated multiprotocol label switching (MPLS) connection.

You will need to create a new data center in your managed cluster, using the Azure CLI to get details of its seed nodes. These are added to the configuration details of your on-premises system, along with defining your site-to-site replication strategy. This process is surprisingly simple, just needing a couple of lines in Cassandra’s query language.

Using Managed Cassandra with other Azure services

One interesting aspect of the service is support for Azure’s Apache Spark–based analytics tool, Databricks. If you install Databricks in the same VNet as your Managed Cassandra service and then use the Apache Spark Cassandra connector to link to your endpoints, you can then use Spark and Databricks notebooks to run analytics on your Cassandra-hosted data.

It’s interesting to see how Microsoft’s commitment to hybrid cloud operations translates to working with data. By offering a managed route to running Cassandra, the company provides a natural bridge for NoSQL data between your on-premises tools and the cloud. It’s a two-way connection, enabling local processing of sensitive data while taking advantage of cloud scale for your applications (and eventually expanding into the global scale of Cosmos DB).

Cassandra’s own replication protocols provide the bridge, while Azure ensures that it’s up to date and secure. The result is an effective set of tools that solve many of the problems associated with linking cloud and data center, one that can take advantage of tools like Apache Spark to deliver that data to other Azure services that rely on big data.

Copyright © 2021 IDG Communications, Inc.