When to use Apache Cassandra, Kafka, Spark and Elasticsearch and when not to use?

In almost all decision-making in the field of technology, two basic criteria should be considered. First, the decision should enable you to meet your business goals and then work well along with your other technologies. When it comes to choosing data-layer technologies to create application architecture, open source apps such as Apache Cassandra, Apache Kafka,Apache Spark and Elastic search are more popular. However, they are not the right choice for anything. Furthermore, a deeper investigation of each of these technologies and some of their uses has been discussed.

Apache Cassandra

First created by Facebook in 2007, Cassandra uses the Dynamo architecture and big table data model to provide a NoSQL database and has high Availability and Scalability.

  • When to use Apache Cassandra:

Cassandra is an idal option for those who need the highest level of Availability. This database is also particularly suitable for servicing organizations that have a large workload or want to ensure that their services will grow resiliently by expanding their workload (they need Cassandra’s easy Scalability), as well as providing redundancy data and active-active operations around several Nodes.

  • When you should not use Apache Cassandra:

Cassandra has more intensive resources than other options when it is responsible for storing data or storing analytics. (Even to use Spark connections and Tableau and Hadoop plugins). It is also not suitable for Real-time analyses (when analysis is performed within seconds or minutes after the arrival of new information), especially for requests from end users of Ad-hoc or custom Query, as the need to run code on the app side can be complicated. In addition, Cassandra does not meet most acid requirements.

Apache Kafka

Apache Kafka was first created by the technical team at LinkedIn. Streaming is a very Scalable and Available platform that provides a passageway to the message. Kafka acts as a distributed log where new messages are added to the beginning of a Queue, and readers (consumers) use them according to Offset.

  • When to use Apache Kafka:

In general, Apache Kafka is a smart choice for tasks including Micro services and Service-oriented architectures. It can also act as a highly effective Queue Work that can coordinate separate work directions and increase processing power by listening and Waiting until new work arrives. The platform’s Streaming processing capabilities include Roll-ups and Aggregation. Kafka is also a very good option for providing event resources, matching data in various Micro services, as well as creating an external log system for distributed systems. Other things kafka is suitable for are log collection, data masking and data filtering.

  • When you should not use Apache Kafka:

Although it may be tempting in some cases, using Kafka as a database or archive source is not a good advice, at least without a full understanding of Kafka’s limitations and characteristics in these use cases. Working with a real database is almost always easier and more flexible. In any case where the goal is to quickly transfer data packets to the final source, such as real-time audio and images, or other items that may be lost. Kafka is an inappropriate choice for data processing, and organizations should use other solutions instead of Kafka.

Apache Spark

Apache Spark is a cluster computational framework suitable for use in cases that require the processing of large volumes of data. Spark divides the data and runs the calculations over divided sections. All possible tasks in each section are done as long as data from other sections is not required. This design gives Spark tremendous Availability and Scalability, while making it extremely resistant to data loss.

  • When to use Apache Spark:

Spark is suitable for use for tasks involving large-scale analysis, especially in cases where data is obtained through several different sources. Spark is a powerful solution for ETL and Loading or any use. This solution involves moving data between systems. Spark is also perfectly suited for organizations that create machine learning Pipelines according to existing data and those that are High-latency or have a lot of interactive analysis. Spark also helps by providing Masking, or data coverage, filtering them, and checking large data sets. This will allow organizations to meet their Compliance needs.

  • When you should not use Apache Spark:

In general, Spark will not be a good option for tasks including Real-time or Low-latency processes. (Apache Kafka or other technologies with End-to-end latency offer things like Real-time processing.) Spark is not a good option when working with small data sets. Also, when it comes to data warehousing, it’s best to use higher-level technologies instead of Apache Spark.

Elastic search

Elastic search offers a full-text search engine that has a wide range of capabilities for searching and analyzing non-fabricated data. This technology offers Scalable linear search close to Real-time and provides significant search and analysis capabilities.

  • When to use Elastic search:

Elastic search is perfect for items that are suitable for full text search, geospatile search, erasure and combination of public data, log-in to the system, log analysis, and Visualizations.

  • When you shouldn’t use Elastic search:

Elastic search should not be used as a database or archive source, with Relational data, or to meet ACID requirements.

Selecting complementary technologies

Obviously, it is not enough to choose the best combination of technology for organizations just evaluating their solutions. Decision makers need to imagine how the organization adopts and uses each solution as part of its technology. Apache Cassandra, Apache Kafka, Apache Spark and Elastic search offer a special set of technology that it’s best for organizations to combine. This will free them from the seller’s licence fees or Lock-in, thanks to the open source nature. By collaborating with these technologies and combining their benefits, organizations can achieve their goals and enable the development of Scalable, Available, Portable and Flexible programs.

And in the end,

Mirbazorgi’s website aims to help you learn and fix your problems by providing articles and practical experiences. Email me if you have any questions.