Software Alternatives, Accelerators & Startups

Apache Spark

Apache Spark is an engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Apache Spark Reviews and details

Screenshots and images

  • Apache Spark Landing page
    Landing page //

Badges & Trophies

Promote Apache Spark. You can add any of these badges on your website.

SaaSHub badge
Show embed code
SaaSHub badge
Show embed code


Weekly Apache Spark live Code Review -- look at StringIndexer multi-col (Scala) & Python testing

What's New in Apache Spark 3.0.0

Apache Spark for Data Engineering and Analysis - Overview

Social recommendations and mentions

We have tracked the following product recommendations or mentions on various public social media platforms and blogs. They can help you see what people think about Apache Spark and what they use it for.
  • How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop
    In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that... - Source: / 26 days ago
  • Shades of Open Source - Understanding The Many Meanings of "Open"
    In contrast, Databricks maintains internal forks of Spark, Delta Lake, and Unity Catalog, using the same names for both the open-source versions and the features specific to the Databricks platform. While they do provide separate documentation, online discussions often reflect confusion about how to use features in the open-source versions that only exist on the Databricks platform. This creates a "muddying of the... - Source: / 28 days ago
  • Groovy 🎷 Cheat Sheet - 01 Say "Hello" from Groovy
    Recently I had to revisit the "JVM languages universe" again. Yes, language(s), plural! Java isn't the only language that uses the JVM. I previously used Scala, which is a JVM language, to use Apache Spark for Data Engineering workloads, but this is for another post 😉. - Source: / 4 months ago
  • 🦿🛴Smarcity garbage reporting automation w/ ollama
    Consume data into third party software (then let Open Search or Apache Spark or Apache Pinot) for analysis/datascience, GIS systems (so you can put reports on a map) or any ticket management system. - Source: / 6 months ago
  • Go concurrency simplified. Part 4: Post office as a data pipeline
    Also, this knowledge applies to learning more about data engineering, as this field of software engineering relies heavily on the event-driven approach via tools like Spark, Flink, Kafka, etc. - Source: / 7 months ago
  • Five Apache projects you probably didn't know about
    Apache SeaTunnel is a data integration platform that offers the three pillars of data pipelines: sources, transforms, and sinks. It offers an abstract API over three possible engines: the Zeta engine from SeaTunnel or a wrapper around Apache Spark or Apache Flink. Be careful, as each engine comes with its own set of features. - Source: / 7 months ago
  • Spark – A micro framework for creating web applications in Kotlin and Java
    A JVM based framework named "Spark", when exists? - Source: Hacker News / about 1 year ago
  • Rest in Peas: The Unrecognized Death of Speech Recognition (2010)
    You could of course search for yourself, but it's a python library[1] for interfacing with "Spark"[2], the Apache large scale data processing framework. [1] [2] - Source: Hacker News / about 1 year ago
  • Integrate Apache Spark and QuestDB for Time-Series Analytics
    Spark is an analytics engine for large-scale data engineering. Despite its long history, it still has its well-deserved place in the big data landscape. QuestDB, on the other hand, is a time-series database with a very high data ingestion rate. This means that Spark desperately needs data, a lot of it! ...and QuestDB has it, a match made in heaven. - Source: / over 1 year ago
  • Query Real Time Data in Kafka Using SQL
    Additionally, one of the challenges of working with Kafka is how to efficiently analyze and extract insights from the large volumes of data stored in Kafka topics. Traditional batch processing approaches, such as Hadoop MapReduce or Apache Spark, can be slow and expensive, and may not be suitable for real-time analytics. To address this challenge, you can use SQL queries with Kafka to analyze and extract insights... - Source: / over 1 year ago
  • Apache Iceberg as storage for on-premise data store (cluster)
    Spark for your transformation compute engine. Get Spark to talk to Nessie. Source: over 1 year ago
  • 5 Best Practices For Data Integration To Boost ROI And Efficiency
    There are different ways to implement parallel dataflows, such as using parallel data processing frameworks like Apache Hadoop, Apache Spark, and Apache Flink, or using cloud-based services like Amazon EMR and Google Cloud Dataflow. It is also possible to use parallel dataflow frameworks to handle big data and distributed computing, like Apache Nifi and Apache Kafka. Source: over 1 year ago
  • Beginner question about transformation
    You should also consider distribution of data because in a company that has machine learning workflows, the same data may need to go through different workflows using different technologies and stored in something other than a data warehouse, e.g. Feature engineering in Spark and loaded/stored in binary format such as Parquet in a data lake/object store. Source: over 1 year ago
  • Patterns for work across computers
    Because I could talk about things like Apache Spark, but you can't properly understand what it's doing until you have the right foundation. Source: over 1 year ago
  • Forward Compatible Enum Values in API with Java Jackson
    We’re not discussing the technical details behind the deduplication process. It could be Apache Flink, Apache Spark, or Kafka Streams. Anyway, it’s out of the scope of this article. - Source: / over 1 year ago
  • Databricks explained for busy engineers | Databricks quick start | Databricks Data Security
    **Databricks **is built on top of Apache Spark, which provides a fast and general-purpose cluster-computing framework for big data processing. - Source: / over 1 year ago
  • DataOps 101: An Introduction to the Essential Approach of Data Management Operations and Observability
    DataOps is a collaborative effort within an organization, with many different teams of people working together to ensure that DataOps functions properly and delivers data value [3]. So, before the data is delivered to end users, it is subjected to a number of treatments and refinements from multiple teams. Data scientists first use their data science techniques, such as machine learning and deep learning to build... - Source: / over 1 year ago
  • Apache Pulsar vs Apache Kafka - How to choose a data streaming platform
    Both Kafka and Pulsar provide some kind of stream processing capability, but Kafka is much further along in that regard. Pulsar stream processing relies on the Pulsar Functions interface which is only suited for simple callbacks. On the other hand, Kafka Streams and ksqlDB are more complete solutions that could be considered replacements for Apache Spark or Apache Flink, state-of-the-art stream-processing... - Source: / over 1 year ago
  • What is the separation of storage and compute in data platforms and why does it matter?
    However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel. - Source: / over 1 year ago
  • Deequ for generating data quality reports
    Aws documentation — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is... - Source: / over 1 year ago
  • In One Minute : Hadoop
    Spark, a fast and general engine for large-scale data processing. - Source: / over 1 year ago

External sources with reviews and comparisons of Apache Spark

15 data science tools to consider using in 2021
Apache Spark is an open source data processing and analytics engine that can handle large amounts of data -- upward of several petabytes, according to proponents. Spark's ability to rapidly process data has fueled significant growth in the use of the platform since it was created in 2009, helping to make the Spark project one of the largest open source communities among big data technologies.
Top 15 Kafka Alternatives Popular In 2021
Apache Spark is a well-known, general-purpose, open-source analytics engine for large-scale, core data processing. It is known for its high-performance quality for data processing – batch and streaming with the help of its DAG scheduler, query optimizer, and engine. Data streams are processed in real-time and hence it is quite fast and efficient. Its machine learning competencies are also quite accurate.
5 Best-Performing Tools that Build Real-Time Data Pipeline
Apache Spark is an open-source and flexible in-memory framework which serves as an alternative to map-reduce for handling batch, real-time analytics and data processing workloads. It provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning and graph processing. From its beginning in the AMPLab at U.C Berkeley in 2009, Apache Spark has...

Do you know an article comparing Apache Spark to other products?
Suggest a link to a post with product alternatives.

Suggest an article

Apache Spark discussion

Log in or Post with

This is an informative page about Apache Spark. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.