Why we don’t use Spark

Big Data Data Dashboard Big Data Tools

Apache Spark Landing Page
1

Apache Spark

Apache Spark is an engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Pricing:
- Open Source
Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.

#Databases #Big Data #Big Data Analytics 70 social mentions
Google Cloud Pub/Sub Landing Page
2

Google Cloud Pub/Sub

Cloud Pub/Sub is a flexible, reliable, real-time messaging service for independent applications to publish & subscribe to asynchronous events.
Pricing:
- Open Source
The Kafka-Dataproc combination served us well for some time, until GCP released its own message queue implementation: Google Cloud Pub/Sub. At the time, we investigated the value of switching. There is always an inherent switching cost, but what we had underestimated with Kafka is that there is a substantial overhead in maintaining the system. This is especially true if the ingested data volume increases rapidly. As an example: the Kafka service requires you to manually shard the data streams while a managed service like Pubsub does the (re)sharding behind the scenes. Pubsub on the other hand also had some downsides, e.g. It didn’t allow for longer-term data retention which can easily be worked around by storing the data on Cloud Storage after processing. Persisting the data and keeping logs on the interesting messages made Kafka obsolete for our use case.

#Stream Processing #Data Integration #Web Service Automation 15 social mentions
Google Cloud Dataproc Landing Page

3

Google Cloud Dataproc

Managed Apache Spark and Apache Hadoop service which is fast, easy to use, and low cost

Specifically, we heavily rely on managed services from our cloud provider, Google Cloud Platform (GCP), for hosting our data in managed databases like BigTable and Spanner. For data transformations, we initially heavily relied on DataProc - a managed service from Google to manage a Spark cluster.

#Data Dashboard #Big Data #Data Warehousing 3 social mentions
Google Cloud Dataflow Landing Page

4

Google Cloud Dataflow

Google Cloud Dataflow is a fully-managed cloud service and programming model for batch and streaming big data processing.

It was clear we needed something that was built specifically for our big-data SaaS requirements. Dataflow was our first idea, as the service is fully managed, highly scalable, fairly reliable and has a unified model for streaming & batch workloads. Sadly, the cost of this service was quite large. Secondly, at that moment in time, the service only accepted Java implementations, of which we had little knowledge within the team. This would have been a major bottleneck in developing new types of jobs, as we would either need to hire the right people, or apply the effort to dive deeper in Java. Finally, the data-point processing happens mainly in our API, making much of the benefits not weigh up against the disadvantages. Small spoiler, we didn't choose DataFlow as our main processor. We still use DataFlow within the company currently, but for fairly specific and limited jobs that require very high scalability.

#Big Data #Data Dashboard #Data Warehousing 14 social mentions