Software Alternatives, Accelerators & Startups

What is the separation of storage and compute in data platforms and why does it matter?

Apache Spark Apache Parquet Amazon S3
  1. Apache Spark is an engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
    Pricing:
    • Open Source
    However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.

    #Databases #Big Data #Big Data Analytics 56 social mentions

  2. Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem.
    Pricing:
    • Open Source
    Apache Parquet is a file format that is used to store data in a columnar format. This allows for faster reads and reduced disk space. Not only is the Parquet format open source, but it also has an entire ecosystem of tools to help you create, read and transform data.

    #Databases #Big Data #Relational Databases 19 social mentions

  3. Amazon S3 is an object storage where users can store data from their business on a safe, cloud-based platform. Amazon S3 operates in 54 availability zones within 18 graphic regions and 1 local region.
    Object Storage: Services like AWS S3 can be used to store files on-demand with "infinite storage". There is no need to provision or manage capacity and it is completely decoupled from the instance or serverless that is doing the processing.

    #Cloud Hosting #Object Storage #Cloud Storage 172 social mentions

Discuss: What is the separation of storage and compute in data platforms and why does it matter?

Log in or Post with