Apache Parquet Reviews and details

Screenshots and images

Landing page //
2022-06-17

Badges

Promote Apache Parquet. You can add any of these badges on your website.

<a href='https://www.saashub.com/apache-parquet?utm_source=badge&utm_campaign=badge&utm_content=apache-parquet&badge_variant=color&badge_kind=approved' target='_blank'><img src="https://cdn-b.saashub.com/img/badges/approved-color.png?v=1" alt="Apache Parquet badge" style="max-width: 150px;"/></a>

Show embed code

Social recommendations and mentions

We have tracked the following product recommendations or mentions on various public social media platforms and blogs. They can help you see what people think about Apache Parquet and what they use it for.

[D] Is there other better data format for LLM to generate structured data?
The Apache Spark / Databricks community prefers Apache parquet or Linux Fundation's delta.io over json. Source: 5 months ago
Demystifying Apache Arrow
Apache Parquet (Parquet for short), which nowadays is an industry standard to store columnar data on disk. It compress the data with high efficiency and provides fast read and write speeds. As written in the Arrow documentation, "Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files". - Source: dev.to / 12 months ago
Parquet: more than just "Turbo CSV"
Googling that suggests this page: https://parquet.apache.org/. Source: about 1 year ago
Beginner question about transformation
You should also consider distribution of data because in a company that has machine learning workflows, the same data may need to go through different workflows using different technologies and stored in something other than a data warehouse, e.g. Feature engineering in Spark and loaded/stored in binary format such as Parquet in a data lake/object store. Source: about 1 year ago
Pandas Free Online Tutorial In Python — Learn Pandas Basics In 5 Lessons!
This section will teach you how to read and write data to and from a variety of file types, including CSV, Excel, SQL, HTML, Parquet, JSON etc. You’ll also learn how to manipulate data from other sources, such as databases and web sites. Source: about 1 year ago
What pandas can do and polars can’t?
Loading or writing Parquet files is lightning fast. Pandas uses PyArrow -Python bindings exposed by Arrow- to load Parquet files into memory, but it has to copy that data into Pandas memory. With Polars there is no extra cost due to copying as we read Parquet directly into Arrow memory and keep it there. Source: about 1 year ago
Help me figure out ETL and storage for user search and click logs. So lost in all the DB alternatives.
Write a python script to convert the JSON data you receive to Parquet. Put the parquet data in the S3 bucket. Parquet is an open source data serialization format. Almost all the major tooling (including pandas) supports it so even if you move solutions (to Snowflake, etc) your data will already be setup. It also compresses the data and makes it faster to do aggregate queries like counts and sums. Source: over 1 year ago
Transition to cloud. What Warehouse to choose?
Why are you trying to work with CSVs anyways? Compared to other options you have in Azure, it's more effective to pull data into your Azure environment and save it as a compressed parquet. You'd be saving quite a bit of space most likely. Working with CSVs isn't really something people do on a large-scale unless the source format is CSVs - there are way more efficient means of storing and working with data files. Source: over 1 year ago
What is the separation of storage and compute in data platforms and why does it matter?
Apache Parquet is a file format that is used to store data in a columnar format. This allows for faster reads and reduced disk space. Not only is the Parquet format open source, but it also has an entire ecosystem of tools to help you create, read and transform data. - Source: dev.to / over 1 year ago
FOSS, cloud native, log storage and query engine build with Apache Arrow & Parquet, written in Rust and React.
CLP is interesting, we are already using a columnar format called Parquet which is open and used by many other platforms .. So it'd be interesting to see how CLP stacks against it. CLP is also harder to integrate with query engine that we are using ( datafusion ) .. Source: over 1 year ago
[D] What are some statistical packages you use in R that aren't available in Python?
I started my career in Data Science in R, but since my first job required Python, I switched. What I miss the most is ggplot; Python plotting is not there in terms of usability. There are quite a few statistical modeling packages that you can only find in R because that's the language that the author knows. Fortunately R <> Python interoperability is getting better by the day with projects like parquet; so now it... Source: over 1 year ago
Perform computation over 500 million vectors
I would guess that Apache Spark would be an okay choice. With data stored locally in avro or parquet files. Just processing the data in python would also work, IMO. Source: almost 2 years ago
Arrowdantic 0.1.0 released
Arrowdantic is a small Python library backed by a Mature Rust implementation of Apache Arrow that can interoperate with * Parquet * Apache Arrow and * ODBC (databases). Source: about 2 years ago
Spice.ai v0.6-alpha is now available!
Spice.ai joins other major projects including Apache Spark, pandas, and InfluxDB in being powered by Apache Arrow. This also paves the way for high-performance data connections to the Spice.ai runtime using Apache Arrow Flight and import/export of data using Apache Parquet. We're incredibly excited about the potential this architecture has for building intelligent applications on top of a high-performance... Source: about 2 years ago
How Does The Data Lakehouse Enhance The Customer Data Stack?
So, having S3, using something like Parquet for storing the data, and using Trino or Spark for processing the data gives us a lean but capable Data Lake. - Source: dev.to / about 2 years ago
AWS EMR Cost Optimization Guide
Data formatting is another place to make gains. When dealing with huge amounts of data, finding the data you need can take up a significant amount of your compute time. Apache Parquet and Apache ORC are columnar data formats optimized for analytics that pre-aggregate metadata about columns. If your EMR queries column intensive data like sum, max, or count, you can see significant speed improvements by reformatting... - Source: dev.to / over 2 years ago
Hydrating a Data Lake using Query-based CDC with Apache Kafka Connect and Kubernetes on AWS
This post describes how to use Kafka Connect to move data out of an Amazon RDS for PostgreSQL relational database and into Kafka. It continues by moving the data out of Kafka into a data lake built on Amazon Simple Storage Service (Amazon S3). The data imported into S3 will be converted to Apache Parquet columnar storage file format, compressed, and partitioned for optimal analytics performance by Kafka Connect. Source: over 2 years ago
Apache Hudi - The Streaming Data Lake Platform
The following stack captures layers of software components that make up Hudi, with each layer depending on and drawing strength from the layer below. Typically, data lake users write data out once using an open file format like Apache Parquet/ORC stored on top of extremely scalable cloud storage or distributed file systems. Hudi provides a self-managing data plane to ingest, transform and manage this data, in a... - Source: dev.to / almost 3 years ago
Please ELI5 what Parquet is for, and NOT for
I am trying to understand how good is Apache Parquet for. - Source: dev.to / almost 3 years ago

Do you know an article comparing Apache Parquet to other products?
Suggest a link to a post with product alternatives.

Suggest an article

Generic Apache Parquet discussion

This is an informative page about Apache Parquet. You can review and discuss the product here. The primary details have not been verified within the last quarter, and they might be outdated. If you think we are missing something, please use the means on this page to comment or suggest changes. All reviews and comments are highly encouranged and appreciated as they help everyone in the community to make an informed choice. Please always be kind and objective when evaluating a product and sharing your opinion.

Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem. subtitle

Apache Parquet Reviews and details

Screenshots and images

Badges

Social recommendations and mentions

Generic Apache Parquet discussion