-
Apache Spark is an engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.Pricing:
- Open Source
You should also consider distribution of data because in a company that has machine learning workflows, the same data may need to go through different workflows using different technologies and stored in something other than a data warehouse, e.g. Feature engineering in Spark and loaded/stored in binary format such as Parquet in a data lake/object store.
#Databases #Big Data #Big Data Analytics 56 social mentions
-
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem.Pricing:
- Open Source
You should also consider distribution of data because in a company that has machine learning workflows, the same data may need to go through different workflows using different technologies and stored in something other than a data warehouse, e.g. Feature engineering in Spark and loaded/stored in binary format such as Parquet in a data lake/object store.
#Databases #Big Data #Relational Databases 19 social mentions