Speaker: Alena Melnikova
In Spark 3.0 releases, all the built-in file source connectors [including Parquet, ORC, JSON, Avro, CSV, Text] are re-implemented using the new data source API V2. We will give a technical overview of how Spark reads and writes these file formats based on the user-specified data layouts. Also, a mechanism for performing Dynamic Partition Pruning at runtime by reusing the dimension table broadcast results in hash joins and that shows significant improvements for most TPCDS queries will be presented.
Alena Melnikova, Data Engineer at Refinitiv, an inspiring Apache Spark practitioner based in Singapore and working on challenging batch and streaming data pipelines.
Produced by Engineers.SG
Help us caption & translate this video!