Scaling TB's of data with Apache Spark & Scala DSL at Production- Chetan Khatri-FOSSASIA 2018

Published on: Saturday, 24 March 2018

Speaker: Chetan Khatri

Apache Spark is one of the top big-data processing platforms and has driven the adoption of Scala in many industry and academic settings. As entire Apache Spark framework has been written in scala as a base it’s real pleasure to understand beauty of functional Scala DSL with Spark.

This talk is intent to present :

Primary data structures (RDD DataSet Dataframe) usage in universal large scale data processing with Hbase (Data lake) Hive (Analytical Engine).

Case study: He will go through importance of physical data split up techniques such as coalesce Partition Repartition and other important spark internals in Scaling TB’s of data / ~17 billions records

Also the talk gives understanding about crucial part and very interesting way of understanding parallel &amp concurrent distributed data processing – tuning memory cache Disk I/O Leaking memory Internal shuffle spark executor spark driver etc.

Room: Training room 2-1
Track: Database
Date: Saturday, 24th March, 2018

