PySpark case - using Random Forest for binary classification problem - PyDataSG

Published on: Tuesday, 8 November 2016

Speaker: Weimin Wang

Synopsis: A binary classification problem (products recommendation) using PySpark on hadoop platform is presented. Specifically, presentation using ipython notebook will go through details such as - 1) data pre-processing, 2) Using mllib random forest classifier for binary classification, 3) Measuring performance using AUC score, 4) Different strategies to handle the problem of unbalanced dataset

Speaker: Weimin Wang - works as Data Scientist in Merck Singapore. During his job, he focuses on Advanced Analytics and Bioinformatics Research. With solid knowledge in Data Mining and Machine Learning. Weimin is also actively involved in Data Science competitions like Kaggle and Data Science Game. His interests lie in Machine Learning, Deep Learning and Natural Language Processing.

Event Page:

Produced by Engineers.SG

Help us caption & translate this video!