In a data science project, one of the biggest bottlenecks (in terms of time) is the constant wait for the data processing code to finish executing. Slow code, as well as intermittent connection to web and remote instances affect every step of a typical data science pipeline — data collection, data pre-processing/parsing, feature engineering, etc. Sometimes, the gigantic execution times even end up making the project infeasible and often forces a data scientist to work with only a subset of the entire dataset, depriving the data scientist of insights and performance improvements that could be obtained with a larger dataset.
In this talk, I will be sharing about common bottlenecks in data processing within a data science pipeline - especially in a young data science team getting started with real-world data. I will also be exploring various approaches such as parallel processing and Just-In-Time (JIT) compiling that could be used to speed up your data processing codes so that you could focus more on getting value out of your data."