I put together a tech talk on Machine Learning and Databricks which is the 3rd part of an 9 part Data Science for Dummies series: Data Engineering with Titanic dataset + Databricks + Python.
Preparing & feature engineering highlighted the importance of domain knowledge, even with something as simple as a 10 column dataset! It also aptly demonstrated how much time is spent on ingesting and prepping data for machine learning versus the actual modelling. I also get asked how important the maths and statistics are to get started. There’s no doubt they are essential for this field, however, I personally enjoy the data engineering/DataOps role and am happy to hand over to a dedicated data science when it gets too hairy. It’s important for all roles involved to have an idea of the end to end workflow. With tools like AutoML I can focus on data engineering & architecture.
I’ll be back for Part 2 where we’ll finish the feature engineering and then run the training data through a series of machine learning classifiers to determine which gives the best accuracy.
PM me if you’d like to give your dev team some technical training on how to get started with Machine Learning or Azure/Databricks/Spark for advanced analytics.
Slides can be found here (Note: Powerpoint animation is not working so well 😉[slideshare id=150788201&doc=datasciencefordummies-1titanicwithdatabricks-190620044318]
Here’s the rest of the series: https://data-driven.ai/blog/tag/data-science-for-dummies/
- Data Science overview with Databricks
- Titanic survival prediction with Azure Machine Learning Studio + Kaggle
- Data Engineering with Titanic dataset + Databricks + Python
- Titanic with Databricks + Spark ML
- Titanic with Databricks + Azure Machine Learning Service
- Titanic with Databricks + MLS + AutoML
- Titanic with Databricks + MLFlow
- Titanic with .NET Core + ML.NET
- Deployment, DevOps/MLOps and Productionisation Z