Hack On Data

Since July 4th, I have been taking a free (mostly) online course on Apache Spark organized by good people from HackOn(Data). I managed to recruit a few people from my company to take this course as well, thinking that we could help each other out and a bit of peer pressure to complete weekly assignments doesn’t hurt!

The course started off with an in-person session at LoyaltyOne building in downtown Toronto. I was very impressed by the number of people that showed up, I think there were more than 100 people in total. I was also glad to re-connect with some of my old co-workers that I haven’t seen in months and years.

The self-sustaining model of this course is very impressive. It’s sponsored by companies, such as Fleep and LoyaltyOne, who are looking to recruit data scientists. We use the community edition of DataBricks, which is free, and I guess acts as a promotion for them. At the end of the course, there will be a hackathon where top team(s) might get offers for interviews from the sponsors. Every week, there is a new course material done via an online video session and a lab, which must be completed within a week. We are also using Slack as a way to communicate between ourselves and instructors. The lab itself consists of writing Python code using Spark API in a Jupyter notebook, which is then submitted via a public URL using Google forms. Every student then gets an e-mail with 3 randomly selected notebooks from other students that they have to grade. This is done to average out the grades from different graders, but also to make sure that the likelihood of someone’s work not being graded is very low. It’s really neat how all these free services are tied together to provide infrastructure to run the course, and how crowd-sourced marking of labs doesn’t make the few people who run the course a bottleneck.

As I am going through the course, I’ve also started to tinker with Azure offerings of Spark, and its big data ecosystem, include Azure Data Lake. I will write another post with more details about that, but right now I think I will start with uploading a data set with almost 3 million tick-level data points on temperature and humidity that I have been collecting since early June via sensors on my Raspberry Pi at home (more on how I set it up will be in a later post). I think it’s a neat way to tie the Raspberry Pi hobby project with what I am learning in this course.

Written on July 16, 2017