Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Wednesday, October 30, 2013

BigML: a 30 minutes evaluation using MASS Cars93


BigML belongs to a new wave of cloud-based machine learning solutions. Those solutions are betting on the fact that you can pay per use rather than spending large sums upfront hosting in-house statistical solutions and hiring expensive data scientists. 

So, I gave it a try. The set I took is a classic. It's the car93 set from the MASS library. This library was originally provided as support material for the book Venables and Ripley, 'Modern Applied Statistics with S' (4th edition, 2002). This library is available in R, and it contains several datasets to flex your muscles on statistics and data science.

Problem:
Device a model to predict the type of the car ("Small", "Sporty", "Compact", "Midsize", "Large" and "Van"), given the fields available in the Cars93 dataset.

First thing first, I have imported the Cars93 set as a comma separated value file (cars93.csv) into BigML.
Here below a two-liner about how to export the dataset to a file from the MASS library using R.



Yes. It's that simple. :) Once the file was produced, it was time to upload it to my brand new BigML account. The BigML user experience is easy and intuitive from the start. Uploading the file worked with no problems, and column names and column types were correctly identified/sorted into numeric and literals fields.


A few clicks further, the source can be converted into a dataset and distribution diagrams are produced for each field as you can see here below. This is produced quickly and it already provide some insight on the spread in the dataset for specific fields.


In order to move on towards the goal of building a predictor, I split the dataset into a train (80% of the data) and a test group (20%). There is a handy slider in the form which will generate those two datasets from the original one. From this moment on, the model can be built using the train dataset, keeping the test dataset for model validation and scoring.



Decision tree model. 

Select "Type" as the target variable. The train set contains 74 objects. The model created by default is a decision tree. It can be further tuned up with more advanced options such as advanced sampling etc.


Navigating the decision tree of the model is very smooth experience, with a cute dynamic HTML5 UI. Furthermore the tree can be filtered, by tuning support, confidence and prediction outcome to better understand the generated model. Next to the traditional tree break down for the model you can visualize the decision tree as circular rings.



Time to test the model. By evaluating the model on the test dataset, we can verify how good we are at identifying car types. You can compare the overal scoring vs the scoring of a specific car type. And you get the well known confusion matrix, with false negatives, false positives, and so forth. 

You can also test your model versus a random selection or other simple baseline models.

I wanted to try a random forest (called ensemble model) but at that point in time my free credits were gone. BigML makes it quite simple to pay as you go. 



So after coughing up a few dollars, I created a random forest with 10 decision trees.












Just as a final test, I compared the confusion matrices of the random forest with that of the decision tree, and the single decision tree was doing better. This is not uncommon and it's a sign that you are over-fitting your train set. I am pretty sure that I could do better by using the advanced tuning parameters while defining the ensemble model. But my 30 min evaluation stopped there ... Hope you enjoyed it.

Conclusion

I loved the UI. The system works flawlessly and the animations and the visualizations are very fresh.
On a more critical note, I think that if you are used to more advanced statistical tools (R, S, SPSS, Matlab), then you would probably miss the flexibility of these environment. I hope is that bigML will in the future offer a richer selection of machine learning techniques and algorithms. 

All in all, I do welcome more cloud based web solutions in the machine learning arena. This sort of web applications are very compelling for companies who would like to work with machine learning but don't have yet the budget or the ambition of building in house a data science team and the IT infrastructure necessary for it.