Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May

View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, April 15, 2014

How to keep Data Science "Scientific"

We talk a lot these days about data science, and how it will pave our paths with beautiful insights and unexpected new relations and connections in our given datasets, and even across datasets.

But how to maintain the "Science" part in "Data Science"? After some time working in this field I appreciate more and more the critical thinking which has characterized the progress in science.

Here below some food for thoughts in a slideshare presentation

Hypothesis, facts, prove and/or disprove the thesis. This is how science has progressed in the past centuries. This method has been formalized by Popper and categorize as non-science all disciplines where the statements cannot be falsified. In other words, if a statement cannot be disproved, we cannot talk of science, since there is no mechanism to left to verify the solution or to prove it wrong.

When that happens the argument can still be accepted, but not scientifically accepted. Ways of accepting or refuting a non falsifiable statement are for instance based on aesthetic, authority or pragmatic or philosophical considerations. All valid but not scientific. This applies for instance to statements in the disciplines of politics, teology, ethics, etc.

Science has definitely progressed since then. For instance, Bayesian networks and statistical inductions are currently part of the arsenal of the (data) scientist weapons. But, no matter how the baseline is set, critical thinking and a rigorous method are definitely helpful in understanding the results produced by science in particular when this is based on large amount of data and computational in nature, rather than formula/model driven.

Data Science has currently many different connotations. On one side it praises the "artistry", the genius of laying out connections between disciplines and concepts. This is a truly great aspect of scientists and creativity is definitely very welcome in all data science profiles.

With the fun of creating new insights and new data golden eggs, a data scientist has to put up with those annoying criteria of reproducibility, falsifiability and peer reviewing. Sometimes these elements are postponed or left behind in name of the artistry. Granted,  it's just hard to find metrics and baselines in order to compare models and data science solutions.  But the scientific method has proven to be solid over the centuries and has proven to allow factual scientific discussion  between scientists and a to allow selection between models based on objective agreed criteria.