Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May

View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Monday, March 3, 2014

Unsupervised Learning: Model  Selection and Evaluation

In terms of model validation and selection, supervised learning is easy. You know what the outcome is, you device the model, your train it, you test it, maybe several times, then you cross validate it. And there you go, you know how good your model is. You can create model competitions, you can improve your model. Great.

But, wait a second. How do you achieve the same for unsupervised learning?

How to interpret unsupervised learning: from scikit http://scikit-learn.org/stable/index.html
In unsupervised learning, there is no reference result. How good is the model in determining classes? Are those classes what you actually wish as a user? How to interpret the quality of an model, so when is a model more discriminating than another?

The term “unsupervised” refers to the fact that there is no “target” to predict and thus nothing resembling an accuracy measure to guide the selection of a best model

This means that there is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder." The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentall. Unless there is a mathematical reason to prefer one cluster model over another.

Unsupervised learning can be looked at as several equivalent definitions:
  • Finding groups in data
  • Finding patterns in data
  • A form of data compression 
  • A form of multi-dimensional reduction
Regardless of the definition we choose, one central matter when dealing with unsupervised learning is how to measure the quality of the clustering.What does confidence mean in the context of features/groups mapping for unsupervised learning?

Unsupervised learninng can be done in may ways, the most common being neural networks, clustering (k-means, etc), and dimensionality reduction techniques such as PCA. Let us call the features, regardless on how they were extracted as clustering.

Clustering classification

Clusterings can be roughly distinguished as: 
  • hard clustering: each object belongs to one cluster only. It's a onto mapping, each samples can be associated to one and only one cluster.
  • soft clustering: also known as fuzzy clustering. Each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster) 

Measuring quality

Internal evaluation methods

One way is to create some internal evaluation method. This method does not realy from any external knowledge, but simply is a way of describing a set of desired characteristics of the mapping.
  • by the definition of an optimization function (for instance, minimize SSE in k-means)
  • by creating an error metric

SSE method
plot the sum of squared error for different clusters
SSE will monotonically decrease as we increase the number of clusters
The knee points on the curve suggest good candidates for an optimal number of clusters

Spectral clustering
Measure/maximizes the eigen gap

Penalty Method
Bayesian Information Criterion

Stability based method

• Stability: repeatedly produce similar clusterings on data originating from the same source.
• High level of agreement among a set of clusterings the clustering model (k) is appropriate for the data
• Evaluate multiple models, and select the model resulting in the highest level of stability.

External evaluation methods

If true class labels (ground truth) are known, the validity of a clustering can be verified by comparing
the class labels and clustering labels.

Convert it to a supervised model

  • by means of a panel (most of the time of humans / experts)
  • by means of ground truth (for instance accessing other data which classify the samples)

Some methods are developed for evaluating unsupervised models against ground truth. Refer to Rand Index and Normalized Rand Index. Also Purity and Normalized Mutual Information index can be used to asses the quality of the model.



Clustering methodologies

Density estimation

Kernels separation

Slides and presentations