Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Wednesday, March 12, 2014

Big Data and APIs

Flickr_-_moses_namkung_-_The_Crowd_For_DMB_1.jpg
Big Data: more users, more interactions, more data science
photo: by m_namkung
Many complex data science problems, such as sparse matrix factorizations have very concrete applications today. These algorithms constitute  the fundation of machine learning engines which help us - the users - to filter, categorize and tag information.Two approaches can be taken in order to process the data: by analysing  large chunks of data (mostly translates into batch, lenghty operations) or per-request event-driven processing (mostly translates into responsive, reactive API and streaming computing in realtime). 

These two approaches can be combined in a unified architecture where large inferences and broad analysis are performed in batch mode on a Hadoop Yarn framework, while faster, reactive analytics can be performed by a actor and event based distributed comupting in Akka.

Those two run-time frameworks have very different latency, throughput, data and compute caracteristsics. No/New SQL distributed database technologies can be successfully used to bridge those two run-time frameworks. 


References


whitepapers


Parallel systems


Libraries and Tools


Data science and machine learning

http://en.wikipedia.org/wiki/Matrix_decomposition