Upcoming talks and demos:

Jupyter Con - New York 23-25 Aug









View Natalino Busa's profile on LinkedIn





Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning.



Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.


​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Wednesday, March 12, 2014

Big Data and APIs

Flickr_-_moses_namkung_-_The_Crowd_For_DMB_1.jpg
Big Data: more users, more interactions, more data science
photo: by m_namkung
Many complex data science problems, such as sparse matrix factorizations have very concrete applications today. These algorithms constitute  the fundation of machine learning engines which help us - the users - to filter, categorize and tag information.Two approaches can be taken in order to process the data: by analysing  large chunks of data (mostly translates into batch, lenghty operations) or per-request event-driven processing (mostly translates into responsive, reactive API and streaming computing in realtime). 

These two approaches can be combined in a unified architecture where large inferences and broad analysis are performed in batch mode on a Hadoop Yarn framework, while faster, reactive analytics can be performed by a actor and event based distributed comupting in Akka.

Those two run-time frameworks have very different latency, throughput, data and compute caracteristsics. No/New SQL distributed database technologies can be successfully used to bridge those two run-time frameworks. 


References


whitepapers


Parallel systems


Libraries and Tools


Data science and machine learning

http://en.wikipedia.org/wiki/Matrix_decomposition