Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, February 7, 2017

The AI scene in the valley: A trip report

A few weeks back I was lucky enough to attend and present at the Global AI Summit in the bay area. This is my personal trip report about the people I have met and some of the projects and startups I came across.

AI is eating the world.
First of all, let me start by saying that literally *everybody* is doing (or claiming to do) AI in the bay area. AI has inflamed the spirits of pretty much every single software engineer, data scientist, business developer, talent scout, and VC in the greater San Francisco area.

All tools and services presented at the conference embed some form of machine intelligence, and scientists are the new cool kids on the block. Software engineering has probably reached an all-time low in terms of coolness in the bay area, and regarded almost as the "necessary evil" in order to unleash the next AI interface. This is somewhat counter-intuitive, as actually Machine Learning and AI are more like the raisins in raisin bread, as Peter Norvig and Marcos Sponton say.

What's behind AI?
Good engineering, and great business focus is still the foundation of many AI-powered tools and services. In my opinion, there is no silver bullet: AI powered applications must still be based on good engineering practices if they want to succeed.

We have seen a similar wave during the dot-com bubble in the beginning of the millennium when web-shops were popping up with little understanding of the underlying retail and marketplace businesses. Since then, web applications have matured and today we value those services both for their digital journey as much as for their operational excellence and their ability to deliver. I believe that a similar maturing path will happen for AI powered applications.

AI is still a very opaque concept. In the worst case it could be just a scripted process, more often is a set of predictive machine learning models. Because of the vagueness of the term, others are branding new terms in order to differentiate themselves: machine intelligence, cognitive/sentient computing, intelligent computing. Advertising more AI-related terms is not really helping clarifying what is running under the hood. After some digging, startups operating in the AI space are mainly interpreting AI as some form of artificial/machine learning (aka weak AI) tailored to very specific tasks.

Today, with some exceptions, the term AI is used to describe Artificial Neural Networks (ANNs), and Deep Learning (DL) mostly related to text, speech, image, and video processing. Putting the hype aside for a moment, without any doubts we can acknowledge that the renaissance of deep learning has contributed to the development of conversational interfaces.

The core of this new generation services might by still hard-coded or scripted, but the interface is going to be more and more flexible, understanding our spoken, text, and visual cues. This human-centric approach to UIs is definitely going to shape the way we interact with devices. This trend goes under the buzz of Natural/Zero UIs.

Let's go deeper in the stack, away from the front-end and human-machine Natural UIs. Narrow AI, in particular deep learning and hierarchical predictive models are getting traction as core data components, in particular for applications such as recommender systems, fraud detection, churn and propensity models, anomaly detection and data auditing.

Before moving on the following list: I am not associated with any of these companies, however I did find their approach worth mentioning and good food for thoughts for the entrepreneurs and the data people following this blog. So, as always, take the following with a pinch of salt and apply your critical & analytical thinking to it. Enjoy :)

Numenta is tackling one of the most important scientific challenges of all time: reverse engineering the neocortex. Studying how the brain works helps us understand the principles of intelligence and build machines that work on the same principles. They have invested heavily in time series research, anomaly detection and natural language processing. By converting text, and geo-spatial data to time series Numenta can detect patterns and anomalies in temporal data.

Recognos' main product "Smart Data Platform" is meant to normalize and integrate the data that is stored in unstructured, semi-structured and structured content. The data unification process is driven by the business ontology. Data extraction, taxonomy, semantic tagging and structured data mapping are all steps in this modern approach to data preparation and normalization. Recognos' ultimate goal is to allow to use the data that is stored in the un-structured, semi-structured and structured content using a unique semantic meaning and unique query language.

Appzen is picking up the challenge of automating the auditing of expenses in real-time. This service collects all receipts, tickets, and other documentation provided and produces a full understanding of the who, where, and why of every expense. Appzen's machine learning engine verifies receipts, eliminates duplicates, and searches through hundreds of data sources to verify merchant identities and validate the truthfulness of every expense – ensuring there is no fraud, misuse, or violation of laws.

Talla is a lightweight system which plugs into messaging systems as a virtual team members, executing on-boarding, e-learning, and team building process flows, and engaging with the various team members, taking over from tasks which are usually done by team managers, scrum masters, and facilitating teams. It employs a combination of natural language processing, robotic process automation, and user- and company- defined rules.

Inbenta has build an extensive multilingual knowledge supporting over 25 native languages and counting, including English, Spanish, Italian, Dutch and German. Its NLP engine understands the nuances of human conversation, so it answers a query based on meaning, not individual keywords. This is a good example of a company which relies on good business development and a great core team of linguistics and language experts. By combining these elements with NLP and Deep Learning techniques they can power chatbots, email management, surveys and other text-based use cases for a number of verticals.


Lymba is also tackling textual information, with the goal of extracting insights and non-trivial bits of knowledge from a semantic graph of heterogeneous linked data elements. Lymba offers transformative capabilities that enable discovery of actionable knowledge from structured and unstructured data, providing business with invaluable insights. Lymba has developed a technology for deep semantic processing that extracts semantic relations between all concepts in the text. One of Lymba's product, the "K-extractor" enables intelligent search, question answering, summarization of a document, generating scientific profiles, etc.


Jetlore is bringing personalization one step further by creating websites and mobile apps which are extremely tailored to each user, both in terms of content as well as layout, color, highlights, promotions, images, and offers. All site assets are ranked individually for each customer, and selected at the time of interaction based on the layout's configuration. Jetlore can select the best categories, brands, and collections of products from your inventory for each user, and automatically feature the best images to represent them. 


Ownerlisten is a smart messaging pipelining solution, rerouting messages in a given organization to the right person or process depending on the nature of the message. Users can filter and process messages combining business rules as well as automated text processing. This is essential in businesses where machine learned models might not be provide sufficiently accuracy for certain topics. Ownerlisten is another good example of how AI and NLP can be organically combined with user defined messaging and communication flows. By combining domain expertise, solid engineering and NLP engines, ownerlisten can deliver very smooth user and customer journeys in a number of different industries and use cases.


This list would not be complete without at least a company offering AI, Machine Learning, Data plumbing, Data engineering, API and Application engineering services. Software and data engineering, might be less cool a land of scientists, but it's still the backbone on top of which all those awesome solutions and products are built. Data Monster is one of those great studios accelerating mvp and product development, with a strong affinity to data processing at scale and all the right techs in the basket (Scala, Python, R, Java, JavaScript, Hadoop, Spark, Hive, Play, Akka, MySQL, PostgreSQL, AWS, Cassandra, etc ), and as a SMACK stack fan I cannot disagree with their list!

I finish this post mentioning and thanking a number of great people I have met during this trip, for their charisma and inspiring ideas and conversations: Hamid Pirahesh, David Talby, Alexy Khabrov, Alexander Tsyplikhin, Christopher Moody, Michael Feng, Delip Rao, Eldar Sadikov, Michelle Casbon, Mustafa Eisa, Ahmed Bakhaty, Adi Bittan, Jordi Torras, and Francesco Corea.

Thursday, February 2, 2017

Data Science Q&A: Natalino Busa


I was kindly asked by Prof. Roberto Zicari to answer a few questions on Data Science and Big Data for www.odbms.org - Let me know what you think of it, looking forward to your feedback in the comment below. Cheers, Natalino

Q1. Is domain knowledge necessary for a data scientist?

It’s not strictly necessary, but it does not harm either. You can produce accurate models without having to understand the domain. However, some domain knowledge will speed up the process of selecting relevant features and will provide a better context for knowledge discovery in the available datasets.

Q2. What should every data scientist know about machine learning?

First of all, the foundation: statistics, algebra and calculus. Vector, matrix and tensor math is absolutely a must. Let’s not forget that datasets after all can be handled as large matrices! Moving on specifically on the topic of machine learning: a good understanding of the role of bias and variance for predictive models. Understanding the reasons for models and parameters regularization. Model cross-validation techniques. Data bootstrapping and bagging. Also I believe that cost based, gradient iterative optimization methods are a must, as they implement the “learning” for four very powerful classes of machine learning algorithms: glm’s, boosted trees, svm and kernel methods, neural networks. Last but not least an introduction to Bayesian statistics as many

Q3. What are the most effective machine learning algorithms?

Regularized Generalized Linear Models, and their further generalization as Artificial Neural Networks (ANN’s), Boosted and Random Forests. Also I am very interested in dimensionality reduction and unsupervised machine learning algorithms, such as T-SNE, OPTICS, and TDA.

Q4. What is your experience with data blending?

Blending data from different domains and sources might increases the explanatory power of the model. However, it’s not always easy to determine beforehand if this data will improve the models. Data blending provide more features and the may or may be not correlated with what you wish to predict. It’s therefore very important to carefully validate the trained model using cross validation and other statistical methods such as variances analysis on the augmented dataset.

Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?

Let’s tackle feature extraction and feature engineering separately. Extraction can be as simple as getting a number of fields from a database table, and as complicated as extracting information from a scanned paper document using OCR and image processing techniques. Feature extraction can easily be the hardest task in a given data science engagement.
Extracting the right features and raw data fields usually requires a good understanding of the organization, the processes and the physical/digital data building blocks deployed in a given enterprise. It’s a task which should never be underestimated as usually the predictive model is just as good as the data which is used to train it.

After extraction, there comes feature engineering. This step consists of a number of data transformations, oftentimes dictated by a combination of intuition, data exploration, and domain knowledge. Engineered features are usually added to the original samples’ features and provided as the input data to the model.

Before the renaissance of neural networks and hierarchical machine learning, feature engineering was as the models were too shallow to properly transform the input data in the model itself. For instance, decision trees can only split data areas along the features’ axes, therefore to correctly classify donut shaped classes you will need feature engineering to transform the space to polar coordinates.

In the past years, however, models usually have multiple layers, as machine learning experts are deploying increasingly “deeper” models. Those models usually can “embed” feature engineering as part of the internal state representation of data, rendering manual feature engineering less relevant. For some examples applied to text check the section “Visualizing the predictions and the “neuron” firings in the RNN” in The Unreasonable Effectiveness of Recurrent Neural Networks. These models are also usually referred as “end-to-end” learning, although this definition it’s still vague not unanimously accepted in the AI and Data Science communities.

So what about feature engineering today? Personally, I do believe that some feature engineering is still relevant to build good predictive systems, but should not be overdone, as many features can be now learned by the model itself, especially in the audio, video, text, speech domains.

Q6. Can data ingestion be automated?

Yes. But beware of metadata management. In particular, I am a big supporter of “closed loop” analytics on metadata, where changes in the data source formats or semantics are detected by means of analytics and machine learning on the metadata itself.

Q7. How do you ensure data quality?

I tend to rely on the “wisdom of the crowd” by implementing similar analyses using multiple techniques and machine learning algorithms. When the results diverge, I compare the methods to gain any insight about the quality of both data as well as models. This approach works also well to validate the quality of streaming analytics: in this case the batch historical data can be used to double check the result in streaming mode, providing, for instance, end-of-day or end-of-month reporting for data correction and reconciliation.

Q8. What techniques can you use to leverage interesting metadata?

Fingerprinting is definitely an interesting field for metadata generation. I have worked extensively in the past on audio and video fingerprinting. However this technique is very general and can be applied to any sort of data: structure data, time series, etc. Data fingerprinting can be used to summarize web pages retrieved by users or to define the nature and the characteristics of data flows in network traffic. I also work often with time (event time, stream time, capture time), network data (ip/mac addresses, payloads, etc.) and geolocated information to produce rich metadata for my data science projects and tutorials.

Q9. How do you evaluate if the insight you obtain from data analytics is "correct" or "good" or "relevant" to the problem domain?

Initially, I interact with domain experts for a first review on the results. Subsequently, I make sure than the model is brought into “action”. Relevant insight, in my opinion, can always be assessed by measuring their positive impact on the overall application. If human interaction is in the loop, the easiest method is actually to measure the impact of the relevant insight in their digital journey.

Q10. What were your most successful data projects? And why?

1. Geolocated data pattern analysis, because of its application to fraud prevention and personalized recommendations. 2. time series analytics for anomaly detection and forecasting of temporal signals - in particular for enterprise processes and KPI’s. 3. Converting images to features, because it allows images/videos to be indexed and classified using standard BI tools.

Q11. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?

Aggregating too much will most of the time “flatten” signals in large datasets. To prevent this, try using more features, and/or provide a finer segmentation of the data space. Another common problem is “buring” signals provided by a small class of samples with those of a dominating class. Models discriminating unbalanced classes tend to perform worse as the dataset grows. To solve this problem try to rebalance the classes by applying stratified resampling, or weighting the results, or boosting on the weak signals.

Q12. What are the ethical issues that a data scientist must always consider?

1. Respect individual privacy and possibly enforce it algorithmically. 2. Be transparent and fair on the use of the provided data with the legal entities and the individuals who have generated the data. 3. Avoid building models discriminating and scoring based on race, religion, sex, age etc as much as possible and be aware of the implication of reinforcing decisions based on the available data labels.

On last point, I would like to close this interview with an interesting idea around “equal opportunity” for ethical machine learning. This concept is visually explained on the following Google Research page Attacking discrimination with smarter machine learning from a recent paper by Hardt, Price, Srebro.

Thursday, January 12, 2017

Streaming Analytics for Chain Monitoring

Many enterprises are moving from batch to streaming data processing. This engineering innovation provides great improvements to many enterprise data pipelines, both on the primary processes such as front-facing services and core operations, as well as on secondary processes such as chain monitoring and operational risk management.


Streaming Analytics is the evolution of Big Data, where data throughput (velocity) and low-latency are important business KPIs. In such systems, data signals are ingested and produced at high speed  - often in the range of millions of events per seconds. On top of that, the system has still to operate on large volumes of heterogeneous resources, it must execute complex processing to verify the completeness and accuracy of data. Finally, the produced output and data transformation must be produced fast enough to be relevant and actionable.
Batch vs Streaming
A batch chain is normally a series of transformations which happen sequentially, from source data to final results. Data moves one batch at a time from one step to the next one. Batch systems usually rely on schedulers to trigger the next step(s) in the pipeline, depending on the status of the previous step.

This approach suffers from a number of limitations:
  • It usually introduces unnecessary latency from the moment the initial data is  provided to the moment the results are produced. If those produced results were in fact insights, they might lose their "actionable" power because it is already too late to act.
  • Responses and results are delivered after the facts, and the only analysis which can be done is a retrospective analysis, but it's too late to steer or correct the system, or to avoid the incidents in the pipeline.
  • Decisions are made on results from aged or stale data, and they might be incorrect as the result do not reflect any longer the state of the system. This could produce over- and under- steering of the system.
  • Data is at rest. This is not necessarily a drawback, but batch system tend to be passive, with time spent in extracting and loading data from file systems to databases and back with peaks and congestions on the enterprise network rather than a continuous flow of data.
A queue-centric approach to data processing
From 10'000 feet high, a streaming analytics system can best be described as a queue. This logical, distributed queue connects agents producing data to those consuming data. Many components can functions both as sources and sinks for data streams. By highlighting the queue rather than the processing, we stress the fact that data is flowing, and data processing is always interleaved data transfer.
The same data element on the queue can potentially be required by many consumers. The pattern that best describe this is the publisher/subscriber pattern.

As the data transiting on the queue can be consumed at different rates, such a queue should also provide a certain persistence, acting as a buffer while producers and consumers are free to access data independently the one from the other.
Streaming Analytics: Definition
Here below a number of definitions which are widely accepted in the industry:
"Continuous processing on unbounded tables" - Apache Flink, Data Artisans
"Software that can filter, aggregate, enrich and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions" - Forrester
Streaming Analytics provides the following advantages w.r.t batch processing: 

  • Events are analyzed and processed in real-time as they arrive
  • Decisions are timely, contextual, and based on fresh data
  • The latency from raw events to actionable insights in small
  • Data is in motion and flows through the architecture
Furthermore, batch processing can be easily implemented on streaming computing architectures, by simply scanning the files or datasets. The opposite is not always possible, because the latency and processing overhead of batch systems is usually not negligible when handling small batches or single events.
Streaming Analytics: Events Streams vs Micro-Batches
Next to latency and throughput, another important parameter which defines a streaming analytics system is the granularity of processing. If the system handles streams one event at a time, we define it as an event-based streaming architecture, if the streams gets consumed in packets/groups of events we call it a micro-batching streaming architecture. In fact you could consider a batch pipeline a streaming architecture, albeit a very slow one, handling the streaming data in very large chunks!
The following two pictures give an intuition of how those two paradigms work:
Why Streaming Analytics for Chain Monitoring
Enterprise BI processing chains tend to be very complex, because of the volume, but also because of the number of regulations and compliance measures taken. Hence it's not uncommon that process changes and unforeseen load can strain part of the chain, with oftentimes big consequences. When incidents occur several steps if not of the entire chain must be re-run. These incidents are often a source of delays, reduced service level and in general lower quality of internal BI process measures and KPIs.
Streaming Analytics can be effectively used as the processing paradigm to control and act on metadata produced by BI chains:
  1. Build models using a large amount of sensor meta-data, events, and facts, and determine which patterns are normal and which are anomalous in the received data
  2. Score, forecast and predict trends on newly generated data, and provide real-time actionable insights
Use ETL logs and meta-data to forecast data quality and process operational kpi's
  • Forecasting for Time, Volume, Query Types
  • Forecasting on Errors and Incidents
Rationale:
Data values tend to be stable around some statistics, therefore we could collect the logs and build a model complying on the statistics of incidents, and other monitors values in order to determine the chance of success of a given ETL pipeline.  
Same sort of analysis can be applied to variables such as ETL jobs logs to monitor and process volumes, time of processing, query types . This information can be captured at run-time as the ETL jobs are executing. Here below a few examples of anomaly predictions and time series forecasting on machine logs.
Objectives:
  • Detect anomalies in log data variables
  • Predict behaviour of ETL processes
  • Predict the probability of future incidents
ref: https://github.com/twitter/AnomalyDetection
Use ETL logs and meta-data to identify records and data anomalies
  • Detect missing/duplicated data
  • Anomaly detection on patterns and queries types  
Rationale:
Data values tend to be stable around some statistics, therefore we could use this statistics to characterize future data and detect potential incident early on the ETL process.
This analysis exploit the nature of data being processed as well as the metadata provided by the ETL tools themselves, to increase the chances of both prediction and detection.
Objectives:
  • Monitor the records of specific products or data categories
  • Cluster and group Data Logs specific to given categories and collections
  • Detect Anomalies based on Volume, Query types, error count, log count, time, etc

Use ETL logs and meta-data to cluster and classify queries and data transformations
Cluster processing based on queries types, query result statuses, access logs, and provide an indication on the "norms" for data and process quality as well as detect possible intrusions and cyber security attacks.
Rationale:
ETL metadata, is a rich source of information. Normally this information is manually curated. However metadata is data. And as such it can be processed as text, text extraction techniques can be applied to db logs, query logs and access logs.
Once the data is being structured, machine learning and data science techniques can be applied to detect clusters, and (semi) automatically classifying datasets, providing higher SLA, better data quality, and higher prevention of both incidents as well as cybersec attacks.
Objectives
  • Extract patterns and information from machine logs
  • Combine multiple sources
  • Normalize the data into a single format
  • Apply machine learning algorithms to cluster and classify the given information
Data Governance on Streaming Data
Streaming data is still data. Hence, it must be managed and governed. One way of managing data is by logically partitioning it in semantic layers, from raw data sources to actionable output. In particular, Streaming data can also be layered: from raw events to alerts and notifications.
Streaming Data Components:
As depicted in the diagram on the right, a streaming analytics, can be logically split three logical function classes:
  1. Data Capture
  2. Data Exploration
  3. Data Exploitation
This can be mapped on 6 logical components:
  1. Data Extraction
  2. Data Transport
  3. Data Storage
  4. Data Analytics
  5. Data Visualization
  6. Data Signaling
Selection of Streaming Analytics Components
If we consider the streaming components as a stack, we can select for each component a number of tools available in the market. Therefore, we can define a number of bundles or recipes depending on the technology used for each component of the stack. In the diagram below you can see a fee of those streaming analytics bundles.
Some of those bundles are composed of open source projects, others by proprietary closed-source technologies. This first classification positions technologies such as Splunk, HP, and Teradata, SQLStream in one group and the SMACK, ELK, Flink stacks in another. Moreover, some bundles are fully delivered and maintained by a single company (Splunk, HP Arcsight, Elastic) while others bundles are composed by tools maintained by different companies and dev centers (Teradata, Flink, SMACK).
Also, considering the streaming analytics use cases, some of this bundles are better tuned to specific domains (cyber security, marketing, operational excellence, infrastructural monitoring)  while others are more less specialized and can be tuned or customized to a specific set of use cases.
While the following diagram is not exhaustive, it provides a selection of some of the most bespoken and widely adopted components from streaming analytics as available today in the market.
Scorecard
The following scorecard can be used to determine which individual components and which bundles are more appropriate and fit-for-purpose provided the use cases, the organization, the capabilities both in terms of people, tools, and technology, the business and financial goals and constraints, and the culture of the given enterprise.
Metrics, CriteriaRationale
Open SourceSharing the source code, provides a higher level of transparency.
Ease of UseHow easy it is to implement new use cases? Or to modify existing ones?
Vendor SpecificSome components, once used might be hard to swap for others
because of the level of tuning and customization and create technologies lock-ins.
DocumentationIs the tool well documented? What about, Install, configuration, and examples?
CommunityAn active community stimulates and steer the innovation
process and provides feedback on features, bugs and best practices.
Easy of IT IntegrationHow straightforward it is to provide this
LongevityThe amount of year of the a given technology in the
market provides an indication of the maturity of the solution.
LibrariesAre Plugins and 3rd Party Libraries available?
Is there a marketplace, and a community of satellite companies
contributing to the technology?
MaintenanceSLA may vary depending of the use case and other requirements
PerformanceHow fast are streams processed?
How efficient is the solution provided the same amount of IT resources?
Release cycleHow often are new releases delivered?
TCOWhat is the estimated total cost of ownership for the selected cpmponents?
Data IntegrationCan the available data sources be directly used?
What about data models and formats?
ExpertiseAre experts available in the job market? Can they be easily acquired?
Data VolumesHow well can the selected technology cope with the data volumes generated?
Learning CurveHow much time does it take to master this technology
from a user/dev/ops perspective?
Data AggregationWhen models require large context, how well can
the technology join and merge data?
User and Access ManagementHow well does this solution fit to the 
security and auditing measures setup in the enterprise?
Streaming Meta-Data: Monitoring BI chains
From a logical architecture perspective, streaming analytics processing can be seen as data transformations or computing step which fetch data from a distributed queue and push results back to the queue, as previously explained on the log-centric conceptual diagram of streaming computing.
In the previous diagram the logical functions of a streaming analytics systems are divided in groups, depending on the nature of the processing. You could govern streaming analytical functions according to the following taxonomy:

  • Capturing
    • Object Store
    • File Store
  • Data Logging
    • Data Acquisition via APIs
    • Data Listeners (files, sockets)
    • Data Agents (browsers, devices)

  • Transformation
    • Data Cleaning
    • Data Augmentation
    • Data Filtering
    • Data Standardization
    • Sessioning, Grouping
    • Data Formatting
    • Data Compaction
  • Modeling
    • Data Modeling
    • Clustering
    • Pattern Extraction
    • Feature Engineering
    • Histogram Analysis
    • Norms Extraction
  • Machine Learning / AI
    • Anomaly Detection
    • Forecasting
    • Recommendation
    • Classification
  • Signaling
    • Alerting
    • Notification

Streaming Analytics: Conceptual Architecture
Before diving in the detailed in the architectural blueprint, let us analyze the main components of such a system. The diagram here below provides a simplified description of the different parts constituting a streaming analytics architecture.
Starting from the bottom, we define two storage layers, the top two layers are analytics, and visualization.
The first is a typical a Big Data layer for long term storage of data. It provides an excellent and cost efficient solution to store raw stream events and meta-data. Data on this layer is most efficiently stored in large files. This layer is usually not great for random access of specific records, but works well to stream out large files and have them processed in engines such as Presto, Hive, and Spark.
The second storage layer is more tailored toward objects and documents. The characteristic of this layer is that access is fast. This form of storage provides better data search and exploration functions. Moreover, a document store provides fast searches by indexing textual data, and fast access to individual stream events/elements. This layer is typically realized using NoSQL technologies, out of which two of them Cassandra, and Elasticsearch, are discussed in better details in the following sections.
The third layer is meant for model building and data exploration. Presto and Hive are SQL engines part of the Hadoop ecosystem and they are tuned respectively for interactive exploratory queries and large batch analysis on big data volumes. Spark is also an interesting components as it allows to interleave Machine Learning operations with both SQL queries and data transformations using languages such as Scala and Python.
The top layer is populated by data visualization tools. These tools usually access the underlying analytical layer in order to perform the computations, and then display the results using dashboards, graphs and widgets, often via a Web UX.
Streaming Analytics: Architectural Blueprint and Data Landscape
The following architectural blueprint, provides a possible implementation for meta-data managements and chain monitoring. It consists of three main parts.

Description:
This open source blueprint serves a number of goals:
  • long term storage of the raw events (data lake)
  • Data exploration and validation of models and hypotheses
  • Implementation and development of ad-hoc  use cases
  • Model creation and model validation using data science and machine learning tools.
Considerations
The above blueprint architecture is a possible end state for chain monitoring and operational excellence. It can definitely be phased in stages according to the organization's appetite, roadmap and strategy to streaming analytics and real-time data processing.
One general remark is that each streaming technology and each component of the above blueprint has its "sweet spot" in the overall data landscape.
Elasticsearch is extremely efficient at storing, capturing and display time series data. However because of the way the data is structured complex queries and joins are usually not performed efficiently within this platform. This is way for complex query Elasticsearch can be complemented by other solutions such as Spark, Presto, Hive, Cassandra or other analytical systems such as enterprise data warehouses to act as "powerhouse" for complex queries and aggregation.

See diagram here below:

The proposed combination of file and object data stores, topped by Spark is quite powerful and provided probably the highest level of flexibility in order to implement each specific use case, in a tailored and customized way. Spark uniqueness comes from the fact that it provides a unified data programming paradigm. Spark combines SQL, Python, Scala, Java, R, as well as streaming and machine learning capabilities under the same programming paradigm, and using the very same engine to perform this variety of computations.
Recommendations
The suggested blueprint requires of course further analysis and it's advised to determine which scoring criteria should weigh more in the selection and determine which components or bundles in the architecture should be prioritized.
It's also probably wise, seen the vast choice of components, tools, libraries and solutions to identify which level of integration (libraries or ready made packaged solutions) is preferred in the organization. Depending on the availability of devops resources, you can trade flexibility, and customized solution for pre-canned use-case specific solutions.

Active human-manned monitoring is becoming unfeasible, especially when hundreds of dashboards are produced by systems such as Kibana. It's therefore highly recommended to complement the dashboarding approach to a more data-driven solution where patterns and anomalies are learned and detected autonomously by the system.
Also the availability of raw metadata signals as part of this architecture stored in a data lake and transported on kafka will probably constitute a great substrate to create and develop other use case in other domains (fraud, cybersecurity, marketing, personalized recommenders, predictive services etc.)
Streaming Analytics Engines: Open Source Projects
For further reading, let's focus on computing engines, as streaming computing engines are the foundation for any domain-specific streaming application. Here below, it's presented a selection of streaming processing technologies which have been developed in the last years:
For a detailed description of those technologies, have a look at this post:
https://www.linkedin.com/pulse/streaming-analytics-story-many-tales-natalino-busa

Wednesday, January 11, 2017

AI Q&A: Natalino Busa

In preparation to my next talk at the Global Artificial Intelligence(AI) Conference on January 19th, January 20th, & January 21st 2017, I have written down a few thoughts on AI and how it could possibly contribute to the enterprise and in particular to Financial and Telecom services. See you guys live in California, Santa Clara!

PS. I have 5 complimentary tickets for the AI enthusiasts among you! 
Send me a message via linkedin for the promo code. Cheers - Natalino

Natalino, tell us about yourself and your background.
I have worked in industrial research at Philips, where I spent several years researching and developing on-chip distributed computing algorithms, especially applied to video, image and audio processing. When I moved on in my career path, I held on my analytical background, and it turned out to be a wise move. When Big Data, Machine Learning and modern AI have emerged as part of the digital transformation in the past decade, I was happy to apply distributed computing and machine learning techniques on large datasets to solve enterprise challenges and provide innovative solutions for banking and telecom applications.

What have you been working on recently?
I have worked on streaming computing, in particular using the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). Also I have been quite busy with anomaly detection techniques for the financial market, in particular detecting fraud and cyber security attacks in retail banking. I have spent quite some time lately in understanding how dimensionality reduction, manifolds and topological analysis can be applied to bank transactions in order to extract patterns and efficiently cluster data.

Tell me about the right tool you used recently to solve financial customer problem?
I have been using a number of mixed Machine Learning and AI techniques. For example, I have been using DBSCAN clustering and stacked auto-encoders for features extraction, I have also done some data exploration and visualization using TDA and T-SNE. Tensorflow, Keras and Scikit-Learn are great tools to analyze financial datasets and identify payment clusters.

Where are we now today in terms of the state of artificial intelligence, and where do you think we’ll go over the next five years?
Artificial Intelligence today works extremely well under two conditions: the availability of large amount of data/examples, and labeled data (supervised learning). Given the current state of the art in AI, these are the typical conditions to train/learn AI systems, and in particular deep neural networks. I think that this alone will provide a tremendous boost in areas where data can be accumulated easily. However, when the data is scarse, the current deep learning and AI methods are less effective. I do expect more abstraction in the the next 5 years, as we have learned AI to read, write, speak, and listen, the next goal, in my opinion should be learn AI to perform abstract thinking.

There is a negative perception around AI and even some leading technology folks have come out against it or saying that it’s actually potentially harmful to society. Where are you coming down on those discussions? How do you explain this in a way that maybe has a more positive beneficial impact for society?
The negative perception of AI usually rests on two interpretations. Let's discuss them one by one. The first one is "It's harmful because machines will replace people". Actually, this is true, but this is hardly news. For centuries, in our progress we have created machines to reduce labour and manual work and so far this has only provided more work, albeit different sort of works. AI is not very different in this respect than what has happened during the industrial revolution with steam engines and factories. The second negative interpretation of AI is "It's harmful because machine will autonomously take decisions about and for people." I think that this second danger is indeed real and we need to make sure that AI systems are ethically and legally compliant. I believe that AI should be part of monitor and control mechanisms.  We have to make sure that AI is not seen as a free-running system but rather as very powerful data analysis tool that we can use in order to accelerate progress, for instance in science and healthcare. Personally I see AI as a sort of a "data lens" that we can put on when we want to better understand the patterns in our datasets.

When you’re hiring, what types of people are you hiring? The job market for traditional programmers, engineers is  very difficult to get into AI space. Are you hiring from that talent pool or is that a different talent pool? In terms of talent, how do you go about ensuring you get the best AI people at your company?
A good understanding of Machine Learning and Data Science is fundamental to step into the wider AI space. In particular, considering that AI and Deep Learning are very much hyped, it's extremely important to be to critical on the methods used, rather than blindly applying one ML algorithm after the other. Notions such as over/under-fitting, (cross) validation, cost optimization must be well understood, especially if the candidate comes from an engineering background. Practical experience is also paramount as AI, Machine Learning and Deep Learning require coding and hands-on swiftness. So a good mix of theory and practice will get you there, provided sufficient study and practice.

Will progress in AI and robotics take away the majority of jobs currently done by humans? Which jobs are most at risk?
Definitely some of the jobs we have today might disappear in the future. Some attempts are already happening: think for instance of the race for "chatbots". Secretarial, call centers, and clerk jobs could be replaced by machines. Ironically, data scientists, in particular data analysts who collect data and provide reporting and aggregated results on data could also be be replaced by AI systems. In general, all tasks where an AI is able to extract and learn the "recipe" can be potentially done by machines.

What can AI systems do now?
AI systems today are extremely good at recognize patterns and learn from examples. They can work extremely well at detecting anomalies, recommend items and actions based on past history. The latest AI models are quite good at extracting semantic and meaning from organic data such as videos, images, text and speech. This AI system are currently used to categorize content, in many applications ranging from security to forensics to predictive maintenance to social networks and personalized marketing.

When will AI systems become more intelligent than people?
One day for sure. 
When this singularity will happen, I can't really tell, but not in our lifetimes in my opinion.

Which AI scientists do you admire the most?
I love the roots of the field, in particular Yoshua Bengio and Peter Norvig for the foundational work they have done in the past 30 years. From the new school my favs are Richard Socher and Nando de Freitas.
You’ve already hired Y number of  people approximately. What would be your pitch to folks out there to join your Organization? Why does your organization matter in the world?
Data is the new electricity which is powering the world. And AI, ML and Data Science tools are the turbines and the engines which are conveying this electricity and transforming it in the most amazing Financial and Telecom Applications. The slogan? "AI for Finance: Help the customer, Protect the customer"

Is AI going to change Financial Services?
Yes I do think so. Most financial services currently are very basic, in the way they interact with customers. AI will provide new ways to better understand the need and the behavior of individual and provide relevant, personalized and helpful hints for everyone to better manage their financial life.

Is Deep Learning directly applicable to finance?
Yes, many use cases already have accumulated large amounts of data to work on and the outcomes/labels are known (aka supervised learning). For these use cases Deep Learning can be applied straight away. However there are still some scenario where the data is not that abundant and where the deep learning techniques need to operate in an unsupervised way. We need more results from the scientific community on that front.

What are some of the best takeaways that the attendees can have from your "AI and Big Data in Commerce"  talk?
- There is definitely tons of data which is currently unused. - AI can be effectively apply to existing hard problems such as better regulatory compliance, risk management, fraud and cyber security defence. - AI works well in less "sexy" domains such as Financial Services. - Tensorflow, Keras, T-SNE and Scikit-Learn are great tools to build and train AI applications for financial services.

What are the top 5 AI Use cases in enterprises?

  • Personalized recommenders, 
  • Cybersecurity, 
  • Personalized Marketing,
  • Operational Excellence, 
  • Predictive Services
Which company do you think is winning the global AI race?The one with the largest group of AI scientist/engineers, and the biggest collection of data.

Any closing remarks?
Looking forward to see you guys live in Santa Clara, California at the Global Artificial Intelligence(AI) Conference on January 19th, January 20th, & January 21st 2017