Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Wednesday, June 25, 2014

Introduction to Elasticsearch

Elasticsearch is a distributed service for textual search and analytics, exposed via a RESTful API, both for data ingestion and for data query.




It allows to ingest real time data and stores the data in transaction logs on multiple nodes in the cluster. Data is stored as structured JSON documents. All fields of the jason document are indexed by default. Search and indexing are build on top of Apache Lucene codebase.

Search comes with multi-language support, a powerful query language, support for geolocation, context aware did-you-mean suggestions, autocomplete and search snippets.


Since the data is schema-less it can handle very different sort of data domains. Traditionally meant for logs, but it can actually handle all sort of documents. Therefore, you could actually build a business intelligence pipeline or a marketing campaign analytics with it.





Indexing happens as soon as the document is posted to the Elasticsearch cluster. This means that you can also use the cluster for real-time analytics.


Elasticsearch integrate with a couple of other great open source projects:

Hadoop: Elastic search can index unstructured documents located on the hadoop filesystem (HDFS). There are also a number of connectors available for  vanilla Map/Reduce, Cascading, Pig and Hive. This is quite something. By connecting hadoop data transformation pipeline to elasticsearch, it is possible to run text queries as part of your ETL/ELT data flow, or store the result of your hive and pig queries as new documents in elasticsearch and text-search the result later.

Logstash: a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).

Kibana: a webconsole to visualize logs and time-stamped data. Elasticsearch works seamlessly with kibana to let you see and interact with your data

Here below a few examples of dashboards you can build using kibana:





References