Upcoming talks and demos:

Codemotion - Amsterdam - 16 May
DevDays - Vilnius - 17 May
Strata - London - 22 May



View Natalino Busa's profile on LinkedIn
Principal Data Scientist, Director for Data Science, AI, Big Data Technologies. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Tuesday, July 1, 2014

Build a wikipedia live search engine using just 3 python scripts.

This tutorial demonstrates how to build a live search engine for wikipedia pages, using hadoop yarn map-reduce and cassandra.

This simple but effective text indexing pipeline provides the data layer for a live search web application. The client web application is capable of searching through all english pages abstract, and respond in millisecond with a list of the most relevant pages.

Four technologies are helping us here:

  • Hadoop yarn map reduce
    Hadoop sifts throught the 3.5 GB of wikipedia page abstracts.
    This demo uses only the core hadoop map-reduce framework.
  • Cassandra noSQL database
    Cassandra exposes the results of Hadoop for low-latency access.
    Cassandra has client libraries in several language including java, scala, python.
  • Angular.js
    A framework for dynamic single-page application. 
  • Flask
    A no frills web application server in python, very concise and excellent for prototyping. 

… and of course Wikipedia
Wikipedia provides the abstracts in xml format for download under the wikipedia terms of use.

Let’s start by the data model.

The first table “wikipedia.pages” is simply a way to collect the abstract and some information about the page (text length , nr of sections in the page, the title) and expose them as a key-value entity on a cassandra database. The unique key in this case is the wikipedia url.

The second table “wikipedia.inverted” builds the inverted index, where the key is the english word, and the values are the list of urls where the word can be found, sorted by relevance.


Hadoop yarn map reduce


Hadoop is used to sift through the xml of the wikipedia pages’ abstract and build an inverted index. The inverted index has as keys the english words present in the abstract, and as value the lists of pages where that word can be found, sorted by a relevance value.

Hadoop yarn map reduce consists of a two phases, map and reduce, both in essence are just producing tuples, in the form of (key, values). In particular while the mapper phase produces key,value pairs from an input file (stored on hdfs), the reduce phase "reduces" all tuples provided by the mapping phase. In particular all tuples sharing the same key, are reduced to a new tuple.

map-reduce dataflow: t-splits to cassandra during map- and reduce- phases

Hadoop is used to sift through the xml of the wikipedia pages’ abstract and build an inverted index. The inverted index's keys are the english words present in the pages' abstract, and the index's values are the url's where that word can be found, sorted by a relevance value.

While performing the hadoop map reduce processing, data is on-the-fly pushed to cassandra as well, while it advanced in the map-reduce yarn pipeline as files.

Data is exported from hadoop to cassandra, using two different approaches:
  • map-phase export
Map only operation work well when only local information is needed, for instance just extracting data from an available xml source. We use a map-phase export to fill a table of abstracts using the page url's as keys.
  • reduce-phase export
When is necessary to collect some global state, for instance the top 10 relevant urls for a given word, then exposing the data from hadoop to cassandra can be done during the reduce step. Global properties require to collect data from multiple mappers, which could be located on different machines in the the distributed hadoop architecture.

Mapper


The mapper consumes the xml file and produces triplets, representing respectively the word extracted, the relevance score, and the page url as follows:

… and repeats the process for all wikipedia abstracts.



The core of the mapper performs the following operations:

  1. Extract the words from the given page xml abstract section
  2. Export the page abstract text plus some stats about the page to cassandra
  3. Emits tuples in the form of word, relevance and originating url
The relevance "algorithm" is not really smart, but it fits in one line of python code. Namely, the article is more relevant proportionally to the length on the abstract, and the number of sections (intra-links) in the wikipedia page. For tutorial purposes is good enough and it produces some interesting results every now and then. :)

Reducer


The reducer collects all triplets belonging to the same key ( i.e. our words) and sorts the url pages by relevance as follows:



The Reducer produces tuuples in the form of word (key), and relevance,url as value. Those entries are filled in the cassandra inverted index page. and ordered by decreasing relevance.



Single page web app


The results available in cassandra can now be exposed with a RESTful API, and consumed by an angular.js client side application. Here below, you see how the top pages from a given word are queried from the cassandra database and marshaled into a json object, ready for use in the client application




Yarn map reduce


The mapper and the reduce script can actually run without hadoop at all. The following script in fact does exactly the same that hadoop yarn map-reduce would do, except the scalability and the perfomance speed up.


Running the same mapper and reduce in distributed mode, would start tens of mapper and reducer scripts, which are going to operate in parallel on different blocks of the original file. This is how you would run the mapper and the reducer scripts using the streaming interface:


Some details about which parameters can be used in the streaming interface can be found at http://wiki.apache.org/hadoop/HadoopStreaming


Final remarks


Hadoop yarn map-reduce is excellent at streaming large files and at processing large amount of information in a batch mode. Cassandra provides a low-latency interface for data stored in a key-value two-dimensional noSQL format (rows->words, columns->relevant wikipages) .

The combination of these two technologies allow us to build a system which is highly available for our client-side application, while re-parse the xml files producing/updating the inverted index.

A presentation about this tutorial is available on slideshare:
http://www.slideshare.net/natalinobusa/fast-queries-on-data-lakes-a-wikipedia-search-tutorial

The demo is available on github: https://github.com/natalinobusa/wikipedia
It requires hadoop and cassandra to be installed on your system to run, at list in as a single node setup.

References

http://hadoop.apache.org/
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
http://wiki.apache.org/hadoop/HadoopStreaming

http://cassandra.apache.org/
https://github.com/datastax/python-driver

https://angularjs.org/