This simple but effective text indexing pipeline provides the data layer for a live search web application. The client web application is capable of searching through all english pages abstract, and respond in millisecond with a list of the most relevant pages.
Four technologies are helping us here:
- Hadoop yarn map reduce
Hadoop sifts throught the 3.5 GB of wikipedia page abstracts.
This demo uses only the core hadoop map-reduce framework.
- Cassandra noSQL database
Cassandra exposes the results of Hadoop for low-latency access.
Cassandra has client libraries in several language including java, scala, python.
A framework for dynamic single-page application.
A no frills web application server in python, very concise and excellent for prototyping.
… and of course Wikipedia
Let’s start by the data model.
The first table “wikipedia.pages” is simply a way to collect the abstract and some information about the page (text length , nr of sections in the page, the title) and expose them as a key-value entity on a cassandra database. The unique key in this case is the wikipedia url.
The second table “wikipedia.inverted” builds the inverted index, where the key is the english word, and the values are the list of urls where the word can be found, sorted by relevance.
Hadoop yarn map reduce
Hadoop yarn map reduce consists of a two phases, map and reduce, both in essence are just producing tuples, in the form of (key, values). In particular while the mapper phase produces key,value pairs from an input file (stored on hdfs), the reduce phase "reduces" all tuples provided by the mapping phase. In particular all tuples sharing the same key, are reduced to a new tuple.
|map-reduce dataflow: t-splits to cassandra during map- and reduce- phases|
Hadoop is used to sift through the xml of the wikipedia pages’ abstract and build an inverted index. The inverted index's keys are the english words present in the pages' abstract, and the index's values are the url's where that word can be found, sorted by a relevance value.
Data is exported from hadoop to cassandra, using two different approaches:
- map-phase export
- reduce-phase export
The core of the mapper performs the following operations:
- Extract the words from the given page xml abstract section
- Export the page abstract text plus some stats about the page to cassandra
- Emits tuples in the form of word, relevance and originating url
The relevance "algorithm" is not really smart, but it fits in one line of python code. Namely, the article is more relevant proportionally to the length on the abstract, and the number of sections (intra-links) in the wikipedia page. For tutorial purposes is good enough and it produces some interesting results every now and then. :)
The reducer collects all triplets belonging to the same key ( i.e. our words) and sorts the url pages by relevance as follows:
The Reducer produces tuuples in the form of word (key), and relevance,url as value. Those entries are filled in the cassandra inverted index page. and ordered by decreasing relevance.
Single page web app
The results available in cassandra can now be exposed with a RESTful API, and consumed by an angular.js client side application. Here below, you see how the top pages from a given word are queried from the cassandra database and marshaled into a json object, ready for use in the client application
Yarn map reduce
Running the same mapper and reduce in distributed mode, would start tens of mapper and reducer scripts, which are going to operate in parallel on different blocks of the original file. This is how you would run the mapper and the reducer scripts using the streaming interface:
Some details about which parameters can be used in the streaming interface can be found at http://wiki.apache.org/hadoop/HadoopStreaming
The combination of these two technologies allow us to build a system which is highly available for our client-side application, while re-parse the xml files producing/updating the inverted index.
A presentation about this tutorial is available on slideshare:
The demo is available on github: https://github.com/natalinobusa/wikipedia
It requires hadoop and cassandra to be installed on your system to run, at list in as a single node setup.