Twitter Indexing Demo

Objective

The demo offers a real-time search of the Twitter stream. The key feature we are demonstrating here is that as soon as a tweet is posted, it is indexed and made available for search.

Try it here!

Architecture

This demo is a web interface for an underlying an infrastructure that indexes in real-time streams of data (e.g. tweets) using Terrier and a distributed architecture where the index is distributed to different computing nodes on a cluster. This infrastructure will be the backbone of the SMART search engine.

This is done with the recently emerging Storm framework which provides a distributed processing environment similar to MapReduce, but which can handle streams of data in real-time. We use this to distribute the workload of indexing the tweet stream using Terrier across multiple machines in a cluster. Terrier has been enhanced to use real-time, in-memory indices, such that as soon as a tweet is posted/received, it is indexed, and made available for search. Typically, on-disk indices for inverted indices are compressed. However, we studied in-memory compression to confirm that this is still appropriate for indices (which they are - they increase retrieval speed, as well as the number of documents that can be indexed in a fixed amount of space). In particular, we use Elias-Gamma of docid deltas, and Elias-Unary for term frequencies (Unary is suitable, as tweets have one or at most two occurrences).

When a query is issued, results are aggregated from different index "shards" (currently 5 "shards" representing 5 distributed computing nodes). Once a tweet becomes a bit old (oldness is a parameter), they are removed from the search results. Currently, the search demo uses a baseline ranking model for tweets - we have previously deployed more advanced and effective ranking models, as described in our paper entitled "University of Glasgow at TREC 2011: Experiments with Terrier in Crowdsourcing, Microblog, and Web Tracks".

Future features

Enhanced user interface to visualise the results.
Refined ranking models for tweets on top of the STORM framework e.g. by deploying learning to rank.
Indexing streams from sensors.
Aggregating results from both social and sensor streams.

Twitter Indexing Demo

Twitter Indexing Demo

Main menu

Project Tweets