As the athletes have been training for the London 2012 Olympic Games so has been our Twitter Collectors. You may have saw the maps we created from data collected by the very first iteration of the Big Data Toolkit’s Twitter collector which produced some great visualisations. Over the past few weeks I re-wrote some of the major components of the system to allow multiple machines to connect together and form a swarm of collectors. This allows the swarm to collect data from different locations over the same period of time. We thought that with a certain international event happening on our doorstep what better way to test the system out. At CASA we are lucky enough to have our own private cloud resources that allow us to spawn machines when we need extra infrastructure. So for the Olympic collectors we have 22 machines each collecting Tweets from each of the Olympic venues all over London and we have managed to collect over 1.4 million tweets from the last 14 days of the Olympics (Each has been sent from the vicinity of each venue hence why the individual numbers are low).
One central server manages the swarm and asks each collector every 5 seconds to send back statistics giving us a live view of each server just incase a machine stalls. This is a big step towards a completed system allowing users to initiate collectors on services such as Amazon’s EC2 to takle large scale data capture and still get responsive, live statistics of what each individual machine is collecting.
We have already incorporated this data back into City DashBoard which you can read more about at Oliver O’Brien’s blog
Update: I’ve now enabled the relevant web sockets ports on our servers so you can see the stats from the live collectors here