Apache Agora

How does a virtual community look like?

I'm a visual type of guy. I love to see things, I love to visualize them, hopefully in esthetically pleasant ways. I'm also fascinated by human dynamics and sociology. For these reasons, I started about the shape of a virtual community and if it was possible to come up with ways to harvest the data as transparently as possible about communities of invididuals and how to visualize this incredible amount of information so that it can yield insights that were hard to see before.

Inspirations

Several things inspired me:

Advogato is a social experiment on trust. It analizes topologies of trust information to come up with a tamper-proof way to estimate how trustful a person might be in a particular social context. This inspired me to create a system that was able to provide the people around the ASF a way to discover their social relationships.

Google is a search engine which bases his page ranking techniques on hyperlink topology analysis. This inspired me the concept that while trends in a small scale would be too small to be appreciated, on a global scale might provide amazing insights into the information that exhibits them. I call this concept "data emergence".

The concept of blogging ecosystems (kudos to Sam Ruby for pointing me to this) inspired me against the use of a central web application (as Advogato does) in favor of a more 'laissez-faire' approach that doesn't force people to use any particular technology to express social relationships but just using hyperlinks to their homepages, just like the blogging community does. This allows people to use whatever tool/system/technology to come up with their homepages and doesn't impose them anything if not just to link to the other ASF people they know and like.

Unfortunately, unlike blogging fans, individuals that participate in apache communities use email as their favorite communication tool and therefore most of the social relationship data is contained into mail archives. (kudos to Ben Laurie for suggesting me this)

The basic idea

The basic idea is that every time you reply to somebody's message, you are creating a social relationship between you and that person.

That's more or less it.

Yes, really, I mean it: we are used to think at mail list archives as flat lists of messages (date view, author view) or, in more advanced visualizations, as trees of replies (thread view). But a tree is only one of the possible ways to visualize the information contained into the concept of one message replying to another. In fact, this act of replying is a relation between two otherwise independent messages.

In my quest to remove any of my personal judgements (or algorithmical heuristics) I wondered how it was possible to visualize the shape of a community on a 2D space but without any human interaction, so instead of coming up with a static image rendering technique, I opted for a run-time interactive molecular-style simulation where a simple model describes the behavior of a single molecule (the person, in our case) and how it interacts with others in the same environment.

A simple math model for human behavior

The basic idea for the community simulation is that every individual has the perception of a private space that surrounds him/her and his/her level of confort is reduced if this space is violated by living entities (humans beings or animals) that are not recognized (this can be easily identified as an instinctive protection system).

Modelling this behavior in a human being is almost impossible given the amount of variables that influence it (culture, age, personal history, mood, health), so I decided to use a very basic and static model where each individual is treated exactly the same. I'm perfectly aware that this is a pretty strong assumption of the model, but I had to start somewhere.

The math model of this private space is done by the creation of a repulsive force field that can pictured in the following graph:

This graphs means that after a safe distance, there is no repulsion force between two individuals, so they are free to move as long as they remain outside this safe circle. When this space is invaded, the perceived tension is expressed with a tendency to repulse the offending intruder and restore the original safe distance. The closer the distance, the more intense the repulsion will be.

Social relationships are equivalentely complex to describe, given their incredible variety. Again, a very basic force field model is applied. The idea is to model social relationships with an attraction force field which can be pictured as in the following graph:

where the slope of the force/distance line represents the 'intensity' of the relation. Here, the attractive tension is resolved when the distance between individuals is zero. This might seem a little unnatural at first, but it makes sense if we consider that the repulsion force field is still present, even when a relationship is established. In fact, the pulsion of attraction and the defensive tendency of repulsion create two counter-balancing forces that create a fed-back mechanism to equilibrate the relationship.

It is also nice to note that if only one of the two individuals exhibits attraction (or if the attraction intensities are not balanced), there is no point of equilibrium and the couple will tend to move. Thus while stability is a sign of relational balance, movement is a sign of lack of ballanced attraction.

Overview of the application

The application is divided into two regions:

the left part deals with the message archives, generates the community model and loads it into the graph visualizer (the right part)
the right part is the graph visualizer. It doesn't modify the model, but allows you to interact with it.

Creating the community model

In order to create the community model, you need to tell Agora what message archives to process. These message archives were previously pre-processed from regular MBOX mail archives with the script available in the distribution, which extracts from each message the metadata that Agora requires from email headers.

To select a message, browse the tree of available mailboxes and click on the file (they are normally split into monthly archives). Agora will fetch this archive from the network and, when, finished, place the white file icon with a reddish one, indicating that the file has been loaded.

After you have selected the mailboxes that you want to visualize, click the "Load" button and the graph will appear on the right part.

Visualizing the community model

When the model is created, the nodes have random placement. In order to start processing the community and see it shape, click on the "Start" button of the Tab "Controls". This will start the processing.

When the process starts, you can play with it interactively. Here is a list of things that you can do:

click on a node will highlight the email address of that node (alternatively, you can use the "highlight" tab)

dragging a node will force its location. the graph will adjust accordingly.

double-click on a node makes the node "sticky". the node turns green, its email address remains visible and the node becomes unmovable. You can do so for as many nodes as you want.

right-click (or control-click for single-buttom mice) will explode the graph using a gaussian coordinate system instead of a cartesian one. Also, the nodes closer to the mouse pointer will be highlighted and drawn radially to avoid

If the graph is complex and contains lots of node and edges, the drawing might be slow. Use the "Drawing" tab to remove edges and speed up drawing until the graphs have stabilized.

You can also play with the "Parameters" tab to provide different physical parameters (such as mass of the node, drag of the surface, and repulsion or attraction coefficients) to change the behavior of the graph (in some big graphs, in fact, the system might become unstable and sweaking the parameters might be necessary).

Implementation

The MBOX processing script is written in Python. The visualization engine is written in Java. It's both an applet and a standalone application. The java application is selfcontained, optimized for speed and size. Uses Java2D for drawing and it is hardware optimized in those JVM that support this (like under MacOSX).

Challenges and Future Work

I see a few, both technological and social. Technologically, the problem will not be able to scale easily since the computational complexity of the mathematical model used grows with the square of the amount of nodes involved (N*N). This is very similar to the math problem that molecular chemists face when they have to visualize the exact shape of complex molecules. There are no known (to me, at least) algorithmical methods to reduce that computational complexity down to, say, N*log(N) which would be much more reasonable. If you know more, I want to hear from you!

Another challenge is to find out how to collapse different email addresses used by the same person so that data is more meaningful. Today, the data is harvested by using the email address as a unique key. In the future, the goal is to either harvest this information somewhere and somehow trasparently, or to have a database where users indicate the system which addresses belong to them, so that the dataclouds can be renormalized on that respect. It can be possible to process .forward files from a given user domain (the ASF, for example), but this raises privacy concerns.

Another technical challenge (kudos to Pier Fumagalli for pointing this out) is to parallelize the symulation routine on different threads so that multi-processor machines can really execute the application in parallel. While this degrades performance on a single-processor machine (althought not by much), it speeds it up incredibly on a multi-processing one. Since I estimate that today the overall ASF datacloud might account for something around 15000/20000 nodes (that is, emal addresses that had a significant replying impact on any ASF community), it might not be possible to actually play with it realtime, but at least it could be possible to run the application in background for several weeks and see what happens. For this, we really need to go multiprocessor.

Disclaimer

People have an automatic repulsion toward any mathematical model that tries to observe and simulate human behavior. While this is understandable, it must also be acknowledged how the simple analysis of human-created hyperlink topologies created the most successful search engine system (Google) and how it's human-semantic inferrence that leads to the perception of its smartness. I humbly try to follow the same model: avoid heuristical analysis and concentrate on extraction of human-inferred metadata (in this case, replying is agora's metadata, where hyperlinking is google's).

At the same time, Agora does NOT want to be a measurement tool for virtual communities and users should not rely only on Agora-inferred data on how a community behaves or various communities interact.

Last but not least, while Agora discards information on the content of the email messages, even its metadata can create security concerns. Analyzing somebody else's email without his permission is unethical and illegal in many countries: make sure you have permission from the email senders before analyzing them.

For example, analyzing email messages that were posted on a public mail list, newsgroup, BBS or forum and publicly made available is perfectly legal. Processing your employers mailbox without telling them, well, that's not, unless they gave you explicit permission.

Anyway, under no circumstance, the Apache Software Foundation will be liable for abuse of its software, as explicitly stated in the license upon which Agora is distributed.

Frequently Asked Questions

Why the name Agora?
The terms originates in ancient greece and can be described such as

A central feature of the polis. Originally a marketplace, the agora also served as the chief social and political meeting place. Along with the acropolis (the upper fortified part of a city), the agora housed the most important buildings of the city-state. [original]

I thought that the visualizer can be seen as the view from above of a virtual meeting place where all individuals are gathered into a location.

Why is it hosted here and not in a more public apache location?
Just because I didn't know where to put it and wanted to hear feedback before doing anything. I'll be very happy to move it on a more visible location if this desire appears.

Can I run Agora on my own mail lists?
Sure, Agora is not bound in any way with Apache or Apache-specific data. Agora is able to generate dataclouds of any collection of MBOX mail archives.

Do you know any similar efforts?
Paul Mutton independently applied similar concepts of reply means social connection on IRC channels for PieSpy. [Kudos to Nicola Ken Barozzi for forwarding me the link]