OK - some comments from me:
"It is possible for the broker to recieve frames at a rate faster than it can process. When this occurs a large number of Jobs and Events are produced. This can further slow down the system by increasing memory usage, causing the GC to to run frequently and generally compound the issue. This is undesireable"
the maximum number of jobs is equal to the number of currently active connections, it's therefore pretty useless to try to trigger anything off the number of jobs that are active (I know, I've tried, before slapping myself on the forehead and realising why it's a pointless thing to do).
1. To completely implement the above solution, a number of changes are required.
2. The broker needs to be able to determine currently used / available memory. This can be obtained via JMX.
I think that it may be easier to instead just work off approximations/upper bounds on the amount of memory consumed. We can account for message sizes as they come in and should be able to use approximations for memory use per connection / subscription / delivery to a queue / etc... In the first stage we can just use an arbitrary hard limit per queue based on underlying message size.
3. The threads which process Jobs and Events need to be signalled to pause, and to resume.
Pausing the threads isn't what you want to do, since this will make the broker lock up (these are threads from a threadpool... if they just pause then they can't be used to perform actions which will reduce the number of messages in the broker). What you really need to do is "suspend" the inbound socket connection. Possibly linking this to some sort of message level flow control so you can unsuspend to receive acks.
4. Protectio in the MINA layer of both the client and the broker needs to be enabled by default.
First it needs to work. When enabled previously we saw errors I believe.
5. When a memory threshold is reached, the broker should fire an event which signals the job processing threads to pause. In future this event should be listened for by other mechanisms designed to mitigate the issue - such as flow to disk.
As above - thread "pausing" is not the answer. What i suggest is that queues have threshold limits when these are triggered then you would expect the queue to move into a "flow-to-disk" mode.
Fundamentally what we are trying to achieve is to bound all our unbounded buffers.
Theoretically we can have buffers at the following points in our code (i.e. discounting TCP stack buffers):
i) Undecoded bytes read from the wire
ii) Decoded but unprocessed frames
iii) Messages on queues
iv) Unencoded Frames to be sent to clients
v) Encoded bytes to be sent on the wire
From the work that Rafi and I did previously we looked to replace i) and v) with fixed size byte buffers, and we removed all buffers at points ii) and iv).
For fixing the queue size problem you need to take action at a higher level than the i/o layer.
Having said all that, if we can get protectio reliably working then that is a good first step.