- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I M M E D I A T E T O D O - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - O P E N B U G S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - BUG: DEADEND LEAK IN SERVER WHEN A MESSAGE IS REJECTED ========================================================== description ----------- when a message arrives at server, and is rejected (forwarded), the message leaks. symptoms --------- leaks are logged analysis ---------- message is i believe session_notify()'ed, and is apparently never destroyed at that terminal. BUG: CLIENT READ THREAD DESTRUCTION EXCEPTION IN destroy_hashtable() ===================================================================== SYMPTOMS 1. usually happens in destroy_plainmailboxmgr()::ETCHOBJ_DESTROY(thisx->mailboxes); but if this is commented out, will occur at some other destroy_hashtable(). 2. occurs at the jenkins refree:: free(x); call which exceptions in crt _output_l() trying to write some humongous text. ANALYSIS 1. the fact that it exceptions somewhere else when we comment out the exception call, may mean that we have recently corrupted the heap. a. back up from crash and try to find where heap may have been clobbered. 2. it always happens on the stress run, even with a single message, but it does not *usually* occur during the singles run, so what is the defining event? 3. it does not occur outside the debugger. it always occurs in the debugger, at ETCHOBJ_DESTROY(thisx->mailboxes); 4. call stack: perftest_threadproc_add(): a. result = remote->remote_base->start_waitup (remote->remote_base, p->waitms); b. etch_int32* resultobj = remote->add (remote, new_int32(1000000000), new_int32(2000000000)); c. ETCHOBJ_DESTROY(resultobj); d. result = remote->remote_base->stop_waitdown (remote->remote_base, p->waitms); perftest_threadproc_add(): calls ETCHOBJ_DESTROY(remote); destroy_perf_remote_server() calls ETCHOBJ_DESTROY(thisx->client_factory); destroy_etch_client_factory() calls ETCHOBJ_DESTROY(thisx->dsvc); destroy_delivery_service_via_interface() calls deliveryservice_implobj->destroy (deliveryservice_implobj); destroy_tcp_delivery_service() calls ETCHOBJ_DESTROY(((etch_mailbox_manager*) thisx->mailboxmgr)); destroy_plainmailboxmgr() calls ETCHOBJ_DESTROY(thisx->mailboxes); destroy_hashtable_default() calls destroy_hashtable destroy_hashtable() calls map->vtab->hdestroy(map->realtable) jenkins_destroy() calls hdestroy((htab*) realtable); hdestroy() calls refree() refree() calls free() free() calls _free_dbg() _free_dbg() calls HeapFree() msvcr80d.dll::_output_l crashes 5. experiment to remove stop_waitdown() from perftest_threadproc_add() to see if exception stops. a. if exception does stop, the stop_waitdown() clobbered the heap. b. first thing we noticed was that with stop_waitdown() removed, the behavior was unchanged outside the debugger, i.e., no leaks, good result. c. however inside the debugger, the exception happens at the same spot. d. CONCLUSION: the problem must be higher in perftest_threadproc_add() 6. log state on breaking just prior to ETCHOBJ_DESTROY(remote): 0202 165504 CONX D connxn 1 start receive pump 0202 165504 CONX D connxn 1 thread 1 receiving ... 0202 165504 CONX X connxn 1 begin blocking receive ... 0202 165506 CONX X connxn 1 begin send 0202 165506 CONX D connxn 1 end send 46 bytes 0202 165508 TRAN X reading mailbox (wait 3000) ... 0202 165508 CONX D connxn 1 end receive 46 bytes 0202 165508 MBOX X etchmbox_message acquiring mailbox read lock 0202 165510 MBOX X tcpdelsvc_endcall acquiring mailbox read lock 0202 165510 MBOX X etchmbox_message releasing mailbox read lock 0202 165510 MBOX X tcpdelsvc_endcall releasing mailbox read lock 0202 165510 CONX D connxn 1 thread 1 receiving ... 0202 165510 CONX X connxn 1 begin blocking receive ... 7. experiment: try removing read lock set and reset a. tried that but it reintroduced the race condition crash that existed before we put the locks in. 8. experiment: try removing the remote->add and just do the start and stop. a. OK that gets rid of the exception at the ETCHOBJ_DESTROY(remote); hdestroy(); jenkhtab::refree() calls; b. this is a good clue c. now add stuff back - first determine what we are in process of freeing at the exception, then find where this gets previously freed after remote->add d. ETCHOBJ_DESTROY(mboxmgr->mailboxes); e. find the destructor being called and breakpoint there to see if called twice dtor is just destroy_hashtable f. so instead, find where "mailboxes" is instantiated, save the pointer globally, and in destroy_hashtable, test to see if hashtable is that global address. g. when we remove the particular hdestroy(), the crash simply moves elsewhere to a HeapAlloc() or a HeapFree. meaning the problem is not with a hashtable at all, it is due to a corrupt heap. h. ask: what is the diff between the verify tests and this stress test call? run a verify add() again and see if the same heap problem exists. yes, it exists there as well. 9. FOUND CODE triggering heap corruption ---------------------------------------------- *** if we comment out the imailbox free the heap corruption is no longer evident however the imailbox object leaks. *** perhaps then this free is not the actual problem but somehow triggers the problem. xmpl_perf_remote_server::perf_remote_add() result = perf_remote_dispose_mailbox (thisx, &mbox); int destroy_mailbox(etch_plainmailbox* thisx) etch_free(thisx->imailbox); // comment out and heap issue gone but this item leaks 10. TEST POINTS 0. xmpl_perf_client_main.c:: int _tmain (int argc, _TCHAR* argv[]) result = perf_xmpl_test_server (uri, new_perf_client, waitms); 1. xmpl_perf_client_tests.c:: int perf_xmpl_test_server (wchar_t* uri, new_client_funcptr user_ctor, const int waitms) #if(1) etchlog (ETCHTEST, ETCHLOG_INFO, "run perf add test ...\n"); result = run_perftest ("add", uri, waitms, runtime, trials, numthreads, perftest_threadproc_add, user_ctor); #endif 2. xmpl_perf_client_tests.c:: int run_perftest (char* descr, wchar_t* uri, const int waitms, const int runtime, const int count, const int threads, perftest_threadproc testproc, new_client_funcptr new_perfclient) //mask = "%s %d took %f -- trying %d to get >= 1 second\n"; //while(time < 1.0) //{ // if (time > 0.1) // etchlog(ETCHTEST, ETCHLOG_INFO, mask, descr, oldn, time, n); // oldn = n; // time = perftest_run_one(&perftest, n, testproc); // n *= 2; //} n = oldn; 3. xmpl_perf_client_tests.c:: int perftest_threadproc_add (perftestobj* p) //while(iterations--) /* execute perf.add() n times */ //{ // etch_int32* resultobj = remote->add (remote, new_int32(1000000000), new_int32(2000000000)); // assert(resultobj); // ETCHOBJ_DESTROY(resultobj); //} 4. perf_remote_server.c:: etch_int32* perf_remote_add (perf_remote_server* thisx, etch_int32* x, etch_int32* y) i_mailbox* mbox = perf_remote_add_begin (thisx, x, y); if (mbox) { // TESTING NOTE we can comment the _end() call out without leaking anything resultobj = perf_remote_add_end (thisx, mbox); // another thing to try - comment prior mailbox access to see if problem is there. result = perf_remote_dispose_mailbox (thisx, &mbox); } - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F I X E D B U G S - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - BUG: CLIENT MAILBOX READ TIMEOUT *** F I X E D *** ================================= description ----------- a. there is a race condition wherein a mailbox write does not complete thru notify before a mailbox read completes and the service destroys itself. mutexes and waits do not seem to work, they deadlock, apparently on same thread. we added a sleep in stop_perf_client() to see if this forces write to complete. however we would prefer to have write complete before read starts. *** could we possibly signal the read wait when we finish writing, rather than at etchqueue_put_withwait(), earlier in the same function (etchmbox_message)? currently etch_apr_queue_push() does apr_thread_cond_signal(queue->not_empty); we would have to pull this way out to the tcp client receive listener on_data call, in order to leave as little of the write stack as possible unwound after the mailbox read kicks in and completes. see Q: who calls etchmbox_message() for the entire call stack from mailbox down to receive listener proc. HOW ABOUT if we establish a callback on the mailbox close which accesses the mutex? need to start at "stopping perf client ..." and find the best spot on the main thread to wait on the mutex set in mailbox.message(). if we can pass a mutex into perf from client main, which get acquired in etchmbox_message(), we can also acquire it in stop_perf_client(), which causes the thread to continue. first trace start_perf_client to see how we can get the mutex up to the mailbox. we should pass the mutex in via the parameter bundle. we should still be able to prevent reads from occuring until write has completed and vv, but we need to find where we can lock the read from outside of the receive thread, to do so. this should be in tcpdelsvc_endcall() b. if the mailbox read times out, the message leaks. seems to be just the message, not wrapper. it would seem that the message is eventually in the mailbox however, is it? if so it should be wrapped but also should be destroyed with the mailbox. if not, what happens to the message? ideally the message is either redelivered, destroyed if not redelivered, or remains in the mailbox and is destroyed with the mailbox. so why does the message leak? notes - leak case ----------------- ___ in etch_plainmailbox.c: etchmbox_read_result() *out = new_mailbox_element (NULL, NULL); **** LEAKS **** ___ in perf_remote_server.c: perf_remote_add_end() we return the leaked object to perf_remote_add() ___ in perf_remote_server.c: perf_remote_add() we return the leaked object to perf_xmpl_verify_server() ___ in perf_xmpl_verify_server() the leaked object is addresult. we ETCHOBJ_DESTROY(addresult); todo ------------ ___ the short-circuited code is in etchremote_stop_waitdown. NO LONGER SHORTED ___ etch_tcpconx_transport_control CLASSID_CONTROL_STOP now if client does etch_tcpclient_stop_listener, it appears as if either client or server can just do etch_tcpconx_close() ___ after etch_tcpconx_closem etch_tcpclient object can perhaps be destroyed in perf_remote_server destructor which is invoked from [main]. ___ mailbox read times out even though wait is infinite. THOUGHT 2B FIXED BUT STILL TIMES OUT 50% OF TIME a. first do mailbox test to test wait mechanism ALL TESTS PASS INCLUDING READ WAITS b. test wait ms -1 OK this means DO NOT WAIT c. then trace start waitup ___ determine where waitms gets set to -1, since it starts out zero NO IT DOES NOT it is the start/wait that is zero, the mailbox wait is per-type. we now specify ETCH_DEFAULT_MAILBOX_READ_WAITMS to force a short wait. to investigate: --------------- ___ see redeliver() in etchmbox_close_read() for possible leaks. OK_ TODO close the socket forcibly instead of sending data, in etch_tcpclient_stop_listener ___ there may be connection closes and mailbox reads going on at the same time, investigate and synchronize perhaps. we've got a stop_perf_client() going on on top of a mbox.close_read(). the stop_perf_client() closes the client listener, the close_read closes what? BUG: URI DISAPPEARS AFTER ONE ROUND TRIP TO SERVER * * * F I X E D * * * ================================================== description ----------- the uri allocated on the stack in main is no longer present after one test. analysis -------- it may be the case that it is simply cleared out by the URL parser. investigate. call sequence ------------- 1. _tmain 2. perf_xmpl_test_server(uri) 3. run_perftest("add", uri, perftest_threadproc_add); 4. perftest_run_one(perftestparams, perftest_threadproc_add) 5. perftest_run_threads(perftestparams, perftest_threadproc_add) 6. perftest_threadproc_add 7. perf_remote_server* remote = new_remote_server (p->uri, impl_factory); 8. deliverysvc = new_etch_transport (uri, (etch_factory_params*) p, NULL); 9a. etch_url* url = new_url(uri); 9b. i_delivery_service* newds = new_etch_transport_a (url, params, conximpl); uri is still OK to this point. it is 0x00454070 9c. url->destroy(url); uri is still OK 10. (perf_remote_add) i_mailbox* mbox = perf_remote_add_begin (thisx, x, y); 11. (perf_remote_add) resultobj = perf_remote_add_end (thisx, mbox); uri is still OK 12. (perf_remote_add) perf_remote_dispose_mailbox (thisx, &mbox); 13. (perftest_threadproc_add) return from etch_int32* resultobj = remote->add (remote, ...)); 14. (perftest_threadproc_add) ETCHOBJ_DESTROY(remote); 15. (destroy_perf_remote_server) ETCHOBJ_DESTROY(thisx->remote_base); 16. (destroy_perf_remote_server) ETCHOBJ_DESTROY(thisx->server_base); 17. (destroy_perf_remote_server) ETCHOBJ_DESTROY(thisx->client_factory); 18. (etch_transport.c destroy_etch_client_factory) 19. (etch_transport.c destroy_etch_client_factory ETCHOBJ_DESTROY(thisx->dsvc); 20. (etch_transport.c destroy_tcp_delivery_service) ETCHOBJ_DESTROY(((etch_mailbox_manager*) thisx->mailboxmgr)); *** the above triggers the stack problem - note that we previously destroyed the mailbox in (12), this is significant. 21. (destroy_etch_client_factory) uri is still OK after destroying the delivery service. 22. (xmpl_perf.client_tests.c perf_xmpl_test_server) i is now 1, numthreadcounts is now 2, uri is still OK run_perftest ("add", uri, perftest_threadproc_add); see (3) 23. (xmpl_perf.client_tests.c run_perftest) perftest_run_one (perftest_threadproc_add); 24. (xmpl_perf.client_tests.c perftest_run_one) perftest_run_threads (perftestobj, perftest_threadproc_add) note that uri is still good and that perftestobj.uri is 0x00454070 24. (xmpl_perf.client_tests.c perftest_run_threads) for(; i < p->threads; i++) etch_thread* newthread = pool->run (pool, runner, p); 25. (etch_thread.c etch_threadpool_run_freethread) 26. (etch_threadpool_run_freethread) newthread->start (newthread); 27. (xmpl_perf.client_tests.c perftest_threadproc_add (perftestobj* p)) // OK the perftestobj here is now invalid perf_remote_server* remote = new_remote_server (p->uri, impl_factory); NOTE that the original uri memory is still OK, but the perftestobj object passed to the threadproc is bad, therefore the uri extracted from this object is garbage. **** THE PROBLEM is that etch_apr_threadproc passes etch_threadparams* to the threadproc, (perftestobj is params.data) whereas when we're unthreaded we pass a perftestobj BUG: ETCH HEAP ERROR AFTER DESTROYING CLIENT RESOURCES * * * F I X E D * * * ======================================================= description ----------- when a client runs a service on simultaneous threads, all threads after the first crash on etch_transport::destroy_etch_client_factory(): ETCHOBJ_DESTROY(thisx->in_resx); WRONG: actually the crash is in the perftest_threadproc_add thread, since threadparams.is_own_data = TRUE, we etch_free(userdata). recall that userdata is the perftestobj. symptoms --------- 0211 150537 TRAN D destroying resources ... 0211 150537 MEMX E etch heap tracking store missing key '0012feb8' analysis --------- the same resources are passed to all threads. one or more objects in resources are destroyed in the first thread and then an ateempt is made to destroy them again. a. eventually we want to get rid of the reources map, not sure if this is the time to do so. b. explore possibility of factory not owning resources. 1. Q: where is client factory allocated/populated? A: etch_transport::new_client_factory (objmask* session, i_session* isession, new_client_funcptr) called from: helper::start_perf_client (wchar_t* uri, new_client_funcptr); HOWEVER resources have not been populated yet. 2. Q: where are resources populated to client factory? A: helper::new_remote_server (uri, etch_client_factory) there does not seem to be anything volatile populated in resources there 3. Q: what does the resx destruction crash on? A: it doesn't. the crash is on the message thread exit, when the thread exit handler frees userdata. and userdata is the perftestobj. 4. Q: where is the perftestobj allocated? A: it is allocated on the run_perftest() stack (!!!!) 5. Q: since it is on the stack, why didn't it crash on the main thread test? A: because the main thread test does not go thru the thread exit proc. 6: Q: should we try to clone the user params, or should we set the thread to not destroy params? A: threadpool.is_free_data *should* control whether the pool's threads free their userdata. THAT FIXED IT! BUG: FOPEN() ABSOLUTE PATHS CHANGE WHEN RUNNING IN DEBUGGER * * * F I X E D * * * ============================================================ description ----------- when running from command line, "../logsrv/etchserver.log" is "C:\workspace\etch\src\etch\bindings\c\project\logsrv" when running in debugger, it is "C:\workspace\etch\src\etch\bindings\c\project\xamples\xmpl_perf\xmpl_perf_server\logsrv" NOTE that this will be a similar problem for the config file. symptoms --------- fails to open log file, since subdirectory does not exist. analysis ---------- possible solution 1: ensure directory exists using c runtime possible solution 2: don't use a subdirectory, use file extensions of ".srvlog" and ".clilog" resolution ---------- 1. create log directory prior to log open ../srvlog or ../clilog 2. expect config file in ../etch-c-server.properties or ../etch-c-client.properties BUG: LOG FILES WILL CONTINUE TO ACCUMULATE * * * F I X E D * * * =========================================== description ----------- log files will accumulate in the log directories without manual intervention analysis ---------- need a logfile back filename generator need portable directory file enumeration code at startup or when log file lines exceed configured maximum, need code to rename log file to backup. when log file count exceeds configured maximum, need code to purge some number of files from log directory resolution ---------- 1. coded and tested POSIX-style (dirent) functions for directory browsing (etchutl.h). these are compiled in windows builds. presumably in linux builds the c runtime dirent functions will work the same. 2. added code to create backup log file names 3. added code to backup log files 4. added code to purge log files - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F A Q - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Q: where do we catch mailbox read timeout? A: etch_transport::tcpdelsvc_endcall() calls mbox->read_withwait () Q: what is call stack prior leading up to mailbox read? A: perf_xmpl_verify_server() calls remote->add(1,2); perf_remote_add() calls perf_remote_add_end (mailbox perf_remote_add_end() calls remote_base->end_call (imailbox, &result); etchremote_endcall() calls deliveryservice->end_call(mailbox); tcpdelsvc_endcall() calls mailbox->read_withwait(timeoutms, &mailboxelt); etchmbox_read_withwait() calls etchqueue_get_withwait(delay, out); etchqueue_get_withwait() calls etch_apr_queue_pop(wait_usec, out); Q: what is call stack for client data receive? A: etch_tcpclient_listenerproc etch_tcpclient_on_data() etch_pktizer_session_data etch_msgizer_session_packet pmboxmgr_session_message *** here we do pmboxmgr_get_mailbox() *** check if mbox closed here etchmbox_message etchmbox_fire_notify (etch_plainmailbox* thisx) Q: what are values for infinite wait, and no wait? A: #define ETCH_INFWAIT 0 #define ETCH_NOWAIT (-1) Q: where is timeout value for mailbox reads determined? A: tcpdelsvc_endcall() etch_type_impl* typeinfo = response_type? (etch_type_impl*) response_type->impl: NULL; timeout = typeinfo->timeout > 0? typeinfo->timeout: ETCH_DEFAULT_MAILBOX_READ_WAITMS; Q: where is the message return object created, what is the return chain, and where is it destroyed? A: return object is created either in etchmbox_check_read_result(), called from etchmbox_read() or etchmbox_read_withwait(), or in etchmbox_message(), calling new_mailbox_element(), called when the message is queued to the mailbox. if the etchmbox_read is successful, the etch_mailbox_element return object is the object read from the mailbox queue, otherwise the return object is instantiated only to host the resultant exception. Q: who calls etchmbox_message() to queue a message to mailbox? A: pmboxmgr_session_message(), called by etch_msgizer_session_packet() via is_message_handled = (0 == thisx->session->session_message (thisx->session->thisx, whofrom, msg)); *** this is also where the inbound message is created and can be destroyed if unwanted this is called by etch_pktizer_session_data via pzr->session->session_packet (pzr->session->thisx, whofrom, fbuf); this is called by etch_tcpconxn.c::etch_tcpclient_on_data() via session->session_data (session->thisx, NULL, data))) this is called by etch_tcpclient_listenerproc() via cx->on_data (cx, 0, result, fbuf); SO the tcp receive listener is at the head of the stack, owning the call stack up through the mailbox waiting to be unwound. IT WOULD BE GOOD IF WE COULD EXIT THIS THREAD FIRST rather than last when we tear down. Q: where is inbound message created? A: see previous question. Q: what is the return path of an exception arriving at the client returned from a one-way message? A. * messagizer.session_packet calls tdi.read_message(). * messagizer.session_packet calls session.session_message(), with the message. * mailboxmgr.session_message calls get_mailbox(in_reply_to), returning error because there is none. * messagizer.session_packet calls session.session_notify(), passing a new unwanted message. * mailboxmgr.session_notify calls deliverysvc.session_notify(). * deliverysvc.session_notify calls stub.session_notify(). * etchstub.session_notify calls etchstub_session_notify_ex with the stub's _obj which is the EXECLIENTIMPL. * etchstub_session_notify_ex gets the session callbacks from the client impl object. * etchsession_def_session_notify destroys the payload (the exception message) and returns. Q: what is call sequence for client main calling stop_xxxx_client()? A: * etchremote_stop_waitdown() * etchremote_transport_control() * tcpdelsvc_transport_control() * pmboxmgr_transport_control() * msgizer_transport_control() * pktizer_transport_control() * tcpconx_transport_control() * tcpclient_stop_listener() * tcpconx_closex()