Server hangs, yet server.exe does not terminate

ATF · May 2014

This is a server related problem, only.

Description:
- Server process does not exit
- Network traffic on the server goes towards 0 and stays like that
- CPU usage randomly spazzes about for server.exe
(Detail: The CPU Core where most CPU time was used during normal operation goes towards 0)
- All players get red plug and all clients time out.
- Server browser reports server as active, with the last playercount/performance value as right before crash
(Detail: When "Filter Empty" is checked, the server appears for a second then disappears; When unchecked it's visible.
If you open the server details, it shows it as if everything was fine. And counts each players connection time continuously)
- Some official servers have remained like that for weeks, as nobody would manually reset them.

Appearance:
- I've personally seen it happen on servers from 8 - 42 players,
- Modded, unmodded, or official are all the exact same. Several community members can confirm this.
- Servers with webadmin active or not, all the same.
- The logs show nothing out of the ordinary, neither does the console. The players timing out is not logged.

Reproducability:
- None so far.
- It happens every now and then. Sometimes 3 times a day, sometimes not for 4 days.
- It appears to happen only when there's considerable activity (unlike for example right after mapcycle)
while at the same time it's not related to extreme load, high entity counts, etc.
- Round time ranges from very few minutes to long games
- Fairly sure i've never seen it happen before or after a round. Only during rounds

Relief efforts:
- Server console: You can press q, but nothing happens even if you wait.
If you press CRTL-C however, sometimes immediately, sometimes with a considerable delay it will write something about memory leaks and actually exit.
- It will not react to commands such as changemap ns2_veil. It does nothing, but you can enter the command at the console.
- If you do nothing, it will stay like this forever.
- It needs to be manually closed and restarted, there is no workaround.

In my opinion this crash appears to affect only one server thread, the main one.
That would explain why the cpu usage of that thread/core goes down, while the other(s) spazz about and why the clients do not get another network packet but the status request (server browser) side still responds.

This issue has been around for a good while now, a couple months i'd say.
I can't easily create plogs as it'd have to run like that for quite a while to reproduce one of those.
I will try to remember to save a memory dump of it, and send a link to Ironhorse. Since i don't have access to an unmodded server, data from a modded one is all i can provide.
You can use data from the official servers at it happens there, too.

Thanks for reading. I invite other server folks to chime in on this and hopefully provide more details !

GhoulofGSG9 · May 2014

Ram usage is here the important thing. I "only" had that issue when the ns2.exe run out of memory and corrupted stacks were born, yeah i love those.

I kind of "fixed" it with doing a server restart all ~12 hours and having a script which restarts the server as soon as the server querry doesn't change for 2 mins at all.

DC_Darkling · May 2014

Not running any server myself whatsover im still gona attempt to butt in. :P
did you check which thread in the executable is running and which one is hanging or gone?

Setting up process explorer with symbols should get you far in that regard.

ATF · May 2014

GhoulofGSG9 wrote: »

Ram usage is here the important thing. I "only" had that issue when the ns2.exe run out of memory and corrupted stacks were born, yeah i love those.

Cannot confirm this. Server has no issues with ram usage. That's so old and a first place to look that i forgot to mention it.

Can you tell me more about how you query the server status with a script? That in combination with measuring the current network traffic could potentially lead to a way to automate it properly.

Darkling: I've looked into the symbols thing for procexp. Getting Installation failed when installing the Debugging tools for Windows.

DC_Darkling · May 2014

Odd but in this regard unneeded.
Just install it on a similar client windows and grab the dbghelp.dll, copy to server and setup the symbol paths. Note the .dll is not the same for 32b as for 64.

(as a side note, you probably could not install because you lack something it needs like visual basic versions)

Also note 'running out memory' is utterly vague. WHAT part of memory?
- physical?
- Commit?
- paged or nonpaged?

Unlikely for a program ran by many but processes can also hit max process and thread limits, the later not standard shown. (process explorer of course shows these)
Beyond this it gets more complicated with limits on handles, user objects and GDI objects.
I hope never to see such things in production as finding the last 3 is rightout disastrous in my opinion.

ATF · June 2014

Often, but not always, i can observe a couple of these in the log right before it goes:

entity 512 both deleted and created on the same snapshot write (class 8->10)!
entity 1449 both deleted and created on the same snapshot write (class 66->8)!
entity 1757 both deleted and created on the same snapshot write (class 164->10)!
entity 2086 both deleted and created on the same snapshot write (class 10->8)!
entity 2372 both deleted and created on the same snapshot write (class 175->8)!
entity 2556 both deleted and created on the same snapshot write (class 10->108)!
entity 2765 both deleted and created on the same snapshot write (class 7->10)!
entity 3716 both deleted and created on the same snapshot write (class 10->142)!
entity 3965 both deleted and created on the same snapshot write (class 162->8)!

IronHorse · June 2014

Need vanilla server logs and dumps void of any mods or hex edited executables in order to investigate further.

ATF · June 2014

That's a shame.

ATF · September 2014

Over the last weeks, we've sent heaps of additional dumps and data on it.
Thanks to matso's commendable perseverance there's finally a hint =D> :

Those crashdumps for half-crashing were useful; it indicates that
the server gets into an infinite loop inside a very gnarly piece of
extremely optimized code (which I wrote ... so I know just how gnarly).

Can't see exactly WHY that happened (it must be a pretty rare set of
circumstances that causes it considering that it works as well as it
does...)
but knowing where it occurs will allow me to add some extra debugging
code that should trigger when that happens.

We will continue to support the investigation as best as we can.

METROID · September 2014

GhoulofGSG9 wrote: »

I kind of "fixed" it with doing a server restart all ~12 hours and having a script which restarts the server as soon as the server querry doesn't change for 2 mins at all.

Good advice.
A restart every 24H is good enough to keep server from freezing.

IronHorse · October 2014

Awesome @ATF‌ , keep us updated with your findings please

Server hangs, yet server.exe does not terminate

Comments