Server hangs, yet server.exe does not terminate
ATF
Join Date: 2014-05-09 Member: 195944Members
This is a server related problem, only.
Description:
- Server process does not exit
- Network traffic on the server goes towards 0 and stays like that
- CPU usage randomly spazzes about for server.exe
(Detail: The CPU Core where most CPU time was used during normal operation goes towards 0)
- All players get red plug and all clients time out.
- Server browser reports server as active, with the last playercount/performance value as right before crash
(Detail: When "Filter Empty" is checked, the server appears for a second then disappears; When unchecked it's visible.
If you open the server details, it shows it as if everything was fine. And counts each players connection time continuously)
- Some official servers have remained like that for weeks, as nobody would manually reset them.
Appearance:
- I've personally seen it happen on servers from 8 - 42 players,
- Modded, unmodded, or official are all the exact same. Several community members can confirm this.
- Servers with webadmin active or not, all the same.
- The logs show nothing out of the ordinary, neither does the console. The players timing out is not logged.
Reproducability:
- None so far.
- It happens every now and then. Sometimes 3 times a day, sometimes not for 4 days.
- It appears to happen only when there's considerable activity (unlike for example right after mapcycle)
while at the same time it's not related to extreme load, high entity counts, etc.
- Round time ranges from very few minutes to long games
- Fairly sure i've never seen it happen before or after a round. Only during rounds
Relief efforts:
- Server console: You can press q, but nothing happens even if you wait.
If you press CRTL-C however, sometimes immediately, sometimes with a considerable delay it will write something about memory leaks and actually exit.
- It will not react to commands such as changemap ns2_veil. It does nothing, but you can enter the command at the console.
- If you do nothing, it will stay like this forever.
- It needs to be manually closed and restarted, there is no workaround.
In my opinion this crash appears to affect only one server thread, the main one.
That would explain why the cpu usage of that thread/core goes down, while the other(s) spazz about and why the clients do not get another network packet but the status request (server browser) side still responds.
This issue has been around for a good while now, a couple months i'd say.
I can't easily create plogs as it'd have to run like that for quite a while to reproduce one of those.
I will try to remember to save a memory dump of it, and send a link to Ironhorse. Since i don't have access to an unmodded server, data from a modded one is all i can provide.
You can use data from the official servers at it happens there, too.
Thanks for reading. I invite other server folks to chime in on this and hopefully provide more details !
Description:
- Server process does not exit
- Network traffic on the server goes towards 0 and stays like that
- CPU usage randomly spazzes about for server.exe
(Detail: The CPU Core where most CPU time was used during normal operation goes towards 0)
- All players get red plug and all clients time out.
- Server browser reports server as active, with the last playercount/performance value as right before crash
(Detail: When "Filter Empty" is checked, the server appears for a second then disappears; When unchecked it's visible.
If you open the server details, it shows it as if everything was fine. And counts each players connection time continuously)
- Some official servers have remained like that for weeks, as nobody would manually reset them.
Appearance:
- I've personally seen it happen on servers from 8 - 42 players,
- Modded, unmodded, or official are all the exact same. Several community members can confirm this.
- Servers with webadmin active or not, all the same.
- The logs show nothing out of the ordinary, neither does the console. The players timing out is not logged.
Reproducability:
- None so far.
- It happens every now and then. Sometimes 3 times a day, sometimes not for 4 days.
- It appears to happen only when there's considerable activity (unlike for example right after mapcycle)
while at the same time it's not related to extreme load, high entity counts, etc.
- Round time ranges from very few minutes to long games
- Fairly sure i've never seen it happen before or after a round. Only during rounds
Relief efforts:
- Server console: You can press q, but nothing happens even if you wait.
If you press CRTL-C however, sometimes immediately, sometimes with a considerable delay it will write something about memory leaks and actually exit.
- It will not react to commands such as changemap ns2_veil. It does nothing, but you can enter the command at the console.
- If you do nothing, it will stay like this forever.
- It needs to be manually closed and restarted, there is no workaround.
In my opinion this crash appears to affect only one server thread, the main one.
That would explain why the cpu usage of that thread/core goes down, while the other(s) spazz about and why the clients do not get another network packet but the status request (server browser) side still responds.
This issue has been around for a good while now, a couple months i'd say.
I can't easily create plogs as it'd have to run like that for quite a while to reproduce one of those.
I will try to remember to save a memory dump of it, and send a link to Ironhorse. Since i don't have access to an unmodded server, data from a modded one is all i can provide.
You can use data from the official servers at it happens there, too.
Thanks for reading. I invite other server folks to chime in on this and hopefully provide more details !
Comments
I kind of "fixed" it with doing a server restart all ~12 hours and having a script which restarts the server as soon as the server querry doesn't change for 2 mins at all.
did you check which thread in the executable is running and which one is hanging or gone?
Setting up process explorer with symbols should get you far in that regard.
Cannot confirm this. Server has no issues with ram usage. That's so old and a first place to look that i forgot to mention it.
Can you tell me more about how you query the server status with a script? That in combination with measuring the current network traffic could potentially lead to a way to automate it properly.
Darkling: I've looked into the symbols thing for procexp. Getting Installation failed when installing the Debugging tools for Windows.
Just install it on a similar client windows and grab the dbghelp.dll, copy to server and setup the symbol paths. Note the .dll is not the same for 32b as for 64.
(as a side note, you probably could not install because you lack something it needs like visual basic versions)
Also note 'running out memory' is utterly vague. WHAT part of memory?
- physical?
- Commit?
- paged or nonpaged?
Unlikely for a program ran by many but processes can also hit max process and thread limits, the later not standard shown. (process explorer of course shows these)
Beyond this it gets more complicated with limits on handles, user objects and GDI objects.
I hope never to see such things in production as finding the last 3 is rightout disastrous in my opinion.
entity 512 both deleted and created on the same snapshot write (class 8->10)!
entity 1449 both deleted and created on the same snapshot write (class 66->8)!
entity 1757 both deleted and created on the same snapshot write (class 164->10)!
entity 2086 both deleted and created on the same snapshot write (class 10->8)!
entity 2372 both deleted and created on the same snapshot write (class 175->8)!
entity 2556 both deleted and created on the same snapshot write (class 10->108)!
entity 2765 both deleted and created on the same snapshot write (class 7->10)!
entity 3716 both deleted and created on the same snapshot write (class 10->142)!
entity 3965 both deleted and created on the same snapshot write (class 162->8)!
Thanks to matso's commendable perseverance there's finally a hint =D> :
We will continue to support the investigation as best as we can.
Good advice.
A restart every 24H is good enough to keep server from freezing.