Slow system shutdown on linux systems (systemd service)

Summary

With Kolibri installed on the Raspberry Pi there is a “stop job” that is delaying the system from shutting down for up to 90 seconds depending on the hardware. I have seen this disappear by setting up the facility but others are still reporting the issue. I’m wondering if there’s a way around this or what this stop job might be? thanks.

Technical details

  • 0.13.2
  • Raspbian Buster Lite

James

Hi @JamesK

Thanks for reporting this. I have been having exactly the same issue. @jredrejo is working on a fix, and I released a beta of this work here:

https://launchpad.net/~learningequality/+archive/ubuntu/kolibri-proposed

Would you be able to test 0.3.7~beta1from this PPA? (available in 10 minutes)

1 Like

Implementation details of the changes are here: https://github.com/learningequality/kolibri-server/pull/54

Thanks @benjamin. I will test the fix today and let you know how it goes.

James

Hi @benjamin,

Since there’s no distribution template for Rasbian/buster I can’t use “sudo add-apt-repository”. I usually add ithe other way from the Raspberry Pi instructions but I can’t seem to get it going. How would you add this on a Pi? Thanks.

James

Ah yes, the PPA does indeed not have anything for Raspbian, but all the packages are binary copies, meaning that they work on any Debian-based system (Xenial or later). This will get you started:

 # Might be missing on some systems
sudo apt-get install dirmngr
# Add the "bionic" repository
sudo su -c 'echo "deb http://ppa.launchpad.net/learningequality/kolibri-proposed/ubuntu bionic main" > /etc/apt/sources.list.d/learningequality-ubuntu-kolibri.list'
# Add Learning Equality's PPA signing key
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys DC5BAA93F9E4AE4F0411F97C74F88ADB3194DD81
# Refresh
sudo apt-get update
# Install
sudo apt-get install kolibri-server
1 Like

Thanks. Switching to bionic did the trick.

I’m implementing Kolibri for RACHEL-Pi which uses apache2 right now so I’m just installing kolibri and not kolibri-server. Are the changes applied only to kolibri-server? It seems to be hanging on the stop job still for the default 90 second time out period. If I stop kolibri first it takes about 15-20 seconds on a pi 4 to stop. Then the stop job during shutdown disappears.

James

Hi @JamesK - I wrongly assumed you were already on kolibri-server.

This seems like something we need to investigate - I’m one of many Kolibri developers - we should definitely look into this.

I made a test of a blank Kolibri installation and indeed I have the same issue with a slow shutdown when no Facility is installed.

Here’s my log ( sudo journalctl --unit kolibri )

Mar 27 21:30:15 kolibritest kolibri[2161]: INFO     Running Kolibri as background process
Mar 27 21:30:15 kolibritest kolibri[2161]: INFO     Kolibri running on:
Mar 27 21:30:15 kolibritest kolibri[2161]:         http://192.168.1.125:8080/
Mar 27 21:30:15 kolibritest kolibri[2161]:         http://127.0.0.1:8080/
Mar 27 21:30:15 kolibritest kolibri[2161]: INFO     Going to background mode, logging to /home/test/.kolibri/logs/kolibri.txt
Mar 27 21:30:16 kolibritest systemd[1]: Started LSB: kolibri daemon, an offline education platform.
Mar 27 21:32:57 kolibritest systemd[1]: Stopping LSB: kolibri daemon, an offline education platform...
Mar 27 21:32:58 kolibritest su[2429]: Successful su for test by root
Mar 27 21:32:58 kolibritest su[2429]: + ??? root:test
Mar 27 21:32:58 kolibritest su[2429]: pam_unix(su:session): session opened for user test by (uid=0)
Mar 27 21:32:58 kolibritest su[2429]: pam_systemd(su:session): Failed to connect to system bus: Broken pipe
Mar 27 21:33:00 kolibritest kolibri[2385]: ERROR    During graceful shutdown, server says: Not responding (5)
Mar 27 21:33:00 kolibritest kolibri[2385]: ERROR    Not responding, killing with force
Mar 27 21:33:00 kolibritest kolibri[2385]: INFO     Initiating shutdown of Kolibri
Mar 27 21:33:00 kolibritest kolibri[2385]: INFO     Waiting for Kolibri to finish shutting down
Mar 27 21:33:31 kolibritest su[2429]: pam_unix(su:session): session closed for user test
Mar 27 21:33:31 kolibritest systemd[1]: Stopped LSB: kolibri daemon, an offline education platform.
-- Reboot --

I’m a little confused, as to why Kolibri provokes this, but I have a log that seems to say that Kolibri was already shutdown when some other service hangs, however every time I disable Kolibri, this stops happening.

Notice the timestamps in this log: There’s a 1:30 gap between the shutdown and the reboot, and the service “Session c1 of user test” is the one that hangs – and notably stops hanging during reboot when the Kolibri service is disabled.

I’ll investigate this some more to see, if we can isolate the behavior for a good and precise issue report.

Hi @benjamin

I was actually just writing this reply when I saw your message. I haven’t seen the same things in my journalctl but I’ll try to do another clean install to see if I do. I’ve been digging into this issue more. I’m new to the internals of Kolibri so maybe I’m way off but there are two reasons this is taking so long to stop or shutdown.

I did some timing and the varying stop time ( 6s-21s ) I saw when running “sudo systemctl stop kolibri” seems to be from the unregister_zeroconf_service() call in server.py on line 117.

github server.py line 117

It gets imported right before and is actually in “”/usr/lib/python3/dist-packages/kolibri/core/discovery/utils/network/search.py" which I couldn’t find on github but is defined at line 203. Within that it’s the “ZEROCONF_STATE[“zeroconf”].close()” call that seems to cause the slowdown. Strangely, when I started using the logger.info() and checking elapsed time in unregister_zeroconf_service() to see if it was .cleanup() or .close() that was taking so long, the stop now always takes 6s. I’ve confirmed this by running htop in one window and stopping Kolibri in another. So I can reliably stop Kolibri in 6s now. I’m going to keep looking at why that might be.

The 90s shutdown is definitely with systemd which I saw mentioned on that pull request. It’s probably hitting DefaultTimeoutStopSec=90s. Since I know Kolibri is stopping in 6s, I’m not really sure why that is but maybe it’s to do with the signal sent to it?

James

@JamesK thanks so much for debugging the code! It sounds like this portion of code? https://github.com/learningequality/kolibri/pull/6081/files

I can’t see that any recent issues or pull requests have addressed the exact symptoms, but I have found some unpredictable behaviors before in the way that the zeroconf service exits: https://github.com/learningequality/kolibri/issues/6610

It could be that these two odd behaviors are related: If Kolibri doesn’t shut down cleanly and is unreliable about how the zeroconf service exits, then it might not clean up all processes, leaving systemd waiting for it until the 90 second timeout.

@benjamin no problem! I don’t see that code in the cli.py on my system. It looks like it’s already been moved to server.py. https://github.com/learningequality/kolibri/blob/27d11d9c838655f43a77aa297c0d5429e8e1cf65/kolibri/utils/server.py#L117 is the line that I’m hitting. This is that function I’m seeing in my kolibri/core/discovery/network/search.py

I’ve now observed cleanup() and close() both taking quite a bit of time. I still think these are two separate issues though but I’ve confirmed that the session c1 that has the stop job is only running kolibri. “sudo systemctl stop kolibri” definitely kills all of the processes every time I do it manually but maybe something in the order of the shutdown could cause this too

On RACHEL-Pi I install Kiwix and it is set up in a similar way. When I ran “sudo edit --full kiwix” I saw “SuccessExitStatus=5 6” and SuccessExitStatus= isn’t set for Kolibri. It may not be necessary but I’m now wondering where the final exit of Kolibri is happening to see if there’s a sys.exit(0) missing at the end or a sys.exit() with some other code that maybe isn’t set in that. Kind of a wild goose chase isn’t it.

James

@benjamin = Just an update. I’m still trying to figure out how everything works here as it uses handlers and stuff, but adding “sys.exit(0)” to the end of the stop() function inside of ServicesPlugin on server.py makes the system shut down right away. So it looks like it’s capable of shutting down normally without changing any of the systemd configuration.

James

My guess is that somewhere in here, it is getting caught on a thread, after either .join is called or waiting on .close from a socket: https://github.com/learningequality/python-zeroconf-py2compat/blob/master/zeroconf.py#L2259

It’s worth noting that there are a lot of different threads being waited on when you ask Kolibri to stop - first all the cherrypy threads, then all the threads for the task scheduler, and workers, and the threads inside the zeroconf service.

“sys.exit(0)” to the end of the stop() function inside of ServicesPlugin on server.py

Note that this will interrupt cherrypy’s shutdown process, so we can’t apply this as a fix.

Hi Richard,

Yeah it’s definitely not a good idea to force anything to stop. I was just thinking the systemd issue might be that it’s not getting a successful exit even when everything is stopped.

I took a look through zeroconf.py and it looks like close() -> remove_all_service_listeners()-> remove_service_listener() -> self.browsers[listener].cancel() -> self.join() is causing the slowdown

I also got the following warning a few times and when it does that it stops right away without the ~20s delay from join().

/usr/lib/python3/dist-packages/kolibri/dist/cherrypy/process/wspbus.py:258: RuntimeWarning: The main thread is exiting, but the Bus is in the states.STOPPING state; shutting it down automatically now. You must either call bus.block() after start(), or call bus.exit() before the main thread exits.

James

Thanks for the pointers, @JamesK - zeroconf run() method doubles self.delay until it’s a 20 seconds maximum delay: That could be the cause of why it takes up to 20 seconds for each thread to find out that self.done is true.

I encountered a new type of exception upon exit and added it here: https://github.com/learningequality/kolibri/issues/6610

Hi @JamesK - opened up the following issue, I was wondering if you can contribute some of the other systemd debugging symptoms that you had DM’ed me in that Github issue?

Once again thanks so much for helping us out on this one, really appreciate your patience with Kolibri and systemd, especially the last one is a very very complicated thing to debug :slight_smile:

1 Like

Hi @benjamin,

No problem! I’m happy to help and I just hope we can figure this out. I’ll have some time later today so I’ll try and organize what I’ve found and comment.on the github.

James

Just an update. I believe I’ve finally found the source of 90s timeout issue ( :grinning: ) and posted what I’ve found on the github in more detail. It seems to be directly related to the /etc/init.d/kolibri file starting Kolibri in a user session that gets abandoned right away. I changed this to run it directly and the second stop job that was hanging went away. I’m looking at whether your zeroconf disabling option brings that 23s stop time down next.

James

After installing your zeroconf disable build I’m happy to say that the pi shuts down pretty much instantly (5s). There’s no 23s delay and no 90s delay. Thanks for adding that option to test! I’m going to do another fresh install with the initd change and then set up the facility fully to see if that gets rid of the delay. If it doesn’t then I think that’s pretty good confirmation of the 23s delay being zeroconf.

update: Setting up the facility didn’t change the shutdown time. Disabling zeroconf does.

James

Hi @JamesK - we’ve recently merged a fix for this in Kolibri. It will be released in the 0.14.0 release, but if you can run from source, checking against the release-v0.14.x branch would allow you to see if this has fixed the issue you are seeing.

Hope this helps resolve the issue!