Docker can become unresponsive, but containers continue to run

Docker containers wrap a piece of software in a complete filesystem that contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server. This guarantees that the software will always run the same, regardless of its environment.

Moderator: Lillian.W@AST

ilike2burnthing
Posts: 367
youtube meble na wymiar Warszawa
Joined: Thu Apr 09, 2020 8:01 pm

Re: Docker can become unresponsive, but containers continue to run

Post by ilike2burnthing »

Code: Select all

/$ mount|grep docker|grep -v overlay
/dev/md1 on /volume1/.@plugins/AppCentral/docker-ce/docker_lib type btrfs (rw,relatime,ssd,discard,space_cache,subvolid=259,subvol=/.@plugins)
/dev/md1 on /volume1/.@plugins/AppCentral/docker-ce/docker_lib/btrfs type btrfs (rw,relatime,ssd,discard,space_cache,subvolid=259,subvol=/.@plugins)
nsfs on /var/run/docker/netns/a23df0881d74 type nsfs (rw)
nsfs on /var/run/docker/netns/53914be5934c type nsfs (rw)
nsfs on /var/run/docker/netns/1bc861af2a80 type nsfs (rw)
nsfs on /var/run/docker/netns/be8156de2fed type nsfs (rw)
nsfs on /var/run/docker/netns/f34ff78cb33a type nsfs (rw)
nsfs on /var/run/docker/netns/998d848a4c74 type nsfs (rw)
nsfs on /var/run/docker/netns/4e42719281a2 type nsfs (rw)
nsfs on /var/run/docker/netns/74a4fbed29a2 type nsfs (rw)
nsfs on /var/run/docker/netns/eff3890210bd type nsfs (rw)

Code: Select all

/$ docker info
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., v2.15.1)

Server:
 Containers: 9
  Running: 9
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 20.10.22
 Storage Driver: btrfs
  Build Version: Btrfs v5.10.1 
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 78f51771157abb6c9ed224c22013cdf09962315d
 runc version: v1.1.4-0-g5fd4c4d1
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.13.0.x
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 1.857GiB
 Name: NameHere
 ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 Docker Root Dir: /volume1/.@plugins/AppCentral/docker-ce/docker_lib
 Debug Mode: false
 Username: UsernameHere
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine
ilike2burnthing
Posts: 367
Joined: Thu Apr 09, 2020 8:01 pm

Re: Docker can become unresponsive, but containers continue to run

Post by ilike2burnthing »

Appending "-H tcp://127.0.0.1:2375" to DOCKERD_OPT in start-stop.sh and restarting dockerd resulted in:

Code: Select all

/$ docker ps -a
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Reverting resulted in:

Code: Select all

/$ /usr/local/AppCentral/docker-ce/CONTROL/start-stop.sh start
Starting Docker daemon...
rpcbind: another rpcbind is already running. Aborting
time="2023-03-23T02:20:40.884944570Z" level=info msg="Starting up"
time="2023-03-23T02:20:40.888723032Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
failed to load listeners: can't create unix socket /var/run/docker.sock: is a directory
mkdir: can't create directory '/sys/fs/cgroup/systemd': File exists
mount: mounting cgroup on /sys/fs/cgroup/systemd failed: Device or resource busy
Deleting the newly created docker.sock directory and rerunning worked fine.
User avatar
Nazar78
Posts: 1984
Joined: Wed Jul 17, 2019 10:21 pm
Location: Singapore
Contact:

Re: Docker can become unresponsive, but containers continue to run

Post by Nazar78 »

Storage Driver: btrfs
Build Version: Btrfs v5.10.1
Ok I'll try something else, probably my volume bind is still on ext4. But so far running on btrfs has no issues even doing snapshots.
Appending "-H tcp://127.0.0.1:2375" to DOCKERD_OPT in start-stop.sh and restarting dockerd resulted in:
The idea was to use docker using TCP instead of unix socket just to rule out using sockets. You can of course use multiple connections including fd and ssh, just append them to the DOCKERD_OPT i.e.: `-H unix:///var/run/docker.sock -H tcp://0.0.0.0:2375` (use bind 0.0.0.0 instead of 127.0.0.1 if you want to access outside your localhost). You can also use the environment or -H to connect i.e. `DOCKER_HOST="tcp://0.0.0.0:2375" docker ps -a` or `docker -H tcp://0.0.0.0:2375 ps -a` or place it in your .profile `export DOCKER_HOST="tcp://0.0.0.0:2375"`. All these are stated in the docker documentations. So try to drop the unix socket then use TCP for troubleshooting.
AS5304T - 16GB DDR4 - ADM-OS modded on 2GB RAM
Internal:
- 4x10TB Toshiba RAID10 Ext4-Journal=Off
External 5 Bay USB3:
- 4x2TB Seagate modded RAID0 Btrfs-Compression
- 480GB Intel SSD for modded dm-cache (initramfs auto update patch) and Apps

When posting, consider checking the box "Notify me when a reply is posted" to get faster response
User avatar
Nazar78
Posts: 1984
Joined: Wed Jul 17, 2019 10:21 pm
Location: Singapore
Contact:

Re: Docker can become unresponsive, but containers continue to run

Post by Nazar78 »

Ok I'll try something else, probably my volume bind is still on ext4. But so far running on btrfs has no issues even doing snapshots.
Got it running in actual btrfs storage driver v5.10.1 by appending docker opts "-s btrfs". My volume bind mounts are still in ext4 (external SSD cache /dev/mapper/md1_c). Had to pull again all the images, then recreated all the containers, no biggy, will monitor further...

Code: Select all

root@Nimbustor4:~# mount|grep -i docker
/dev/mapper/md1_c on /share/Docker type ext4 (rw,relatime,nobarrier,stripe=256,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group)
/dev/sde1 on /volume1/Docker type ext4 (rw,noatime,nobarrier)
/dev/sde1 on /share/Docker type ext4 (rw,noatime,nobarrier)
/dev/sde1 on /share/USB31-1/chroot/share/Docker type ext4 (rw,noatime,nobarrier)
/dev/loop3 on /volume1/.@plugins/AppCentral/docker-ce type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
/dev/loop3 on /volume1/.@plugins/AppCentral/docker-ce/docker_lib type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
/dev/loop3 on /volume1/.@plugins/AppCentral/docker-ce/docker_lib/btrfs type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
nsfs on /var/run/docker/netns/0ea725c269cc type nsfs (rw)
nsfs on /var/run/docker/netns/default type nsfs (rw)
AS5304T - 16GB DDR4 - ADM-OS modded on 2GB RAM
Internal:
- 4x10TB Toshiba RAID10 Ext4-Journal=Off
External 5 Bay USB3:
- 4x2TB Seagate modded RAID0 Btrfs-Compression
- 480GB Intel SSD for modded dm-cache (initramfs auto update patch) and Apps

When posting, consider checking the box "Notify me when a reply is posted" to get faster response
User avatar
Nazar78
Posts: 1984
Joined: Wed Jul 17, 2019 10:21 pm
Location: Singapore
Contact:

Re: Docker can become unresponsive, but containers continue to run

Post by Nazar78 »

Some updates.

After over slightly 5 days running on btrfs (since I last posted, my NAS uptime was way longer think after the last firmware upgrade), on the 3rd day IIRC, I was trying to recreate one of the container (watchtower), to update my auto container upgrade email notifications environment. It created successfully but I was then baffled why I can't start the container, it says error missing entry point in the image? Thinking if there's command typos after few retries, I deleted and pull the image again then that fixed it. Note that the image was last update 2 months ago so it couldn't be possible it failed upgrading itself. Just to add, I'm not using any docker volumes, even for portainer, all persistent data being mount bind to an ext4 partition. Plenty of space left on the btrfs partition, 11GB used out of 20GB.

Then today, the 5th day, I was on portainer trying to mess with some macvlan and list the images, suddenly it stalled. Even watchtower registered failures (it ran some upgrades few hours earlier). Seems the /var/run/docker.sock is not responding. At first `docker ps -a` works then after I did `docker images`, it hangs. Now all the docker commands follow suit, they all doesn't return. But strangely as you've mentioned earlier, all the containers continue to be working normally except watchtower of course as it's dependent to the /var/run/docker.sock.

I also noticed running the ADM start-stop (doesn't matter if it was run from terminal or App Central) doesn't work because one of the commands during stop procedure is `docker stop $(docker ps -a -q)` which is to stop all the containers. And because docker is no longer responding, the stop hangs too. So force killing did the trick, `ps -ef|grep docker|grep -v grep|awk '{print $1}'|xargs kill -9`, then start it again, `start-stop start`. If you're still using the script to auto restart, I've updated the previous post to comment it out. Also you should revert back to socket instead of tcp if you already changed it because I don't think it's a connection issue.

I suspect some btrfs corruption but running btrfs scrub doesn't show any issue. Scour the internet nothing come close to this https://github.com/moby/moby/issues/44903.

I'm now running docker in debug mode opts "-D", see if I can catch anything in the next few days...

Code: Select all

root@Nimbustor4:~# tail -f ~/docker.log
time="2023-03-29T04:48:28.421351282+08:00" level=debug msg="event forwarded" ns=moby topic=/tasks/exec-added type=containerd.events.TaskExecAdded
time="2023-03-29T04:48:28.421759547+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-added
time="2023-03-29T04:48:28.503742635+08:00" level=debug msg="event forwarded" ns=moby topic=/tasks/exec-started type=containerd.events.TaskExecStarted
time="2023-03-29T04:48:28.504007551+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started
time="2023-03-29T04:48:28.829347362+08:00" level=debug msg="event forwarded" ns=moby topic=/tasks/exit type=containerd.events.TaskExit
time="2023-03-29T04:48:28.829704740+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
time="2023-03-29T04:48:28.829834803+08:00" level=debug msg="attach: stdout: end"
time="2023-03-29T04:48:28.829902399+08:00" level=debug msg="attach: stderr: end"
time="2023-03-29T04:48:28.829950667+08:00" level=debug msg="attach done"
time="2023-03-29T04:48:28.829994153+08:00" level=debug msg="Health check for container b67899587d2c599f9d78beef950a9f1f451d7002a4a906ed13951272e806b7aa done (exitCode=0)"
AS5304T - 16GB DDR4 - ADM-OS modded on 2GB RAM
Internal:
- 4x10TB Toshiba RAID10 Ext4-Journal=Off
External 5 Bay USB3:
- 4x2TB Seagate modded RAID0 Btrfs-Compression
- 480GB Intel SSD for modded dm-cache (initramfs auto update patch) and Apps

When posting, consider checking the box "Notify me when a reply is posted" to get faster response
ilike2burnthing
Posts: 367
Joined: Thu Apr 09, 2020 8:01 pm

Re: Docker can become unresponsive, but containers continue to run

Post by ilike2burnthing »

I'm not going mad then! Thanks for verifying this :D
User avatar
Nazar78
Posts: 1984
Joined: Wed Jul 17, 2019 10:21 pm
Location: Singapore
Contact:

Re: Docker can become unresponsive, but containers continue to run

Post by Nazar78 »

Code: Select all

root@Nimbustor4:~# docker ps -aq
root@Nimbustor4:~# ^C
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~# tail ~/docker.log
2023-04-01 04:05:33.107876 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:34.108862 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:35.109030 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:36.109843 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:37.110840 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:38.111841 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:39.112845 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:40.114004 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:41.114840 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2023-04-01 04:05:42.115844 I | http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~# grep -E '(^Limit|open)' /proc/$(pidof dockerd)/limits
Limit                     Soft Limit           Hard Limit           Units
Max open files            1024                 4096                 files
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~# chroot ~/.chroot prlimit --pid $(pidof dockerd) --nofile=10240:10240
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~#
root@Nimbustor4:~# docker ps -aq
a7253fb8824d
9d710ed18897
295eeccf1a93
6ead7d5c52c1
7929f36bf077
624fbadaa8b5
b67899587d2c
6f8a0e09b5da
a26dbd56cb43
Looks like open files limit issue. I can't seem to find the command prlimit (util-linux package) in Entware repo so I ran prlimit from my debian chroot. Bam the docker command works again (this affects both unix socket and tcp).

For your case, you can kill the dockerd processes, add `ulimit -SHn 10240` in the docker-ce start-stop file just before the dockerd runs, then start it normally, `./start-stop start`. The 10240 soft/hard limit value is up to you to decide, default 1024/4096. There's a container that I set it to 2147483584 for its intended purposes. EDITED: removed comments about ulimit init.d script, refer to the later part of the post.

The thing is I've never encountered this while using ext4. I could be wrong but it seems the docker btrfs driver is opening much more resources. Whatever it is, it's case closed for now, I'm reverting back to ext4.
Last edited by Nazar78 on Sat Apr 01, 2023 11:52 pm, edited 2 times in total.
AS5304T - 16GB DDR4 - ADM-OS modded on 2GB RAM
Internal:
- 4x10TB Toshiba RAID10 Ext4-Journal=Off
External 5 Bay USB3:
- 4x2TB Seagate modded RAID0 Btrfs-Compression
- 480GB Intel SSD for modded dm-cache (initramfs auto update patch) and Apps

When posting, consider checking the box "Notify me when a reply is posted" to get faster response
ilike2burnthing
Posts: 367
Joined: Thu Apr 09, 2020 8:01 pm

Re: Docker can become unresponsive, but containers continue to run

Post by ilike2burnthing »

Do you have an example S00ulimit script I could use?

That's instead of amending start-stop? If not, where are you suggesting adding `ulimit -SHn 10240`?

If I wanted to chase this issue up further, who am I looking at? Docker/Moby? Asustor? Btrfs?

Thanks again!
User avatar
Nazar78
Posts: 1984
Joined: Wed Jul 17, 2019 10:21 pm
Location: Singapore
Contact:

Re: Docker can become unresponsive, but containers continue to run

Post by Nazar78 »

EDITED: removed comments about ulimit init.d script, refer to the later part of the post.

If you also want it to work from the App Central then you'll have to modify the start-stop script too which will be lost if docker-ce gets updated.

Yes it's quite a hurdle because there's no such thing as /etc/security/limits.conf in Asustor compared to a standard distro.
If I wanted to chase this issue up further, who am I looking at? Docker/Moby? Asustor? Btrfs?
That's a good question which I'm TBH not sure. But ulimit is there for good reason to control the system resources and it should be up to the system admin, in this case ourself, who regulates all these restrictions. Imagine a process using up all the resources or ransomware going berserk.
Last edited by Nazar78 on Sat Apr 01, 2023 11:54 pm, edited 1 time in total.
AS5304T - 16GB DDR4 - ADM-OS modded on 2GB RAM
Internal:
- 4x10TB Toshiba RAID10 Ext4-Journal=Off
External 5 Bay USB3:
- 4x2TB Seagate modded RAID0 Btrfs-Compression
- 480GB Intel SSD for modded dm-cache (initramfs auto update patch) and Apps

When posting, consider checking the box "Notify me when a reply is posted" to get faster response
ilike2burnthing
Posts: 367
Joined: Thu Apr 09, 2020 8:01 pm

Re: Docker can become unresponsive, but containers continue to run

Post by ilike2burnthing »

Oh you really meant simple, my bad :P Will do that now.

I'll try on Moby/Moby as they'll probably reply the quickest, even if it's to tell me it's not their problem. I'll post the issue in a bit.
Post Reply

Return to “Docker”