General protection fault and container restart #53
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When running the new 0.9.20 release I get a general protection fault every time I try to start the container using docker compose, after it's tried to connect to the printer for a while:
[ 1331.878832] traps: python[6058] general protection fault ip:7f149fca0527 sp:7f149d206810 error:0 in libc.so.6[a1527,7f149fc27000+163000]
After this fault the container restarts.
I also noticed that the logs were flooded with many connection issues with intermittent success sprinkled in, but that was already the case in 0.9.19 afaik, so that may be a separate issue.
Thanks @BasK! A few quick questions to narrow this down:
uname -randuname -mprinter_ipin your config?First of all, thanks for the lightning-fast response!
I'm running the proxy on a debian 13 host in proxmox V9. Uname -r and -m results: 6.12.90+deb13.1-amd64, x86_64
I'm using an IP address.
Feel free to ask me any follow up-questions, I'll be more than glad to get to the bottom of this together.
Thanks! We have a Proxmox setup ourselves and would like to reproduce this. One question: are you running Docker directly, inside an LXC container, or inside a VM? And are you using the Docker image or the standalone Linux binary?
Sorry, was away for a few hours. I'm using https://community-scripts.org/scripts/docker-vm
Update: I saw that I missed part of your question. The above script that I used creates a debian 13 vm with docker preinstalled. Within it I'm using your docker-compose.yml
I tested by getting the latest source version and running a ' docker compose build' and 'docker compose up', same result as before. I did manage to catch a secondary error in the host console though:
[ 251.042052] python[1771]: segfault at 30 ip 00007feaba0b6950 sp 00007feab912a588 error 4 in libcrypto.so.3[108950,7feaba073000+27d000] likely on CPU 5 (core 1, socket 1)
[ 251.045989] Code: 0Less than a second 0.0.0.0:7125-7130->7125-7130/tcp kx 1f 80 00 00 00 00 f7 d6 21 77 30 31 f6 31 ff c3 66 0f 1f-bridge-release-kx-bridge-131 f6 31 ff c3 66 0f 1f 44 00 00 09 77 30 31 f6 31
Thanks for the additional details, @BasK!
The segfault in
libcrypto.so.3is a strong hint that this is a CPU feature emulation issue in your Proxmox VM.By default, Proxmox VMs use the
kvm64CPU model, which only exposes a minimal set of CPU features. OpenSSL (used internally for the MQTT TLS connection) performs CPU feature detection at startup and can crash if the CPU model advertises features that the hypervisor does not fully implement.Could you check which CPU type is set for your VM in Proxmox?
Proxmox UI: VM → Hardware → Processors → Type
If it is set to
kvm64(or any other emulated model), please try changing it tohost— this passes through the actual host CPU flags to the VM and eliminates the mismatch.After changing the CPU type, a full VM reboot is required (not just a container restart). Please let us know if that resolves the segfault!
The CPU type was host, but it did trigger me to perform more in-depth testing. I added the following flags to docker-compose.yml:
environment:
- PYTHONFAULTHANDLER=1
- PYTHONASYNCIODEBUG=1
This lead to a full segfault log, which I've attached.
According to claude (which may completely wrong, I know): _this is a heap corruption error, not a kernel/VM/seccomp issue at all. Something is corrupting the memory allocator's internal structures, and then when another thread tries to allocate memory (in this case the logging thread calling formatTime), glibc detects the corruption and aborts.
What's Happening
Two threads are racing:
Thread 1 (_poll_loop → publish → _reconnect → _do_connect) is doing an SSL read via ssl.py
Thread 2 (_read_loop) is simultaneously doing its own SSL operations and then tries to log a warning
Both threads are sharing an SSL context or socket object without proper locking, and the concurrent access to OpenSSL's internal state corrupts the heap.
This is a thread-safety bug in the application code (kobrax_client.py), not a VM or OS issue. It only manifests on this machine likely because of timing differences — slightly different CPU speed, scheduling, or load causes the race to be hit reliably here but rarely on other machines._
@BasK — your analysis was spot on, and the fault-handler trace was exactly what we needed. Thank you for digging in. 🙏
You (and Claude) were right: this is not a VM/CPU/seccomp issue — it's a genuine thread-safety bug in
kobrax_client.py. The MQTT-over-TLS client shares a single SSL socket between the reader thread (recv) and the sender threads (sendall/publish) without serializing them. CPython'ssslmodule does not allow concurrent read and write on the same socket, so the overlap corrupted OpenSSL's internal record-layer state → heap corruption → the segfault inlibcrypto.so.3. It manifested reliably on your machine purely because of timing (CPU speed / scheduling), which is why it's rare elsewhere — your earlier note about the_poll_loop/_read_looprace was exactly the mechanism.Fix (in v0.9.25): all socket access (recv / sendall / close / reconnect) is now serialized under a single lock. To avoid the reader starving the senders, the reader probes readiness with
select()outside the lock and only takes the lock for the actualrecv. Reconnect and disconnect now swap the socket atomically. I stress-tested it with many concurrent publishes during the status stream underPYTHONFAULTHANDLER=1— no more crash.v0.9.25 is up now (Docker
:latest/:0.9.25and the standalone binaries). Could you pull it and let me know whether the segfault is gone on your Proxmox VM? You can drop thePYTHONFAULTHANDLERenv vars again. Thanks again for the great debugging.Thank you for the quick fix, I've started the bridge again and so far it has been running without crashing for 10 minutes, while it crashed within minutes before. The connection to the printer still seems very inreliable however, but I'll file a separate bug for that in the hopes that we can solve that as well.