General protection fault and container restart #53

Closed
opened 2026-06-12 15:27:51 +02:00 by BasK · 9 comments

When running the new 0.9.20 release I get a general protection fault every time I try to start the container using docker compose, after it's tried to connect to the printer for a while:
[ 1331.878832] traps: python[6058] general protection fault ip:7f149fca0527 sp:7f149d206810 error:0 in libc.so.6[a1527,7f149fc27000+163000]
After this fault the container restarts.

I also noticed that the logs were flooded with many connection issues with intermittent success sprinkled in, but that was already the case in 0.9.19 afaik, so that may be a separate issue.

When running the new 0.9.20 release I get a general protection fault every time I try to start the container using docker compose, after it's tried to connect to the printer for a while: [ 1331.878832] traps: python[6058] general protection fault ip:7f149fca0527 sp:7f149d206810 error:0 in libc.so.6[a1527,7f149fc27000+163000] After this fault the container restarts. I also noticed that the logs were flooded with many connection issues with intermittent success sprinkled in, but that was already the case in 0.9.19 afaik, so that may be a separate issue.
Owner

Thanks @BasK! A few quick questions to narrow this down:

  • What OS/platform is your Docker host? (Linux, Windows with Docker Desktop, Synology, Raspberry Pi, ...)
  • If Linux: uname -r and uname -m
  • Are you using an IP address or a hostname for printer_ip in your config?
Thanks @BasK! A few quick questions to narrow this down: - What OS/platform is your Docker host? (Linux, Windows with Docker Desktop, Synology, Raspberry Pi, ...) - If Linux: `uname -r` and `uname -m` - Are you using an IP address or a hostname for `printer_ip` in your config?
viewit added the
question
label 2026-06-12 15:37:44 +02:00
Author

First of all, thanks for the lightning-fast response!

I'm running the proxy on a debian 13 host in proxmox V9. Uname -r and -m results: 6.12.90+deb13.1-amd64, x86_64
I'm using an IP address.

Feel free to ask me any follow up-questions, I'll be more than glad to get to the bottom of this together.

First of all, thanks for the lightning-fast response! I'm running the proxy on a debian 13 host in proxmox V9. Uname -r and -m results: 6.12.90+deb13.1-amd64, x86_64 I'm using an IP address. Feel free to ask me any follow up-questions, I'll be more than glad to get to the bottom of this together.
Owner

Thanks! We have a Proxmox setup ourselves and would like to reproduce this. One question: are you running Docker directly, inside an LXC container, or inside a VM? And are you using the Docker image or the standalone Linux binary?

Thanks! We have a Proxmox setup ourselves and would like to reproduce this. One question: are you running Docker directly, inside an LXC container, or inside a VM? And are you using the Docker image or the standalone Linux binary?
Author

Sorry, was away for a few hours. I'm using https://community-scripts.org/scripts/docker-vm

Update: I saw that I missed part of your question. The above script that I used creates a debian 13 vm with docker preinstalled. Within it I'm using your docker-compose.yml

Sorry, was away for a few hours. I'm using https://community-scripts.org/scripts/docker-vm Update: I saw that I missed part of your question. The above script that I used creates a debian 13 vm with docker preinstalled. Within it I'm using your docker-compose.yml
Author

I tested by getting the latest source version and running a ' docker compose build' and 'docker compose up', same result as before. I did manage to catch a secondary error in the host console though:
[ 251.042052] python[1771]: segfault at 30 ip 00007feaba0b6950 sp 00007feab912a588 error 4 in libcrypto.so.3[108950,7feaba073000+27d000] likely on CPU 5 (core 1, socket 1)
[ 251.045989] Code: 0Less than a second 0.0.0.0:7125-7130->7125-7130/tcp kx 1f 80 00 00 00 00 f7 d6 21 77 30 31 f6 31 ff c3 66 0f 1f-bridge-release-kx-bridge-131 f6 31 ff c3 66 0f 1f 44 00 00 09 77 30 31 f6 31

I tested by getting the latest source version and running a ' docker compose build' and 'docker compose up', same result as before. I did manage to catch a secondary error in the host console though: [ 251.042052] python[1771]: segfault at 30 ip 00007feaba0b6950 sp 00007feab912a588 error 4 in libcrypto.so.3[108950,7feaba073000+27d000] likely on CPU 5 (core 1, socket 1) [ 251.045989] Code: 0Less than a second 0.0.0.0:7125-7130->7125-7130/tcp kx 1f 80 00 00 00 00 f7 d6 21 77 30 31 f6 31 ff c3 66 0f 1f-bridge-release-kx-bridge-131 f6 31 ff c3 66 0f 1f 44 00 00 09 77 30 31 f6 31
Owner

Thanks for the additional details, @BasK!

The segfault in libcrypto.so.3 is a strong hint that this is a CPU feature emulation issue in your Proxmox VM.

By default, Proxmox VMs use the kvm64 CPU model, which only exposes a minimal set of CPU features. OpenSSL (used internally for the MQTT TLS connection) performs CPU feature detection at startup and can crash if the CPU model advertises features that the hypervisor does not fully implement.

Could you check which CPU type is set for your VM in Proxmox?

Proxmox UI: VM → Hardware → Processors → Type

If it is set to kvm64 (or any other emulated model), please try changing it to host — this passes through the actual host CPU flags to the VM and eliminates the mismatch.

After changing the CPU type, a full VM reboot is required (not just a container restart). Please let us know if that resolves the segfault!

Thanks for the additional details, @BasK! The segfault in `libcrypto.so.3` is a strong hint that this is a **CPU feature emulation issue** in your Proxmox VM. By default, Proxmox VMs use the `kvm64` CPU model, which only exposes a minimal set of CPU features. OpenSSL (used internally for the MQTT TLS connection) performs CPU feature detection at startup and can crash if the CPU model advertises features that the hypervisor does not fully implement. Could you check which CPU type is set for your VM in Proxmox? **Proxmox UI:** VM → Hardware → Processors → Type If it is set to `kvm64` (or any other emulated model), please try changing it to **`host`** — this passes through the actual host CPU flags to the VM and eliminates the mismatch. After changing the CPU type, a full VM reboot is required (not just a container restart). Please let us know if that resolves the segfault!
Author

The CPU type was host, but it did trigger me to perform more in-depth testing. I added the following flags to docker-compose.yml:
environment:
- PYTHONFAULTHANDLER=1
- PYTHONASYNCIODEBUG=1

This lead to a full segfault log, which I've attached.

According to claude (which may completely wrong, I know): _this is a heap corruption error, not a kernel/VM/seccomp issue at all. Something is corrupting the memory allocator's internal structures, and then when another thread tries to allocate memory (in this case the logging thread calling formatTime), glibc detects the corruption and aborts.
What's Happening
Two threads are racing:

Thread 1 (_poll_loop → publish → _reconnect → _do_connect) is doing an SSL read via ssl.py
Thread 2 (_read_loop) is simultaneously doing its own SSL operations and then tries to log a warning

Both threads are sharing an SSL context or socket object without proper locking, and the concurrent access to OpenSSL's internal state corrupts the heap.
This is a thread-safety bug in the application code (kobrax_client.py), not a VM or OS issue. It only manifests on this machine likely because of timing differences — slightly different CPU speed, scheduling, or load causes the race to be hit reliably here but rarely on other machines._

The CPU type was host, but it did trigger me to perform more in-depth testing. I added the following flags to docker-compose.yml: environment: - PYTHONFAULTHANDLER=1 - PYTHONASYNCIODEBUG=1 This lead to a full segfault log, which I've attached. According to claude (which may completely wrong, I know): _this is a heap corruption error, not a kernel/VM/seccomp issue at all. Something is corrupting the memory allocator's internal structures, and then when another thread tries to allocate memory (in this case the logging thread calling formatTime), glibc detects the corruption and aborts. What's Happening Two threads are racing: Thread 1 (_poll_loop → publish → _reconnect → _do_connect) is doing an SSL read via ssl.py Thread 2 (_read_loop) is simultaneously doing its own SSL operations and then tries to log a warning Both threads are sharing an SSL context or socket object without proper locking, and the concurrent access to OpenSSL's internal state corrupts the heap. This is a thread-safety bug in the application code (kobrax_client.py), not a VM or OS issue. It only manifests on this machine likely because of timing differences — slightly different CPU speed, scheduling, or load causes the race to be hit reliably here but rarely on other machines._
viewit added the
bug
label 2026-06-17 07:02:49 +02:00
Owner

@BasK — your analysis was spot on, and the fault-handler trace was exactly what we needed. Thank you for digging in. 🙏

You (and Claude) were right: this is not a VM/CPU/seccomp issue — it's a genuine thread-safety bug in kobrax_client.py. The MQTT-over-TLS client shares a single SSL socket between the reader thread (recv) and the sender threads (sendall/publish) without serializing them. CPython's ssl module does not allow concurrent read and write on the same socket, so the overlap corrupted OpenSSL's internal record-layer state → heap corruption → the segfault in libcrypto.so.3. It manifested reliably on your machine purely because of timing (CPU speed / scheduling), which is why it's rare elsewhere — your earlier note about the _poll_loop / _read_loop race was exactly the mechanism.

Fix (in v0.9.25): all socket access (recv / sendall / close / reconnect) is now serialized under a single lock. To avoid the reader starving the senders, the reader probes readiness with select() outside the lock and only takes the lock for the actual recv. Reconnect and disconnect now swap the socket atomically. I stress-tested it with many concurrent publishes during the status stream under PYTHONFAULTHANDLER=1 — no more crash.

v0.9.25 is up now (Docker :latest / :0.9.25 and the standalone binaries). Could you pull it and let me know whether the segfault is gone on your Proxmox VM? You can drop the PYTHONFAULTHANDLER env vars again. Thanks again for the great debugging.

@BasK — your analysis was spot on, and the fault-handler trace was exactly what we needed. Thank you for digging in. 🙏 You (and Claude) were right: this is **not** a VM/CPU/seccomp issue — it's a genuine thread-safety bug in `kobrax_client.py`. The MQTT-over-TLS client shares a single SSL socket between the reader thread (`recv`) and the sender threads (`sendall`/publish) without serializing them. CPython's `ssl` module does not allow concurrent read and write on the same socket, so the overlap corrupted OpenSSL's internal record-layer state → heap corruption → the segfault in `libcrypto.so.3`. It manifested reliably on your machine purely because of timing (CPU speed / scheduling), which is why it's rare elsewhere — your earlier note about the `_poll_loop` / `_read_loop` race was exactly the mechanism. **Fix (in v0.9.25):** all socket access (recv / sendall / close / reconnect) is now serialized under a single lock. To avoid the reader starving the senders, the reader probes readiness with `select()` *outside* the lock and only takes the lock for the actual `recv`. Reconnect and disconnect now swap the socket atomically. I stress-tested it with many concurrent publishes during the status stream under `PYTHONFAULTHANDLER=1` — no more crash. v0.9.25 is up now (Docker `:latest` / `:0.9.25` and the standalone binaries). Could you pull it and let me know whether the segfault is gone on your Proxmox VM? You can drop the `PYTHONFAULTHANDLER` env vars again. Thanks again for the great debugging.
Author

Thank you for the quick fix, I've started the bridge again and so far it has been running without crashing for 10 minutes, while it crashed within minutes before. The connection to the printer still seems very inreliable however, but I'll file a separate bug for that in the hopes that we can solve that as well.

Thank you for the quick fix, I've started the bridge again and so far it has been running without crashing for 10 minutes, while it crashed within minutes before. The connection to the printer still seems very inreliable however, but I'll file a separate bug for that in the hopes that we can solve that as well.
BasK closed this issue 2026-06-17 17:11:15 +02:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: viewit/KX-Bridge-Release#53
No description provided.