General protection fault and container restart #53

New Issue

BasK · 2026-06-12T15:27:51+02:00

BasK commented

2026-06-12 15:27:51 +02:00

When running the new 0.9.20 release I get a general protection fault every time I try to start the container using docker compose, after it's tried to connect to the printer for a while:
[ 1331.878832] traps: python[6058] general protection fault ip:7f149fca0527 sp:7f149d206810 error:0 in libc.so.6[a1527,7f149fc27000+163000]
After this fault the container restarts.

I also noticed that the logs were flooded with many connection issues with intermittent success sprinkled in, but that was already the case in 0.9.19 afaik, so that may be a separate issue.

When running the new 0.9.20 release I get a general protection fault every time I try to start the container using docker compose, after it's tried to connect to the printer for a while: [ 1331.878832] traps: python[6058] general protection fault ip:7f149fca0527 sp:7f149d206810 error:0 in libc.so.6[a1527,7f149fc27000+163000] After this fault the container restarts. I also noticed that the logs were flooded with many connection issues with intermittent success sprinkled in, but that was already the case in 0.9.19 afaik, so that may be a separate issue.

viewit commented

2026-06-12 15:37:21 +02:00

Thanks @BasK! A few quick questions to narrow this down:

What OS/platform is your Docker host? (Linux, Windows with Docker Desktop, Synology, Raspberry Pi, ...)
If Linux: uname -r and uname -m
Are you using an IP address or a hostname for printer_ip in your config?

Thanks @BasK! A few quick questions to narrow this down: - What OS/platform is your Docker host? (Linux, Windows with Docker Desktop, Synology, Raspberry Pi, ...) - If Linux: `uname -r` and `uname -m` - Are you using an IP address or a hostname for `printer_ip` in your config?

viewit added the

question

label 2026-06-12 15:37:44 +02:00

BasK commented

2026-06-12 16:31:11 +02:00

First of all, thanks for the lightning-fast response!

I'm running the proxy on a debian 13 host in proxmox V9. Uname -r and -m results: 6.12.90+deb13.1-amd64, x86_64
I'm using an IP address.

Feel free to ask me any follow up-questions, I'll be more than glad to get to the bottom of this together.

First of all, thanks for the lightning-fast response! I'm running the proxy on a debian 13 host in proxmox V9. Uname -r and -m results: 6.12.90+deb13.1-amd64, x86_64 I'm using an IP address. Feel free to ask me any follow up-questions, I'll be more than glad to get to the bottom of this together.

viewit commented

2026-06-12 18:52:22 +02:00

Thanks! We have a Proxmox setup ourselves and would like to reproduce this. One question: are you running Docker directly, inside an LXC container, or inside a VM? And are you using the Docker image or the standalone Linux binary?

BasK commented

2026-06-12 21:09:35 +02:00

Sorry, was away for a few hours. I'm using https://community-scripts.org/scripts/docker-vm

Update: I saw that I missed part of your question. The above script that I used creates a debian 13 vm with docker preinstalled. Within it I'm using your docker-compose.yml

Sorry, was away for a few hours. I'm using https://community-scripts.org/scripts/docker-vm Update: I saw that I missed part of your question. The above script that I used creates a debian 13 vm with docker preinstalled. Within it I'm using your docker-compose.yml

BasK commented

2026-06-15 23:40:02 +02:00

I tested by getting the latest source version and running a ' docker compose build' and 'docker compose up', same result as before. I did manage to catch a secondary error in the host console though:
[ 251.042052] python[1771]: segfault at 30 ip 00007feaba0b6950 sp 00007feab912a588 error 4 in libcrypto.so.3[108950,7feaba073000+27d000] likely on CPU 5 (core 1, socket 1)
[ 251.045989] Code: 0Less than a second 0.0.0.0:7125-7130->7125-7130/tcp kx 1f 80 00 00 00 00 f7 d6 21 77 30 31 f6 31 ff c3 66 0f 1f-bridge-release-kx-bridge-131 f6 31 ff c3 66 0f 1f 44 00 00 09 77 30 31 f6 31

I tested by getting the latest source version and running a ' docker compose build' and 'docker compose up', same result as before. I did manage to catch a secondary error in the host console though: [ 251.042052] python[1771]: segfault at 30 ip 00007feaba0b6950 sp 00007feab912a588 error 4 in libcrypto.so.3[108950,7feaba073000+27d000] likely on CPU 5 (core 1, socket 1) [ 251.045989] Code: 0Less than a second 0.0.0.0:7125-7130->7125-7130/tcp kx 1f 80 00 00 00 00 f7 d6 21 77 30 31 f6 31 ff c3 66 0f 1f-bridge-release-kx-bridge-131 f6 31 ff c3 66 0f 1f 44 00 00 09 77 30 31 f6 31

viewit commented

2026-06-16 13:10:34 +02:00

Thanks for the additional details, @BasK!

The segfault in libcrypto.so.3 is a strong hint that this is a CPU feature emulation issue in your Proxmox VM.

By default, Proxmox VMs use the kvm64 CPU model, which only exposes a minimal set of CPU features. OpenSSL (used internally for the MQTT TLS connection) performs CPU feature detection at startup and can crash if the CPU model advertises features that the hypervisor does not fully implement.

Could you check which CPU type is set for your VM in Proxmox?

Proxmox UI: VM → Hardware → Processors → Type

If it is set to kvm64 (or any other emulated model), please try changing it to host — this passes through the actual host CPU flags to the VM and eliminates the mismatch.

After changing the CPU type, a full VM reboot is required (not just a container restart). Please let us know if that resolves the segfault!

Thanks for the additional details, @BasK! The segfault in `libcrypto.so.3` is a strong hint that this is a **CPU feature emulation issue** in your Proxmox VM. By default, Proxmox VMs use the `kvm64` CPU model, which only exposes a minimal set of CPU features. OpenSSL (used internally for the MQTT TLS connection) performs CPU feature detection at startup and can crash if the CPU model advertises features that the hypervisor does not fully implement. Could you check which CPU type is set for your VM in Proxmox? **Proxmox UI:** VM → Hardware → Processors → Type If it is set to `kvm64` (or any other emulated model), please try changing it to **`host`** — this passes through the actual host CPU flags to the VM and eliminates the mismatch. After changing the CPU type, a full VM reboot is required (not just a container restart). Please let us know if that resolves the segfault!

BasK commented

2026-06-16 19:49:05 +02:00

The CPU type was host, but it did trigger me to perform more in-depth testing. I added the following flags to docker-compose.yml:
environment:
- PYTHONFAULTHANDLER=1
- PYTHONASYNCIODEBUG=1

This lead to a full segfault log, which I've attached.

According to claude (which may completely wrong, I know): _this is a heap corruption error, not a kernel/VM/seccomp issue at all. Something is corrupting the memory allocator's internal structures, and then when another thread tries to allocate memory (in this case the logging thread calling formatTime), glibc detects the corruption and aborts.
What's Happening
Two threads are racing:

Thread 1 (_poll_loop → publish → _reconnect → _do_connect) is doing an SSL read via ssl.py
Thread 2 (_read_loop) is simultaneously doing its own SSL operations and then tries to log a warning

Both threads are sharing an SSL context or socket object without proper locking, and the concurrent access to OpenSSL's internal state corrupts the heap.
This is a thread-safety bug in the application code (kobrax_client.py), not a VM or OS issue. It only manifests on this machine likely because of timing differences — slightly different CPU speed, scheduling, or load causes the race to be hit reliably here but rarely on other machines._

The CPU type was host, but it did trigger me to perform more in-depth testing. I added the following flags to docker-compose.yml: environment: - PYTHONFAULTHANDLER=1 - PYTHONASYNCIODEBUG=1 This lead to a full segfault log, which I've attached. According to claude (which may completely wrong, I know): _this is a heap corruption error, not a kernel/VM/seccomp issue at all. Something is corrupting the memory allocator's internal structures, and then when another thread tries to allocate memory (in this case the logging thread calling formatTime), glibc detects the corruption and aborts. What's Happening Two threads are racing: Thread 1 (_poll_loop → publish → _reconnect → _do_connect) is doing an SSL read via ssl.py Thread 2 (_read_loop) is simultaneously doing its own SSL operations and then tries to log a warning Both threads are sharing an SSL context or socket object without proper locking, and the concurrent access to OpenSSL's internal state corrupts the heap. This is a thread-safety bug in the application code (kobrax_client.py), not a VM or OS issue. It only manifests on this machine likely because of timing differences — slightly different CPU speed, scheduling, or load causes the race to be hit reliably here but rarely on other machines._

log_segfault.txt

7.2 KiB

viewit added the

bug

label 2026-06-17 07:02:49 +02:00

viewit commented

2026-06-17 07:39:57 +02:00

@BasK — your analysis was spot on, and the fault-handler trace was exactly what we needed. Thank you for digging in. 🙏

You (and Claude) were right: this is not a VM/CPU/seccomp issue — it's a genuine thread-safety bug in kobrax_client.py. The MQTT-over-TLS client shares a single SSL socket between the reader thread (recv) and the sender threads (sendall/publish) without serializing them. CPython's ssl module does not allow concurrent read and write on the same socket, so the overlap corrupted OpenSSL's internal record-layer state → heap corruption → the segfault in libcrypto.so.3. It manifested reliably on your machine purely because of timing (CPU speed / scheduling), which is why it's rare elsewhere — your earlier note about the _poll_loop / _read_loop race was exactly the mechanism.

Fix (in v0.9.25): all socket access (recv / sendall / close / reconnect) is now serialized under a single lock. To avoid the reader starving the senders, the reader probes readiness with select() outside the lock and only takes the lock for the actual recv. Reconnect and disconnect now swap the socket atomically. I stress-tested it with many concurrent publishes during the status stream under PYTHONFAULTHANDLER=1 — no more crash.

v0.9.25 is up now (Docker :latest / :0.9.25 and the standalone binaries). Could you pull it and let me know whether the segfault is gone on your Proxmox VM? You can drop the PYTHONFAULTHANDLER env vars again. Thanks again for the great debugging.

@BasK — your analysis was spot on, and the fault-handler trace was exactly what we needed. Thank you for digging in. 🙏 You (and Claude) were right: this is **not** a VM/CPU/seccomp issue — it's a genuine thread-safety bug in `kobrax_client.py`. The MQTT-over-TLS client shares a single SSL socket between the reader thread (`recv`) and the sender threads (`sendall`/publish) without serializing them. CPython's `ssl` module does not allow concurrent read and write on the same socket, so the overlap corrupted OpenSSL's internal record-layer state → heap corruption → the segfault in `libcrypto.so.3`. It manifested reliably on your machine purely because of timing (CPU speed / scheduling), which is why it's rare elsewhere — your earlier note about the `_poll_loop` / `_read_loop` race was exactly the mechanism. **Fix (in v0.9.25):** all socket access (recv / sendall / close / reconnect) is now serialized under a single lock. To avoid the reader starving the senders, the reader probes readiness with `select()` *outside* the lock and only takes the lock for the actual `recv`. Reconnect and disconnect now swap the socket atomically. I stress-tested it with many concurrent publishes during the status stream under `PYTHONFAULTHANDLER=1` — no more crash. v0.9.25 is up now (Docker `:latest` / `:0.9.25` and the standalone binaries). Could you pull it and let me know whether the segfault is gone on your Proxmox VM? You can drop the `PYTHONFAULTHANDLER` env vars again. Thanks again for the great debugging.

BasK commented

2026-06-17 17:11:13 +02:00

Thank you for the quick fix, I've started the bridge again and so far it has been running without crashing for 10 minutes, while it crashed within minutes before. The connection to the printer still seems very inreliable however, but I'll file a separate bug for that in the hopes that we can solve that as well.

BasK closed this issue

2026-06-17 17:11:15 +02:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: viewit/KX-Bridge-Release#53