Protocol field notes¶
PWP-0222 closes by describing itself as a first draft and soliciting feedback. This page is that feedback, from one implementation effort: a collection of observations made while writing this library, its mock server, and the integration suite that drives the real xrouter — together with some ideas for what a hypothetical RHPv2.1 or v3 might tighten up.
Spirit of this page
RHPv2 is a good protocol, and this page only exists because it was pleasant enough to implement that we kept going until the corners were reachable. Everything below was found by an implementation written against the published papers and a live xrouter; some of it may turn out to be a misreading of intent rather than a gap, and corrections are very welcome. Where an observation already has a workaround in this library, it's linked — none of this blocks building real applications today.
What RHPv2 gets right¶
It's worth being explicit about the foundations, because they're the reason the rest of this page is about refinements rather than problems:
- JSON over TCP with a 2-byte length prefix is debuggable with
tcpdump, implementable in any language in an evening, and a huge step up from binary RHPv1. Keeping it should be a design goal of any future revision. - The combined
open(socket + bind + listen/connect in one message) is genuinely good ergonomics — most applications never need anything else. id-correlated request/reply plusseqno-tagged notifications is a sound async model that maps directly onto modern client libraries (it became oneRequestAsyncmethod and five events here).- The family/mode matrix (
ax25/netrom/inet/unix×stream/dgram/trace/raw/…) cleanly exposes everything xrouter can do without inventing per-protocol message types.
The whole client, mock server and CLI fit in a few thousand lines. Small protocol, small implementations — that's the compliment.
Observations¶
These are ordered roughly by how much they affected the implementation, not by severity. The protocol primer lists the raw spec-vs-reality deltas; this section is about the design consequences.
1. A successful status query has no correlatable reply¶
Per the spec, statusReply is only sent when a status query fails;
on success the server emits a status notification — which carries
a seqno but not the request's id. A client therefore cannot
distinguish "the answer to my query" from "an unrelated link-state
change that happened to arrive at the same moment".
This library's
QueryStatusAsync
has to race an event subscription against the error-reply path, and its
own doc comment admits the race can be satisfied by a coincidental
push. Roughly forty lines of careful code stand in for one echoed
field.
Suggestion: every request that carries an id gets exactly one
reply echoing that id — including the success case of status.
Nothing stops the server also pushing the usual notification. This
is the single highest-value, lowest-cost change on this page.
2. connectReply.errCode mirrors the handle on success¶
A successful connect returns errCode equal to the socket handle
(rather than 0), with errText:"Ok". This breaks the otherwise
universal "errCode == 0 means success" contract, and it forces
clients to fall back to string-comparing errText as the success
signal — which is what this library does
(pinned at the wire level
so we notice when it's fixed).
Suggestion: errCode: 0 on success, like every other reply.
3. Oversized send.data vanishes silently¶
send.data above ~8 KB (the cliff sits between 8 100 and 8 200 bytes)
is dropped with no sendReply, no error, and a healthy-looking TCP
connection. A client awaiting the reply hangs forever. Compounding
it, the limit isn't advertised anywhere, so a client can't even avoid
the cliff on purpose — it has to know folklore.
Suggestion: reply with errCode: 13 ("No buffers" — it already
exists and fits), and advertise the actual limit somewhere a client
can read it (see the next item).
4. There's no way to ask the server what it supports¶
No version number, no capability list, no limits. A client can't learn
which families/modes this xrouter build supports, what the maximum
data size is, or whether any given quirk on this page has been fixed
— except by trying things and watching what happens. This also makes
the protocol hard to evolve: any behavioural fix risks breaking
deployed clients, because there's no way for either side to detect the
other's vintage.
This matters more now that RHP is no longer a one-server protocol: besides xrouter there's BPQ's server-side implementation (deliberately partial — enough to support WhatsPac), WhatsPac as a client, and the Samoyed soundmodem's implementation in progress. A client today can't even discover which subset it's talking to.
There is, happily, an accident of the current implementation that makes
this retrofittable: xrouter answers an unknown type by appending
Reply to it and setting errCode: 2. So a hypothetical
is already cleanly detectable today: an old server answers
helloReply / errCode: 2 (→ assume baseline v2), a new server could
answer errCode: 0 plus capability fields:
{
"type": "helloReply", "id": 1, "errCode": 0, "errText": "Ok",
"proto": "2.1", "impl": "xrouter 504j",
"pfams": ["ax25", "netrom", "inet", "unix"],
"maxData": 8100,
"enc": ["latin1", "b64"]
}
Suggestion: an optional hello/helloReply exchange. It costs a
few lines in the server, is fully backwards compatible, and is the
prerequisite for fixing everything else without requiring every client
and node on the network to upgrade at the same moment.
5. Binary payloads are underspecified¶
The spec says control characters in data must be JSON-escaped, but
JSON strings are sequences of Unicode code points, not bytes — so for
bytes 0x80–0xFF there are two equally spec-compliant serialisations of
the same JSON document that put different octets on the wire: the
six-character escape ÿ, or the two-byte UTF-8 encoding of U+00FF.
A conforming JSON library is free to emit either.
This library adopts the convention that works against the real xrouter:
bytes map 1:1 to code points U+0000–U+00FF (Latin-1) and everything
non-ASCII is \u00XX-escaped, which
round-trips byte-perfectly over real RF-path AX.25
(see Binary_Bytes_Round_Trip_Via_Dgram_Through_Real_Xrouter). But
nothing in the published papers says this is the convention, and the
cost is real: six wire bytes per high byte means a worst-case binary
payload inflates 6:1 — under a 65 535-byte frame cap and an ~8 KB
data ceiling. Compressed FBB forwarding, YAPP, anything 8-bit hits
this.
Suggestion: two parts. For v2.1, write the existing code-point
convention into the spec as normative, so all implementations agree.
Alongside it, add an optional "enc": "b64" field on
send/sendto/recv (negotiated via hello) — base64 is 1.33:1
instead of 6:1, and removes the encoding ambiguity entirely.
6. A failed connect never says why¶
An AX.25 connect to an unreachable station returns
connectReply/"Ok" immediately (correctly — the handshake is
asynchronous), and then, after FRACK × N retries, the failure arrives
as a bare status with flags: 0 followed by a close notification.
No reason is given. Retry-exhausted, DM received, busy, link reset —
AX.25 has distinct failure modes, and applications genuinely want them:
a BBS forwarding scheduler treats "busy, try later" very differently
from "no such route". Notably, errCode: 15 ("No Route") already
exists in the spec and would be a natural fit, but doesn't appear on
this path.
Suggestion: an optional reason field (numeric, reusing the
existing error-code table where it fits) on server-initiated status
and close notifications.
7. Socket handles are global, not per-connection¶
Handles are allocated from a single pool inside xrouter, and — as the
integration suite pins
(Handles_Are_Globally_Numbered_Across_Connections) — a handle
allocated on one RHP connection can be operated on from another.
With a single trusted client this is harmless; with several clients on
one node (a BBS, a chat server, and a monitor, say) it means any of
them can close or send on the others' sockets, by accident or
otherwise, by using small integers.
Suggestion: track the owning RHP connection per handle and reject
operations from elsewhere (errCode: 3 fits). Per-connection handle
numbering would be the v3 version, but ownership enforcement alone
removes the sharp edge compatibly.
8. One failed auth wedges the whole connection¶
After a rejected auth, every subsequent request on that TCP
connection — whatever its type — is answered with
authReply/errCode: 14. The only recovery is to reconnect. A
mistyped password in an interactive client costs the user their whole
session, and (in our case) it dictated test-harness design: each
error-path test needs a fresh connection.
Suggestion: a failed auth fails that request; the client may
retry auth on the same connection.
9. Two complete lifecycles do the same job¶
The combined open and the BSD-style
socket/bind/listen/connect path are alternative ways to reach
identical states. Supporting both doubles the message catalogue, the
documentation, the test surface, and the room for divergence — and
empirically the divergence is real: most of the quirks on this page
(connectReply mirroring, the dgram-listen rejection, the inet bind
nondeterminism) live on the BSD path. The only thing the long way
genuinely adds is being able to tell a bind failure from a connect
failure.
Suggestion: bless one path (the combined open seems the natural
keeper, perhaps with an optional richer error report) and document the
other as legacy. Even just declaring which path is canonical would
help future implementers.
10. Flow control is advisory, and arrives after the fact¶
The BUSY flag — in async status pushes and in sendReply.status —
tells a client the transmit queue is saturated, but only after the
sends that saturated it have been issued. A client pipelining sends at
TCP speed into a 1200-baud link can overrun the queue before the first
BUSY arrives, and (per observation 3) the overflow may then vanish
silently. There's no window, no advertised queue depth, no hard
contract for how much may be outstanding.
Suggestion (v3-sized): credit-based flow control — openReply
grants an initial byte budget, the server tops it up with small
credit notifications as the queue drains, and a client never sends
beyond its credit. This is the standard solution wherever a fast
transport feeds a slow link (SSH channels, HTTP/2, AMQP), and packet
radio is about as extreme a fast-feeds-slow boundary as exists.
11. Notes on the security model¶
Three things worth flagging, all about the TCP/WebSocket side (where — unlike the RF side — encryption and real credentials are legal):
- RFC1918 is not a trust boundary. CGNAT ranges, shared LANs,
hotel/club networks and container networks all put strangers inside
10/8 and 192.168/16. An explicit allowlist (the
ACCESS.SYSmechanism) is a stronger default than implicit RFC1918 trust. - The WebSocket endpoint needs an
Origincheck. The same-origin policy does not stop a web page from opening a WebSocket tows://node:9000/rhp— any site visited by a browser on the LAN could drive the node (and its transmitter) unless the server validates theOriginheader. Worth doing before the WS endpoint sees wide use. - Plaintext
authis a stated, reasonable tradeoff for the amateur context — but a simple challenge-response (server sends a nonce, client returns a hash) would cost little and avoid replayable credentials on the wire, and TLS on the TCP listener is always an option for internet-exposed nodes.
12. port and special addresses live outside the spec¶
The port field is, in practice, an opaque server-defined identifier:
on xrouter and BPQ it's a radio port number, but a single-interface
packet engine (a soundmodem, say) has nothing natural to put there, and
the spec doesn't say what a server should do with a null or omitted
port. In xrouter, port "0" (or omitting it) means "all ports"
for datagram and raw binds, while NetRom streams don't use it at all —
none of which is written down. There's also at least one magic
address: a remote of SWITCH connects to a node's command
interpreter. That's a node convention rather than part of RHP (it has
no meaning on a pure packet engine, and BPQ ties its null-port
handling to it), which is exactly why it belongs in the spec as a
non-normative note — an implementer can't currently learn it exists
without reading other people's source.
Suggestion: define port as an opaque, server-defined string;
specify null/omitted-port behaviour per mode; and add a non-normative
appendix listing well-known conventions (SWITCH, port "0") so
implementations can interoperate with nodes without reverse-engineering
them.
A hypothetical v2.1 (backwards compatible)¶
Everything here can ship without breaking a single deployed client:
| # | Change | Fixes observation |
|---|---|---|
| 1 | Every id-carrying request gets exactly one id-echoing reply |
1 |
| 2 | connectReply.errCode = 0 on success |
2 |
| 3 | Oversized send → errCode: 13, never silence |
3 |
| 4 | Optional hello/helloReply with version, capabilities, maxData |
4 |
| 5 | Normative byte ↔ code-point mapping for data; optional enc:"b64" |
5 |
| 6 | Optional reason on server-initiated status/close |
6 |
| 7 | Handle ownership enforcement across RHP connections | 7 |
| 8 | Failed auth fails the request, not the connection |
8 |
| 9 | Document the published errata (casing, port types, TRACE fields, errCode: 17) as normative |
— |
Item 9 deserves a sentence: this library's integration suite is in effect a machine-checkable errata document for PWP-0222 — several tests are deliberately written to fail when xrouter changes behaviour, so a v2.1 spec could largely be transcribed from it, and conformance of future builds checked against it.
A hypothetical v3 (breaking, someday, maybe)¶
If a breaking revision is ever on the table, the v2.1 list plus:
- Keep JSON and the length-prefix framing — readability is the protocol's superpower; CBOR would buy little.
- Base64
dataeverywhere; drop the code-point convention. - One lifecycle: keep
open/accept, retire the BSD message family. idrequired on every request; a uniform reply envelope (type,id,errCode,errText, then payload fields) with no field that means different things on success and failure.- Credit-based flow control (observation 10).
- Per-connection handle namespaces.
- Token or challenge-response auth;
Originvalidation on WebSocket; optional TLS.
Closing¶
None of this is a request so much as a record: these are the points where an independent implementation had to stop and experiment rather than read. RHPv2 already does the most important thing a protocol can do — it exists, it works, and real applications talk to real radios through it. The hope is that this page makes the next implementer's evening even shorter, and gives the protocol's author useful raw material if and when a revision happens.
Discussion welcome — the OARC #packet channels or this repo's
issue tracker are both
good places.