Overview of the AudioSocket protocolPaul Tagliamonte 2023-12-13 protocol
The asterisk VoIP projct has a protocol built-in called “AudioSocket”. AudioSocket is built on top of TCP, streaming int16 values at a sample rate of 8 kHz, neither of those options are configurable (by design). AudioSocket will stream audio from the connected phone to the tcp server, and play audio samples sent from the tcp server to the phone.
Data is exchanged over
AudioSocket by framing data into
packets. This should be a pretty natural concept for anyone
who’s worked on other line encoding schemes like
type is a
uint8, length is transmitted as a
uint16, and the
payload is a variable sized block of data.
The header is encoded using
network byte order (big endian).
The only field this really matters for is the
length field, since
type field is
payload format is dependent on the
type of message.
A full list of Commands, and the semantics of their Argument is detailed on the table below.
|16-byte UUID encoded as raw bytes.
|variable length buffer of little endian signed 16 bit integers sampled at 8 kHz
|byte (see table below)
The most simple (and also shortest) command for AudioSocket is the “
command, which can be used to indicate that the connection should be tore down,
which is a type of
0x00, and no payload (length of
0, no body). This would
be encoded as
[0x00, 0x00, 0x00].
Well known Error Codes
The length of the
Error packet is not defined, and may be any length.
According to an
AudioSocket Go library
Asterisk has the following well known error codes (although I can’t seem to
find these in the source, if anyone has a link). Given the most common
implementation is asterisk, I suspect mandating a 1-byte Error code is not
a bad idea.
|Caller has hung up the Connection
|Error forwarding the Frame to the caller
|Internal memory allocation error
Terminate the connection:
0x00 0x00 0x00
Indicate an error state of
0xFF 0x00 0x01 0x11
Send 2 audio samples of
0x10 0x00 0x04 0x01 0x00 0xFF 0xFF
After a TCP connection is established, the client is expected to send a
UUID Packet to the server, which has an application dependent meaning. It
could indicate the audio stream to attach to, an identity, or an API key
depending on how the server uses it.
UUID packet is sent, both the Client and the Server begin to send
Audio packets to their peer until the TCP connection is closed, the
Terminate command is issued, or an
Error packet is sent.
Because the aduio stream needs to be very low latency, it’s advisable to
TCP_NODELAY, in order to disable Nagle’s algorithm on the TCP connection.
The reason is that we’re sending many small packets with time sensitive
audio information which need to be sent right away, even if there is more data
to be sent very shortly after.
Additionally, Asterisk specifically will be very upset if you send headers, and reading the body takes more than 5ms, even if there’s a buffer you never exhaust. This state is hard to hit when the audio data is contained in an IP packet, but it’s very easy to trigger when you’re operating under Nagle’s algorithm, since your packet is likely to be split along non-packet boundaries.