Overview of the AudioSocket protocol

Paul Tagliamonte 2023-12-13 protocol

The asterisk VoIP projct has a protocol built-in called “AudioSocket”. AudioSocket is built on top of TCP, streaming int16 values at a sample rate of 8 kHz, neither of those options are configurable (by design). AudioSocket will stream audio from the connected phone to the tcp server, and play audio samples sent from the tcp server to the phone.

This documentation is a work in progress, and a result of source code spelunking or reverse engineering. It may contain errors or outright lies. The names may not match the original name, but it's been documented on a best-effort basis to help future engineering efforts.

AudioSocket Packet

Data is exchanged over AudioSocket by framing data into TLV packets. This should be a pretty natural concept for anyone who’s worked on other line encoding schemes like ASN.1, SSH, PGP, or protobuf.

The type is a uint8, length is transmitted as a uint16, and the payload is a variable sized block of data.

The header is encoded using network byte order (big endian). The only field this really matters for is the length field, since the type field is uint8. The payload format is dependent on the type of message.

type
length
payload

A full list of Commands, and the semantics of their Argument is detailed on the table below.

CommandDefinitionPayload
0x00Terminatenone
0x01UUID16-byte UUID encoded as raw bytes.
0x10Audio Samplesvariable length buffer of little endian signed 16 bit integers sampled at 8 kHz
0xFFErrorbyte (see table below)

The most simple (and also shortest) command for AudioSocket is the “Terminate” command, which can be used to indicate that the connection should be tore down, which is a type of 0x00, and no payload (length of 0, no body). This would be encoded as [0x00, 0x00, 0x00].

Well known Error Codes

The length of the Error packet is not defined, and may be any length. According to an AudioSocket Go library (github.com/CyCoreSystems/audiosocket), Asterisk has the following well known error codes (although I can’t seem to find these in the source, if anyone has a link). Given the most common implementation is asterisk, I suspect mandating a 1-byte Error code is not a bad idea.

CodeImplDescription
0x01AsteriskCaller has hung up the Connection
0x02AsteriskError forwarding the Frame to the caller
0x04AsteriskInternal memory allocation error

Example Packets

Terminate the connection:

0x00 0x00 0x00

Indicate an error state of 0x11

0xFF 0x00 0x01 0x11

Send 2 audio samples of +1 and -1

0x10 0x00 0x04 0x01 0x00 0xFF 0xFF

Handshake

After a TCP connection is established, the client is expected to send a UUID Packet to the server, which has an application dependent meaning. It could indicate the audio stream to attach to, an identity, or an API key depending on how the server uses it.

After the UUID packet is sent, both the Client and the Server begin to send Audio packets to their peer until the TCP connection is closed, the Terminate command is issued, or an Error packet is sent.

Implementation Notes

Because the aduio stream needs to be very low latency, it’s advisable to set TCP_NODELAY, in order to disable Nagle’s algorithm on the TCP connection. The reason is that we’re sending many small packets with time sensitive audio information which need to be sent right away, even if there is more data to be sent very shortly after.

Additionally, Asterisk specifically will be very upset if you send headers, and reading the body takes more than 5ms, even if there’s a buffer you never exhaust. This state is hard to hit when the audio data is contained in an IP packet, but it’s very easy to trigger when you’re operating under Nagle’s algorithm, since your packet is likely to be split along non-packet boundaries.