Advice on debugging MQTT packet corruption?

We’re using Losant on a few early-stage projects. In one project, we’ve noticed Losant receiving corrupt MQTT packets. Just wondering if you have any advice on where to start.

Of course we’re looking at our application to see if we’re simply transmitting invalid packets, but it doesn’t look that way right now. Before and after transmitting, all our buffers contain the expected string values.

On the Losant side, we can see that the corrupted packets basically have some characters swapped out for invalid characters. My best guess is that this is a true transmission error, but I haven’t seen this on other devices using the same modem so I’m not sure why it would be happening so frequently on this device.

Do you accept CRCs or checksums for MQTT packets?

Is there any way we can view a historical log of non-accepted messages? Right now I’m debugging by watching the live Application Log window, but it’s pretty inconvenient if we want to investigate something that happened in the past.

A couple questions about what you are seeing - so the only thing corrupted is the payload? The packet as a whole is accepted by Losant, on the expected topic, but the payload is not what you expect? Is it possible that it isn’t corruption, but an encoding mismatch/display error? If you are sending binary data, the application log could display it very weirdly (the browser may be coercing the data to UTF8 when trying to display it).

There is no built in mechanism for CRC or checksums on MQTT that I know of (I think it just relies on the checksums stuff built into TCP), so Losant doesn’t have anything built in for that. It is certainly a thing you can add to your payloads, though, and then verify the checksum, hash, or CRC using a workflow.

  1. We’re seeing random corruption anywhere in the packet; this can include the topic, or the word ‘losant’ itself, or the device ID, or the payload.

  2. No, I’m pretty sure this is not an encoding mismatch. Because, for example, I see the device ID appearing correctly on some transmitted messages in Losant, but occasionally I see something very close to the correct ID but with a couple of characters garbled.

  3. It didn’t occur to me to use a workflow, I will keep that in mind. I was hoping more for an MQTT-level mechanism so that the transmitting end wouldn’t even receive an ACK if transmission errors were detected. The issue right now is that (conceivably) we could transmit valid but corrupted JSON, and our transmitting device would think the packet was transmitted successfully because it gets the ACK from MQTT. Using a workflow, I don’t really see how we could achieve the same effect, unless we have some complex closed-loop system going on in the firmware to poll the server and ensure every measurement arrived with a valid CRC.

And, to reiterate: is is possible to view a historical log of application errors? If not, I would really like to have this feature added. Otherwise, we have to log on the device-end to diagnose transmit failures. Something like a basic server log of rejected messages would be very useful.

I’m kinda surprised that you are even seeing messages when the topics are corrupted (especially the device id) - because Losant will actually not accept a message at all on topics such as /losant//state where the device id is invalid - it will actually close the MQTT connection. Also, if random characters in the entire MQTT packet are being corrupted, I’m surprised the connection is able to stay open at all - i.e., I’m surprised the packets are parseable. Our broker will close the connection when a packet is un-parseable. To be honest, I’ve never seen the kind of situation you are describing, the mechanisms built into TCP itself should be preventing that kind of corruption (if it is a network transmission error).

There is no historical application log - we don’t store data from messages except in specific circumstances (like device state). I do know customers have used workflows to send the data to third parties that are more equipped for things like long term log analysis (for instance, using a service like Loggly, or dumping data to an S3 bucket). We do have a historical connection log, though, with timestamps and disconnection reasons - I’d be interested to know if your device is getting disconnected randomly with protocol errors (not sure off the top of my head what the error message would be for corrupted packets).

I think we’re in agreement; Losant isn’t accepting the packets, it’s just indicating that it got a packet that was rejected in the Application Log window. The malformed packet data isn’t actually making its way into the device’s data history. It’s possible the connection is being closed by the Losant end when it sees malformed packets; I haven’t yet investigated the socket behavior we’re seeing when the corrupted packet is transmitted.

I would copy the exact error message into this thread, but it’s pretty device-specific and I didn’t want to reveal product information.

So, it’s not an issue of us seeing incorrect data being displayed as if it was correct data. The issue is providing an indication to our transmitting device that a particular MQTT packet was rejected.

I’m not familiar with how MQTT acknowledgement works (I’ve just jumped into this project), so I’ll probably start looking there in our device firmware. It sounds like detecting a missing ACK or a NACK from the server would be sufficient to treat a transmission as failed, and queue it for retry. At least this would catch issues like a corrupt device ID or topic, even if it may not catch a corrupted data-field value, right?

Edit: To address a couple more of your points,

  1. I agree with what you’re saying about TCP; it should handle any transmit errors between our modem and your server, so the issue is likely not at that point
  2. I suspect the error is between our MCU and modem, but I’d like to be able to catch this error at the server-response level if somehow the MCU->modem transmit fails. We’re currently working on debugging and validating the MCU->modem payload transmission to catch any transmit errors at that stage as well.

If you are publishing packets to Losant with a QoS level of 1 (we support QoS levels 0 and 1), then the broker replies with an ACK when it accepts a packet. Any corruption that would cause a packet to be rejected out of hand (i.e., bad topics, un-parsable packet) the Losant broker would not respond with an ACK - and so that should be enough to know to resend.

There is still the potential for a message to be accepted by the broker, however, if the corruption is limited to the payload of the message (as long as the message as a whole is well formed) - since the broker itself doesn’t care about the contents of the message (or know what correct contents even should be). In which case, the packet would still be ACKed, the device would remain connected, and the potentially corrupted packet would still be pushed through the rest of Losant (where, for instance, if it was a Device State message, it would probably be ignored because it wouldn’t match what is expected for a device state message).

What kind of device/modem/library are you using? I’m asking around here in the office to see if anyone else has experienced anything like what you are seeing.

Thanks - that info about QoS levels you support is exactly what I was looking for. I do understand the situation you’re describing where the payload data itself could be corrupted but accepted; we can handle that seperately if needed.

We’re using a u-blox GSM modem and a ported version of the MQTT Paho library. To be honest, I suspect this is related to some hairy details of our application code (i.e. do we check every ACK in the right order, at the right time, when publishing payloads) rather than being related to our stack. We have another product with the same modem but using HTTP POST, and we don’t experience the same issue.

There’s some kind of complex stuff in our application around the connect/publish/disconnect sequence, sleeping, keeping a circular buffer of failed transmits, all performed with the modem in transparent-mode (as opposed to using AT commands).

Our plan of attack right now is to modify our bulk-transmit operation so that it only publishes a single payload inside each connect/publish/disconnect, so that publishing a single message is an atomic operation. This will allow us to examine server responses on that socket with better granularity.

Thank you again for the support - if you hear anything around the office about the stack we’re using, I would be interested to know.