Edge device/gateway connection reliability, MQTT heartbeat and QoS

Hi all,

Is there any more information on how losant uses MQTT under the hood wrt heart beat messages, commands, QoS, etc. I am looking to understand a few things:

  1. how many messages and what type of messages are being sent by edge devices (like LEA) so I understand better what the data consumption will be for a device that is on an LTE connection. For instance, are you sending heartbeat messages to verify if the connection is still alive? How often?
  2. How does losant detect edge devices going online/offline? If a device doesn’t update its state for a long while, how do you know its still online? Especially since you only support QoS 0, could it be that devices are actually offline because their network connection dropped, and they didn’t actively disconnect from the MQTT broker? I guess you are relying on the TCP connection and TCP ack messages, and TCP connection timeout to detect that a connnection to the broker dropped when a command is sent to an edge device? Would a TCP connection timeout trigger the device going offline in the cloud? And on the edge device itself? What are the connection timeouts that are being used for this? Since this is all stuff that losant takes care of for the users internally, I would be interested to know how you handle this stuff under the hood, so I can understand the consequences on the reliability of commands being sent to devices, and state being sent from devices.
  3. Do you have a write-up somewhere how to handle connection reliability in general. I know there are docs on individual nodes that help with this (like state inactivity timer), or for instance documentation about what parts of the MQTT protocol are supported, but an article piecing together the bigger picture and describing common design patterns to help with this would be useful. Things I would be interested would be: how do I know whether commands have been received by edge devices? How do I make sure from the cloud that any commands that weren’t received by an edge device, get re-sent once they reconnect? And on the edge device: if I send the state over MQTT, how do I make sure the state commands are being received by the broker (aka, how do I setup QoS1 for publishing state messages on losant edge workflows)? Is there a common design pattern in losant to make sure that past state messages that haven’t been received by the broker get sent as soon as a connection comes back? How quickly can I detect on the edge side the network connectivity has dropped to go into a fail-safe state, and snap out of it when the connection comes back.
  4. The question has been asked many times, but there seems no real activity as far as I can see on the product side: what is the status of supporting retaining messages and QoS1? The ability to integrate message acknowledgement both on the edge and cloud side to detect whether devices are online, act upon it if they go offline, and confirm whether state/command messages have been received, seems valuable for device control where reliability and fail-safe scenarios are important.

Hey @Dolf_Andringa,

Thanks for the good questions. I’ll answer as many as I can.

For instance, are you sending heartbeat messages to verify if the connection is still alive?

This is done through MQTT’s built-in KeepAlive functionality. If the client is not publishing messages normally, it will send a PING to keep the connection alive. The GEA’s default KeepAlive timeout is 60 seconds, however this can be changed in by setting the MQTT_KEEPALIVE environment variable option.

How does losant detect edge devices going online/offline?

Either the underlying TCP socket disconnects or a KeepAlive timeout occurs.

Would a TCP connection timeout trigger the device going offline in the cloud? And on the edge device itself? What are the connection timeouts that are being used for this?

Yes, a TCP timeout will cause a disconnect, but it’s much more likely the MQTT KeepAlive will cause the disconnect first. The TCP timeout is 20 minutes. This long timeout period was specifically implemented for cellular connections. When using cellular, you can configure a long KeepAlive interval and the device can remain idle for long periods of time not generating any data.

how do I make sure the state commands are being received by the broker (aka, how do I setup QoS1 for publishing state messages on losant edge workflows)?

The GEA publishes all messages with QoS 1 automatically. QoS 1 is available for publishing. The GEA will not remove a message from its offline buffer until a PUBACK is received from the broker indicating it has received the message.

How quickly can I detect on the edge side the network connectivity has dropped to go into a fail-safe state, and snap out of it when the connection comes back.

The GEA will fire the Device: Disconnect trigger as soon as the connection to the MQTT broker is lost. Every GEA trigger also adds an isConnectedToLosant property (true | false) for the current connection status.

Things I would be interested would be: how do I know whether commands have been received by edge devices? How do I make sure from the cloud that any commands that weren’t received by an edge device, get re-sent once they reconnect?

This is a known challenge in Losant. At the moment, our recommended best practice is for the device to respond to the command with either a state update or a custom MQTT message. Using the Workflow Output Node, you can schedule an action to be taken if no response is received. If a response is received, the scheduled workflow run can be canceled.

The question has been asked many times, but there seems no real activity as far as I can see on the product side: what is the status of supporting retaining messages and QoS1?

There are some initiatives we’d like to pursue that do require more MQTT features (specifically retain). I can’t guarantee any timelines, but it is something we’d like to move forward.

Hi brandon,

thanks for the many answers. that does answer it consistently. I think it would be interesting for application architects to have a write-up of these questions, instead of having to piece it together from individual bits of documentation, but my questions are all answered.

Cheers,

Dolf

Hi,
a good read thanks @Brandon_Cannaday ! I have the same architectural challenge posed around how best to acknowledge commands, given that some device types will allow immediate feedback, while others may sleep for long periods and then send an acknowledgement. I don’t want to use custom MQTT queues as I’m emulating a ‘Gateway’ device via an MQTT client in the Tartabit IoT Bridge SaaS platform meaning only the default Losant state & commands topics can be used, so that leaves me with the option of utilising (Peripheral) device Attributes for command acknowledgement state instead. All good and workable in a fashion, but correct me if I’m wrong - all the Attribute ‘state’ updates end up stored in your time-series DB/storage as well, correct? Wouldn’t managing command acknowledgements in this manner also consume the organisations allocated Losant storage faster as a consequence? Or is there a way to delete all but the last 2-3 Attributes used for this command ack purpose in some easy manner to manage this issue?

You are correct that if device state is used as command acknowledgements, they will be written to the time-series database. This will result in a payload being consumed (same as if a custom MQTT topic were used). As for the data storage, there’s no concern. Customers do not have storage limits other than retention (data is automatically deleted after a certain amount of time - defaults to 6 months for paying customers). Command acks are also likely a very small amount of data compared to telemetry data.

2 Likes

@Brandon_Cannaday Thank you for the reply - clear. Commands and Command acknowledgements can result in reasonably complex state-machines also, in scenarios where the device may be sleeping and using protocols such as CoAP & LwM2M, and a proxy like the Tartabit IoT Bridge is being used (seen by Losant as a ‘Gateway’ type device). Here, we may want the proxy Gateway to acknowledge the command has been received, but also additional state info such as “queued for delivery” & “command ack timeout value”, and an additional unique ID (E.g., a NanoID value) associated with the command for later acknowledgement correlation in a workflow. So there may in fact be quite a number of device Attributes to report into for such states. However, as you say, it’s not resource consumptive in the scheme of things.

1 Like

I don’t know much about it, but.
From my experience, implementing MQTT with QoS 1 alongside a heartbeat feature has been a game-changer for maintaining reliable connections between edge devices and gateways. This combination ensures message delivery even during network fluctuations and helps quickly detect device downtime for efficient troubleshooting.