Edge device/gateway connection reliability, MQTT heartbeat and QoS

Hi all,

Is there any more information on how losant uses MQTT under the hood wrt heart beat messages, commands, QoS, etc. I am looking to understand a few things:

  1. how many messages and what type of messages are being sent by edge devices (like LEA) so I understand better what the data consumption will be for a device that is on an LTE connection. For instance, are you sending heartbeat messages to verify if the connection is still alive? How often?
  2. How does losant detect edge devices going online/offline? If a device doesn’t update its state for a long while, how do you know its still online? Especially since you only support QoS 0, could it be that devices are actually offline because their network connection dropped, and they didn’t actively disconnect from the MQTT broker? I guess you are relying on the TCP connection and TCP ack messages, and TCP connection timeout to detect that a connnection to the broker dropped when a command is sent to an edge device? Would a TCP connection timeout trigger the device going offline in the cloud? And on the edge device itself? What are the connection timeouts that are being used for this? Since this is all stuff that losant takes care of for the users internally, I would be interested to know how you handle this stuff under the hood, so I can understand the consequences on the reliability of commands being sent to devices, and state being sent from devices.
  3. Do you have a write-up somewhere how to handle connection reliability in general. I know there are docs on individual nodes that help with this (like state inactivity timer), or for instance documentation about what parts of the MQTT protocol are supported, but an article piecing together the bigger picture and describing common design patterns to help with this would be useful. Things I would be interested would be: how do I know whether commands have been received by edge devices? How do I make sure from the cloud that any commands that weren’t received by an edge device, get re-sent once they reconnect? And on the edge device: if I send the state over MQTT, how do I make sure the state commands are being received by the broker (aka, how do I setup QoS1 for publishing state messages on losant edge workflows)? Is there a common design pattern in losant to make sure that past state messages that haven’t been received by the broker get sent as soon as a connection comes back? How quickly can I detect on the edge side the network connectivity has dropped to go into a fail-safe state, and snap out of it when the connection comes back.
  4. The question has been asked many times, but there seems no real activity as far as I can see on the product side: what is the status of supporting retaining messages and QoS1? The ability to integrate message acknowledgement both on the edge and cloud side to detect whether devices are online, act upon it if they go offline, and confirm whether state/command messages have been received, seems valuable for device control where reliability and fail-safe scenarios are important.

Hey @Dolf_Andringa,

Thanks for the good questions. I’ll answer as many as I can.

For instance, are you sending heartbeat messages to verify if the connection is still alive?

This is done through MQTT’s built-in KeepAlive functionality. If the client is not publishing messages normally, it will send a PING to keep the connection alive. The GEA’s default KeepAlive timeout is 60 seconds, however this can be changed in by setting the MQTT_KEEPALIVE environment variable option.

How does losant detect edge devices going online/offline?

Either the underlying TCP socket disconnects or a KeepAlive timeout occurs.

Would a TCP connection timeout trigger the device going offline in the cloud? And on the edge device itself? What are the connection timeouts that are being used for this?

Yes, a TCP timeout will cause a disconnect, but it’s much more likely the MQTT KeepAlive will cause the disconnect first. The TCP timeout is 20 minutes. This long timeout period was specifically implemented for cellular connections. When using cellular, you can configure a long KeepAlive interval and the device can remain idle for long periods of time not generating any data.

how do I make sure the state commands are being received by the broker (aka, how do I setup QoS1 for publishing state messages on losant edge workflows)?

The GEA publishes all messages with QoS 1 automatically. QoS 1 is available for publishing. The GEA will not remove a message from its offline buffer until a PUBACK is received from the broker indicating it has received the message.

How quickly can I detect on the edge side the network connectivity has dropped to go into a fail-safe state, and snap out of it when the connection comes back.

The GEA will fire the Device: Disconnect trigger as soon as the connection to the MQTT broker is lost. Every GEA trigger also adds an isConnectedToLosant property (true | false) for the current connection status.

Things I would be interested would be: how do I know whether commands have been received by edge devices? How do I make sure from the cloud that any commands that weren’t received by an edge device, get re-sent once they reconnect?

This is a known challenge in Losant. At the moment, our recommended best practice is for the device to respond to the command with either a state update or a custom MQTT message. Using the Workflow Output Node, you can schedule an action to be taken if no response is received. If a response is received, the scheduled workflow run can be canceled.

The question has been asked many times, but there seems no real activity as far as I can see on the product side: what is the status of supporting retaining messages and QoS1?

There are some initiatives we’d like to pursue that do require more MQTT features (specifically retain). I can’t guarantee any timelines, but it is something we’d like to move forward.

Hi brandon,

thanks for the many answers. that does answer it consistently. I think it would be interesting for application architects to have a write-up of these questions, instead of having to piece it together from individual bits of documentation, but my questions are all answered.

Cheers,

Dolf