Losant Edge disconnected + MQTT pending payloads prevents Edge from ever reconnecting

Jason_Barbee · February 28, 2022, 4:24pm

I have what I think is a bug. The reason is that the Losant Agent is not able to recover/reconnect, and goes into a perpetrual loop, staying offline. Here’s what happens.

I run a fleet of OBD2 + GPS vehicles. If a vehicle loses cell connectivity for a while, it continues to collect data, sending MQTT messages with GPS and OBD2 metrics to Losant Local MQTT Agent. Small obd2/gps updates build up in the queue. Then the Losant Agent reconnects, it tries to send state updates to the cloud.
The local agent hits the rate limit defined here
Resource Limits | Losant Documentation
That’s 300 messages within 15 seconds, then it’s blocked for 30 seconds, then it continues this cycle and is blocked for 1 hour, it tries again, and has the same cycle of failure perpetually, until we manually destroy the running container to clear the queue.

We have no way in a workflow to clear the queue, check the queue, or throttle how Losant reconnects back to the cloud, so if this happens, the Edge is bricked until restarted.

We have looked at the data to see if we can discard them by detecting when the Edge agent is offline. I think this is possible, but ultimately, it’s GPS tracking, and safety information. We would like to try to find a way to keep it and not discard it.

Is there something the Losant team or anyone else who may have hit this can suggest here?

Heath · February 28, 2022, 6:28pm

Hi @Jason_Barbee,

What you are seeing is intended behavior.

When the GEA reconnects, buffered messages are sent to the Losant broker as fast as your device will send them up to 20 messages per second for 5 minutes. After that 5 minute window, the Losant broker will resume enforcing the rate limit of 2 messages per second. If there are still buffered messages after that 5-minute window, the device will still try to continue to send messages at the elevated rate and may be disconnected by the broker for violating the rate limit.

The device “ban” is progressive, as you have noticed. Here’s what it looks like:

Exceed 2 message per second limit: banned for 30 seconds
Exceed the limit again within 15 minutes (after the 5-minute window of elevated rate) the ban doubles up to 1 hour.
Ban resets back to 30 seconds whenever a device does not violate the rate limit for 15 minutes

We are updating the documentation right now to reflect this, as well as internally discussing how we can update this process.

Thank you,
Heath

Jason_Barbee · February 28, 2022, 8:44pm

Thanks @Heath
Looking forward to hearing an option here.

Also, is there a way to alert if the device is experiencing this problem? It shows up in the connection log, but I don’t see a way I could build a workflow to alert for this problem?

Heath · February 28, 2022, 10:28pm

@Jason_Barbee,

Building a workflow to alert for this problem is possible, but does involve a few nodes. Here’s what you can do (note: this is for an Application Workflow):

Use a Device: Disconnect Trigger that fires for the device(s) that you want to monitor for this issue (i.e. based on a tag value).
A Mutate Node that uses the #includes helper to check the disconnnectReason for “throughput limit exceeded” which places a value of true or false on the payload. Note the Treat value as JSON option is checked.
A Conditional Node based on the value placed on the payload in the previous step.

From the Conditional node on, you can implement the alerting of your choice: Email, SMS, or even Slack.

Here’s a picture of what that would look like:

The Mutate Node configuration:

Let me know if this is something that works for you.

Thank you,
Heath

Jason_Barbee · March 1, 2022, 3:36pm

@Heath - Thanks, Yes, We monitor when devices connect and disconnect, I added the disconnect reason, and it’s logging correctly. Found another one this morning stuck in throttled, repeatedly disconnecting.
You would still be discussing ways to resolve this, right? It’s certainly not ideal for us to wait for a notification trend in chat, vpn in and manually destroy a container with data that will be lost forever.

Heath · March 1, 2022, 4:06pm

@Jason_Barbee,

Yes, we are discussing, internally, ways to update this process. I will be sure to reach out to you as soon as I learn more about it from our engineering team.

Thank you,
Heath

Topic		Replies	Views
Edgeworkflow doesn't seem to retry if it can't resolve broker.losant.com on startup Bug Report	19	1262	August 21, 2018
Edge device constantly reconnects to Losant Broker Help mqtt , edge	2	587	July 15, 2021
Topic inbound throughput limit exceeded on mqtt topic, even when offending workflow removed Help mqtt , edge , device	4	856	July 12, 2021
Slow the rate of transmission of old data from Edge Agent Feature Request	2	720	September 12, 2018
Large numbers of MQTT disconnects in a short period of time across multiple devices Bug Report	20	3274	March 2, 2017

Losant Edge disconnected + MQTT pending payloads prevents Edge from ever reconnecting

Related topics