I have what I think is a bug. The reason is that the Losant Agent is not able to recover/reconnect, and goes into a perpetrual loop, staying offline. Here’s what happens.
I run a fleet of OBD2 + GPS vehicles. If a vehicle loses cell connectivity for a while, it continues to collect data, sending MQTT messages with GPS and OBD2 metrics to Losant Local MQTT Agent. Small obd2/gps updates build up in the queue. Then the Losant Agent reconnects, it tries to send state updates to the cloud.
The local agent hits the rate limit defined here Resource Limits | Losant Documentation
That’s 300 messages within 15 seconds, then it’s blocked for 30 seconds, then it continues this cycle and is blocked for 1 hour, it tries again, and has the same cycle of failure perpetually, until we manually destroy the running container to clear the queue.
We have no way in a workflow to clear the queue, check the queue, or throttle how Losant reconnects back to the cloud, so if this happens, the Edge is bricked until restarted.
We have looked at the data to see if we can discard them by detecting when the Edge agent is offline. I think this is possible, but ultimately, it’s GPS tracking, and safety information. We would like to try to find a way to keep it and not discard it.
Is there something the Losant team or anyone else who may have hit this can suggest here?
When the GEA reconnects, buffered messages are sent to the Losant broker as fast as your device will send them up to 20 messages per second for 5 minutes. After that 5 minute window, the Losant broker will resume enforcing the rate limit of 2 messages per second. If there are still buffered messages after that 5-minute window, the device will still try to continue to send messages at the elevated rate and may be disconnected by the broker for violating the rate limit.
The device “ban” is progressive, as you have noticed. Here’s what it looks like:
Exceed 2 message per second limit: banned for 30 seconds
Exceed the limit again within 15 minutes (after the 5-minute window of elevated rate) the ban doubles up to 1 hour.
Ban resets back to 30 seconds whenever a device does not violate the rate limit for 15 minutes
We are updating the documentation right now to reflect this, as well as internally discussing how we can update this process.
Thanks @Heath
Looking forward to hearing an option here.
Also, is there a way to alert if the device is experiencing this problem? It shows up in the connection log, but I don’t see a way I could build a workflow to alert for this problem?
Building a workflow to alert for this problem is possible, but does involve a few nodes. Here’s what you can do (note: this is for an Application Workflow):
A Mutate Node that uses the #includeshelper to check the disconnnectReason for “throughput limit exceeded” which places a value of true or false on the payload. Note the Treat value as JSON option is checked.
A Conditional Node based on the value placed on the payload in the previous step.
From the Conditional node on, you can implement the alerting of your choice: Email, SMS, or even Slack.
@Heath - Thanks, Yes, We monitor when devices connect and disconnect, I added the disconnect reason, and it’s logging correctly. Found another one this morning stuck in throttled, repeatedly disconnecting.
You would still be discussing ways to resolve this, right? It’s certainly not ideal for us to wait for a notification trend in chat, vpn in and manually destroy a container with data that will be lost forever.
Yes, we are discussing, internally, ways to update this process. I will be sure to reach out to you as soon as I learn more about it from our engineering team.