I am having an issue with an edge device that is disconnecting due to an “Topic inbound throughput limit exceeded on mqtt topic” error.
Now I think I understand why this is happening. I had a workflow on the device that was writing to an MQTT topic whenever there was a Modbus device (that it was trying to read from) was not connected. This was compounded when it was trying to talk to 10 disconnected devices in a row. The end result was that the Edge device was writing to this MQTT topic ~10 time in less than a second. This then causes the Edge device to be kicked from the Broker.
The problem now is I have removed the offending workflow, but it still disconnects with the same error every time the Broker allows it to connect. In fact, I have removed all the workflows from the device. Still it gets kicked.
This means as far as doing from Losant the device is essentially bricked.
Why is this still happening, when I have removed the workflows?
Is there away to fix it?
Does it sound like it is working as intended?
Cheers,
(P.S. Note: I have full access to the device, and it is in a test environment at the moment)
Hi, @Cameron_Kelly. Welcome to the Losant Forums.
I’m looking into this for you right now. If you can provide me an application ID and the ID of the workflow, that would help me get to the bottom of this. Easiest way to get those is to just visit the workflow in question in your browser and copy the URL.
Resource IDs are not considered sensitive information so you can paste them as a reply here, or if you’d rather, you can send to me as a private message.
Thanks.
@Cameron_Kelly how soon after the device connects to the broker is it getting kicked off again? Is it instantaneous or does it take a few minutes?
I found what I think is the application and the workflow in question and I can confirm that the workflow has been removed from the device. However, removing the workflow does not remove any MQTT messages that have been queued up by that workflow. So when the device connects again, it will try and process all the pending MQTT publishes that were generated by previous workflow runs.
My best guess at the moment is that the device is still trying to work its way through that batch of publish messages. You could confirm this by making an application workflow that listens on that topic via an MQTT trigger, and connect that trigger to a debug node. Then, with that workflow open in a browser window, connect the device and see if you get a bunch of debug messages.
Now, we do allow “bursting” when a device connects, which lets the device publish more than our usual maximum number of publishes per second, for a period of time. This is to allow for this exact scenario, where the edge device has gone offline and needs to catch the cloud up with what it’s been up to in that time period (state reports and custom topic publishes.)
If my theory is correct, I believe you could resolve this by blowing away the Docker container that is running the edge agent and reconnecting it - but if you are using a custom path for storage, and you use that same path, we may run into this problem again. So you may need to use a different file storage path or also delete the file you are reading in your edge workflow.
Hey Dylan,
I have DM you the URL of the workflow in question.
To answer your other questions, though, it seems to get kicked instantaneously. In the Edge device log I get a message that says the device has “authenticated” but it doesn’t connect and then 30s (on initial run) it “authenticates” again (doesn’t get to the “Connection succeeded” stage). The log file on the device ends up as a stream of “[info] Connecting to: mqtts://broker.losant.com …”.
I have deleted the local data and reloaded the docker image to try and get the device to behave again. So far, so good. Is there a way to clear the cache remotely (through Losant)? So I can apply a remote fix if this ever pops up in a production install.
Main takeaway: I’ll make sure I’m not putting MQTT writes in short loops any more.
Cheers,
Thanks, Cameron. I took a look in your account and it actually appears your device is connecting for exactly five minutes and then getting disconnected. Here’s a screenshot from your edge compute device.
I’d call this confirmation of my theory that led me to suggest blowing away the storage / container: That your workflow had queued up tens of thousands of messages and, while some were getting through to the cloud in the five-minute reconnect burst period, there were still more messages to process that ultimately led to the device getting kicked off again.
As for remotely clearing the cache, I don’t have a way to do that but you could perhaps write a workflow that lets you see the size of the file in question, or even read it, from the cloud and proactively truncate or even delete the file before hitting this point again. There is a library template that demonstrates some of these principles; you could import that and modify it to suit your use case.