Large numbers of MQTT disconnects in a short period of time across multiple devices

Tim_Hoffman · March 1, 2017, 12:58am

Hi

Over the last few days we have seen multiple ocurrances of many gateways losing and re-stablishing MQTT connections.
This is currently happening now. Has been occurring since 6:22am GMT+8 through to currently 8:58 GMT+8

Given the equipment is literally all over the country (Australia, QLD and WA) I am inclined to think it might not be Telstra but possibly something happening at Losant. The service status page doesn’t show any issues.

Are you aware of any MQTT issues with connections.

Thanks

Tim

Tim_Hoffman · March 1, 2017, 1:00am

Actually it might be Telstra. A little difficult to tell.

Will keep investigating.

T

Tim_Hoffman · March 1, 2017, 1:08am

I am even having trouble connecting to Losant dashboards.

Tim_Hoffman · March 1, 2017, 1:10am

Hmm, definately not Telstra,

As I am not having any trouble logging into the remote gateway devices.

At least Telstra in Aus, it could be international links but I am not having trouble getting to other services.

Tim_Hoffman · March 1, 2017, 1:24am

Further diagnosis.

We use a mosquitto broker as a bridge on each local gateway.

It seems the connection can be initiated.

Normally we see
1488331031: Bridge 578881XXXXX sending CONNECT
1488331033: Received CONNACK on connection local.57888XXXX

However when the problem is occurring we don’t see the CONNACK from the attempted bridge connect.

So it seems to be an issue with broker.losant.com

Tim_Hoffman · March 1, 2017, 1:26am

Connection failures, and reconnect failures are still occurring as at 9:26am GMT+8

Tim_Hoffman · March 1, 2017, 2:12am

At the moment we are also having trouble logging into losant.

Just got

504 Gateway Time-out

The server didn’t respond in time.

from https://accounts.losant.com/

Brandon_Cannaday · March 1, 2017, 2:51am

Just an update that we have received your message and are investigating.

Brandon_Cannaday · March 1, 2017, 3:08am

The login issue is due to a failed accounts worker that did not fail over properly. It’s unrelated to any known MQTT issue.

Our broker connection monitoring was showing an above-average rate of activity (between 8:00PM ET and 9:30PM ET, roughly 2 hours ago), but has since returned to stable levels. We are reviewing logs for underlying cause. Are you still experiencing disconnect issues at this time?

Tim_Hoffman · March 1, 2017, 3:11am

Hi Brandon

We are still seeing a higher than normal rate of disconnects but the rate has dropped off compared with 3 hours ago.

We don’t normally see gateway disconnect/reconnect for days, at most typically 1 on a 24 hour period but quite often much longer than that. Which has more to do with the mobile network as it will be specific gateway and not multiple.

Most recent events GMT+8

5822736XXX 03/01/2017 11:08:50 AM
5822736XXX 03/01/2017 11:08:50 AM 0 0 Device establishing new connection, closing previous connection
5822736XXX 03/01/2017 11:08:41 AM
5822736XXX 03/01/2017 11:07:41 AM 157 0 Publish Error - Error: read ECONNRESET
5822736XXX 03/01/2017 11:00:20 AM
5822736XXX 03/01/2017 10:59:47 AM 565 0 Publish Error - Error: read ECONNRESET

Brandon_Cannaday · March 1, 2017, 3:15am

Thanks for the update - it’s very helpful. We’re continuing to investigate.

Tim_Hoffman · March 1, 2017, 3:19am

Just had a bunch of disconnects.

And the following new message

Device DACIAN GATEWAY disconnected from Losant
Device establishing new connection, closing previous connection
Wed Mar 1 2017 11:18:16 GMT+08:00
Device BLACKHAM GATEWAY disconnected from Losant
Message throughput limit exceeded on 585764097XXX/losant/58acf7aba316XXXXXcc/state

We typically only send a packet for every machine every 10 secs. The blackham gateway has one device.
So I imagine mosquitto was backing up messages

And from the connect log

582273678XXXXX 03/01/2017 11:20:13 AM 28 0 Publish Error - Error: read ECONNRESET
582273678XXXXX 03/01/2017 11:18:06 AM
582273678XXXXX 03/01/2017 11:17:34 AM 52 0 Publish Error - Error: read ECONNRESET
582273678XXXXX 03/01/2017 11:14:59 AM
582273678XXXXX 03/01/2017 11:14:26 AM 114 0 Publish Error - Error: read ECONNRESET

Tim_Hoffman · March 1, 2017, 3:29am

Still getting disconnects.

I have workflows in place to send emails

And some customers get them, as loss of connection to losant means alerting in the event of pump stopping is required.

T

Tim_Hoffman · March 1, 2017, 4:00am

Hi Brandon

Still having lots of disconnects for extended periods.
Seems to be happening about every 10 mins.

82273678XXXXX 03/01/2017 11:59:30 AM
582273678XXXXX 03/01/2017 11:56:51 AM 0 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:56:50 AM
582273678XXXXX 03/01/2017 11:56:18 AM 1 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:56:13 AM
582273678XXXXX 03/01/2017 11:55:08 AM 24 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:54:07 AM
582273678XXXXX 03/01/2017 11:53:35 AM 24 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:52:35 AM

Brandon_Cannaday · March 1, 2017, 4:02am

Thanks for the report. We have narrowed the issue down to the offending instance. We are redeploying workers for various services that will cause some reconnects to occur.

Tim_Hoffman · March 1, 2017, 4:09am

Hi Brandon

Looks like all gateways are back online. But will keep an eye on it.

However I am still having a lot of timeouts trying to use the Losant user interface.

504 Gateway Time-out
The server didn’t respond in time.

At login and once logged in constant spinners in device and applications lists.

Brandon_Cannaday · March 1, 2017, 4:11am

Thanks, we see that as well. The underlying instance that was having issues was running workers for various services, which has caused unexpected ripples. Everything should back online very soon.

Brandon_Cannaday · March 1, 2017, 4:16am

We are still cycling services, so some disconnects should be expected over the next 30-60 minutes.

Michael_Kuehl · March 1, 2017, 4:03pm

Tim, are you still seeing unstable MQTT connections?

We know what the root cause the unstable MQTT connections between approximately 18:45 EST and 23:15 EST and the partial accounts.losant.com outage (it was up or down depending on where DNS resolved to) between 20:20 EST and 23:10 EST - two of our underlying instances became unstable enough to cause issues, but not unstable enough that our health checks kicked it out of our pool altogether.

Our device canaries are not currently sensitive to mqtt connect/disconnects (we obviously expect some degree of connection cycling) as long as state and commands still end up getting through within acceptable timeframes. We will be modifying that, though, as a result of this incident, to throw alerts if connections start cycling frequently.

However, if you have been seeing unstable connection issues prior to yesterday’s incident, or are still seeing connection issues, we would like to investigate further.

Tim_Hoffman · March 2, 2017, 12:04am

Hi

We did see another similiar but not as extreme on the 21/Feb between 3:00am GMT+8 and 5:00am GMT+8.

This affected all gateways we have with about 10 disconnect/reconnects during that period.

Thanks

Tim

Topic		Replies	Views
Losant MQTT Broker Timeout Disconnect Help mqtt	8	4194	December 13, 2019
[SOLVED] Getting timeouts to Losant but not other MQTT brokers Help	9	2918	December 10, 2019
Edge device constantly reconnects to Losant Broker Help mqtt , edge	2	600	July 15, 2021
MQTT Outages Today Help mqtt	3	583	November 16, 2021
[SOLVED] Frequent MQTT disconnections; "Device Already Connected" Help	34	17935	February 18, 2025

Large numbers of MQTT disconnects in a short period of time across multiple devices

Related topics