Large numbers of MQTT disconnects in a short period of time across multiple devices

#1

Hi

Over the last few days we have seen multiple ocurrances of many gateways losing and re-stablishing MQTT connections.
This is currently happening now. Has been occurring since 6:22am GMT+8 through to currently 8:58 GMT+8

Given the equipment is literally all over the country (Australia, QLD and WA) I am inclined to think it might not be Telstra but possibly something happening at Losant. The service status page doesn’t show any issues.

Are you aware of any MQTT issues with connections.

Thanks

Tim

#2

Actually it might be Telstra. A little difficult to tell.

Will keep investigating.

T

#3

I am even having trouble connecting to Losant dashboards.

#4

Hmm, definately not Telstra,

As I am not having any trouble logging into the remote gateway devices.

At least Telstra in Aus, it could be international links but I am not having trouble getting to other services.

#5

Further diagnosis.

We use a mosquitto broker as a bridge on each local gateway.

It seems the connection can be initiated.

Normally we see
1488331031: Bridge 578881XXXXX sending CONNECT
1488331033: Received CONNACK on connection local.57888XXXX

However when the problem is occurring we don’t see the CONNACK from the attempted bridge connect.

So it seems to be an issue with broker.losant.com

#6

Connection failures, and reconnect failures are still occurring as at 9:26am GMT+8

#7

At the moment we are also having trouble logging into losant.

Just got

504 Gateway Time-out

The server didn’t respond in time.

from https://accounts.losant.com/

#8

Just an update that we have received your message and are investigating.

#9

The login issue is due to a failed accounts worker that did not fail over properly. It’s unrelated to any known MQTT issue.

Our broker connection monitoring was showing an above-average rate of activity (between 8:00PM ET and 9:30PM ET, roughly 2 hours ago), but has since returned to stable levels. We are reviewing logs for underlying cause. Are you still experiencing disconnect issues at this time?

#10

Hi Brandon

We are still seeing a higher than normal rate of disconnects but the rate has dropped off compared with 3 hours ago.

We don’t normally see gateway disconnect/reconnect for days, at most typically 1 on a 24 hour period but quite often much longer than that. Which has more to do with the mobile network as it will be specific gateway and not multiple.

Most recent events GMT+8

5822736XXX 03/01/2017 11:08:50 AM
5822736XXX 03/01/2017 11:08:50 AM 0 0 Device establishing new connection, closing previous connection
5822736XXX 03/01/2017 11:08:41 AM
5822736XXX 03/01/2017 11:07:41 AM 157 0 Publish Error - Error: read ECONNRESET
5822736XXX 03/01/2017 11:00:20 AM
5822736XXX 03/01/2017 10:59:47 AM 565 0 Publish Error - Error: read ECONNRESET

#11

Thanks for the update - it’s very helpful. We’re continuing to investigate.

#12

Just had a bunch of disconnects.

And the following new message

Device DACIAN GATEWAY disconnected from Losant
Device establishing new connection, closing previous connection
Wed Mar 1 2017 11:18:16 GMT+08:00
Device BLACKHAM GATEWAY disconnected from Losant
Message throughput limit exceeded on 585764097XXX/losant/58acf7aba316XXXXXcc/state

We typically only send a packet for every machine every 10 secs. The blackham gateway has one device.
So I imagine mosquitto was backing up messages

And from the connect log

582273678XXXXX 03/01/2017 11:20:13 AM 28 0 Publish Error - Error: read ECONNRESET
582273678XXXXX 03/01/2017 11:18:06 AM
582273678XXXXX 03/01/2017 11:17:34 AM 52 0 Publish Error - Error: read ECONNRESET
582273678XXXXX 03/01/2017 11:14:59 AM
582273678XXXXX 03/01/2017 11:14:26 AM 114 0 Publish Error - Error: read ECONNRESET

#13

Still getting disconnects.

I have workflows in place to send emails :wink:

And some customers get them, as loss of connection to losant means alerting in the event of pump stopping is required.

T

#14

Hi Brandon

Still having lots of disconnects for extended periods.
Seems to be happening about every 10 mins.

82273678XXXXX 03/01/2017 11:59:30 AM
582273678XXXXX 03/01/2017 11:56:51 AM 0 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:56:50 AM
582273678XXXXX 03/01/2017 11:56:18 AM 1 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:56:13 AM
582273678XXXXX 03/01/2017 11:55:08 AM 24 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:54:07 AM
582273678XXXXX 03/01/2017 11:53:35 AM 24 0 Publish Error - Error: connect ECONNREFUSED 104.197.77.155:443
582273678XXXXX 03/01/2017 11:52:35 AM

#15

Thanks for the report. We have narrowed the issue down to the offending instance. We are redeploying workers for various services that will cause some reconnects to occur.

#16

Hi Brandon

Looks like all gateways are back online. But will keep an eye on it.

However I am still having a lot of timeouts trying to use the Losant user interface.

504 Gateway Time-out
The server didn’t respond in time.

At login and once logged in constant spinners in device and applications lists.

#17

Thanks, we see that as well. The underlying instance that was having issues was running workers for various services, which has caused unexpected ripples. Everything should back online very soon.

#18

We are still cycling services, so some disconnects should be expected over the next 30-60 minutes.

#19

Tim, are you still seeing unstable MQTT connections?

We know what the root cause the unstable MQTT connections between approximately 18:45 EST and 23:15 EST and the partial accounts.losant.com outage (it was up or down depending on where DNS resolved to) between 20:20 EST and 23:10 EST - two of our underlying instances became unstable enough to cause issues, but not unstable enough that our health checks kicked it out of our pool altogether.

Our device canaries are not currently sensitive to mqtt connect/disconnects (we obviously expect some degree of connection cycling) as long as state and commands still end up getting through within acceptable timeframes. We will be modifying that, though, as a result of this incident, to throw alerts if connections start cycling frequently.

However, if you have been seeing unstable connection issues prior to yesterday’s incident, or are still seeing connection issues, we would like to investigate further.

#20

Hi

We did see another similiar but not as extreme on the 21/Feb between 3:00am GMT+8 and 5:00am GMT+8.

This affected all gateways we have with about 10 disconnect/reconnects during that period.

Thanks

Tim