Edgeworkflow doesn't seem to retry if it can't resolve broker.losant.com on startup

HI

We have encountered the following a few times now.

Device running Edge workflow and Modem powered up at the same time.
Device with Edge workflow starts up quicker than the 4G modem connection and so Edge workflow can’t resolve Losant. At this point the Edge workflow doesn’t appear to retry the name resolution and therefore doesn’t connect ever to Losant, until the agent is restarted.

Losant logs report unable to resolve broker,losant.com

To try and work around, on bootup I have a delayed start (60 seconds).

But we have found thats not always long enough, so requires a manual restart of agent once the network comes up.

Had a different site running an Agent Workflow, and it lost comms around midnight, however was back up this morning, agent running but not sending data. So it appears the agent doesn’t really recover from loss of comms.
Will have to be careful moving forward, to monitor this more closely before deploying more Edge agents.

Tim,

I’m surprised at both of your scenarios, we explicitly test for both of them. The agent continually tries to reconnect to Losant, trying to connect every 15 seconds until it is able to. For example, here is a log of an agent where I was just double checking this behavior:

Starting with no internet connectivity:

2018-08-10T01:16:49.586Z [info] Agent Starting...
2018-08-10T01:16:49.590Z [info] Agent Version: 1.2.1
2018-08-10T01:16:49.736Z [info] Connecting to: mqtts://broker.losant.com ...
2018-08-10T01:16:49.787Z [warn] Unable to connect to mqtts://broker.losant.com, could not resolve host.
2018-08-10T01:16:49.788Z [warn] MQTT client disconnected.
2018-08-10T01:16:49.795Z [info] Webserver started on port: 8080
2018-08-10T01:16:49.796Z [info] Workflows initialized and running...
2018-08-10T01:17:04.795Z [info] Attempting to reconnect to mqtts://broker.losant.com...
2018-08-10T01:17:04.798Z [warn] Unable to connect to mqtts://broker.losant.com, could not resolve host.
2018-08-10T01:17:19.817Z [info] Attempting to reconnect to mqtts://broker.losant.com...
2018-08-10T01:17:19.823Z [warn] Unable to connect to mqtts://broker.losant.com, could not resolve host.
2018-08-10T01:17:34.842Z [info] Attempting to reconnect to mqtts://broker.losant.com...
2018-08-10T01:17:34.847Z [warn] Unable to connect to mqtts://broker.losant.com, could not resolve host.

And then enabling internet again:

2018-08-10T01:17:49.865Z [info] Attempting to reconnect to mqtts://broker.losant.com...
2018-08-10T01:17:50.516Z [info] Connected to: mqtts://broker.losant.com
2018-08-10T01:17:50.612Z [info] Workflows received from Losant and applied.

So I’m definitely interested in more details about what you are seeing, because these are key parts of the value prop for the edge agent. It is designed to to deal with no network connectivity for long stretches, and connect to the platform when it can to send and receive data.

Will go through the logs and get info for you. Unfortunately logs aren’t rolled by default so have a bit of digging :wink:

Oh bugger logs are in UTC :wink: Looking in the wrong place.

Hmm ok logs are telling me something different. After workflow starts up we see repeatedly this
Until the agent was manually restarted. The agent was manually restarted by logging in remotely over VPN connection from modem. So network was up at this point. Normally the timeouts (from modbus) in the workflow show detail errors on specfic nodes. But this is different.

2018-08-09T21:04:57.000Z [ESC[36mverboseESC[39m] Running workflows for trigger:
{
  "payload": {"time": "2018-08-09T21:04:57.000Z", "data": {}},
  "triggerId": "5b5e907e87f9330007ddff90-lh9fTB72Bg4AsX7tEwKNs",
  "triggerType": "timer"
}
2018-08-09T21:04:57.024Z [ESC[36mverboseESC[39m] Completed running workflow(s):
{
  "5b5e907e87f9330007ddff90": {
    "errors": [
      {"nodeId": null, "error": {"name": "FlowTimeoutError", "message": "Timeout running workflow"}}
    ],
    "pathsCompleted": 0,
    "wallTime": 60021
  }
}
2018-08-09T21:05:00.519Z [ESC[32minfoESC[39m] Attempting to reconnect to mqtts://broker.losant.com...
2018-08-09T21:05:01.005Z [ESC[36mverboseESC[39m] Completed running workflow(s):
{
  "5b5e907e87f9330007ddff90": {
    "errors": [
      {"nodeId": null, "error": {"name": "FlowTimeoutError", "message": "Timeout running workflow"}}
    ],
    "pathsCompleted": 0,
    "wallTime": 60002
  }

For instance here is a log message with a node level timeout form Modbus

    2018-08-09T23:12:27.013Z [ESC[36mverboseESC[39m] Completed running workflow(s):
    {
      "5b5e907e87f9330007ddff90": {
        "errors": [
          {"nodeId": null, "error": {"name": "FlowTimeoutError", "message": "Timeout running workflow"}}
        ],
        "pathsCompleted": 0,
        "wallTime": 60001
      }
    }
    2018-08-09T23:12:27.232Z [ESC[36mverboseESC[39m] Sending Message
    {
      "topic": "losant/5b189f960c1c7d0008dbd0a8/state",
      "payload": "{\"data\":{\"errors\":[{\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Timed out\"},        
      {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"},
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"},            
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"},     
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}, 
     {\"type\":\"MODBUS_READ_ERROR\",\"message\":\"Port Not Open\"}],\"flow_lps\":[0],\"flow_total\": 
    [6],\"coil_stop\":[1],\"coil_cool\":[0],\"dosing_run\": 
    [0],\"calc_flow_lps\":0,\"calculated_total\":15},\"time\":1533856282000}"
    }
     2018-08-09T23:12:27.437Z [ESC[36mverboseESC[39m] Sending Message
    {
     "topic": "losant/5b189f960c1c7d0008dbd0a8/fromAgent/debug",
      "payload": [
       {
          "time": "2018-08-09T23:12:26.436Z",
         "flowId": "5b5e907e87f9330007ddff90",
         "triggerUiIds": ["dlDR7CZXhr"],
   :

I’m having a bit of trouble following what you are saying - is the second block of logs a “good” output? With errors you are expecting? Or are these both bad?

Hi Michael.

That is correct.

The first is a log entry, where the physical device is on line (logged via ssh), but nothing is received on the cloud platform, and the empty FlowTimeOutError is the result.

The second log is a typical timeout due to an identifiable problem (MODBUS timeout) but data still gets through from the Agent.

Cheers

T

Hmm, the reason I was confused was because both log blocks seem to have an identical looking FlowTimeoutError:

The first has:

2018-08-09T21:04:57.024Z [ESC[36mverboseESC[39m] Completed running workflow(s):
{
  "5b5e907e87f9330007ddff90": {
    "errors": [
      {"nodeId": null, "error": {"name": "FlowTimeoutError", "message": "Timeout running workflow"}}
    ],
    "pathsCompleted": 0,
    "wallTime": 60021
  }
}

The second has:

2018-08-09T23:12:27.013Z [ESC[36mverboseESC[39m] Completed running workflow(s):
{
  "5b5e907e87f9330007ddff90": {
    "errors": [
      {"nodeId": null, "error": {"name": "FlowTimeoutError", "message": "Timeout running workflow"}}
    ],
    "pathsCompleted": 0,
    "wallTime": 60001
  }
}

The second log block just has a Sending Message log message afterward, with what looks like a Device State Node send (guessing the workflow managed to send something before it timed out). First log block doesn’t have a Sending Message log message probably because it isn’t connected to Losant at the time (as you have realized) - any messages it wants to send should be getting queued locally to send next time it is connected.

I’m interested in what (hopefully) prints out a few lines after the
2018-08-09T21:05:00.519Z [ESC[32minfoESC[39m] Attempting to reconnect to mqtts://broker.losant.com...
message - the particular error keeping it from reconnecting and sending data. Should be a log message that contains the text
Unable to connect to mqtts://broker.losant.com. Any chance you have that?

On a totally different note, if you expect to have long running workflows on your edge devices, there is an environment variable (that we need to document) that lets you set the amount of time a workflow is allowed to run before it times out. MAX_FLOW_RUN_TIME is the variable, and the value is the maximum number of milliseconds a flow should be allowed to run.

Hi Michael will collect more entries for you around the first example.

I have all the logs :wink: Was doing my head in for a while. I am GMT+8, device is GMT+10, he time I was given when our staff restarted, and weren’t getting any data through. That was 6:40am (whilst I was driving telling them how to restart the agent :wink:

The looking through the 400MB log I was forgetting which timezone I was working from :wink:

Oh, some other information that might be useful - what are you running the Edge Agent on, and what version of Docker and the Edge Agent are you using? Can try and build a test in our office replicating your situation as closely as possible.

Hi Michael

Runtime info

raspberry pi 3.
4.14.34-hypriotos-v7+ (v.1.9.0) https://blog.hypriot.com/downloads/
Losant agent 1.2.0

Cheers

T

Just had a different site go offline. Same runtime environment. However it wasn’t offline (mosquitto, ssh), the link was just very slow.

On that device I also am running a mosquitto broker, to give me an alternate view of connectivity.
It didn’t disconnect at all during that period. What I noticed was debug messages stopped in the log for the period of outage, data messages where still being logged, whilst I was logged into the agent host.

No data was sent through for about 20 minutes. Despite me being logged in via Open VPN.

This workflow is also using modbus, but no modbus errors or workflow timeouts.

Same versions of agent, and OS.

It could be OS issue, but everything else is running fine on these pi’s

Could you clarify what you mean by “debug messages stopped but data messages were still being logged”? What kinds of log messages stopped showing up in the log file? Did the agent log say that it had disconnected from Losant, and if so, what kind of error “unable to connect” messages printed out? Did it have log entries about “Sending Message”, but those messages never appeared in the Losant platform?

HI Michael.

I am finding lots of different scenarios where the edge agent is either not online (disconnected) or online and messages not getting to Losant.

For instance this morning the Edge Agent was off line. The logs are showing below.

On this site, at the same time I am running a Mosquitto bridge in parallel with our agent and it was connected and sending data, successfully. The link is slow and will drop out here and there for 30 seconds.

I know I am all over the place here with these statements, and not really helping you hone in on the actual problem.

But I am seeing the Losant Edge off line frequently and not recovering for some time on this site.

Will send through more detail logs extract shortly once I find the specific offline event.

2018-08-13T00:03:02.022Z [verbose] Sending Message
{
  "topic": "losant/5b5fc0a3c20085000783f7a9/state",
  "payload": "{\"data\":{\"flow_lps\":1125.7086999022483,\"raw_lps\":464,\"total\":1431,\"online\":true},\"time\":1534118582000}"
}
2018-08-13T00:03:02.025Z [verbose] Completed running workflow(s):
{"5b600bb587f9330007ddffed": {"pathsCompleted": 1, "errors": [], "wallTime": 24}}
2018-08-13T00:03:07.002Z [verbose] Running workflows for trigger:
{
  "payload": {"time": "2018-08-13T00:03:07.000Z", "data": {}},
  "triggerId": "5b600bb587f9330007ddffed-WWEzpFAnJHumDKdYzF1MZ",
  "triggerType": "timer"
}
2018-08-13T00:03:07.028Z [verbose] Sending Message
{
  "topic": "losant/5b5fc0a3c20085000783f7a9/state",
  "payload": "{\"data\":{\"flow_lps\":1122.5806451612902,\"raw_lps\":462,\"total\":1435,\"online\":true},\"time\":1534118587000}"
}
2018-08-13T00:03:07.032Z [verbose] Completed running workflow(s):
{"5b600bb587f9330007ddffed": {"pathsCompleted": 1, "errors": [], "wallTime": 29}}
2018-08-13T00:03:12.000Z [verbose] Running workflows for trigger:
{
  "payload": {"time": "2018-08-13T00:03:12.000Z", "data": {}},
  "triggerId": "5b600bb587f9330007ddffed-WWEzpFAnJHumDKdYzF1MZ",
  "triggerType": "timer"
}
2018-08-13T00:03:12.020Z [verbose] Sending Message
{
  "topic": "losant/5b5fc0a3c20085000783f7a9/state",
  "payload": "{\"data\":{\"flow_lps\":1125.7086999022483,\"raw_lps\":464,\"total\":1439,\"online\":true},\"time\":1534118592000}"
}
2018-08-13T00:03:12.022Z [verbose] Completed running workflow(s):
{"5b600bb587f9330007ddffed": {"pathsCompleted": 1, "errors": [], "wallTime": 21}}

Hmm we for this particular device I have a cloud workflow logging disconnect. The edge agent was disconnected at 8:55 (and no data received a few minutes earlier - Device Inactive). However the Edge Agent logs show no disconnection at all, just Sending messages as above. On restarting the agent, the agent is backfilling the missing messages.

So one clarification note about attempted message sends to Losant - the Sending Message log entry prints even if the agent is not currently connected to Losant. That log message will print out whenever a workflow hits a DeviceState or MQTT node, and the workflow creates an MQTT message to send. If the agent is connected to Losant, the message is immediately sent. If the agent is not connected to Losant, the message is queued to send when the agent does connect. I actually just updated the log messages in our code base to be a bit more informative, so it says queued when the agent is not connected, instead of sending.

In addition, the messages are sent with a Quality of Service level of 1, so even if the agent thinks it is connected and sends the message, if the server does not acknowledge receipt, the message will be sent again.

So really, I think the main question is why the agent is not correctly automatically reconnecting, if it is stuck in some sort of network state where it is unable to connect to the broker, or if it is not actually trying at all. Thats why I’d like to see the Unable to connect to mqtts://broker.losant.com messages. They should be happening approximately every 15 seconds when the agent is not connected.

I did set up a Raspberry Pi 3 with Hypriot v1.9.0 locally in the office, and so far I have not seen any issues. I plan on messing with its network connection today, taking it on and offline a bunch and making sure it does reconnect as expected.

@Tim_Hoffman did you ever manage to find the Unable to connect messages in the logs?

Yes I was able to find them. I am running two agents (mine and edge) side by side at one site.
I find the edge agent drops out every hour (mosquitto doesn’t seem to) and takes about 5-10 minutes before it comes back online , then backfills retained data. (The site has a poor 4G connection - it’s on an island in the Indian Ocean, and it appears our link get’s reset at the 55min mark nearly every hour)

But in this situation we aren’t losing power.

I am not in a position to test startup with no internet connection at first. To much on at the moment.

I need to set up a good simulation here in the office, and will then follow this up.

Thanks for looking into to it.

We are continuing to investigate on our side as well with a test raspberry pi running hypriotos.

Glad you determined that it is in fact successfully reconnecting (even if it takes longer than it seems like it should). Did you determine what particular Unable to connect gets printed? There are a couple different ones (like host not found, network unreachable, etc…) - it would help narrow down the network scenarios we should try and reproduce in the office.