Losant Edge workflow timeout when MODBUS device of many goes off line and edge agent needs a restart to clear


#1

HI

I have a situation, multiple similiar modbus devices being polled through 1 workflow, and a second send (different type) polled in another,

If one device goes off line I see workflow timeout errors and this affects all edge workflows, until I restart the agent.

Once the agent restarts then they succesfully run and fail to connect to the off line unit but don’t timeout

I thought the timeout message and failed function was limited to all modbus activity, but also device commands sent appear to be recieved - late (about 2mins) but subsequant activities such as write to redis don’t work, so its seems like the network stack is hanging.

Also the Workflowtimeout errors only seems to be logged in the agent log with verbose, it wasn’t logged with info or error.

This is a bit of a problem.

Running current edge agent.


#2

Hi Tim,

I am trying to recreate the issues you seem to be seeing. So first off what type of Modbus device are you using? Also, I want to clarify the issues you are seeing because it seems like you are seeing quite a few.

  1. When one workflow errors all of the workflows on the edge error with the same error.
  2. Device Commands that are sent to your edge device take two minutes to arrive.
  3. Writing Redis just does not work?

The first error in the list is occurring when a Modbus is offline causing the workflow to somehow throw a workflow timeout error. However, if one Modbus device cannot be connected to it should not error the workflow but you should see an error object in your results. The Modbus connection timeout is 30 seconds and the workflow times out after a minute, so if you are running your Modbus reads serially, the workflow timeout error could occur before the connection timeout has occurred. That’s just my theory as to what is happening for the first error, let me know if I am way off. And send me screenshots of your workflow so I can better try to reproduce the issue. However, this doesn’t explain why one workflow would cause another to timeout, are both workflows accessing the same set of Modbus devices, or are they both throwing a flow timeout error?

Can you give me additional information on when the Device Command and Redis write issues? Is this around when you are seeing the issues with the Modbus, or is this consistently happening?

Thanks,
Erin


#3

HI Erin.

I have 2 separate workflows reading 2 families of devices serially.

If a modbus device is not connected to the network when the workflow starts then we get a normal error object in the workflow.

If however the device is connected to the network and the workflow is successfully being read from via MODBUS read, and it is powered off or physically disconnected, this causes the workflow timeout errors. In addition the device that is still connected is not read from.

We then start seeing the same timeout error on the other workflow even though its devices are connected.

I included the REDIS example, as it is also being affected by the timeout error. We use device command was to set a value in REDIS (storage GET/SET is not visible across workflows hence the use of REDIS). What I am seeing once these timeout errors start occurring is the device command is received (late) then we see a timeout error in that workflow and nothing is written to REDIS.

Hopefully this makes things a bit clearer.

Easiest way to replicate - poll two MODBUS devices every 5 seconds. Have the workflow running then physically disconnect one.

I am trying to work out how to restart the edge agent automatically in this scenario.

I think you should have a timeout parameter for MODBUS transactions. 30 secs is too long. We are typically using TCP/Serial converters and their timeouts are set to between 1 and 3 seconds. But if the device is not present, then a 30sec timeout is extreme. If you think about it, in my situation, we typically poll every 3-5 secs, which means a massive backlog of timer triggered runs will develop during the first timeout.

Also it’s odd the normal expected error occurs if the device is down before the workflow runs, but occurs only if the device goes offline once its been running.

Are you holding the TCP connection open between runs. (As as aside if you are for some devices this can be an issue, if concurrent access from multiple masters is required - a few only allow a single socket connection at a time.)

Once we go live I was going to have at least 4 devices being polled in a loop so if 2 go off line suddenly then I will exceed the 60 sec workflow timeout. Which is not good.


#4

Hi Tim,

I was finally able to reproduce the issue you were seeing. We were being overzealous in that only allowing the edge to run 5 nodes concurrently at a time. We have updated that to be per running workflow instead of globally. This was causing the issue of one workflow causing another to be timed out as the queue was full and workflows had to start waiting on other workflows. We have already republished the losant-edge-agent with the new version with this change at 1.2.4.

Thanks for bringing this to our attention,
Erin


#5

Thats awesome news.

Once again you guys prove how good your response and service is.

Cheers

T


#6

I will be able to run tests on monday.

One last point, I do think you should make the MODBUS timeout period settable to something less than 30 sec.

Cheers

T


#7

HI

Went to test however the version 1.2.4 is not coming up as a deploy able option.

Thanks

Tim


#8

Hi Tim,

The target version for workflows did not change, cause this edge change did not change the available features for a workflow or any workflow nodes. Deploying a workflow whose target version 1.2.3, after upgrading your edge agent to 1.2.4, will not affect this change. The upgrade will still be there. Let us know if you are seeing any issues with this.

Also, I filed a ticket to add a timeout field on both Modbus Read and Modbus Write Node. I’ll post back when it has gone live.

Thanks,
Erin


#9

Hi

I found another scenario that causes blocking timeouts. If the MODBUS protocol adaptor is connectable but serial slave doesn’t respond we get the same situation, and after a minute or two we see workflow timeouts which affects other reads on the same workflow loop.

It seems that the MODBUS read isn’t handling the lack of response and causing a complete error in the workflow.

At the moment it seems I would have to split each device into own workflow to work around this.


#10

Are you holding a socket open in a workflow for each modbus device ?

I am finding that an Engine (BBA) connected via MODBUS, when in standby mode (auto waiting to start) it powers down the ECU, and it seems we get some readings from the workflow for a minute or so but then get a timeout.
I can read explicitly via python, so the MODBUS is working but it will open the connection if the socket is closed.

Restarting the agent, means we get a reading.


#11

I had split the looping single workflow (into 4 seperate workflows) in the hope that the Workflowtimeout doesn’t affect the other workflows, but it does appear to, so the only way to clear the situation is to restart the agent.


#12

Tim,

We do not hold a connection open between Modbus nodes. For each (read or write) Modbus node in a workflow, they will each open a connection and then close it once the instructions have completed. If you are still seeing errors across workflows it sounds like you did not docker pull to get version 1.2.4 on the agent?

Thanks,
Erin


#13

Hi

I did perform docker pull latest - (twice)

How can I confirm this. As the Agent is not reporting 1.2.4

What I found was splitting the loop of 4 devices into 4 workflows performed even worse when I had the situation of the protocol convertor online but the RS485 device disconnected.

So I had to go back to a single workflow with a loop.

I do find it odd that if you didn’t keep a socket connection open that a device went offline once the worflow was running we got the Worflowtimeout at all., but if the worflow started and the device was offline that the problem didn’t occur (Error No Route to Host). I would have though the initial connection failing on startup would be no different to a later attempt at connecting once running which resulted in a WorkflowTimeout.

This new scenario however is different, the protocol convertor is online but the device on RS485 goes offlne, and the particular protocol converter isn’t returning MODBUS error 11, but is just timing out in 2 secs. (what we have set it too).

Unfortunately we are trying deploy live equipment and it’s on the other side of the country.

We will try and set up a test build here in Perth, and you could even log into a the device running the edge agent to directly see what is going on.


#14

Hi Tim,

The way you check your agent version is on your edge device page, under the Edge Compute tab.

It also prints out the version you are running on the edge agent on startup.
I’ll try to recreate your issues here with our Modbus device.

Thanks,
Erin