Architecture for a High Availability Edge Compute Environment

Nearly every solution that utilizes Losant’s Edge Compute functionality involves at least one gateway that is reading or receiving data from one or more local peripherals. When you begin to architect your solution, there’s an important question to consider: “what happens if that gateway goes down?”

In many cases, customers will use a single gateway because the risk and potential downtime is acceptable for their use case. If the gateway goes offline, the cloud will send an alert using the Device: Inactive Trigger and then a technician can replace it. Given the average lifespan of industrial gateways, this could introduce a few hours of downtime every few years.

If your solution requires high availability, and this amount of downtime is not acceptable, you’ll require a different architecture. This post will outline a primary/secondary failover model for gateways and peripherals.

At a minimum, you’ll require two gateways: a primary and a secondary. Each gateway runs the Losant Edge Agent and identical Edge Workflows.

For this example, I’m assuming peripherals are pushing data to gateways either through the Edge Agent’s MQTT Broker or through the Edge Agent’s Web Server.

Peripheral Configuration

Within Losant, peripherals can be tied to a single gateway or configured as floating, which means any gateway can report on its behalf. Since data could come from either the primary or the secondary gateway, your peripherals must be configured as floating.

In terms of the peripheral’s code or firmware, the simplest approach is for your peripherals to know the IP address or host name of both the primary and secondary. Host names provide more flexibility, but may require some additional coordination with the IT department to configure DNS appropriately. If you use IP address, I’d recommend obtaining static IPs for each gateway.

Using MQTT Between Peripheral and Gateway

When using MQTT, the peripheral can open and maintain connections to both gateways. When it comes time to report state, the peripheral would ensure it has an established connection to the primary. If it does not, it would instead publish its data to the secondary. As an additional layer of confidence, you can publish with QOS=1, which means the gateway will reply with an ACK indicating it did receive your message. This helps in cases where the connection may be established, but the gateway is not properly receiving data for unknown reasons. If an ACK is not received, the peripheral can then attempt to publish the data to the secondary gateway.

Using HTTP/REST Between Peripheral and Gateway

When sending data to the Edge Agent’s web server, the edge workflow you deploy has the ability to reply with a status code back to the peripheral.

The peripheral should first attempt to POST the data to the primary’s web server. If the connection fails, times out, or your edge workflow returns a status code which indicates an error, the peripheral can then attempt to make the same POST request to the secondary.

Sending Data from a Gateway to the Losant Platform

What’s nice about this architecture is that the Losant Platform doesn’t care which gateway reports state on behalf of a peripheral. As long as your Edge Workflows are using the Device: State Node to report data, the cloud will receive, process, and record the data in the exact same way regardless of which gateway reported it.

If you have Application Workflows using the Device: State Trigger, you can use the relayId field to know the device ID of the gateway that reported this state payload. This provides an additional alerting opportunity. If you ever receive data reported by the secondary, you know something may be wrong with the primary.

3 Likes

How to deal with Read semantics? If the gateway is (also) pulling data (Modbus:Read, f.e.), then when both edge gateways are online, won’t both of them pull and report data.