WAN Resiliency

Configuration Options

WAN resiliency is one of the most important considerations for any critical enterprise network.
There are several configuration options available to achieve the final objective, depending on specific business requirements.

1. Use kernel default routes metric for failover

We will use the below topology for our discussion here, using one HSA-500L2 with two SIMs from two different providers.

This method utilizes the kernel routing table to perform failover. We simply need to have a lower route metric for SIM1 (wwan0), and higher for SIM2 (wwan1), so HSA/UA will have two default routes.

Image1: CPE with 2 SIMs from 2 different providers

Since the default route for wwan0 has a lower metric (preferred), traffic will be routed out from wwan0 by default. In case of SIM1 failure (eg. connection loss or SIM card failure), the SIM1 default route will be withdrawn and the SIM2 default route will kick in and traffic will failover to SIM2 immediately.

Solution:

Configure to set the lower route-metric for preferred link
User can navigate to ‘SD-Branch>Network>Wireless-WAN’ to configure the below

CLI configuration for fail-over:

!
interface wwan0
enable
apn isp1apn
route-metric 20
!
interface wwan1
enable
apn isp2apn
route-metric 21
!

The output of default route [User can use the command ‘# show ip route‘] as shown below, wwan0 has a lower route metric and will be the preferred next-hop, therefore traffic will primarily route out from wwan0 (SIM1).

K>* 0.0.0.0/0 [0/20] via 10.64.189.185, wwan0, src 10.64.189.184, 02:21:25
K>* 0.0.0.0/0 [0/21] via 10.65.10.164, wwan1, src 10.65.10.165, 02:21:25

Once the kernel detects the link is down, it will quickly withdraw all routes associated with the link/interface. Then the second kernel route (using wwan1 as nexthop) will kick in and traffic will route out from wwan1. The advantage is that the failover is very fast. However, it’s purely active/standby only, eg. traffic can only use one of the links at a time.

NOTE

by default, the HSA/UA system will auto-assign route-metric to each interface at bootup, in order of bootup sequence. In other words, the CLI interface loaded first (at the top) will have a lower metric, eg. wwan0 by default will have a lower metric than wwan1. We simply need to slot in the primary SIM into slot M2 (wwan0), and the secondary SIM into slot M1 (wwan1).

2. Use default routes with upstream host tracking

There is one major disadvantage of using the above #1 kernel routing for failover. When we combine WAN/eth0 with SIM and if there’s an upstream router for eth0 connection, it’s unable to detect upstream failure.

For example, if the link between the upstream router fails but the connection between the upstream router and HSA/UA is still available, the default kernel route (which uses upstream router IP as nexthop) still remains, therefore no failover will occur.

To address such failure/failover scenarios, we can do upstream tracking to determine if the end-to-end connection is indeed up. The below topology illustrates the scenario.

Image 2: Default routes with upstream host tracking

Solutions:

  • Disable default kernel route for each link
  • Configure default route using CLI
    • Set higher administrative distance for the backup link (less preferred)
    • Set route tracking for the primary default route

CLI configuration:

!
interface eth0
 enable
 ip address dhcp nodefault
!
interface wwan0
 enable
 apn isp2apn
 ip address mobile nodefault
!
ip route 0.0.0.0/0 nexthop 172.16.1.1 track-host 8.8.8.8 15
ip route 0.0.0.0/0 nexthop wwan0 distance 200

In the above configuration:

  • We used the “nodefault” option so that HSA/UA doesn’t install kernel default route for each link and use the CLI configured default routes instead.
  • 172.16.1.1 is the upstream router LAN IP address.
  • “8.8.8.8 15” tracks 8.8.8.8 across upstream links (therefore can detect upstream link failure), at 15s intervals.

NOTE

HSA/UA will ping 8.8.8.8 using the eth0 link every 15 seconds, but it will announce ping failure at 2nd attempt and withdraw this default route, so the maximum failover time is double the configured interval here (the 30s in this case).

3. Use Multi-WAN Link Balancing

Multi-WAN link balancing is a more advanced traffic steering approach to provide link aggregation and failover between links. Refer to the link MWAN.

MWAN also uses the ping approach to detect upstream end-to-end link availability. It combines both routing metric and ping tracking to make failover decisions.

CLI configuration:

!
interface eth0
 description "ISP1 connection via fixed line"
 enable
 ip address dhcp
 mwan-group 10
  track 8.8.8.8 timer 5 5
  metric 1
  weight 1
!
interface wwan0
 description "ISP2 connection via LTE"
 enable
 mwan-group 10
  track 8.8.4.4 timer 10 10
  metric 2
  weight 1
!

In the above configuration:

  • eth0 (ISP1) is the primary link (with a lower metric) and wwan0 (ISP2) is the backup link with a higher metric.

NOTE

The weight setting is for link balancing when both links have the same metric (active/active), so it’s not in use for an active/standby scenario.

  • The “timer x y” determines how to determine upstream link failure or availability. X (in seconds) configures the interval (default 5s) between each ping test; Y configures number of consecutive test attempts before the link is declared UP/DOWN.


Using the above configuration example,

  • If eth0 is disconnected (or upstream router powered off), the eth0 link is down, and HSA/UA will immediately withdraw the default kernel route for eth0 and failover to wwan0, so the failover is fast for such a situation, typically in 2-3 seconds.
  • If eth0 upstream link is down, eg. eth0 is connected but the upstream link is down, therefore tracking to 8.8.8.8 will fail. After max of 25s (5s x 5 attempts), the tracking will declare eth0 is unusable and failover to wwan0, so the failover is much slower.
  • Fallback to eth0 will be determined by tracking confirmation (5s x 5 attempts), therefore typically fallback is much slower.

NOTE

* To speed up the failover/fallback time, you can set shorter intervals and lesser attempts, however, for slow and unreliable links (especially mobile/LTE links), it’s not recommended to set too short as this may cause flapping (false failover/fallback).

* The biggest benefit for MWAN is its link balancing capability (same metric for both links, optionally with weights) so that we can balance traffic to both active links and aggregate total upstream link capacity. If you just want a simple active/standby setup, option #1 or 2 is more recommended.