June 22nd 2020
At approximately 10:00 AM Eastern, Cloudstar engineers noticed what was appearing as an intermittent connectivity interruption from Cogent Communications.
Two network engineers logged into the routing environment, while a third contacted Cogent Communications.
Upon further investigation, we discovered routing issues within the cluster 2 firewall environment. One firewall had inbound only traffic, while the other firewall had no Internet traffic, but was able to pass Citrix data between clusters.
A remote restart was initialized on both devices as an onsite engineer was en-route to the datacenter.
Upon restart, the problem persisted. Both firewalls were responsive to GUI and CLI inputs and indicated operational; however, asymmetric routing was still present on one firewall, while the other firewall could not pass any Internet traffic.
Engineers then discovered FortiOS system corruption on the first firewall. This firewall was removed from the network, while the secondary firewall was manually configured as primary. Service was restored.
ROOT CAUSE: Cloudstar runs dual firewalls for the purposes of redundancy and to prevent downtime. Redundancy is configured to be 100% automatic and hands off. In this case, a partial operating system failure in the primary firewall prevented a clean failover to the backup firewall. Once engineers identified the problem, the solution was to eliminate the bad firewall from the high-availability firewall cluster thereby restoring service.