Outages? Learning from our mistakes to create a stronger network.

Not everything’s perfect. As much as I would like to think that I am it has taken many months to come to that realisation but it is sadly not the case. I’ll quietly dismiss myself now and sob gently in the corner of my office.

I do, however, like to think that I’m relatively transparent when it comes to mistakes that I make. This is why I’m writing a short blog post about some mistakes Vale WISP (yours truly) has made and how we’re evolving to make a stronger and more reliable network. This post is on more of a technical level, so I’m sorry if it isn’t very accessible to those less savvy about this stuff, but lets hope it makes an interesting read for those who are.

 

Over the past weekend, we’ve had a variety of outages with some being our fault and others not, I’ll list them in order.

Friday evening: Upstream Fibre connection went out due to a routing error on our ISP’s transit network.

Saturday morning: Water ingress due to a faulty cable on our backhaul to our main distribution mast.

Saturday afternoon: Interference from an unknown source was causing instability on the wireless backhaul from our office to our main mast.

Saturday midnight: A power outage for a couple of properties on our street caused our UPS to kick in, keeping the network afloat for approximately half an hour. As you’ll read later a variety of issues followed this.

This has been an eventful weekend to say the least, and we’re very fortunate to have customers who understand the situation we were in and were more than willing to be patient whilst we fixed these issues.

All the issues we encountered could have been (relatively) easily mitigated. Here are the issues in more detail, as long as the fixes we’re putting in place to prevent them from happening agin.

Outage 1

On Friday Night our upstream ISP was hit with a routing error which caused our network to go out. We have a lot of monitoring software and servers inside out offices to check the health and status of our own network and notify us of any outages. We hadn’t accounted for our upstream provider going out, and of course even though our monitoring software will notice this, because there was no connection they can’t tell me at home. It was a good customer in Ruthin who notified us of the outage so we were able to get in touch with our suppliers very quickly to solve the issue. What are we doing to get around this?

  • We’ve installed a 4G router as failover for our servers to send messages even when the fibre connection isn’t working.
  • We’ve installed a server at my house (on a connection by a different supplier) which will notify me when I can’t access the WISP network.
  • We’re investigating a new route to bring internet into our office via an E-band wireless link. Of course it will be slightly slower than our main connection but it will still enable decent connectivity for all our customers. This will be served most likely from a mast over in St Asaph.

 

Outage 2

On Saturday morning one of our main backhaul links (backhaul links are what carry the majority of data to a particular mast, where it is then distributed. Every backhaul has it’s own dedicated dish/horn at each end.) was experiencing packet loss and as a result was kicking some but not all customers offline. Whilst initially I was struggling to find the source of this, I quickly discovered that it was due to a corroded connector as a result of water ingress. But how did water find it’s way into armoured CAT6a and down 40 meters of cabling into a switch? We swapped out the cable, verified everything was back online and got to diagnostics mode. As it happens, the faulty cable had a slight nick. Water was dripping down the outside of the cable, finding the nick and seeping into the armour jacket within the waterproof sheathing. This then trickled down the inside of the cable and into one of our main switches where the connector corroded.

The connector, as you can see it’s not pretty.

The majority of our larger backhauls have two ports, and we make sure to connect both of them and verify that one is set to failover in the event that the main port fails. However some equipment (the Mikrotik 60Ghz stuff in this case) only has one port so we aren’t able to do this. This is we always make sure to install and terminate a replacement cable alongside the one in use. This means if there is a faulty cable can quickly and easily swap over the connection to maintain uptime whilst we remove and install a new cable. To the right is a picture of the connector for your own amusement.

Whilst our network currently doesn’t allow for it (It’s too small and wide spread), we will gradually start swapping out the switched at our masts to routers and create a routed network using the industry standard OSPF protocol. This will add redundancy into the system because we can easily triangulate the masts meaning if one link breaks, the system will automatically find the next best route back to our offices and hence out to the internet. This system will also allow for alternate frequency links to be installed at each mast with greater ease, which will also mitigate the issues we experienced on Saturday afternoon with the interference.

 

Outage 3

Saturday afternoon was exciting. Having just got back and changed after a soggy morning on my office roof swapping cables, interference was experienced on one of our main masts. Sigh. After a couple of minutes we quickly realised what the issue was and I got back into the Hilux to drive to the mast and change the frequencies. Up to the mast I go, this time on a very windy and equally hill, with my Panasonic Toughbook changing frequencies at this end and verifying a connection.

There’s not a lot I can do about these sorts of things, this interference was from a completely unknown source and although the devices are capable of it, I don’t trust their “auto frequency” features.

As mentioned in response to outage 2, we are gradually moving to a routed network, and as a result a redundant dish will be installed to take over in the event of interference.

At present we use a lot of RF Elements for our access points (the device which broadcasts signal to our customers radios), but we don’t use them for our backhaul equipment. They have one product rightly dubbed the “Ultrahorn”. These are very impressive bits of kit which have negligible back or side interference, so in theory the only interference would have to be from directly in front of the horn. This technology is tried and tested in ‘Merica, and when funding allows for it we will be upgrading to Ultrahorns for our entire network. We have 16 backhauls currently, 12 of which are 5 Ghz. Ultrahorns come in at about £900 per link (when you include the ‘twistport’ adapters etc) so this will be a costly upgrade but will definitely be worth it for piece of mind.

 

Outage 4

Having got back and finally sat down for dinner in front of my fire, I was notified that our office UPS had kicked in due to a power cut. We have 30 minutes of power before the UPS runs out, so didn’t have long to get the generator into my Hilux and up to the office. I had just finished a wonderful glass or port so had to enlist the help of my sister (who was home for the weekend) to drive for me. We missed the half hour mark sadly, and power was off until 9AM and our generator had run out of fuel by 6AM. Argh.

Our UPS isn’t good enough. I knew this from the start, but never thought power would be much of an issue. Oh how wrong I was. We’re looking to invest in a much bigger system which will keep our entire office powered for approximately 24 hours before needing a generator. This will be a custom designed system to fit the new 3 phase supply we’re having installed into the offices. Essentially each phase is redundant but there will also be both a single and 3 phase input for a generator for longer power outages. There will be nearly 3/4 of a tonne of batteries consisting over 5000Ah of capacity. 4 connections will come out of this, two for our servers (fed from the two battery banks), one for our semi critical devices such as office lighting and computers, and one for non-cricical devices such as water heaters, fridges, and other appliances.

This design allows for our entire office to run separate from grid power, meaning we never stop even when faced with no power for 16+ hours (unlimited, when we factor in diesel power). It also gives the benefit of having conditioned power in the office (something our current UPS is doing already, but only for our server rack), this will further limit the cost of damage caused by electrical surges. With this new lease of life I also intend on dishing out cups of tea for those who are affected by outages, consider it a public service (are there grants available for this type of public service?). Joking aside this is an important change we’ll be making to the system so will write more about the final design and its build.

 

Well this maybe isn’t the short post I was wanting it to be, but it was definitely in depth. I’m not sure whether people can comment on these? If you can then please do with suggestions or questions and I’m more than happy to discuss with you. Failing commenting on this post thingy, email me on Hamish@valewisp.com

All the best and thank you for reading.

 

Hamish