Troubleshooting remote site networks
By Simon Lee, Vice President of Sales - Asia Pacific, Fluke Networks
Monday, 09 June, 2014
With the increasing use in industry of common IT networking technologies such as Ethernet, Wi-Fi and WANs, there is an increasing need for the ability to effectively monitor and troubleshoot these technologies in industrial settings.
Management and remote site employees at industrial plants expect the same level of network service as the headquarters site. However, when network engineers are faced with limited resources to support remote site networks, the applications, services and performance at those sites are not as robust as they should be.
Although server virtualisation, consolidation and the move towards web-delivered applications have business benefits, optimal productivity can still only be achieved when the same levels of services are available across the organisation. Unfortunately, even the best-planned deployment may leave remote offices and users vulnerable to performance degradation and availability issues.
This creates additional challenges for network engineers in maintaining remote site performance, availability, security and visibility. It also creates challenges for workers in industrial plants in delivering on time, on budget and maintaining safety. Effective troubleshooting helps avoid or reduce additional hardware expenses, purchasing excess wide area network (WAN) capacity and unnecessary investment in outsourced troubleshooting.
In the headquarters environment, when remote users complain about poor performance or VoIP quality, network engineers must be able to determine the root cause of the problem and correct it quickly. Remote office network outages and slowdowns are made far more difficult to solve because of the challenges presented by distance, travel time and the need for tools that may not necessarily be available at the remote location. Organising the necessary tools and dispatching staff to remote locations for troubleshooting is both time-consuming and expensive, and the time spent contributes to delays or stoppages in critical work.
With the right information and tools, technicians can understand and resolve issues quickly and efficiently. Adding the appropriate level of visibility, technicians could even identify remote network degradations before they become significant problems at that site. This strategy means proactive action can be taken to eliminate congestion, latency and other problems that could affect remote users and interfere with operations. Additionally, the ability to resolve problems from the headquarters site avoids the need to dispatch staff, resulting in time and travel expense savings, increased network availability, more time for mission-critical projects and fewer delays.
Best practices for remote site troubleshooting: baselining, assessment and documentation
To gain efficiency later on, a proactive first step must be taken to establish a baseline of the existing remote site network, so that network engineers know what they are dealing with. The first task is a comprehensive discovery and documentation of the remote site network. This entails not only what kind of equipment exists but also identifies the users and how and where they are connected to the network.
Discovery must include information on hardware inventory, servers, access points, switch and router configurations, and network connection paths. Updated maps are an essential element of ‘knowing’ the remote site and are needed for reference when future problems emerge.
The next necessary step is to understand what normal traffic levels are at the remote site. This provides a reference to work from when determining abnormal activity and compare against when validating problems in the future. Technicians must evaluate current network performance, including traffic patterns with protocol and application usage, bandwidth use, internet/WAN connectivity performance and potential network vulnerabilities.
Next steps: proactive, reactive and maintenance tasks
Proactive tasks
Once up-to-date network configuration diagrams are available and traffic levels and performance have been baselined, it is necessary to automatically alert headquarters staff when overall traffic levels or individual critical switch port traffic has exceeded what are considered to be normal levels.
Many management tools (network management systems or NMSs) are capable of monitoring individual switch ports and WAN interface traffic and provide a method to determine when specific traffic thresholds have been exceeded on those interfaces, either by error rates or use rates. This will alert network engineers to potential network degradations before they become significant problems at that remote site. However, due to their primary purpose of providing long-term monitoring and trending, most management systems take samples that are too coarse for effective troubleshooting.
When trying to determine the presence of intermittent or ‘spikey’ bandwidth-hogging events, an analyser with granular sampling rates is essential for problem detection and isolation. Additionally, seemingly minor problems such as incorrect subnet masks, duplicate IP addresses etc should also be reported. It is also necessary to monitor the protocols in use, which is especially important for the traffic traversing the WAN link.
Reactive tasks
When remote users complain of a slow network, network engineers must follow a consistent process and have access to the necessary data to identify the problem domain to identify and prove who or what is at fault.
First step: Testing connectivity and response times
For most network engineers, the first step in troubleshooting a problem is to ping the remote site; either the machine of the complaining user, a local server or other reliably ‘on’ device, providing that ICMP (the layer 3 protocol used by ping) is not blocked. If ping has worked in the past, but is not working now, then an examination of port status along the path is required. In the absence of ‘down’ ports or links, an unsuccessful ping means troubleshooting from the bottom of the stack and moving up. Unfortunately, physical connectivity issues may require staff to travel to the site for troubleshooting but do not rely on ping alone for making that determination; it may be blocked.
A successful ping at least assures physical connectivity and can give an inexact estimation of network round trip time. But ping is not a reliable measurement method for determining packet loss and, being symmetrical in nature, provides no insight into asymmetrical link problems. Also, no user application uses ICMP, so whether the protocols used by a particular application can traverse the network, and the speed at which they do so, must be measured a different way, such as opening a port.
Initiating the SYN/SYN-ACK/ACK ‘three-way handshake’ of a TCP port provides a far more reliable test of layer 3 connectivity. Even better than a port connect, which validates network connectivity and network response times, conducting and measuring an application transaction provides a more reliable method of application connectivity and response times. Certain tools can target a local or remote web server and execute and measure an HTTP GET command as a way of measuring performance of a web-based application, for example.
Second step: Examining network usage
It is very common for performance slowdowns to be caused by overuse of network bandwidth. While most LAN connections exceed the available WAN or internet bandwidth by some multiple, it is not impossible for a local LAN connection to become overloaded, particularly if configurations are not achieving maximum throughput. Many a network engineer has been surprised to find 10 Mbps half duplex links in operation where 100 Mbps full duplex or Gig was expected. SNMP or flow-based data can be examined to determine interface utilisation. Granular measurement can indicate when spikes of usage are occurring, with flow data providing evidence of who is doing what.
Third step: Testing network quality
A unique method for testing the available bandwidth is to conduct a performance test from the HQ to the remote site. Software agents are available that can be deployed on remote PCs and then targeted by an analyser at the HQ. ‘Layering on’ a stream of test traffic to/from the remote site provides instant insight into the quality of packet transmission, revealing issues with latency, loss and jitter that could be impacting application performance.
Fourth step: Packet analysis
Still operating from the HQ, the network engineer can place their analyser inline with the traffic feed from the remote site (either using an analyser capable of inline analysis, or via a SPAN port or a network tap). Keep in mind that hardware-based tools are essential for zero packet-loss analysis. The worst waste of an engineer’s time would be to capture only parts of the traffic to/from the remote site, and (at best) be left guessing or, even worse, to mistakenly troubleshoot ‘lost packets’ when the loss was actually caused by the analyser itself.
With capture files of traffic to/from the remote site, the engineer can examine delta times between frames and distinguish between network transfer time and client response time, thereby validating whether there really is an issue with performance to the remote site or whether the issue is with the client or the HQ side.
Analyser at the remote site
Despite these best efforts, and as pointed out already, testing from the HQ can only go so far and is only providing test information from the point of view of the HQ site. At some point, measurements must be taken from the remote site, from the point of view of the affected users. While remote desktop (RDP) can be used to take control of a remote PC and conduct various command line tests (such as ping or tracert), these have their limitations. The ideal scenario is to have a dedicated network analyser on site for local testing but control that analyser remotely from the HQ, eliminating the need for travel to the site.
Maintenance
During network maintenance times, ensure that the internet/WAN links to the remote sites are capable of supporting the allocated bandwidth and providing quality transmission of application traffic. In order to perform this task, a network performance test (NPT) should be run between an analyser at the remote site and a similar analyser at the headquarters site. The test needs to be performed at various traffic rates and different frame sizes to determine if the WAN link is capable of handling the traffic, to determine packet loss and, more importantly, in which direction the packets are being lost. If there are dropped packets, or the link will not support the advertised data rate, the analyser needs to have features available to diagnose the source of the problem.
Testing for throughput and loss is only one dimension of network quality. Latency and jitter must be measured, and jitter must be measured asymmetrically if one is to understand its impact on streaming applications. Also, QoS must be tested by passing traffic at various QoS settings to ensure proper traffic prioritisation and prevent improper discarding or throttling of application traffic.
The business benefit
Effective troubleshooting not only reduces time and travel expense but, done properly, can help avoid or reduce additional hardware expenses, purchasing excess WAN capacity, unnecessary investment in outsourced troubleshooting or having persistent problems that suck time and money from the entire organisation. Importantly, in industrial plants, effective troubleshooting eliminates delays and costly outages and improves user safety
Mineral processing: a eulogy for analog
Leading mines have already accomplished an automated, digitally connected mine and are reaping...
What is TSN and do we really need it?
Whether or not TSN becomes an industry-wide standard remains to be seen.
AI and condition monitoring
The rise of artificial intelligence is seen as particularly useful in the field of condition...