Is the Server Down? Tracking Down the Source of Network Problems

Most servers are attached to some sort of network and generally use the network to provide some sort of service. Many different problems can creep up on a network, so network troubleshooting skills become crucial for anyone responsible for servers or services on those servers. Linux provides a large set of network troubleshooting tools, and this chapter discusses a few common network problems along with how to use some of the tools available for Linux to track down the root cause.

Network troubleshooting skills are invaluable for every member of a DevOps team. It’s almost a given that software will communicate over the network in some way, and in many applications, network connectivity is absolutely vital for the software to function. When there is a problem with the network, everyone from the sysadmin, to the QA team, to the entire development staff will probably take notice. Whether your networking department is a separate group or not, when your entire DevOps team works together on diagnosing networking problems, you will get a better overall view of the problem. Your development team will give you the deep knowledge of how your software operates on the network; your QA team will explain how the application behaves under unusual circumstances and provide you with a backlog of networking bug history; and your sysadmin will provide you with an overall perspective of how networked applications work under Linux. Together you will be able to diagnose networking problems much faster than any team can individually.
Server A Can’t Talk to Server B

Probably the most common network troubleshooting scenario involves one server being unable to communicate with another server on the network. This section will use an example in which a server named dev1 can’t access the web service (port 80) on a second server named web1. Any number of different problems could cause this, so we’ll run step by step through tests you can perform to isolate the cause of the problem.

Normally when troubleshooting a problem like this, you might skip a few of these initial steps (such as checking the link), since tests further down the line will also rule them out. For instance, if you test and confirm that DNS works, you’ve proven that your host can communicate on the local network. For this example, though, we’ll walk through each intermediary step to illustrate how you might test each level.
Client or Server Problem

One quick test you can perform to narrow down the cause of your problem is to go to another host on the same network and try to access the server. In this example, you would find another server on the same network as dev1, such as dev2, and try to access web1. If dev2 also can’t access web1, then you know the problem is more likely on web1, or on the network between dev1, dev2, and web1. If dev2 can access web1, then you know the problem is more likely on dev1. To start, let’s assume that dev2 can access web1, so we will focus our troubleshooting on dev1.
Is It Plugged In?

The first troubleshooting steps to perform are on the client. You first want to verify that your client’s connection to the network is healthy. To do this you can use the ethtool program (installed via the ethtool package) to verify that your link is up (the Ethernet device is physically connected to the network). If you aren’t sure what interface you use, run the /sbin/ifconfig command to list all the available network interfaces and their settings. So if your Ethernet device was at eth0

Click here to view code image

$ sudo ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: pg
Wake-on: d
Current message level: 0x000000ff (255)
Link detected: yes

Here, on the final line, you can see that Link detected is set to yes, so dev1 is physically connected to the network. If this was set to no, you would need to physically inspect dev1’s network connection and make sure it was connected. Since it is physically connected, you can move on.

Note

ethtool has uses beyond simply checking for a link. It can also be used to diagnose and correct duplex issues. When a Linux server connects to a network, typically it autonegotiates with the network to see what speeds it can use and whether the network supports full duplex. The Speed and Duplex lines in the example ethtool output illustrate what a 100Mb/s, full duplex network should report. If you notice slow network speeds on a host, its speed and duplex settings are a good place to look. Run ethtool as in the previous example, and if you notice Duplex set to Half, then run

$ sudo ethtool -s eth0 autoneg off duplex full

Replace eth0 with your Ethernet device.
Is the Interface Up?

Once you have established that you are physically connected to the network, the next step is to confirm that the network interface is configured correctly on your host. The best way to check this is to run the ifconfig command with your interface as an argument. So to test eth0’s settings, you would run

Click here to view code image

$ sudo ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:17:42:1f:18:be
inet addr:10.1.1.7 Bcast:10.1.1.255 Mask:255.255.255.0
inet6 addr: fe80::217:42ff:fe1f:18be/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:11 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:229 (229.0 B) TX bytes:2178 (2.1 KB)
Interrupt:10

Probably the most important line in this is the second line of output, which tells us our host has an IP address (10.1.1.7) and subnet mask (255.255.255.0) configured. Now, whether these are the correct settings for this host is something you will need to confirm. If the interface is not configured, try running sudo ifup eth0 and then run ifconfig again to see if the interface comes up. If the settings are wrong or the interface won’t come up, inspect /etc/network/interfaces on Debian-based systems or /etc/sysconfig/network_scripts/ifcfg- on Red Hat-based systems. It is in these files that you can correct any errors in the network settings. Now if the host gets its IP through DHCP, you will need to move your troubleshooting to the DHCP host to find out why you aren’t getting a lease.
Is It on the Local Network?

Once you see that the interface is up, the next step is to see if a default gateway has been set and whether you can access it. The route command will display your current routing table, including your default gateway:

Click here to view code image

$ sudo route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
default 10.1.1.1 0.0.0.0 UG 100 0 0 eth0

The line you are interested in is the last line, which starts with default. Here you can see that the host has a gateway of 10.1.1.1. Note that the -n option was used with route so it wouldn’t try to resolve any of these IP addresses into hostnames. For one thing, the command runs more quickly, but more important, you don’t want to cloud your troubleshooting with any potential DNS errors. If you don’t see a default gateway configured here, and the host you want to reach is on a different subnet (say, web1, which is on 10.1.2.5), that is the likely cause of your problem. To fix this, either be sure to set the gateway in /etc/network/interfaces on Debian-based systems or /etc/sysconfig/network_scripts/ifcfg- on Red Hat-based systems, or if you get your IP via DHCP, be sure it is set correctly on the DHCP server and then reset your interface with the following on Debian-based systems:

$ sudo service networking restart

The following would be used on Red Hat-based systems:

$ sudo service network restart

On a side note, it’s amazing that these distributions have to differ even on something this fundamental.

Once you have identified the gateway, use the ping command to confirm that you can communicate with the gateway:

Click here to view code image

$ ping -c 5 10.1.1.1
PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=3.13 ms
64 bytes from 10.1.1.1: icmp_seq=2 ttl=64 time=1.43 ms
64 bytes from 10.1.1.1: icmp_seq=3 ttl=64 time=1.79 ms
64 bytes from 10.1.1.1: icmp_seq=5 ttl=64 time=1.50 ms
— 10.1.1.1 ping statistics —
5 packets transmitted, 4 received, 20% packet loss, time 4020ms
rtt min/avg/max/mdev = 1.436/1.966/3.132/0.686 ms

As you can see, we were able to successfully ping the gateway, which means that we can at least communicate with the 10.1.1.0 network. If you couldn’t ping the gateway, it could mean a few things. It could mean that your gateway is blocking ICMP packets. If so, tell your network administrator that blocking ICMP is an annoying practice with negligible security benefits and then try to ping another Linux host on the same subnet. If ICMP isn’t being blocked, then it’s possible that the switch port on your host is set to the wrong VLAN, so you will need to further inspect the switch to which it is connected.
Is DNS Working?

Once you have confirmed that you can speak to the gateway, the next thing to test is whether DNS functions. Both the nslookup and dig tools can be used to troubleshoot DNS issues, but since you need to perform only basic testing at this point, just use nslookup to see if you can resolve web1 into an IP:

$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
Name: web1.example.net
Address: 10.1.2.5

In this example DNS is working. The web1 host expands into web1.example.net and resolves to the address 10.1.2.5. Of course, make sure that this IP matches the IP that web1 is supposed to have! In this case, DNS works, so we can move on to the next section; however, there are also a number of ways DNS could fail.
No Name Server Configured or Inaccessible Name Server

If you see the following error, it could mean either that you have no name servers configured for your host or they are inaccessible:

Click here to view code image

$ nslookup web1
;; connection timed out; no servers could be reached

In either case you will need to inspect /etc/resolv.conf and see if any name servers are configured there. If you don’t see any IP addresses configured there, you will need to add a name server to the file. Otherwise, if you see something like the following, you need to start troubleshooting your connection with your name server, starting off with ping:

search example.net
nameserver 10.1.1.3

If you can’t ping the name server and its IP address is in the same subnet (in this case, 10.1.1.3 is within the subnet), the name server itself could be completely down. If you can’t ping the name server and its IP address is in a different subnet, then skip ahead to the Can I Route to the Remote Host? section, but only apply those troubleshooting steps to the name server’s IP. If you can ping the name server but it isn’t responding, skip ahead to the Is the Remote Port Open? section.
Missing Search Path or Name Server Problem

It is also possible that you will get the following error for your nslookup command:

$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
** server can’t find web1: NXDOMAIN

Here you see that the server did respond, since it gave a response: server can’t find web1. This could mean two different things. One, it could mean that web1’s domain name is not in your DNS search path. This is set in /etc/resolv.conf in the line that begins with search. A good way to test this is to perform the same nslookup command, only use the fully qualified domain name (in this case, web1.example.net). If it does resolve, then either always use the fully qualified domain name, or if you want to be able to use just the hostname, add the domain name to the search path in /etc/resolv.conf.

If even the fully qualified domain name doesn’t resolve, then the problem is on the name server. The complete method for troubleshooting all DNS issues is covered in Chapter 6, but here are some basic pointers. If the name server is supposed to have that record, then that zone’s configuration needs to be examined. If it is a recursive name server, then you will have to test whether or not recursion is working on the name server by looking up some other domain. If you can look up other domains, then you must check if the problem is on the remote name server that does contain the zones.
Can I Route to the Remote Host?

After you have ruled out DNS issues and see that web1 is resolved into its IP 10.1.2.5, you must test whether you can route to the remote host. Assuming ICMP is enabled on your network, one quick test might be to ping web1. If you can ping the host, you know your packets are being routed there and you can move to the next section, Is the Remote Port Open? If you can’t ping web1, try to identify another host on that network and see if you can ping it. If you can, then it’s possible web1 is down or blocking your requests, so move to the next section. If you can’t ping any hosts on the remote network, packets aren’t being routed correctly. One of the best tools to test routing issues is traceroute. Once you provide traceroute with a host, it will test each hop between you and the host. For example, a successful traceroute between dev1 and web1 would look like this:

Click here to view code image

$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 web1 (10.1.2.5) 8.039 ms 8.348 ms 8.643 ms

Here you can see that packets go from dev1 to its gateway (10.1.1.1), and then the next hop is web1. This means it’s likely that 10.1.1.1 is the gateway for both subnets. On your network you might see a slightly different output if there are more routers between you and your host. If you can’t ping web1, your output would look more like the following:

Click here to view code image

$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 * * *
3 * * *

Once you start seeing asterisks in your output, you know that the problem is on your gateway. You will need to go to that router and investigate why it can’t route packets between the two networks. Instead you might see something more like

Click here to view code image

$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
1 10.1.1.1 (10.1.1.1) 3006.477 ms !H 3006.779 ms !H 3007.072 ms

In this case, you know that the ping timed out at the gateway, so the host is likely down or inaccessible even from the same subnet. At this point, if you haven’t tried to access web1 from a machine on the same subnet as web1, try pings and other tests now.

Note

If you have one of those annoying networks that block ICMP, don’t worry, you can still troubleshoot routing issues. You just need to install the tcptraceroute package (sudo apt-get install tcptraceroute), then run the same commands as for traceroute, only substitute tcptraceroute for traceroute.
Is the Remote Port Open?

So you can route to the machine but you still can’t access the web server on port 80. The next test is to see whether the port is even open. There are a number of different ways to do this. For one, you could try telnet:

Click here to view code image

$ telnet 10.1.2.5 80
Trying 10.1.2.5…
telnet: Unable to connect to remote host: Connection refused

If you see Connection refused, then either the port is down (likely Apache isn’t running on the remote host or isn’t listening on that port) or the firewall is blocking your access. If telnet can connect, then, well, you don’t have a networking problem at all. If the web service isn’t working the way you suspected, you need to investigate your Apache configuration on web1. Troubleshooting web server issues is covered in Chapter 8.

Instead of telnet, I prefer to use nmap to test ports because it can often detect firewalls. If nmap isn’t installed, use your package manager to install the nmap package. To test web1, type the following:

Click here to view code image

$ nmap -p 80 10.1.2.5
Starting Nmap 4.62 ( https://nmap.org ) at 2009-02-05 18:49 PST
Interesting ports on web1 (10.1.2.5):
PORT STATE SERVICE
80/tcp filtered https

Aha! nmap is smart enough that it can often tell the difference between a closed port that is truly closed and a closed port behind a firewall. Normally when a port is actually down, nmap will report it as closed. Here it reported it as filtered. What this tells us is that some firewall is in the way and is dropping the packets to the floor. This means you need to investigate any firewall rules on the gateway (10.1.1.1) and on web1 itself to see if port 80 is being blocked.
Test the Remote Host Locally

At this point, we have either been able to narrow the problem down to a network issue or we believe the problem is on the host itself. If we think the problem is on the host itself, we can do a few things to test whether port 80 is available.
Test for Listening Ports

One of the first things you should do on web1 is test whether port 80 is listening. The netstat -lnp command will list all ports that are listening along with the process that has the port open. You could just run that and parse through the output for anything that is listening on port 80, or you could use grep to show only things listening on port 80:

Click here to view code image

$ sudo netstat -lnp | grep :80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 919/apache

The first column tells you what protocol the port is using. The second and third columns are the receive and send queues (both are set to 0 here). The column you want to pay attention to is the fourth column, as it lists the local address on which the host is listening. Here the 0.0.0.0:80 tells us that the host is listening on all of its IPs for port 80 traffic. If Apache were listening only on web1’s Ethernet address, you would see 10.1.2.5:80 here.

The final column will tell you which process has the port open. Here you can see that Apache is running and listening. If you do not see this in your netstat output, you need to start your Apache server.
Firewall Rules

If the process is running and listening on port 80, it’s possible that web1 has some sort of firewall in place. Use the iptables command to list all of your firewall rules. If your firewall is disabled, your output will look like this:

Click here to view code image

$ sudo /sbin/iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Notice that the default policy is set to ACCEPT. It’s possible, though, that your firewall is set to drop all packets by default, even if it doesn’t list any rules. If that is the case you will see output more like the following:

Click here to view code image

$ sudo /sbin/iptables -L
Chain INPUT (policy DROP)
target prot opt source destination

Chain FORWARD (policy DROP)
target prot opt source destination

Chain OUTPUT (policy DROP)
target prot opt source destination

On the other hand, if you had a firewall rule that blocked port 80, it might look like this:

Click here to view code image

$ sudo /sbin/iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination
REJECT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:80 reject-with icmp-port-unreachable

Chain FORWARD (policy ACCEPT)
target prot opt source destination

Chain OUTPUT (policy ACCEPT)
target prot opt source destination

Clearly, in the latter case you would need to modify the firewall rules to allow port 80 traffic from the host

Leave a Reply

Your email address will not be published. Required fields are marked *