The adoption of new Catalyst 9800 not only brought new ways to configure Wireless Controllers, but also several new mechanisms to troubleshoot them. It has some awesome troubleshooting features that will help to identify root causes faster spending less time and effort.
There are several new key troubleshooting differentiators compared to previous WLC models:
◉ Trace-on-failure: Summary of detected failures
◉ Always-on-tracing: Events continuously stored without having to enable debugging
◉ Radioactive-traces: More detailed debugging logs filtered per mac, or IP address
◉ Embedded Packet capture: Perform filtered packet captures in the device itself
◉ Archive logs: Collect stored logs from all processes
Let me present you the different features through a real reported wireless problem: a typical “client connectivity issue”, showing how to use them to do root cause analysis, while following a systematic approach.
Let’s start by a user reporting wireless client connectivity issue. He was kind to provide the client mac address and timestamp for the problem, so scope starts already partially delimited.
First feature that I would use to troubleshoot is the Trace-on-failure. The Catalyst 9800 can keep track of predefined failure conditions and show the number of events per each one, with details about failed events. This feature allows to be proactive and detect issues that could be occurring in our network even without clients reporting them. There is nothing required for this feature to work, it is continuously working in the background without the need of any debug command.
How to collect Trace-on-failure:
◉ show wireless stats trace-on-failure.
Show different failure conditions detected and number of events.
◉ show logging profile wireless start last 2 days trace-on-failure
Show failure conditions detected and details about event. Example:
9800wlc# show logging profile wireless start last 2 days trace-on-failure
Load for five secs: 0%/0%; one minute: 1%; five minutes: 1%
Time source is NTP, 20:50:30.872 CEST Wed Aug 4 2021
Logging display requested on 2021/08/04 20:50:30 (CEST) for Hostname: [eWLC], Model: [C9800-CL-K9], Version: [17.03.03], SN: [9IKUJETLDLY], MD_SN: [9IKUJETLDLY]
Displaying logs from the last 2 days, 0 hours, 0 minutes, 0 seconds
executing cmd on chassis 1 ...
Large message of size [32273]. Tracelog will be suppressed.
Large message of size [32258]. Tracelog will be suppressed.
Time UUID Log
----------------------------------------------------------------------------------------------------
2021/08/04 06:32:45.763075 0x1000000e37c92 f018.985d.3d67 CLIENT_STAGE_TIMEOUT State = WEBAUTH_REQUIRED, WLAN profile = CWA-TEST2, Policy profile = flex_vlan4_cwa, AP name = ap3800i-r3-sw2-Gi1-0-35
Tip: To just focus on failures impacting our setup we can filter output by removing failures that have no events. We also can monitor statistics to check failures increasing and its pace. Following command can be used:
◉ show wireless stats trace-on-failure | ex : 0$
With those commands we can identify which are the failure events detected in the last few day by the controller, and check if there is any reported event for client mac and timestamp provided by the user.
In case that there is no event for user reported issue or we need more details I would use the next feature, Always-on-tracing.
The Catalyst 9800 is continuously logging control plane events per process into a memory buffer, copying them to disk regularly. Each process logs can span several days, even in the case of a fully controller.
This feature allows to check events that have occurred in the past even without having any debugs enabled. This can be very useful to get context and actions that caused a client or AP disconnections, to check client roaming patterns, or the SSIDs where client had connected. This is a huge advantage if we compare with previous platforms where we had to enable “debug client” command after issue occurred and wait for next occurrence.
Always-on-tracing can be used to check past events for clients, APs or any wireless related process. We can collect all events for wireless profile or filter by concrete client or AP mac address. By default, command is showing last 10 minutes, and output is displayed in the terminal, but we can specify start/end time selecting date from where we want to have logs and we can store them into a file.
How to collect Always-on-tracing:
◉ show logging profile wireless
Show last 10minutes of all wireless involved process in WLC terminal
◉ show logging profile wireless start last 24 hours filter mac MAC-ADDRESS to-file bootflash:CLIENT_LOG.txt
Show events for specific client/AP mac address in the last 24 hours and stores results into a file.
With these commands and since we know the client mac address and timestamp for the issue, we can collect logs for the corresponding point in the past. I always try to get logs starting sometime before the issue so I can find what client was doing before problem occurred.
The Catalyst 9800 has several logging levels details. Always-on-tracing is storing at “info” level events. We can enable higher logging levels if required, like notice, debugging, or even verbose per process or for group of processes. Higher levels will generate more events and reduce the total overall period of time that can be logged for that process.
In case we couldn’t identify root cause with previously collected data and need more in-depth information of all processes and actions I would use the next feature Radioactive-traces.
This feature avoids the need to manually increase logging level per process and will increase level of logging per different processes involved when a concrete set of specified mac or ip addresses transverses the system. It will return logging level back to “info” once it is finished.
Radioactive-traces needs to be enabled before issue occur and will require we wait for next event to collect the data, behaving like the old “debug client” present in legacy controllers. This will be one “One Stop Shop” to do in-depth troubleshooting for multiple issues, like client related problems, APs, mobility, radius, etc, and avoid having to enable a list of different debug commands for each scenario. By default, it will provide logging level “notice” but the keyword “internal” can be added to provide additional logging level details intended for development to troubleshoot.
How to collect Radioactive-traces:
◉ CLI Method 1:
show platform condition
clear platform condition all
debug platform condition feature wireless mac dead.beaf.dead
debug platform condition start
Reproduce issue
debug platform condition stop
show logging profile wireless filter mac dead.beaf.dead to-file File.log
If needed more details for engineering
show logging profile wireless internal filter mac dead.beaf.dead to-file File.log
◉ CLI Method 2: Script doing same steps as Method 1, automatically starting traces for next 30 minutes. Time is configurable.
debug wireless mac MAC@ [internal]
Reproduce issue
no debug wireless mac MAC@ [internal]
It will generate ra_trace file in bootflash with date and mac address.
dir bootflash: | i ra_trace
◉ This can also be enabled through GUI, in the troubleshooting section: