7.4.3.3. Troubleshooting the Fabric Manager
The Fabric Manager provides log messages for the following:
Events (NOTICE)
Information (INFO)
Warning (WARN)
Errors
7.4.3.3.1. Fabric Manager Event Messages
The Fabric Manager logs significant fabric events in a standard machine-readable format. The format for these special event messages provides information not only about the event, but information about what nodes in the fabric are causing the event.
The format of these messages is as follows:
<prefix>;MSG:<msgType>|SM:<sm_node_desc>:port <sm_port_number>| COND:<condition>|NODE:<node_desc>:port <port_number>:<node_guid>| LINKEDTO:<linked_desc>:port <linked_port>:<linked_guid>| DETAIL:<details>
Where:
<prefix>– Includes the date and time information of the event along with either the slot number OR hostname and IP address of the Fabric Manager reporting the message.<msgType>– Is one of the following values:ERRORWARNINGNOTICEINFORMATION
<sm_node_desc>and<sm_port_number>– Indicate the node name and port number of the SM that is reporting the message, prefixed with the word 'port'. Any pipes (|) or colons (:) in the node description will be converted to spaces in the log message.<condition>– Is one of the conditions from the event SM Reporting Table that is detailed in the Event Descriptions. The condition text includes a unique identification number.<node_desc>,<port_number>, and<node_guid>are the node description, port number, and node GUID of the port and node that are primarily responsible for the event. Any pipes (|) or colons (:) in the node description will be converted to spaces in the log message.<linked_desc>,<linked_port>, and<linked_guid>are optional fields describing the other end of the link. These fields and the 'LINKEDTO' keyword will only be shown in applicable messages. Any pipes (|) or colons (:) in the node description will be converted to spaces in the log message.<details>is an optional free-form field detailing additional information useful in diagnosing the log message cause.
Event Descriptions
The following sections describe the Fabric Manager event messages, their severity, an explanation, and possible causes for the event.
The subnet manager emits this message when it is the only running Subnet Manager on a given subnet.
Severity
Warning
Causes
No redundant SM exists on the subnet.
A user shut down a redundant SM or possibly disconnected, or shut down, the node on which the SM was running.
Action
If running redundant SMs on a fabric, verify the health of each host or switch running an SM.
The Master SM for the subnet detected that another SM has come online.
Severity
Notice
Causes
A user started a redundant SM on another host or switch.
A user just connected two separate subnets together.
Action
None
A new SuperNIC port, switch, inter-switch link, or Subnet Manager was detected by the master Subnet Manager.
Severity
Notice
Causes
User action
Action
None
A SuperNIC port, switch, inter-switch link, or Subnet Manager has disappeared from fabric. This encompasses system shutdowns and loss of connectivity.
Severity
Notice
Action
The administrator should validate whether or not the components have disappeared from the fabric due to user action or not. Nodes will typically disappear from the fabric when they are rebooted, re-cabled, or if their Omni-Path Fabric stacks are stopped.
Subnet Manager transitioned into the 'master' state from one of the 'standby', 'discovering', or 'not active' states.
Severity
Notice
Action
The administrator should check the state of the machine (or chassis) that was providing the master SM service to determine if it has failed and needs to be replaced, or whether the state change occurred due to user action.
Example
Nov 28 17:45:25 sample-host fm0_sm[29326]: ;MSG:NOTICE|SM:sample-host.sample-domain.com:port 1|COND:#5 SM state to master|NODE:sample-host.sample-domain.com:port 1:0x0x00066a00a0000405|DETAIL:transition from DISCOVERING to MASTER
Subnet Manager transitioned from 'master' into 'standby' state.
Severity
Notice
Action
The administrator should validate that this was due to a modification in the CN5000 Omni-Path Fabric network configuration. If not, then this issue should be reported to customer support.
Example
Nov 29 12:15:28 sample-host fm0_sm[31247]: ;MSG:NOTICE|SM:sample-host.sample-domain.com:port 1|COND:#6 SM state to standby|NODE:sample-host.sample-domain.com:port 1:0x0x00066a00a0000405|DETAIL:transition from MASTER to STANDBY
The master Subnet Manager is shutting down.
Severity
Notice
Action
The administrator should check the state of the machine (or chassis) that was providing the master SM service, or whether the state change occurred due to user action.
Example
;MSG:NOTICE|SM:sample-host.sample-domain.com:port 1|COND:#7 SM shutdown|NODE:sample-host.sample-domain.com:port 1:0x0x00066a00a0000405|DETAIL:
Some form of error occurred during fabric initialization.
Severity
Notice
Explanation
Examples of possible errors include:
Link could not be activated in 4x mode.
Subnet Manager could not initialize a port or node with proper configuration.
Action
The administrator should perform the fabric troubleshooting procedure to isolate and repair the faulty component. The faulty component could be the SM platform itself (for example, its own SuperNIC) or a component in the CN5000 Omni-Path Fabric network.
Example
Apr 6 22:48:42 sample-host fm0_sm[21458]: sample-host; MSG:NOTICE|SM:sample-host:port 2|COND:#8 Fabric initialization error|NODE:sample-host2:port 1:0x0011750000ffd7af|LINKEDTO:Cornelis OPA Switch:port 18:0x00066a00d9000108|DETAIL:Failed to set portinfo for node
The SM received an asynchronous trap from a switch or end-port indicating a link integrity problem.
Severity
Notice
Action
The administrator should perform the fabric troubleshooting procedure to isolate and repair the faulty component. This is typically due to a bad cable, an incorrect cable being used for the signaling rate and cable length (for example, too small a wire gauge), or a hardware failure on one of the two SuperNIC ports.
The SM received an asynchronous trap from a switch or end-port indicating a management key violation.
Severity
Notice
Action
The administrator should validate that the software configuration has not changed, because this issue is most likely due to a configuration issue. However, this event could also indicate a more serious issue such as a hacking attempt.
The Subnet Manager encountered an error at some time after fabric initialization.
Severity
Notice
Explanation
Examples of possible errors are:
The SM received an invalid request for information.
The SM could not perform the action requested by another fabric entity such as a request to create or join a multicast group with an unrealizable MTU or rate.
Action
The administrator should check to see if other SM-related problems have occurred and perform the corrective actions for those items. If these other exceptions continue to persist, then customer support should be contacted.
A brief message describing the number of changes that the SM detected on its last subnet sweep. This message will include totals for the number of switches, SuperNICs, end-ports, total physical ports, and SMs that have appeared or disappeared from the fabric. This message will only be logged at the end of a subnet sweep if the SM had detected changes.
Severity
Notice
Action
As this is only a summary of events detected during a fabric sweep, the administrator should examine the logs for preceding messages that describe the fabric changes in detail.
Example
Apr 8 15:31:36 sample-host fm0_sm[21458]: sample-host; MSG:NOTICE|SM:sample-host:port 2|COND:#12 Fabric Summary|NODE:sample-host:port 2:0x00066a01a0000405|DETAIL:Change Summary: 1 SWs disappeared, 0 HFIs appeared, 1 end ports disappeared, 3 total ports disappeared, 0 SMs appeared
Subnet Manager transitioned from standby into inactive state.
Severity
Notice
Action
The administrator should check for inconsistencies in XML configurations between the master SM and this SM.
Example
Nov 29 12:15:28 sample-host fm0_sm[31247]: ;MSG:NOTICE|SM:sample-host.sample-domain.com:port 1|COND:#13 SM state to inactive|NODE:sample-host.sample-domain.com:port 1:0x0x00066a00a0000405|DETAIL:transition from STANDBY to NOTACTIVE
Deactivating the Standby Subnet Manager and Secondary Performance Manager due to inconsistent Subnet Manager XML configuration on Standby.
Severity
Warning
Action
If the condition persists, compare the XML configuration files between the master and standby SM for inconsistencies.
Example
Oct 22 12:49:06 shaggy fm0_sm[31032]: shaggy; MSG:WARNING|SM:shaggy:port 1|COND:#14 SM standby configuration inconsistency|NODE:i9k118:port 0:0x00066a00d8000118|DETAIL:Deactivating standby SM i9k118 : 0x00066a00d8000118 which has a SM configuration inconsistency with master! The secondary PM will also be deactivated.
Deactivating the Standby Subnet Manager and Secondary Performance Manager due to inconsistent Subnet Manager Virtual Fabrics XML configuration on Standby.
Severity
Warning
Action
If the condition persists, compare the XML configuration files between the master and standby SM for inconsistencies.
Example
Oct 22 12:23:41 shaggy fm0_sm[30778]: shaggy; MSG:WARNING|SM:shaggy:port 1|COND:#15 SM standby virtual fabric configuration inconsistency|NODE:i9k118:port 0:0x00066a00d8000118|DETAIL:Deactivating standby SM i9k118 : 0x00066a00d8000118 which has a Virtual Fabric configuration inconsistency with master! The secondary PM will also be deactivated.
Reserved
Deactivating the Secondary Performance Manager and Standby Subnet Manager due to inconsistent Performance Manager XML configuration on Secondary.
Severity
Warning
Action
If the condition persists, compare the XML configuration files between the primary and secondary PM for inconsistencies.
Example
Oct 22 12:51:42 shaggy fm0_sm[31173]: shaggy; MSG:WARNING|SM:shaggy:port 1|COND:#17 PM secondary configuration inconsistency|NODE:i9k118:port 0:0x00066a00d8000118|DETAIL:Attempting to deactivate secondary PM which has a configuration inconsistency with primary! The standby SM will also be deactivated.
7.4.3.3.2. Other Log Messages
In addition to the Fabric Manager Event messages detailed in the previous section, the Fabric Manager software suite may emit other log messages that provide extra detail for use by technical personnel in troubleshooting fabric issues.
Log messages generally follow the following format:
<prefix>: <severity>[<module>]: <component>: <function>: <message>
Where:
prefix- Includes time and date followed by the hostname and/or IP of the Fabric Manager reporting the message, the instance name, and a Process ID (PID) number.severity- One of the following:FATAL,ERROR,WARN,NOTIC,INFO,PROGR,VBOSE,DBG[1-4],ENTER, orEXIT.module- Program module that generated the message. Typically the name of the sub-component or library that saw the event.component- Name of the Fabric Manager process that owns the module.function- Part of the sub-module where the event occurred. This is probably only useful for developers but might give insight into what the Fabric Manager is currently doing.message- Free form message text giving more details or explaining the event.
Note
Some of the listed components of the formatting may be omitted.
Example:
Jan 20 14:40:22 phgppriv36 fm0_sm[4082]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 0 SWs, 2 HFIs, 2 end ports, 2 total ports, 1 SM(s), 26 packets, 0 retries, 0.004 sec sweep
Information (INFO)
SM Area
Discovery
Meaning
The last full member of the group has left. The group is removed from the fabric.
Action
None
SM Area
Discovery
Meaning
SM has transitioned to STANDBY mode.
Action
None
SM Area
Discovery
Meaning
Discovery sweep has started.
Action
None
SM Area
Discovery
Meaning
Discovery sweep has ended.
Action
None
SM Area
Discovery
Meaning
Usually happens during the merging of two fabrics.
Action
None
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check the health of the requester and the connected port if the message persists.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric or the destination has dropped from fabric.
Action
None
SM Area
Administrator
Meaning
Group is cleaned out when last the FULL member leaves.
Action
None
Warning (WARN)
SM Area
SM to SM Communication
Meaning
Lost communication path to other SM on node HFI1.
Action
Check the health of the node described in the message and the status of the SM node.
SM Area
SM to SM Communication
Meaning
Lost communication path to the other SM on node HFI1.
Action
Check the health of the node described in the message and the status of the SM node.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is violating the protocol.
Action
If the condition persists, turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
SM on node HFI1 is violating the protocol specification.
Action
If the condition persists, turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is incompatible or lost the communication path.
Action
Remove the incompatible SM from the fabric or check the health of the node HFI1.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is violating the protocol specification.
Action
If the condition persists, turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is violating the protocol specification.
Action
If the condition persists, turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is violating the protocol specification.
Action
If the condition persists, turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
Lost communication path to the other SM on node HFI1.
Action
Check the health of the node described in the message and the status of the SM node.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 configuration does not match the master.
Action
Verify that the XML configuration between master and standby SM is consistent.
SM Area
SM to SM Communication
Meaning
The SM on node HFI1 is violating the protocol specification.
Action
If the condition persists turn off the SM on node HFI1.
SM Area
SM to SM Communication
Meaning
Remote SM may have handed over to another SM on the fabric.
Action
None
SM Area
SM to SM Communication
Meaning
Lost the communication path to the other SM at lid 0x1.
Action
Check the health of the node described in the message and the status of the SM node.
SM Area
Discovery
Meaning
Multiple cable pulls or chassis removal/insertion event.
Action
Check links with high error count and reseat or replace cable. If the condition persists, capture the log information and call support.
SM Area
Discovery
Meaning
Lost communication path to node HFI1.
Action
Check the health of the node port described in the message.
SM Area
Discovery
Meaning
The node connected to port x of HFI1 is not responding.
Action
Check the health of the node connected to the port.
SM Area
Discovery
Meaning
The node connected to port x of HFI1 is not responding.
Action
Check the health of the node connected to the port.
SM Area
Discovery
Meaning
Switch node 1 is not responding.
Action
If the condition persists, check the health of the switch and capture health data if possible.
SM Area
Discovery
Meaning
Switch node 1 not responding.
Action
If the condition persists, check the health of the switch and capture health data if possible.
SM Area
Discovery
Meaning
Port x of HFI1 not responding.
Action
Check the health of node HFI1.
SM Area
Discovery
Meaning
The node may have been marked down if it did not respond to SMA queries.
Action
Check the health of the node connected to switch 1 port X.
SM Area
Discovery
Meaning
Another SM with a different Mkey configured.
Action
Stop one of the subnet managers and make the configuration consistent.
SM Area
Discovery
Meaning
May be caused by simultaneous removal/insertion events in the fabric. Persistence indicates that the node may be having problems.
Action
Check the health of the node HFI1/SW1 if the fabric was idle or persistent condition.
SM Area
Discovery
Meaning
May be caused by simultaneous removal/insertion events in the fabric. Persistence indicates the node may be having problems.
Action
Check the health of node HFI1/SW1 if the fabric was idle or persistent condition.
SM Area
Discovery
Meaning
Caused by simultaneous removal/insertion events in the fabric.
Action
None
SM Area
Administrator
Meaning
Invalid data in the SA request.
Action
Check the health of the requester at lid 0x1.
SM Area
Administrator
Meaning
The source and destination do not share a partition with the given PKey.
Action
Configuration change required if they should have access.
SM Area
Administrator
Meaning
The requesting node and destination do not share a PKey.
Action
Configuration change required if they should have access.
SM Area
Administrator
Meaning
The source and destination do not share a path in a vFabric that contains limitations on MTU and rate.
Action
Configuration change required if the path is valid.
SM Area
Administrator
Meaning
A request was for a given Pkey, but the source of the query is not a member of the same partition.
Action
Configuration change may be necessary.
SM Area
Administrator
Meaning
A query request failed pairwise PKey checks.
Action
Configuration change may be necessary.
SM Area
Configuration
Meaning
A vFabric with an undefined PKey has been assigned a PKey.
Action
None
SM Area
Configuration
Meaning
PKey validation failed for service record, request node does not have valid PKey.
Action
Configuration change required if request should be valid.
SM Area
Administrator
Meaning
May have duplicate data in the fabric.
Action
Check topology data for duplicate GUIDs.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check SW1 for a bad port and the health of the destination node if the condition persists.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check SW1 for a bad port and health of the destination node if the condition persists.
SM Area
Administrator
Meaning
SM Multicast denial of service configured and the threshold has been reached. Bouncing the port in an attempt to clear the issue.
Action
If multiple occurrences, check the health of the node HFI1.
SM Area
Administrator
Meaning
SM Multicast denial of service configured and the threshold has been reached, disabling the port.
Action
Check the health of the node HFI1.
ERROR
SM Area
SM to SM communication
Meaning
Lost communication path to the other SM on node GUID 0x00066a00d9000143.
Action
Check the health of the node HFI1 and the status of the SM node.
SM Area
Discovery
Meaning
topology_initialize: cannot get PortInfo; sleeping.
Action
Make sure the stack is running. Restart the SM node and stack.
SM Area
Discovery
Meaning
The node port of the SM is down.
Action
Be certain the host cable is connected to a switch (host SM only).
SM Area
Discovery
Meaning
SM cannot communicate with the stack.
Action
Make sure the stack is running. Restart the SM node and stack.
SM Area
Discovery
Meaning
The SM cannot communicate with the stack.
Action
Make sure the stack is running. Restart the SM node and stack.
SM Area
Discovery
Meaning
The SM cannot communicate with the stack.
Action
If the condition persists, restart the SM node.
SM Area
Discovery
Meaning
The SM cannot communicate with the stack.
Action
If the condition persists, restart the SM node.
SM Area
Discovery
Meaning
Duplicate Node GUID in fabric.
Action
Using fabric tools, locate the device with the duplicate node GUID and remove it.
SM Area
Discovery
Meaning
Duplicate Port GUID in the fabric.
Action
Using fabric tools, locate the device with the duplicate port GUID and remove it.
SM Area
Discovery
Meaning
A duplicate Node GUID in fabric.
Action
Using fabric tools, locate the device with the duplicate node GUID and remove it.
SM Area
Discovery
Meaning
Port x of HFI1 is not responding.
Action
Check the health of node HFI1.
SM Area
Discovery
Meaning
May be caused by simultaneous removal/insertion events in the fabric. Persistence indicates that the node may be having problems.
Action
Check the health of node HFI1/SW1 if the fabric was idle or in persistent condition.
SM Area
Discovery
Meaning
May be caused by simultaneous removal/insertion events in the fabric. Persistence indicates that the node may be having problems.
Action
Check the health of node HFI1/SW1 if the fabric was idle or in persistent condition.
SM Area
Discovery
Meaning
May be caused by simultaneous removal/insertion events in the fabric. Persistence indicates that the node may be having problems.
Action
Check the health of node HFI1/SW1 if the fabric was idle or in persistent condition.
SM Area
Administrator
Meaning
The response buffer is too large.
Action
Contact support.
SM Area
Administrator
Meaning
Possible data corruption.
Action
Contact support.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check the health of the destination if the condition persists.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check for the next three messages.
SM Area
Administrator
Meaning
The response buffer is too large.
Action
Contact support.
SM Area
Administrator
Meaning
Invalid data in the SA request.
Action
Check the health of the requester at LID 0x1.
SM Area
Administrator
Meaning
May be caused by simultaneous removal/insertion events in the fabric.
Action
Check the health of the destination LID 0x2.
SM Area
Administrator
Meaning
A path does not exist in the partition with the given PKey between the given source and destination.
Action
Check the configuration to determine if the path should exist in the given PKey. Check the health of the destination LID 0x2 if the configuration is valid.
SM Area
Administrator
Meaning
Group may have just been deleted or the requester is not a member of the group.
Action
None
SM Area
Administrator
Meaning
Invalid data in the SA request.
Action
Check the health of HFI1.
SM Area
Administrator
Meaning
The requester port data is not compatible with the group data.
Action
Create a group at the lowest common denominator or the host should join with a rate selector of “less than” rather than “exactly”.
SM Area
Administrator
Meaning
Node compute-0-24 has requested a port rate that is incompatible with the group rate.
Action
Check that the requester port is not running at 1X width or that the multicast group was not created with a rate greater than what some of the host ports can support.
SM Area
Administrator
Meaning
End node may be trying to join a group that does not exist.
Action
OpenIB and Sun stacks require that the broadcast group be pre-created by the SM.
SM Area
Administrator
Meaning
Specific bits must be set in a CREATE group request.
Action
The requester is violating the protocol specification.
SM Area
Administrator
Meaning
The PKey specified in request was limited, it should be full.
Action
Check configuration.
SM Area
Administrator
Meaning
An attempt to create a multicast group failed due to validation failures.
Action
Check configuration if create by source should be valid.
SM Area
Administrator
Meaning
MGID requested violating the protocol specification.
Action
The requester is violating the protocol specification.
SM Area
Administrator
Meaning
Creation of a Multicast Group requires FULL membership.
Action
The requester is violating the protocol specification.
SM Area
Administrator
Meaning
Creation of a Multicast Group requires FULL membership.
Action
The requester is violating the protocol specification.
SM Area
Administrator
Meaning
Specific bits must be set in a JOIN group request.
Action
The requester is violating the protocol specification.
SM Area
Administrator
Meaning
No resources.
Action
Delete some of the multicast groups or configure the SM to overload MLIDs during a group creation.
SM Area
Administrator
Meaning
No resources.
Action
Delete some of the multicast groups or configure the SM to overload the MLIDs during group creation.
SM Area
Administrator
Meaning
No resources.
Action
Delete some of the multicast groups or configure the SM to overload the MLIDs during group creation.
SM Area
Administrator
Meaning
JOIN of a group that does not exist.
Action
OpenIB and Sun stacks require that the broadcast group be pre-created by SM.
SM Area
Administrator
Meaning
SM may have been set to create the default broadcast group with parameters not valid for the fabric.
Action
Reconfigure the default broadcast group with the proper parameters.
SM Area
Administrator
Meaning
SM may have been set to create the default broadcast group with parameters not valid for fabric.
Action
Reconfigure the default broadcast group with the proper parameters.
SM Area
Administrator
Meaning
The SM may have been set to create a default broadcast group with parameters not valid for a fabric or a host has created the group at a RATE not supported by other hosts.
Action
Create the group at the lowest common denominator or the host should join with a rate selector of “less than” rather than “exactly”.
SM Area
Administrator
Meaning
Requester using wrong smkey.
Action
Make sure to use the same smkey as what is configured in the SM.
SM Area
Administrator
Meaning
The SA select is limited to Virtual Fabrics using the Default PKey 0x7fff.
Action
Configuration change needed for SA Select.
SM Area
Administrator
Meaning
The Multicast Group has an MGID configured that does not match any application that is part of this Virtual Fabric.
Action
Configuration change needed for multicast group creation.
SM Area
Administrator
Meaning
The MulticastGroup linked to this Virtual Fabric do not share a common Pkey. Disabling the multicast group for this vFabric.
Action
Configuration change required if mcast group default creation is needed.
SM Area
Administrator
Meaning
An internal error occurred when attempting to refresh the SM PKeys.
Action
Contact customer support if condition persists.
SM Area
Administrator
Meaning
Serviced record add failure due to request with invalid PKey.
Action
Configuration change required if request should be granted.
SM Area
Administrator
Meaning
Serviced record add failure due to request with PKey not shared by requestor.
Action
Configuration change required if request should be granted.
SM Area
Administrator
Meaning
Serviced record add failure due to request with PKey not shared by requestor.
Action
Configuration change required if request should be granted.