7.4.1.2. Identifying Suspect Cables

Cornelis Technical Documentation

7.4.1.2. Identifying Suspect Cables

Often the first indication of possible signal integrity and link issues associated with suspect cables appear in the Fabric Manager log messages. You can observe fabric issues through several methods including log messages in the syslog and monitoring tools.

7.4.1.2.1. Common PM Log Messages

This section provides some of common PM log messages, what they mean and what they could indicate.

General Integrity Threshold

As the PM sweeps more frequently, the PM is often the first to log messages indicating signal integrity issues.

Format:

fm0_sm[<PID>]: WARN [PmEngine]: PM: Integrity of <VALUE> Exceeded Threshold of 100. <NODE> Guid <GUID> LID <LID> Port <#>  Neighbor: <NEIGHBOR_NODE> Guid <NEIGHBOR_GUID> LID <NEIGHBOR_LID> Port <#>
fm0_sm[<PID>]: WARN [PmEngine]: PM: <COUNTER_DATA>

Example:

fm0_sm[12345]: WARN [PmEngine]: PM: Integrity of 600 Exceeded Threshold of 100. compute001 hfi1_0 Guid 0x0011750101010101 LID 0x3 Port 1  Neighbor: edge003 Guid 0x00117501cafebeef LID 0x2 Port 2 
fm0_sm[12345]: WARN [PmEngine]: PM: LQI=1

Note

Only the first Pm.ThresholdsExceededMsgLimit.integrity messages (default of 10) will print each sweep as WARN. The remaining messages will print as INFO (which, by default, will not print on most rsyslog configs).

Port Bounce Integrity Threshold

In the event that a port is not accessible by the PM, often the neighbor may report an integrity message similar to the General Integrity Threshold Message that the port/link is in the DOWN state (LQI=0).

When rebooting a node or during maintenance, these messages can be quite common. The PM often queries ports before the SM finalizes its topology. If the messages do not repeat after the reboot or maintenance, they can be ignored.

Query Failures

If the PM is unable to query the counters on a port, it reports an error. This error can be normal if the server or switch is rebooting at the time.

Occasionally a failure occurs when a port fails to respond because of an issue (such as signal integrity) along the request or response route.

Format:

fm0_sm[<PID>]: WARN [PmEngine]: PM: PmPrintFailPort: Unable to Get(PortStatus) <NODE> Guid <GUID> LID <LID> Port <#>

Example:

fm0_sm[12345]: WARN [PmAsyncRcv]: PM: PmPrintFailPort: Unable to Get(PortStatus) compute004 hfi1_0 Guid 0x0011750101010102 LID 0x1f Port 1

Note

Only the first ten (Pm.SweepErrorsLogThreshold) Sweep Errors messages will print each sweep as WARN. The remaining messages will print as INFO.

PM Sweep Failures occur any time the PM fails to query the counters on a port during a sweep. At the end of any PM sweep that encountered errors, the PM provides a summary of the number of nodes and ports that it was unable to access as shown in the example below.

Format:

fm0_sm[<PID>]: WARN [PmEngine]: PM: PmSweepAllPortCounters: Unable to get <#> Ports on <#> Nodes

Example:

fm0_sm[12345]: WARN [PmEngine]: PM: PmSweepAllPortCounters: Unable to get 277 Ports on 60 Nodes

7.4.1.2.2. Common SM Log Messages

This section provides some of the common SM log messages, what they mean, and what they could indicate.

Node Appearance and Disappearance Messages

An Appearance Message indicates that a node has entered the fabric topology during the previous sweep. Conversely, a Disappearance Message indicates that a node has left the fabric topology during the previous sweep.

When a node leaves the fabric, the disappearance message is often accompanied by additional sweep failure messages.

The disappearance of a node in one sweep followed by an appearance of the same node in the next sweep usually indicates intermittent issues where the node fails to respond in time to the SM and is marked as no longer in the fabric topology until it can successfully respond.

Disappearance Format:

fm0_sm[<PID>]: <FM_NODE>; MSG:NOTICE|SM:<FM_NODE>:port <#>
|COND:#4 Disappearance from fabric
|NODE:<NODE>:port <#>:<GUID>
|LINKEDTO:<NEIGHBOR_NODE>:port <#>:<NEIGHBOR_GUID>
|DETAIL:Node type: <NODETYPE>

Disappearance Example:

fm0_sm[12345]: fm001; MSG:NOTICE|SM:fm001:port 1
|COND:#4 Disappearance from fabric
|NODE:compute002 hfi1:port 1:0x00117501deadbeef
|LINKEDTO:edge002:port 7:0x00117501deadcafe
|DETAIL:Node type: hfi

Appearance Format:

fm0_sm[<PID>]: <FM_NODE>; MSG:NOTICE|SM:<FM_NODE>:port <#>
|COND:#3 Appearance in fabric
|NODE:<NODE>:port <#>:<GUID>
|LINKEDTO:<NEIGHBOR_NODE>:port <#>:<NEIGHBOR_GUID>
|DETAIL:Node type: <NODETYPE>

Appearance Example:

fm0_sm[12345]: fm001; MSG:NOTICE|SM:fm001:port 1
|COND:#3 Appearance in fabric
|NODE:compute002 hfi1:port 1:0x00117501deadbeef
|LINKEDTO:edge002:port 7:0x00117501deadcafe
|DETAIL:Node type: hfi

Discovery Failures

Discovery failure messages are often seen during the beginning of the SM sweep where the FM runs through a quick discovery process to build a topology of the fabric for later steps in the SM sweep.

The message shown in the example below is usually accompanied by additional error messages. The node indicated in the failure is often the upstream port, which is the port the FM can see on the link (usually the switch port on a SuperNIC-to-switch link).

Format:

fm0_sm[<PID>]: WARN [topology]: SM: topology_discovery: unable to setup port[<#>] of node <NAME>, nodeGuid <GUID>, ignoring port!

Example:

fm0_sm[12345]: WARN [topology]: SM: topology_discovery: unable to setup port[15] of node edge002, nodeGuid 0x00117501deadcafe, ignoring port!

Setup Node Failures

A Setup Node Failure is the most common failure to accompany the Discovery Failure message. The failure occurs when the initial packet is sent across the wire and the SM did not receive a response (packet) and timed out. A status code of 7 is the common indicator for a timeout.

Format:

fm0_sm[<PID>]: WARN [topology]: SM: sm_setup_node: Get NodeInfo failed for nodeGuid <GUID> port <#>, via node <NEIGHBOR_NODE> nodeGuid <NEIGHBOR_GUID> port <#>; status=7

Example:

fm0_sm[12345]: WARN [topology]: SM: sm_setup_node: Get NodeInfo failed for nodeGuid 0x00117501deadbeef port 1, via node edge002 nodeGuid 0x00117501deadcafe port 7; status=7

Programming Failures

Programming Failures (such as SCVL_t/nt) can occur intermittently when a node fails to respond during one of the programming phases of the sweep.

In the example shown below, a failure occurred when attempting to program a port. The response could not be completed successfully and the port was marked down indicating that the port was not part of the fabric topology.

Format:

fm0_sm[<PID>]: WARN [topology]: SM: sm_initialize_Switch_SCVLMaps: Failed to set SCVL_t Map for node <NODE> nodeGuid <GUID> output port <#>

Example:

fm0_sm[12345]: WARN [topology]: SM: sm_initialize_Switch_SCVLMaps: Failed to set SCVL_t Map for node edge003 nodeGuid 0x00117501cafebeef output port 13

Would you like to provide feedback? Just click here to suggest edits.

Cornelis Technical Documentation

7.4.1.2. Identifying Suspect Cables

7.4.1.2.1. Common PM Log Messages

General Integrity Threshold

Note

Port Bounce Integrity Threshold

Query Failures

Note

7.4.1.2.2. Common SM Log Messages

Node Appearance and Disappearance Messages

Discovery Failures

Setup Node Failures

Programming Failures

Search results