Skip to main content

Cornelis Technical Documentation

4.4.2. Changing Parameters and Impacts on System Operation and Performance

Parameters that may disrupt a live system should not be changed while applications are running. To minimize disruptions, Cornelis recommends the following steps.

  1. Stop all Fabric Managers.

  2. Move all host-to-switch links to Init, which stops all fabric applications, using either of these methods:

    • Reset all hosts or bounce their fabric links.

    • Reset all switches.

  3. Complete the change to the FM configuration.

  4. Restart all the Fabric Managers.

The following action may be disruptive and should be avoided while applications are running:

  • Increasing LMC, LMCE0

Changes to the following parameters may disrupt some applications or impact application performance and must be carefully considered on a live fabric:

  • Decreasing LMC, LMCE0

    Some applications could continue to use LIDs whose assignments have changed and may fail or enter recovery and reconnection modes.

  • Changes to PathRecord settings (PathSelection)

    Applications that use PathRecords to select end-to-end addresses may ignore any resultant changes and continue to use the previous settings until the application is restarted or it re-fetches its PathRecords. This can change what paths applications use (only applicable when LMC != 0) and cause application performance changes or impacts.

  • Activating or deactivating pre-defined pre-Enabled vFabrics

    This is intended to be permitted on a live fabric. However, applications attempting to use a deactivated vFabric will fail after the vFabric is deactivated. In general, such applications should be stopped prior to deactivating the given vFabric. Activation of a pre-defined pre-enabled vFabric is typically safe.

  • Changing device groups or applications associated with an active vFabric

    Applications or nodes whose assignments have changed may fail or continue executing with the old PathRecords and therefore not fully obey the change.

  • Changes to an active vFabric's bandwidth, timeouts, priority, or preemption/traffic flow optimization rank

    This will change arbitration and scheduling within the fabric. Application operation should not be impacted; however, the performance and scheduling of fabric bandwidth will change to the new settings in a timely manner. Applications that use PathRecords to compute end-to-end timeouts may ignore any resultant changes in their timeouts and continue to use the previous timeouts until the application is restarted or refetches its PathRecords. Typically, changes to timeouts are small and have limited impact, but under extreme congestion situations, this could cause unexpected application performance impacts.

  • Changes to FM VL buffer allocation parameters (MinSharedVLMem, DedicatedVLMemMulti, WireDepthOverride, ReplayDepthOverride)

    These parameters tune VL buffering and therefore affect application performance.

  • Changes to parameters that are activated only when a port is bounced

  • Changes to CongestionControl (CongestionControl)

  • Changes to routing algorithm (RoutingAlgorithm, SpineFirstRouting, ForceRebalance, FatTreeTopology, DGShortestPathTopology, AdaptiveRouting)

    These can result in rerouting of the fabric and may briefly disrupt traffic. If incorrect (or much better than previous) choices are made, fabric and application performance can be greatly affected (for better or worse).

  • ForceAttributeRewrite

    This can undo previous adaptive routing decisions in switches and result in a short-term change to application performance.

  • Changing FM security parameters (PreDefinedTopology, VL15CreditRate, SmaSpoofingCheck, McDosThreshold, McDosInterval, McDosAction, SmAppliance)

    In general, accurate changes should not impact existing applications. However, mistakes or tightening of security may disable nodes or applications in the cluster that were running prior to the change. Such changes may be intentional and desired if the goal is to tighten security and impact such applications or nodes.

  • Multicast

    Changes to FM Multicast section of parameters may impact assignments of MLIDs to multicast groups or QoS parameters (Mtu, Rate) or Security parameters (PKey) associated with a given multicast group. Typically on FM restart, end nodes must rejoin multicast groups, and if these parameters change there can be minor disruptions of multicast traffic via lost packets. Most multicast applications will ride through these short-term disruptions without error.

  • Changes to fabric timeouts (TimerScalingEnable, SwitchLifetime, HoqLife, VLStallCount)

    On uncongested fabrics, minor changes often do not affect applications. Significant decreases in values may cause congestion mitigation mechanisms in switches to fire sooner, resulting in packet discards and changes to application performance. Significant increases may slow switch mitigation and allow congestion to propagate further and impact application performance.

  • Changes to Link Policies (HFILinkPolicy, ISLLinkPolicy)

    If all active links are currently "in policy" this will have no impact on existing applications. However, mistakes or tightening of link policies may disable nodes or links in the cluster that were running prior to the change. Such changes may be intentional and desired if the goal is to tighten link policies and impact such nodes or links.

  • Changes to SA timeouts (SaRespTime, NoReplyIfBusy)

    This will impact how long applications will wait for the SA and how the SA handles excessive requests when busy. For most applications, this has limited impact but under extreme situations, application behavior and performance with regard to SA queries can change.

  • Changes to FM SubnetSize

    This alters the memory used by the FM for selected buffering and may impact FM performance. In general, there should be no impact on existing applications beyond changes to the FM's responsiveness to PA and SA requests.

  • QueryValidation

    Cornelis recommends not changing this as non-compliant OFA applications may cease to work if it changes from 0 to 1.

  • LID

    This may change the SM's LID and briefly disrupt application queries to the SA.

Parameters that can safely be changed on a live system without impacting applications are the following:

  • PM/PA parameters (may impact active management and monitoring applications, but will not impact non-management applications running on the cluster).

    This includes adding or removing PM port groups, altering DeviceGroups that are only used as PmPortGroups.

  • FM controls on FM core dumps (CoreDumpLimit, CoreDumpDir)

  • Changes to FM failover and DB Sync parameters (Priority, ElevatedPriority, ConfigConsistencyCheckLevel, MasterPingInterval, MasterPingMaxFail, DbSyncInterval)

    However, such changes may result in a different FM becoming primary or the movement of inconsistent FMs to standby.

  • Changes to FM sweep time, SMA timeouts/retries, SM strategies (VL15FlowControlDisable, SmaBatchSize, MaxParallelReqs, SweepInterval, IgnoreTraps, MaxAttempts, RespTimeout, MinRespTimeout, SweepErrorsThreshold, SweepAbandonThreshold, TrapThreshold, TrapThresholdMinCount, NonRespTimeout, NonRespMaxCount, SwitchCascadeActivateEnable, NeighborNormalRetries)

    This mainly impacts SM responsiveness to fabric changes and SM handling of non-responsive nodes. Typically has no application impact. However, if adjustments are such that nodes are now not responsive within expected time frames, SM could drop some nodes from the fabric.

  • Changes to FM logging (LogLevel, LogMode, SyslogFacility, Debug, RmppDebug,*LogMask*, NodeAppearanceMsg, SmPerfDebug, SaPerfDebug, PortBounceLogLimit)