Multiprotocol Label Switching Transport Profile Survivability FrameworkNokia Siemens Networks3 Hanagar St. Neve Ne'eman BHod Hasharon45241Israelnurit.sprecher@nsn.comOld Dog Consultingadrian@olddog.co.ukNetwork survivability is the ability of a network to restore traffic delivery following
disruption or failure of network resources. Survivability is critical to the delivery of
guaranteed network services such as those subject to strict Service Level Agreements (SLAs)
that place maximum bounds on the length of time the service may be degraded or unavailable.The Transport Profile of Multiprotocol Label Switching (MPLS-TP) is a packet transport
technology based on the MPLS data plane and re-using many aspects of the MPLS management
and control planes. This document provides a framework for the provision of survivability in an MPLS-TP network,
describing recovery elements, types, methods and topological considerations. Survivability may
be supported by control plane, management plane, and by Operations, Administration and Maintenance (OAM)
functions to achieve data plane recovery. This document describes mechanisms for protecting MPLS-TP
Label Switched Paths (LSPs). Detailed consideration for the protection of pseudowires in MPLS-TP
networks is out of scope.This document is a product of a joint Internet Engineering Task Force (IETF) / International
Telecommunications Union Telecommunications Standardization Sector (ITU-T) effort to include
an MPLS Transport Profile within the IETF MPLS and PWE3 architectures to support the capabilities
and functionalities of a packet transport network as defined by the ITU-T.This Informational Internet-Draft is aimed at achieving IETF Consensus before publication as an
RFC and will be subject to an IETF Last Call.[RFC Editor, please remove this note before publication as an RFC and insert the correct Streams
Boilerplate to indicate that the published RFC has IETF Consensus.]Network survivability is the network's ability to restore traffic delivery following
a failure or degradation of traffic delivery caused by a network fault or an attack on
the network; it plays a critical role in the delivery of reliable services in transport
networks. Guaranteed services in the form of Service Level Agreements (SLAs) require a
resilient network that very rapidly detects facility or node failures, and immediately
starts to restore network operations in accordance with the terms of the SLA.The MPLS Transport Profile (MPLS-TP) is described in and
. MPLS-TP is designed to be consistent with existing transport
network operations and management models, and to provide survivability mechanisms, such as
protection and restoration. The function provided is intended to be similar or better to that
found in established transport networks which set a high benchmark for reliability.This document provides a framework for MPLS-TP-based survivability. It uses the recovery terminology
defined in which draws heavily on , and it refers to the
requirements specified in .Various recovery schemes (for protection and restoration) and processes have been defined and
analyzed in and . These schemes can also be applied in MPLS-TP
networks to re-establish end-to-end traffic delivery within the agreed service level and to
recover from 'failed' or 'degraded' transport entities (links or nodes , or Label Switched
Paths – LSPs)). Such actions are normally initiated by the detection of a defect or performance
degradation, or by an external request (e.g., an operator request for manual control of protection
switching). makes a distinction between protection switching and restoration mechanisms.
Protection switching makes use of pre-assigned capacity between nodes, where the simplest scheme has
one dedicated protection entity for each working entity, while the most complex scheme has m protection
entities shared between n working entities (m:n). Protection switching may be either unidirectional or
bidirectional; unidirectional meaning that each direction of a bidirectional connection is protection
switched independently, while bidirectional means that both directions are switched at the same
time even if the fault applies to only one direction of the connection. Restoration uses any capacity
available between nodes and usually involves re-routing. The resources used for restoration may be
pre-planned and recovery priority may be used as a differentiation mechanism to determine which
services are recovered and which are not recovered or are sacrificed in order to achieve recovery of
other services. In general, protection actions are completed within time frames of tens of milliseconds,
while automated restoration actions are normally completed in periods ranging from hundreds of milliseconds
to a maximum of a few seconds.The recovery schemes described in and evaluated in are
presented in the context of control plane-driven actions (such as the configuration of the protection
entities and functions, etc.). The presence of a distributed control plane in an MPLS-TP network is optional,
and the absence of such a control plane does not affect the ability to operate the network and to use MPLS-TP
forwarding, Operations, Administration and Maintenance (OAM), and survivability capabilities.Thus, some of the MPLS-TP recovery mechanisms do not depend on a control plane and use MPLS-TP OAM mechanisms
or management actions to trigger protection switching across connections that were set up using management plane
configuration. These OAM mechanisms may be triggered by data plane events or by operator actions, and are based
on MPLS-TP OAM fault management functions. 'Fault management' in this context refers to failure detection,
localization, and notification (where the term 'failure' is used to represent both signal failure and
signal degradation). The term 'trigger' is used to indicate any event that may be used to cause an
implementation to consider taking protection action.The principles of MPLS-TP protection switching operation are similar to those described in
as the protection mechanism is based on the ability to detect certain defects in the transport entities within
the recovery domain. In the context of this document, transport entities are nodes, links, LSP segments,
concatenated LSP segments, and whole LSPs. The protection switching controller does not care which monitoring
method is used, as long as it can be given information about the status of the transport entities within the
recovery domain (e.g., 'OK', signal failure, signal degradation, etc.).The protection switching operation is basically a data-plane capability and in the context of MPLS-TP it needs
to be ensured that it is possible to switch over independent of the way the network is configured and managed. All
the MPLS and GMPLS protection mechanisms are applicable in MPLS-TP environment, and it should be possible also to
provision and manage the related protection entities and functions defined in MPLS and GMPLS using a management
plane.In some protection switching schemes (such as bidirectional protection switching), it is necessary to coordinate
the protection state between the edges of the recovery domain. An MPLS-TP Protection State Coordination (PSC)
protocol may be used as an in-band (i.e.,, data plane-based) control protocol to align both ends of the protected
domain. Control plane-based mechanism can also be used to synchronize the protection states between the edges of
the protection domain.The MPLS-TP recovery mechanisms may be applied at various nested levels throughout the MPLS-TP network, as is
the case with the recovery schemes defined in and . An MPLS-TP LSP
may be subject to any or all of MPLS-TP link recovery, path segment recovery, or end-to-end recovery,
where:MPLS-TP link recovery refers to the recovery of an individual link (and hence all or a subset of the LSPs
routed over the link) between two MPLS-TP nodes.Segment recovery refers to the recovery of an LSP segment (i.e., segment and concatenated segment in the
language of ) between two nodes which are the boundary nodes of the segment.End-to-end recovery refers to the recovery of an entire LSP from its ingress to its egress node.Multiple recovery levels may be used concurrently by a single LSP for added resiliency.Co-routed bidirectional MPLS-TP LSPs are defined such that both directions of the LSP follow the same route
through the network. In this case the directions are often required by the operator to fate-share (that is, if
one direction fails, both directions should cease to operate). This may also be the case for associated bidirectional
LSPs where the two directions of the LSP take different paths through the network. This causes a direct interaction
between the recovery levels affecting the directions of an LSP such that both directions of the LSP are switched to
a new MPLS-TP link, segment, or end-to-end path together.The recovery scheme operating at the data plane level can function in a multi-domain environment; it can also
protect against a failure of a boundary node in the case of inter-domain operation.MPLS-TP recovery schemes are intended to protect client traffic as it is sent across the MPLS-TP network. This
document introduces protection and restoration techniques in general terms and then describes how they may be applied
in the LSP layer and in the pseudowire layer to meet the requirements of the MPLS-TP recovery schemes .
A description of the MPLS-TP LSP and pseudowire layers can be found in .This framework introduces the architecture of the MPLS-TP recovery domain and describes the recovery schemes in MPLS-TP
(based on the recovery types defined in as well as the principles of operation, recovery states,
recovery triggers, and information exchanges between the different elements that sustain the reference model. The
reference model is based on the MPLS-TP OAM reference model which is defined in .The framework also describes the qualitative levels of the survivability functions that can be provided, such as
dedicated recovery, shared protection, restoration, etc. The level of recovery directly affects the service level
provided to the end-user in the event of a network failure. There is a correlation between the level of recovery
provided and the cost to the network.The general description of the functional architecture is applicable for both LSPs and pseudowires (PWs).This framework applies to general LSP recovery schemes, but also to schemes that are optimized for specific
topologies in order to handle protection switching in a cost-efficient manner. Recovery schemes for PWs are introduced
in Section 7, but the details are for further study and will be addressed in a separate document.This document takes into account the need for co-ordination of protection switches at multiple layers. This allows
an operator to prevent races and allows the protection switching mechanism of one layer to fix a problem before switching
at another layer.This framework also specifies the functions that must be supported by MPLS-TP to support the recovery mechanisms. MPLS-TP
introduces a tool kit to enable recovery in MPLS-TP-based networks and to ensure that affected traffic is recovered in the
event of a failure.Generally, network operators aim to provide the fastest, most stable, and the best protection mechanism at a reasonable
cost according to the requirements of the customers. The higher the levels of protection, the greater the number of resources
consumed and so the higher the likely cost both to the operator and to the customer. It is therefore expected that network
operators will offer a wide spectrum of service levels. MPLS-TP-based recovery offers the flexibility to select the recovery
mechanism, choose the granularity at which traffic is protected, and also choose the specific types of traffic that are to be
protected. With MPLS-TP-based recovery, it is possible to provide different levels of protection for different classes of
service, based on their service requirements.This document is a product of a joint International Telecommunications Union Telecommunications
Standardization Sector (ITU-T) / IETF effort to include an MPLS Transport Profile within the IETF MPLS
and PWE3 architectures to support the capabilities and functionalities of a packet transport network as
defined by the ITU-T.This document is a product of a joint Internet Engineering Task Force (IETF) / International
Telecommunications Union Telecommunications Standardization Sector (ITU-T) effort to include an MPLS
Transport Profile within the IETF MPLS and PWE3 architectures to support the capabilities and
functionalities of a packet transport network as defined by the ITU-T.The terminology used in this document is consistent with that defined in . That RFC is, itself,
consistent with .However, certain protection concepts (such as ring protection) are not discussed in , and
for those concepts, terminology in this document is drawn from .Readers should refer to those documents for normative definitions. This document supplies brief summaries of some
terms for clarity and to aid the reader, but does not re-define terms.In particular, note the distinction and definitions made in for the following three terms.Protection: re-establishing end-to-end traffic using pre-allocated resources.Restoration: re-establishing end-to-end traffic using resources allocated at the time of need. Sometimes
referred to as "repair".Recovery: a generic term covering both Protection and Restoration.Important background information on survivability can be found in ,
, , , and .In this document, the following additional terminology is applied:Fault Management refers to the combination of failure detection, localization, and notification mechanisms.Failure is used to indicate both signal failure and signal degradation event.Trigger indicates any event that may be used to cause an implementation to consider taking protection action.The acronym OAM is defined as Operations, Administration and Maintenance consistent with .General terminology for MPLS-TP is found in and . Background
information on MPLS-TP can be found in .MPLS-TP requirements are presented in and serve as a normative reference for the definition
of all MPLS-TP function including survivability. Survivability is presented in as a critical
factor in the delivery of reliable services, and the requirements for survivability are set out using the recovery terminology
defined in .These requirements are summarized below. Reference numbers refer to the requirements as presented in .
Readers should refer to for the definitive list of requirements which is not replaced or superseded by
the list provided here.Protection and restoration mechanisms must be provided (56).Recovery techniques should be as similar as possible to those in existing transport networks (56A).Point-to-point (P2P) and point-to-multipoint (P2MP) recovery techniques should be the same if possible (56B).Recovery must be applicable to links, transport paths, segments, concatenated segments, and end-to-end LSPs and
PWs (57).Recovery objectives must be configurable to meet the SLA objectives of the services offered including rapid
(sub-50ms) recovery, protection of all traffic on a path, and protection across multiple domains (58, 59).The recovery mechanisms should be applicable to any topology (60).See also Section 3.4.Recovery must be coordinated across network layers (61).Recovery and reversion must not 'flap' (62).Note that there is no requirement for support for extra traffic except in a ring where
MPLS-TP must support the sharing of protection bandwidth in a ring by allowing best-effort traffic (108).The restored and protected paths must be able to share resources (70).Priorities must be available to control the order of restoration and to facilitate preemption during
restoration (71, 72).Reversion must be supported (73).MPLS-TP data plane protection must operate without regard to payload content (63).The following protection schemes must be supported:reversion (64).unidirectional and bidirectional 1+1 protection for P2P (65A, 65B)).unidirectional 1+1 protection for P2MP (65C).bidirectional 1:n protection for P2P (67A).unidirectional 1:n protection for P2MP (67B).It must be possible to share protection resources (66). This includes:1:n mesh recovery should be supported (68).sharing of resources between protection paths that will not be required to protect the same fault (69).MPLS-TP recovery mechanisms may be optimized for specific topologies provided such optimizations interoperate
with, and be as similar as possible to, standard techniques to provided end-to-end recovery (91, 100).Ring topologies support must include:single ring (92)interconnected rings (93)connection of rings to arbitrary networks (99)logical and physical rings (101)Traffic protection in rings must include:unidirectional and bidirectional P2P paths (94)unidirectional P2MP paths (95)Ring recovery techniques:must default to bidirectional (102)must support reversion as the default behavior (103)must distinguish (to the operator) trigger mechanisms (104)should protect against multiple failures (106B)must support sharing of protection resources (109)must prevent recovery flapping (107)Ring protection mechanism scaling must include:1+1 and 1:1 protection switching 50 ms from the moment of fault detection in a network with
a 16-node ring with less than 1200km of fiber (96)independence from the number of LSPs crossing the ring (97)good performance with increases in the number of transport paths, the number of nodes on the ring,
and the number of ring interconnects (98)It must be possible to disable protection mechanisms on selected links in a ring (105).MPLS-TP recovery mechanisms in a ring must support prioritization of recovery actions arising from
different commands or triggers and for different protected entities (106A).Triggers must be supported from:lower network layers (74)MPLS-TP OAM (75)the management plane (76)the control plane (if present) (78)It must be possible to distinguish trigger sources and to prioritize recovery action requests (77, 79).Support is required for preplanning, pre-calculation, and pre-provisioning of recovery paths and groups
of paths (80, 81, 82, 85).External commands (controls) must allow the operator to effect, prevent, or test without effecting, any
recovery operation (83, 84).It must be possible to configure all aspects of recovery (86).It must be possible to monitor all aspects of recovery (87, 88).If a control plane is used, it must be possible operate all aspects of recovery (89).In-band OAM must support administrative control and protection state coordination (90).This section presents an overview of the elements of the functional architecture for survivability within an
MPLS-TP network. The intention is to break the components out as separate items so that it can be seen how they
may be combined to provide different levels of recovery to meet the requirements set out in the previous section.Survivability is achieved through specific actions taken to repair network resources or to redirect traffic onto
paths that avoid failures in the network. Those actions may be triggered automatically by the MPLS-TP
network nodes upon detection of a network failure, or may be under direct the control of an operator.
Automatic action may be enhanced by in-band (i.e., data-plane based) OAM mechanisms for fault management
and performance monitoring, or by in-band or out-of-band control plane signaling.The survivability behavior of the network as a whole, and the reaction of each LSP when a fault is reported,
may be under operator control. That is, the operator may establish network-wide or local policies that determine
what actions will be taken when different failures are reported that affect different LSPs. At the same time,
when a service request is made to cause the establishment of one or more LSPs in the network, the operator (or
requesting application) may express a required or requested level of service, and this will be mapped to particular
survivability actions taken before and during LSP setup, after the failure of network resources, and upon recovery
of those resources.It should be noted that it is unusual to present a user or customer with options directly related to recovery actions.
Instead, the user/customer enters into an SLA with the network provider, and the network operator maps the terms of the
SLA (for example for guaranteed delivery, availability, or reliability) onto recovery schemes within the network.The operator can also be given manual control of survivability actions and events. For example, the operator may
perform the following actions:inhibit survivability actionsenable or disable survivability functioninduce the simulation of a network faultforce a switchover from a working path to a recovery pathForced switchover may be done for network optimization purposes with minimal disturbance of
services, such as when modifying protected or unprotected services, when replacing MPLS-TP network
nodes, etc. In some circumstances, a fault may be reported to the operator and the operator may
then select and initiate the appropriate recovery action.Survivability actions may be directly triggered by network failures. That is, the device that detects the failure
(for example, detection of Loss of Light on an optical interface, a failure to receive an OAM Continuity message, or a
reception of OAM Alarm Report) may immediately perform a survivability action. Recall that the term "failure" is
used to represent both signal failure and signal degradation.This behavior can be subject to management plane or control plane control, but does not require
any control, management or data plane message exchange to trigger the recovery action; the action is
directly triggered by data plane stimuli. Note, however, that coordination of recovery actions between the edges of the recovery domain may
require message exchanges for some qualitative levels of recovery or when performing a bidirectional recovery action.OAM signaling refers to message exchanges that are in-band or closely coupled to the data channel. Such messages may
be used to detect and isolate faults or indicate a degradation in the operation of the network, but in this context we
are concerned with the use of these messages to control or trigger survivability actions.OAM signaling may also be used to coordinate recovery actions within the protection domain.Control plane signaling is responsible for setup, maintenance, and teardown of transport paths that are not under
management plane control. The control plane may also be used to detect, isolate, and communicate network failures pertaining
to peer relationships (neighbor-to-neighbor, or end-to-end). Thus, control plane signaling may initiate and coordinate
survivability actions.The control plane can also be used to distribute topology and resource-availability information. In this way,
"graceful shutdown" of resources may be effected by withdrawing them, and this can be used as a stimulus to
survivability action in a similar way to the reporting or discovery of a fault as described in the previous sections.This section describes the elements of recovery. These are the quantitative aspects of recovery; that is the pieces of the
network for which recovery can be provided.Note that the terminology in this section is consistent with . Where the terms differ from those in
a mapping is provided.A span is a single hop between neighboring MPLS-TP nodes in the same network layer. A span is sometimes referred to as
a link although this may cause some confusion between the concept of a data link and a traffic engineering (TE) link. LSPs
traverse TE links between neighboring MPLS-TP nodes in the MPLS-TP network, however, a TE link may be provided by:a single data linka series of data links in a lower layer established as an LSP and presented to the upper layer as a single TE
link a set of parallel data links in the same layer presented either as a bundle of TE links, or a collection of data
links that, together, provide data link layer protection scheme.Thus, span recovery may be provided by:selecting a different TE link from a bundlemoving the TE link so that it is supported by a different data link between the same pair of neighborsre-routing the LSP in the lower layer.Moving the protected LSP to another TE link between the same pair of neighbors is a form of segment recovery and
is described in Section 4.2.2. refers to a span as a "link".An LSP segment is one or more continuous hops on the path of the LSP. defines two terms. A
"segment" is a single hop on the path of an LSP, and a "concatenated segment" is more than one hop
on the path of an LSP. In the context of this document, a segment covers both of these concepts.A PW segment refers to a Single Segment PW (SS-PW) or to a single segment of a multi-segment PW (MS-PW) that is
set up between two PE devices (i.e., T-PE and S-PE, S-PE and S-PE, or S-PE and T-PE). As indicated in Section 1, the recovery
of PWs and PW segments is out of scope of this document, but see Section 7.LSP segment recovery involves redirecting of traffic at one end of a segment of an LSP onto an alternate path to the other
end of the segment. According to the required level of recovery (described in Section 4.3), this redirection may be onto a
pre-established LSP segment, through re-routing of the protected segment, or by tunneling the protected LSP through a
"bypass" LSP. For details on recovery mechanisms, see Sections 4.4 and 4.5 below.Note that protecting an LSP against the failure of a node requires the use of segment recovery, while a link could be
protected using span or segment recovery.End-to-end recovery is a special case of segment recovery where the protected LSP segment is the whole of the LSP.
End-to-end recovery may be provided as link-diverse or node-diverse recovery where the recovery path shares no links
or no nodes with the working path. Note that node-diverse paths are necessarily link-diverse, and that full, end-to-end
node-diversity is required to guarantee recovery.This section describes the qualitative levels of survivability function that can be provided. The level of recovery
offered has a direct effect on the service level provided to the end-user in the event of a network fault. This will be
observed as the amount of data lost when a network fault occurs, and the length of time to recover connectivity.In general there is a correlation between the service level (i.e., the rapidity of recovery and reduction of data loss)
and the cost to the network; better service levels require pre-allocation of resources to the recovery paths, and those
resources cannot be used for other purposes if high quality recovery is required. Thus, 'cost' in this case
may be measured as the financial cost of providing resources for the recovery scheme, or the financial loss from dedicating
resources to the recovery scheme such that they cannot be used to draw new revenue.Sections 6 and 7 of provide a full breakdown of protection and recovery schemes. This section
summarizes the qualitative levels available.In dedicated protection, the resources for the recovery entity are pre-assigned for use only by the protected service. This
will clearly be the case in 1+1 protection, and may also be the case in 1:1 protection where extra traffic (see Section 4.3.3)
is not supported.Note that in the use of protection tunnels (see Section 4.4.3) resources may also be dedicated to protecting a
specific service. In some cases (one-for-one protection) the whole of the bypass tunnel may be dedicated to provide
recovery for a specific LSP, but in other cases (such as facility backup) a subset of the resources of the bypass tunnel
may be pre-assigned for use to recover a specific service. However, as described in Section 4.4.3, the bypass tunnel approach
can also be used for shared protection (Section 4.3.2), to carry extra traffic (Section 4.3.3), or without reserving resources
to achieve best-effort recovery.In shared protection, the resources for the recovery entities of several services are shared.
These may be shared as 1:n or m:n, and may be shared on individual links, on LSP segments, on PW
segments, or on end-to-end transport path (LSP or PW). Note that there is no requirement for m:n
recovery in the list of MPLS-TP requirements documented in .Where a bypass tunnel is used (Section 4.4.3), the tunnel might not have sufficient resources to simultaneously protect
all of the paths to which it offers protection, so that if they were all affected by network failures at the same time,
they would not all be recovered.Shared protection is a trade-off between the dedication of expensive network resources to protection
that is not required most of the time, and the risk of unrecoverable services in the event of multiple
network failures. Rapid recovery that can be achieved with dedicated protection, but is delayed by
message exchanges in the management, control, or data planes for shared protection. This means that
there is also a trade-off between rapid recovery and the reduction of network cost achieved by
sharing protection resources.These trade-offs may be somewhat mitigated by:adjusting the value of n in 1:n protectionusing m:n protection for some value of m > 1by establishing new protection paths as each available protection path is put into use.Network resources allocated for protection represent idle capacity during the time that recovery
is not actually required. These resources can be utilized by carrying other traffic referred to as
"extra traffic". Extra traffic is not supported on dedicated protection resources (Section 4.3.1)
by definition, but can be supported in shared protection (Section 4.3.2) and in tunnel protection
(Section 4.4.3).When a network resource that is carrying extra traffic is required for recovery, the extra traffic
is disrupted – essentially it is pre-empted by the recovery LSP. This may require additional
message exchanges in the management, control, or data planes, and that may mean that recovery could
be delayed. Thus the benefits of carrying extra traffic must be weighed against the disadvantage
of delayed recovery, additional network overhead, and the impact to the services the extra traffic
supports.Note that in MPLS-TP support for extra traffic is not required except in ring topologies (Section 3 and
).This section refers to LSP restoration and repair. Restoration for PWs is out of scope of this
document (but see Section 7).Restoration represents the most cost-effective use of network resources as no resources are reserved for protection.
However, restoration requires computation of a new path and activation of a new LSP (through the management or control
plane). These steps can take much more time than is required for recovery using protection techniques.Furthermore, there is no guarantee that restoration will be able to recover the service. It may be that all suitable
network resources are already in use for other LSPs so that no new path can be found. This problem can be partially mitigated
by the use of LSP setup priorities so that recovery LSPs can pre-empt existing LSPs of low priority.Additionally, when a network failure occurs, multiple LSPs may be disrupted by the same event.
These LSPs may have been established by different Network Management Stations (NMSs) or signaled by
different head-end MPLS-TP nodes, and this means that multiple points in the network will be trying
to compute and establish recovery LSPs at the same time. This can lead to contention for resources
within the network, causing recovery failures and meaning that some recovery actions must be retried
resulting in even slower recovery times for some services.Both hard and soft LSP restoration may be supported. In hard LSP restoration, the resources of the LSP are released before
the full establishment of the recovery LSP (i.e., break-before-make). In soft LSP restoration, the resources of the LSP are
released after the full establishment of an alternate LSP (i.e., make-before-break).Note that the restoration resources may be pre-calculated and even pre-signaled before the restoration action starts,
but not pre-allocated. This is known as pre-planned LSP restoration. The complete establishment/activation of the restoration
LSP occurs only when the restoration action starts. The pre-planning may happen periodically to have the most accurate
information about the available resources in the network.After a service has been recovered so that traffic is flowing on the recovery LSP, the faulted
network resource may be repaired. The traffic can be redirected back on to the original working
LSP (called "reversion"), or to left it where it is on the recovery LSP
("non-revertive" behavior).It should be possible to specify the reversion behavior of each service, and this might even be
configured for each recovery instance.In the non-revertive mode an additional operational option exists where protection roles are
switched so that the recovery LSP becomes the working LSP, and the previous working path (or the
resources used by the previous working path) are used for recovery in the event of a further fault.In revertive mode it is important to prevent excessive swapping between working and recovery
paths in the case of an intermittent defect. This can be addressed by the use of a reversion delay
timer that controls the length of time to wait following the repair of the fault on the original
working path before performing reversion. It should be possible for an operator to configure this
timer per LSP, and a default value should be defined.The purpose of this section is to describe in general (MPLS-TP non-specific) terms the mechanisms that can be used to
provide protection. As indicated above, while the functional architecture applies to both LSPs and PWs, the mechanism for
recovery described in this document refers to LSPs and LSP segments only. Recovery mechanisms for pseudowires and pseudowire
segment are for further study and will be described in a separate document (see also Section 6).Link-level protection refers to two paradigms: (1) where the protection is provided in a lower network layer, and (2)
the protection is provided by the MPLS-TP link layer.Note that link-level protection mechanisms do not protect the nodes at each end of the entity (e.g., a link or span)
that is protected. End-to-end or segment protection should be used in conjunction to link-level protection to protect
against a failure of the edge nodes.Link-level protection offers the following levels of protections:Full protection, where a dedicated protection entity (e.g., a link or span) is pre-established to protect a working
entity. When the working entity fails, the protected traffic is switched onto the protecting entity. In this scenario,
all LSPs carried over the entity are recovered (in one protection operation) when there is a failure condition. This
is referred to in as 'bulk recovery'.Partial protection, where only a subset of the LSPs carried over a given entity is recovered when there is a failure
condition. The decision as to which LSPs will be recovered and which will not depends on local policy.When there is no failure on the working entity, the protection entity may transport extra traffic which may be preempted
when protection switching occurs.As with recovery in layered networks, a protection mechanism at the lower layer needs to be coordinated with protection
actions at the upper layer in order to avoid race conditions. In general, this is arranged to allow protection actions to
be performed in the lower layer before any attempt is made to perform protection actions in the upper layer.A protection mechanism may be provided at the MPLS-TP link layer (which connects two MPLS-TP nodes). Such a mechanism
can make use of the procedures defined in to set up in-band communication channels at the MPLS-TP
link level and use these channels to monitor the health of the MPLS-TP link and coordinate the protection states between
the ends of the MPLS-TP link.The use of alternate paths and segments refers to the paradigm whereby protection is performed in the same network
layer as the protected LSP either for the entire end-to-end LSP or for a segment of the LSP. In this case, hierarchical
LSPs are not used – compare with Section 4.4.3.Different levels of protection may be provided:Dedicated protection, where a dedicated entity (e.g., LSP or LSP segment) is fully pre-established to protect a
working entity (e.g., LSP or LSP segment). When there is a failure condition on the working entity, the traffic is
switched onto the protection entity. Dedicated protection may be performed using 1:1 or 1+1 protection schemes. When
the failure condition is eliminated, the traffic may revert to the working entity. This is subject to local
configuration.Shared protection, where one or more protection entity is pre-established to protect against a failure of one or
more working entities (1:n or m:n).When the fault condition on the working entity is eliminated, the traffic should revert back to the working entity in
order to allow other related working entities to be protected by the shared protection resource.A protection tunnel is a hierarchical LSP that is pre-provisioned in order to protect against a failure condition
along a network segment, which may affect one or more LSPs that transmit over the network segment.When there is a failure condition in the network segment, one or more of the protected LSPs are switched over at the
ingress point of the network segment and transmitted over the protection tunnel. The way to realize this is using label
stacking. Label mapping may be an option as well.Different levels of protection may be provided:Dedicated protection, where the protection tunnel has resource reservations sufficient to provide protection
for all protected LSPs without service degradation.Shared protection, where the protection tunnel has resources to protect some of the protected LSPs, but not all
of them simultaneously.As described in the requirements listed in Section 3 and detailed in , the recovery techniques
used may be optimized for different network topologies if the performance of those optimized mechanisms is significantly
better than the performance of the generic ones in the same topology.It is required that such mechanisms interoperate with the mechanisms defined for arbitrary topologies to allow
end-to-end protection and to allow consistent protection techniques to be used across the whole network.This section describes two different topologies and explains how recovery may be markedly different in those different
scenarios. It also introduces the concept of a recovery domain and shows how end-to-end survivability may be achieved
through a concatenation of recovery domains each providing some level of recovery in part of the network.Linear protection provides a fast and simple protection switching mechanism and fits best in mesh networks. It can
protect against a failure that may happen on an node, a span, an LSP segment, or an end-to-end LSP. Linear protection
provides a clear indication of the protection status.Linear protection operates in the context of a Protection Domain. A Protection Domain is a special case of a Recovery
Domain that applies to the protection function. A Protection Domain is composed of the following
architectural elements:A set of end points which reside at the boundary of the Protection Domain. In this simple case of 1:n or 1+1 P2P
protection, exactly two endpoints reside at the boundary of the Protection Domain. In each transmission direction one
of the end points is referred to as a source and the other one is referred to as a sink. In the case of unidirectional
P2MP protection, three or more endpoints reside at the boundary of the Protection Domain. One of the endpoints is
referred to as source/root and the other ones are referred to as sinks/leaves.A Protection Group which consists of a working (primary) path and one or more recovery (backup) paths which run
between the endpoints of the Protection Domain. In order to guarantee protection in all situations, a dedicated recovery
path should be pre-provisioned to protect against a failure of a working path (i.e., 1:1 or 1+1 protection schemes). Also
the working and the recovery paths should be disjoint, i.e.,, the physical routes of the working and the recovery paths
should have complete physical diversity.Note that if the resources of the protection path are less that those of the working path, the protection path may not have
sufficient resources to protect the traffic of the working path.As mentioned in Section 4.3.2, the resources of the protection path may be shared as 1:n. In such a case, the protection
path might not have sufficient resources to simultaneously protect all of the working paths that may be affected by fault
conditions at the same time.For P2P paths, both unidirectional and bidirectional protection switching is supported. In bidirectional protection switching,
in the event of failure, the recovery actions are taken in both directions (even when the fault is unidirectional). This requires
some level of synchronization of the recovery state between the endpoints of the protection domain.In unidirectional protection switching, the recovery actions are taken only in the affected direction.Revertive and non-revertive operations are provided as network operator options.Linear protection supports the protection schemes described in the following sub-sections:In the 1:1 scheme, a recovery path is allocated to protect against a failure of degradation in a working path.
As described above, in order to guarantee protection, the recovery entity should support the full capacity and
bandwidth, but it may be degraded from the normal working entity.Figure 1 presents 1:1 protection architecture. In normal conditions the data traffic is transmitted over the
working entity and the recovery entity is in an idle state. Normal conditions are defined when there is no failure
or degradation on the working entity and there is no administrative configuration or requests that cause traffic
to transmit over the recovery entity.Upon a fault condition (failure or degradation) along the working entity or a specific administrative request,
the traffic is switched over to the recovery entity.Note that in the non-revertive behavior (see section 4.3.5), data traffic can be transmitted over the recovery
entity also in normal conditions. This can happen after the condition(s) causing the switchover has/have been
cleared.In each transmission direction, the source of the protection domain bridges the traffic into the appropriate
entity and the sink selects the traffic from the appropriate entity. The source and the sink need to coordinate the
protection states to ensure that the bridging and the selection are done to and from the same entity. For that sake
a signaling coordination protocol (either data-plane in-band signaling protocol or a control-plane based signaling
protocol) is needed.In bidirectional protection switching, both ends of the protection domain switch to the recovery entity (even when
the fault is unidirectional). This requires a protocol to try and synchronize the protection state between the two end
points of the Protection Domain.When there is no failure, the resources of the idle entity may be used for less priority traffic. When protection
switching is performed, the less priority traffic may be pre-empted by the protected traffic.In the general case of 1:n linear protection, one recovery entity is allocated to protect n working entities. The
Protection entity might not have sufficient resources to simultaneously protect all of the Working entities that may
be affected by fault conditions at the same time.In case of failures along multiple working entities, priority should be set as to which entity is protected. The
protection states between the edges of the Protection Domain should be fully synchronized to ensure consistent behavior.
As explained above in section Revertive behavior is recommended when 1:n is supported.In the 1+1 protection scheme, a fully dedicated recovery path is allocated.As depicted in figure 2, data traffic is copied at fed at the source to both the working and the recovery entities.
The traffic on the working and the recovery entities is transmitted simultaneously to the sink of the Protection Domain,
where the selection between the working and recovery entities is made (based on some predetermined criteria).Note that control traffic between the edges of the Protection Domain (such as OAM or control protocol to synchronies
the protection state, etc.) may be transmitted on a different entity than the one used for the protected traffic. These
packets should not be discarded by the sink.In 1+1 unidirectional protection switching there is no need to coordinate the recovery state between the protection
controllers at both ends of the protection domain. In 1+1 bidirectional protection switching, there is a need for a
protocol to coordinate the protection state between the edges of the Protection Domain.In both protection schemes traffic is restored to the working entity after the condition(s) causing the switchover
has/have been cleared. To avoid frequent switching in case of intermittent failures when the network is not stabilized,
traffic is not switched back to the working entity before the Wait-to-Restore (WTR) timer has expired.Linear protection may apply to protect unidirectional P2MP entity using 1+1 protection architecture. The source/root
MPLS-TP node bridges the user traffic to both the Working and Protected entities. Each sink/leaf MPLS-TP node selects
the traffic from one entity based on some predetermined criteria. Note that when there is a fault condition on one of
the branches of the P2MP path, some leaf MPLS-TP nodes may select the Working entity, while other leaf MPLS-TP nodes
may select traffic from the Protection entity.In a 1:1 P2MP protection scheme, the source/root MPLS-TP node needs to identify the existence of a fault condition
on any of the branches of the network. This requires the sink/leaf MPLS-TP nodes to notify the source/root MPLS-TP node
of any fault condition. This required also a return path from the sinks/leaves to the source/root MPLS-TP node.When protection switching is triggered, the source/root MPLS-TP node selects the recovery transport path to transfer
the traffic.Note that such a mechanism does not exist and its exact behavior if for further study.The protection switching may be performed when:A fault condition ('failed' or 'degraded') is declared on the working entity and is not declared
on the recovery entity. Proactive in-band OAM CC&V (Continuity and Connectivity Verification) monitoring of both the
working and the recovery entities may be used to enable the fast detection of a fault condition. For protection switching,
it is common to run a CC&V every 3.33ms. In the absence of three consecutive CC&V messages, a fault condition is declared.
In order to monitor the working and the recovery entities, an OAM Maintenance Entity should be defined for each of the
entities. OAM indications of fault conditions should be provided to the edges of the Protection Domain which are
responsible for the protection switching operation. Input from OAM performance monitoring indicating degradation in the
working entity may also be used as a trigger for protection switching. In the case of degradation, switching to the
recovery entity is needed only if the recovery entity can guarantee better conditions.An indication is received from a lower layer server that there is a network failure.An external operator command is received (e.g., 'Forced Switch', 'Manual Switch'). For details
see Section 6.1.2.A request to switch over is received from the far end. The far end may initiate this request for example when it gets
an administrative request o switch over, or when bidirectional 1:1 protection switching is supported and there was a fault
that could be detected only by the far end, etc.As described above, in some cases an attempt should be done to coordinate the protection states between the end points of
the Protection Domain. Control message should be exchanged between the edges of the Protection Domain to synchronize the
protection state of the edge nodes. The control messages can be delivered using in-band data-plane driven control protocol
or a control plane based protocol.In order to achieve 50ms protection switching it is recommended to use in-band data-plane driven signaling protocol to
coordinate the protection states. An in-band data-plane PSC (Protection State Coordination) protocol is defined in
for this purpose. This protocol is also used to detect mismatches between the
configuration provisioned at the ends of the Protection Domain.As described below in section 6.5, GMPLS already defines procedures and messages' elements to synchronize the protection
states between the edges of the protection domain. These procedures and protocols messages are specifies in ,
and . However, these messages lack the capability to synchronize the
revertive/non-revertive behavior and the consistency of configured timers at the edges of the Protection Domain (timers such as
Wait to Restore (WTR), Hold-off timer, etc.).In order to implement data-plane based linear protection on LSP segments, there is a need to support the MPLS-TP
architectural element PST (Path Segment Tunnel). Maintenance operations (e.g., monitoring, protection or management)
engage with a transmission of messages (e.g., OAM, Protection Path Synchronization, etc.) in the maintained domain. According
to the MPLS architecture which is defined in , such messages can be initiated and terminated at the
edges of a path where push and pop operations are enabled. As an exception, these messages may be terminated at an intermediate
node when the TTL value is expired. In order to support the option to monitor, protect and manage a portion of an LSP, a new
architectural element is defined, Path Segment Tunnel (PST). A Path Segment Tunnel is an LSP which is basically defined and
used for the purposes of OAM monitoring, protection or management of LSP segments. PST makes use of the MPLS construct of
hierarchical nested LSP which is defined in .For linear protection operation, PSTs should be defined over the working and recovery entities between the edges of a
Protection Domain. OAM and PSC messages can be initiated at the edge of the PST and sent to the peer edge of the PST. Note
that these messages are sent over G-ACH channels, within the PST and use two labels stack, the PST label at the bottom of
stack and the G-ACH label.The end-to-end traffic of the LSP, including data-traffic and control traffic (OAM, PSC, management and signaling messages)
is tunneled within the PSTs by means of label stacking as defined in .The mapping between an LSP and a PST can be 1:1 which is similar to the ITU-T Tandem Connection element which defines a sub
layer corresponding to a segment of a path. The mapping can also be 1:n to allow scalable protection of a set of LSPs'
segments traversing the portion of the network in which a Protection Domain is defined. Note that each of these LSPs can be
initiated or terminated at different endpoints in the network, but they all traverse the Protection Domain and share similar
constraints (such as requiremtns for QoS, terms of protection ,etc.). In case of 1:n mapping PSTs can be referred to also
as TE links.Note that in the context of segment recovery, the PSTs serve as the working and protection entities.Several Service Providers have expresses a high level of interest in operating MPLS-TP in ring topologies and require
a high level of survivability function in these topologies.Different criteria for optimization are considered in ring topologies, such as:Simplification of the operation of the Ring in terms of the number of OAM Maintenance Entities that are needed
to trigger the recovery actions, the number of elements of recovery, the number of management plane transactions
during maintenance operations, etc.Optimization of resource consumption around the ring, like the number of labels needed for the protection paths
that cross the network, the total bandwidth needed in the ring to ensure the protection of the paths, etc. introduces a list of requirements on ring protection that cover the recovery mechanisms
need to protect traffic in a single ring and traffic that traverses more than one ring. Note that configuration and
the operation of the recovery mechanisms in a ring must scale well with the number of transport paths, the number
of nodes, and the number of ring interconnects.The requirements for ring protection are fully compatible with the generic requirements for recovery. The architecture and the mechanisms for ring protection are specified in separate documents. These mechanisms need
to be evaluated against the requirements specified in . The principles for the development of the
mechanisms should be:Reuse existing procedures and mechanisms for recovery in ring topologies as along as their performance is as
good as new potential mechanisms.Ensure complete interoperability with the mechanisms defined for arbitrary topologies to allow end-to-end
protection.Protection and restoration are performed in the context of a recovery domain. A recovery domain is defined between
two or more recovery reference endpoints which are located at the edges of the recovery domain and bounds the element
on which recovery can be provided (as described in section 4.2 above). This element can be end-to-end path, a portion
of a path or a span.The case of an end-to-end path can be observed as a special case of a portion of a path, and the ingress and the
egress LERs serve as the recover reference end-points.In this simple case of a P2P protected entity, exactly two endpoints reside at the boundary of the Protection Domain.
An LSP can enter through exactly one reference endpoint and exit the recovery domain through another reference endpoint.In the case of unidirectional P2MP, three or more endpoints reside at the boundary of the Protection Domain. One of the
endpoints is referred to as source/root and the other ones are referred to as sinks/leaves. An LSP can enter the recover
domain through the root point and exit the recovery domain through the leaves points.The recovery mechanism should restore interrupted traffic due to a facility (link or node) fault within the recovery
domain. Note that a single link may part of several recovery domains. If two recovery domains have any links in common,
then one recovery domain must be contained with the other. This can be referred to as nested recovery domains. However
recovery domains must not overlap.Note that the edges of a recovery domain are not protected and unless contained in another recovery domain, they form
a single point of failure..
A recovery group is defined within a recovery domain and it consists of a working (primary) entity and one or more
recovery (backup) entities which reside between the endpoints of the recovery Domain. In order to guarantee protection
in all situations, a dedicated recovery entity should be pre-provisioned using disjoint resources in the recovery domain
in order to protect against a failure of a working entity.The method used to monitor the health of the recovery element is unimportant, provided that the endpoints which are
responsible for the recovery action receive the information on its condition. The condition of the recovery element
may be 'OK', 'failed', or 'degraded'.When the recovery operation is triggered by an OAM FM or PM indication, an OAM Maintenance Entity Group is defined for
each of the working and protection entities.The recovery entities and functions in a recovery domain can be provisioned using a management plane or a control plane.
A management plane may be used to configure the recovery domain by setting the reference points, the working and recovery
entities, and the recovery type (e.g., 1:1 bidirectional linear protection, ring protection, etc.). Additional parameters
associated with the recovery process may also be configured. For more details, see section 6.1.When a control plane is used, the ingress LERs may communicate with the recovery reference points requesting protection or
restoration across a recovery domain. For details, see section 6.5.Cases of multiple interconnections between distinct recovery domains actually just create a
hierarchical arrangement of recovery domains as a single top-level recovery domain is created
from the concatenation of two recovery domains that have multiple interconnections. In this case,
recovery actions may be taken both in the individual lower-level recovery domains to protect any
LSP segment that crosses the domain, and within the higher-level recovery domain to protect the
longer LSP segment that traverses the higher-level domain.In multi-layer or multi-region networking, recovery may be performed at multiple layers or across cascaded recovery
domains.The MPLS-TP recovery mechanism must ensure that the timing of recovery is coordinated in order to avoid races, and
to allow either the recovery mechanism of the server layer to fix the problem before recovery takes place at the MPLS-TP
layer, or to allow an upstream recovery domain to perform recovery before a downstream domain. In inter-connected rings,
for example, it may be preferable to allow the upstream ring to perform recovery before the downstream ring, in order
to ensure that recovery takes place in the ring in which the failure occurred.A hold-off timer is required to coordinate the timing of recovery at multiple layers or across cascaded recovery domains.
Setting this configurable timer involves a trade-off between rapid recovery and the creation of a race condition where
multiple layers respond to the same fault, potentially allocating resources in an inefficient manner. Thus, the detection
of a failure condition in the MPLS-TP layer should not immediately trigger the recovery process if the hold-off timer is
set to a value other than zero. The hold-off timer should be started and, on expiry, the recovery element should be checked
to determine whether the failure condition still exists. If it does exist, the defect triggers the recovery operation.The hold-off timer should be configurable.In other configurations, where the lower layer does not have a restoration capability, or where it is not expected to
provide protection, the lower layer needs to trigger the higher layer to immediately perform recovery.Reference should be made to that presents the near-term and practical requirements for network
survivability and hierarchy in current service provider environments.Where a link in the MPLS-TP network is formed from connectivity (i.e., a packet or non-packet LSP) in a lower layer
network, that connectivity may itself be protected. For example, the LSP in the lower layer network may be provisioned
with 1+1 protection. In this case the link in the MPLS-TP network as an inherited level of protection.An LSP in the MPLS-TP network may be provisioned with protection in the MPLS-TP network as already described, or it
may be provisioned to utilize only links that themselves have inherited protection.By classifying the links in the MPLS-TP network according to the level of underlying protection that they have, it is
possible to compute an end-to-end path in the MPLS-TP network that uses only links with a specific or better level of
inherited protection. This means that the end-to-end MPLS-TP LSP can be protected at the level necessary to conform with
the SLA without the need to provide any additional protection in the MPLS-TP layer. This saves complexity and network
resources, and reduces issues of protection switching coordination.Where the requisite level of inherited protection is not available along the whole path in the MPLS-TP network, it can
be "topped up" using protection in the MPLS-TP layer. Segment protection would be particularly suitable.It should be noted, however, that inherited protection only applies to links. Nodes cannot be protected in this way.
An operator will need to perform an analysis of the relative likelihood and consequences of node failure if this approach
is taken without providing any protection in the MPLS-TP layer to handle node failure.When an MPLS-TP protection scheme is established, it is essential that the working and protection paths do not share
resources in the network. If this is not achieved, a single failure may affect both the working and the protection path
with the result that the traffic cannot be delivered – it was, in fact, not protected.Note that this restriction does not apply for restoration as this takes place after the fault has arisen meaning that
the point of failure can be avoided.When planning a recovery scheme it is possible to select paths that use diverse links and nodes within the MPLS-TP network
using a topology map of the network. However, this does not guarantee that the paths are truly diverse. For example, two
separate links in an MPLS-TP network may be provided by two lambdas in the same optical fiber, or by two fibers that cross
the same bridge. And two completely separate MPLS-TP nodes might be situated in the same building with a shared power supply.Thus, in order to achieve proper recovery planning, the MPLS-TP network must have an understanding of the groups of lower
layer resources that share a common risk of failure. From this, MPLS-TP shared risk groups can be constructed that show
which MPLS-TP resources share a common risk of failure. The working and protection paths can be planned to be not only
node and link diverse, but to not use any resources from the same shared risk groups.In a layered network a low-layer fault may be detected and reported by multiple layers and may
sometimes give rise to multiple fault reports from the same layer. For example, a failure of a data
link may be reported by the line cards in an MPLS-TP node, but it could also be detected and reported
by the MPLS-TP OAM.Section 4.6 explains how it is important to coordinate the survivability actions configured and
operated in a multi-layer network to avoid over-equipping the survivability resources in the network,
and to ensure that recovery actions are taken only in one layer at a time.Fault correlation is about understanding what single event has led to a set of fault reports so
that the recovery actions can be coordinated, and so that the fault logging system does not become
overloaded. Fault correlation depends on an understanding of resource usage at lower layers, shared
risk groups, and a wider view of how the layers are inter-related.Fault correlation is most easily performed at the point of fault detection. For example, an MPLS-TP
node that receives a fault notification from the lower layer and detects a fault on an LSP in the
MPLS-TP layer can easily correlate these two events. Furthermore, the same node detecting multiple
faults on LSPs using the same faulted data link, can easily correlate these. Such a node may use
the correlation to perform group-based recovery actions, and can reduce the number of alarm events
that it raises to its management station.Fault correlation may also be performed at a management station that receives fault reports from
different layers and different nodes in the network. This enables the management station to coordinate
management-originated recovery actions, and to present a consolidated fault information to the user
and any automated management systems.There is also a desire to correlate fault information detected and reported through OAM. This
function would enable a fault detected at a lower layer and reported at a transit node of an MPLS-TP
LSP to be correlated with an MPLS-TP layer fault detected at a Maintenance End Point (MEP) (for
example the egress of the MPLS-TP LSP. Such correlation allows the coordination of recovery actions
taken at the MEP, but it requires that the lower layer fault information is propagated to the MEP
which is most easily achieved by using a control plane, management plane, or OAM message.The MPLS-TP network can be viewed as two sub-layers (the MPLS LSP layer and the PW layer). The
MPLS-TP network operates over data link connections and data link networks such that the MPLS-TP
links are provided by individual data links or by connections in a lower layer network. The MPLS
LSP layer is a mandatory part of the MPLS-TP network, and the PW layer is an optional addition to
support specific services.MPLS-TP survivability provides recovery from failure of the links and nodes in the MPLS-TP network.
The link failures are typically caused by failures in the underlying data link connections and
networks, but this section is only concerned with recovery actions taken in the MPLS-TP network,
which must necessarily be to recover from the manifestation of any problem as a failure in the
MPLS-TP network.This section lists which recovery elements (see Section 1) are supported in each of the two layers
to recover from failures of nodes or links in the MPLS-TP network.Recovery ElementMPLS LSP LayerPW LayerLink RecoveryMPLS LSP recovery can be used to survive the failure of an MPLS-TP link.The PW layer is not aware of the underlying network. This function is not supported.Segment/Span RecoveryAn individual LSP segment can be recovered to survive the failure of an MPLS-TP link.For a SS-PW, segment recovery is the same as end-to-end recovery. Segment recovery for a MS-PW
is for future study, and this function is now provided using end-to-end recovery.Concatenated Segment RecoveryA concatenated LSP segment can be recovered to survive the failure of an MPLS-TP link or node.Concatenated segment recovery (in a MS-PW) is for future study, and this function is now provided
using end-to-end recovery.End-to-end RecoveryAn end-to-end LSP can be recovered to survive any node or link failure, except for the failure
of the ingress or egress node.End-to-end PW recovery can be applied to survive any node (including S-PE) or link failure except
for the failure of the ingress egress T-PE.Service RecoveryThe MPLS LSP layer is service agnostic. This function is not supported.PW layer service recovery requires surviving faults in T-PEs or on ACs. This is currently out of
scope for MPLS-TP.Section 6 provides a description of mechanisms for survivability of MPLS-TP LSPs. Section 7 provides
a brief overview of mechanisms for survivability of MPLS-TP PWs.This section describes the existing mechanisms available to provide protection of LSPs within MPLS-TP networks, and
highlights areas where new work is required. It is expected that, as new protocol extensions and techniques are developed,
this section will be updated to convert the statements of required work into references to those protocol extensions
and techniques.As described above, a fundamental requirement of MPLS-TP is that recovery mechanisms should be capable of functioning in
the absence of a control plane. Recovery may be triggered by MPLS-TP OAM fault management functions or by external requests
(e.g., an operator request for manual control of protection switching).The management plane may be used to configure the recovery domain by setting the reference endpoints points (which
controls the recovery actions), the working and the recovery entities, and the recovery type (e.g., 1:1 bidirectional
linear protection, ring protection, etc.).Additional parameters associated with the recovery process (such as a WTR and hold-off timers, revertive/non-revertive
operation, etc.) may also be configured.In addition, the management plane may initiate manual control of the recovery function. A priority should be set
between fault conditions and operator's requests.Since provisioning the recovery domain involves the selection of a number of options, mismatches may occur at the
different reference points. The MPLS-TP OAM PSC (protection State Coordination) which is specified in
may be used as an in-band (i.e., data plane-based) control protocol to coordinate
the protection states between the endpoints of the recovery domain and to check consistency of configured parameters (such
as timers, revertive/non-revertive behavior, etc.)It should also be possible for the management plane to monitor the recovery status.In order to implement the protection switching mechanisms, the following entities and information should be configured
and provisioned:The endpoints of a recovery domain. As described above, these endpoints bound the element of recovery for which
recovery is applied.The protection group which depending on the required protection scheme, consists of a recovery entity and one or
more working entities. In 1:1 or 1+1 P2P protection, in order to guarantee protection, the paths of the working entity
and the recovery entities should have complete physical diversity.As defined in section 4.5.2, in order to implement data-plane based LSP segment recovery, there is a need to support
the MPLS-TP architectural element PST (Path Segment Tunnel), since related control messages (e.g., for OAM, Protection Path
Synchronization, etc.) can be initiated and terminated at the edges of a path where push and pop operations are enabled.
PST is an end-to-end LSP which corresponds in this context to the recovery entities (working and protection) and makes use
of the MPLS construct of hierarchical nested LSP which is defined in . OAM and PSC messages can be
initiated at the edge of the PST and sent to the peer edge of the PST, over G-ACH. There is a need to configure the related
PSTs and map between the LSP segment(s) being protected and the PST. The mapping can be 1:1 or 1:N to allow scalable
protection of a set of LSPs' segments traversing the portion of the network in which a Protection Domain is defined.
Note that each of these LSPs can be initiated or terminated at different endpoints in the network, but they all traverse the
Protection Domain and share similar constraints (such as requirements for QoS, terms of protection ,etc.).The protection type that should be defined (e.g., unidirectional 1:1, bidirectional 1+1, etc.).Revertive/non-revertive behavior should be configured.timers (such as WTR, hold-off timer, etc.) should be set.The following external, manual commands may be provide for manual control of the protection switching operation. These
commands apply to a protection group and they are listed in descending order of priority:Blocked protection action – a manual command to prevent data traffic from switching to the recovery entity.
This command actually disables the protection group.Force protection action – a manual command that forces a switch of normal data traffic to the recovery entity.Manual protection action – a manual command that forces a switch of data traffic to the recovery entity when
there is no failure in the working or the recovery entity.Clear switching command – the operator may request to clear previous administrative command to switch over (manual or
force switch).Fault detection is a fundamental part of recovery and survivability. In all schemes except for some forms of 1+1
protection, the necessary actions for recovery of traffic delivery rely on discovering that there is some kind of fault.Faults may be detected in a number of ways depending on the traffic pattern and the underlying hardware. End-to-end
faults may be reported by the application or by knowledge of the application's data pattern, but this is an unusual
approach. There are two more common mechanisms for detecting faults in the MPLS-TP layer:faults reported by the lower layersfaults detected by protocols within the MPLS-TP layer.In an IP/MPLS network, the second of these may utilize control plane protocols (such as the routing protocols) to
detect a failure of adjacency between neighboring nodes. In an MPLS-TP network, there is no certainty that a control
plane will be present. Even if a control plane is present, it will be a GMPLS control plane
that makes a logical separation between control channels and data channels with the result that no conclusion about
the health of a data channel can be drawn from the failure of an associated control channel. MPLS-TP layer faults are,
therefore, only detected through the use of OAM protocols as described in Section 6.4.1.Faults may, however, be reported by lower layer. These generally show up as interface failures or link failures
within the MPLS-TP network. For example, an underlying optical link may detect loss of light and report a failure
of the MPLS-TP link that uses it. Alternatively, and interface card failure may be reported to the MPLS-TP layer.Such failures will only be reported after link level protection has been attempted (Section 4.6.1) and it is important
that any lower layer recovery actions are coordinated with the MPLS-TP recovery actions (Section 4.6).Faults reported by lower layers are only visible at specific nodes within the MPLS-TP network (i.e., at the adjacent
end-points of the MPLS-TP link). This only allows recovery to be performed locally.In order that recovery can be performed by nodes that are not immediately local to the fault, the fault must be
reported (Sections 6.4.3 and 6.5.4).If an MPLS-TP node detects that there is a fault in an LSP (that is, not a network fault reported from a lower layer,
but a fault detected by examining the LSP) it can immediately perform a recovery action. However, unless the location
of the fault is known, the only practical options are:perform end-to-end recoveryperform some other recovery as a speculative act.Since speculative acts are not guaranteed to achieve the desired results and could be costly, and since end-to-end
recovery is a costly option, it is important to be able to isolate the fault.Fault isolation may be achieved by dividing the network into protection domains. End-to-end protection is thereby
operated on an LSP segments depending on the domain in which the fault is discovered. This requires that the LSP can
be monitored at the domain edges.Alternatively, a proactive mechanism of fault isolation through OAM (Section 6.4.2) or through the control plane
(Section 6.5.3) is required.MPLS-TP provides comprehensive set of OAM tools for fault management and performance monitoring at different nested
levels (end-to-end, a portion of a path (LSP or PW) and at the link level).These tools support proactive and on-demand fault management (for fault detection and fault localization) and for
performance monitoring (to measure the quality of the signals and detect degradation).To support fast recovery, it is useful to use some of the proactive tools to detect fault conditions (e.g., link/node
failure or degradation) and trigger the recovery action.The MPLS-TP OAM messages run in-band with the traffic and support unidirectional and bidirectional P2P paths as well
as P2MP paths.As described in , MPLS-TP OAM operates in the context of a Maintenance Entity
which bounds the OAM responsibilities and represents the portion of a path between two points which is being monitored
and maintained, and in which OAM messages are exchanged. refers also to a
Maintenance Entity Group (MEG), which is a collection of one or more MEs that belongs to the same transport path
(e.g., P2MP transport path) and that are maintained and monitored as a group.An ME includes two MEPs (Maintenance Group End Points) which reside at the boundaries of an ME, and a set of zero or
more MIPS (Maintenance Group Intermediate Points) which reside within the Maintenance Entity along the path. A MEP is
capable of initiating and terminating OAM messages, and as such can only located at the edges of a path where push and
pop operations are supported. In order to be define an ME over a portion of path there is a need to support the MPLS-TP
architectural element PST (Path Segment Tunnel). PST is an end-to-end LSP which corresponds in this context to the ME
and makes use of the MPLS construct of hierarchical nested LSP which is defined in . OAM messages
can be initiated at the edge of the PST and sent to the peer edge of the PST, over G-ACH.There is a need to configure the related PSTs and map between the LSP segment(s) being monitored and the PST. The mapping
can be 1:1 or 1:N to allow scalable operation. Note that each of these LSPs can be initiated or terminated at different
endpoints in the network and share similar constraints (such as requirements for QoS, terms of protection ,etc.).In the context of recovery where MPLS-TP OAM is supported, an OAM Maintenance Entity Group is defined for each of the
working and protection entities.MIP is capable of reacting to OAM messages.MPLS-TP OAM tools may be used proactively to detect the following fault conditions between MEPs:Loss of continuity and misconnectivity – the proactive Continuity Check (CC) function is used to detect
loss of continuity between two MEPs in an MEG. The proactive misconnectivity (CV) allows a sink MEP can detect
misconnectivity defect (e.g., mismerge or misconnection) with its peer source MEP when the received packet carries
an incorrect ME identifier. For protection switching, it is common to run CC&V (Continuity & Connectivity
Verification) message every 3.33ms. In the absence of three consecutive CC&V messages, Loss of Continuity is
declared and locally notified to the edge of the recovery domain to trigger a recovery action. In some cases, when
a slower recovery time is acceptable, it is also possible to lengthen the transmission rate.Signal degradation – notification from the OAM performance monitoring indicating degradation in the working
entity may also be used as a trigger for protection switching. In the case of degradation, switching to the recovery
entity is needed only if the recovery entity can guarantee better conditions. Degradation can be measured activating
proactively the MPLS-TP OAM packet loss measurement or delay measurement.A MEP can get an indication from its sink MEP of a Remote Defect Indication and locally notify the endpoint of the
recovery domain of fault condition to trigger the recovery action.MPLS-TP provides OAM tools to isolate a fault and determining exactly where a fault has occurred. It is often the case
the fault detection only takes place at key points in the network (such as at LSP end points, or MEPs). This means that
the fault may be located anywhere within a segment of the LSP concerned. Finer granularity of information is needed to
implement optimal recovery actions or to diagnose the fault. On-demand tools like trace-route, loopback and on-demand
CC&V can be used to isolate a fault.The information may be locally notified to the endpoint of the recovery domain to allow him implementing optimal
recovery action. This may be useful in case of re-calculation of a recovery path.The information should also be reported to the network management for diagnostics purposes.The endpoints of a recovery domain should be able to report a network management on fault conditions detected in
the recovery domain.In addition, a node within a recovery domain detecting a fault condition should also be able to report the fault
condition to the network management. The network management should be capable to correlate the fault reports and identify
the source of the fault.MPLS-TP OAM tools support a function where an intermediate node along a path can send an alarm report message to the MEP
indicating of a fault condition in the server layer connecting it to its adjacent node. The purpose of this capability is
to allow a MEP to suppress alarms that may be generated as a result of the failure condition in the server layer.As described above, in some cases (such as in bidirectional protection switching, etc.) there is a need to coordinate
the protection states between the edges of the recovery domain. defines
procedures and protocol messages and elements to support the PSC (Protection State Coordination) function.The protocol is also used to signal administrative requests (e.g., manual switch, etc.) when these are provisioned
only at on edge of the recovery domain.The protocol also allow to detect mismatches between the configuration provisioned at the ends of the Protection Domain
(such as timers, revertive/non-revertive behavior).The GMPLS control plane has been proposed as the control plane for MPLS-TP . Since GMPLS
was designed for use in transport networks, and has been implemented and deployed in many networks, it is not surprising
that it contains many features to support a high level of survivability function.The signaling elements of the GMPLS control plane utilize extensions to the Resource Reservation Protocol (RSVP) as
documented in a series of documents commencing with and , but based on
and . The architecture for GMPLS is provided in ,
and gives a functional description of the protocol extensions needed to support GMPLS-based
recovery (i.e., protection and restoration).A further control plane protocol called the Link Management Protocol (LMP) is part of the
GMPLS protocol family and can be used to coordinate fault isolation and reporting.Clearly, the control plane techniques described here only apply where an MPLS-TP control plane is deployed and operated.
All mandatory survivability features must be enabled even in the absence of the control plane, but where the control plane
is present it may provide alternative mechanisms that may be desirable by virtue of their ease of automation or richer
feature-set.The control plane is not able to detect data plane faults. However, it does provide mechanisms to detect control
plane faults and these can be can be used to deduce data plane faults where it is known that the control and data planes
are fate sharing. Although specifies that MPLS-TP must support an out-of-band control channel,
it does not insist that this is used exclusively. That means that there may be deployments where an in-band (or at
least in-fiber) control channel is used. In this case, the failure of the control channel can be used to infer a
failure of the data channel or at least to trigger an investigation of the health of the data channel.Both RSVP and LMP provide a control channel "keep-alive" mechanism (called the Hello message in both
cases). Failure to receive a message in the configured/negotiated time period indicates a control plane failure.
GMPLS routing protocols ( and also include keepalive mechanisms designed
to detect routing adjacency failures and, although these keep-alive mechanisms tend to operate at a relatively low frequency
(order of seconds) it is still possible that the first indication of a control plane fault will be through the routing
protocol.Note, however, care must be taken that the failure is not caused by a problem with the control plane software or processor
component at the far end of a link.Because of the various issues involved, it is not recommended that the control plane be relied upon as the primary
mechanism for fault detection in an MPLS-TP network.The control plane may be used to initiate and coordinate testing of links, LSP segments, or whole LSPs. This is important
in some technologies where it is necessary to halt data transmission while testing, but may also be useful where testing
needs to be specifically enabled or configured.LMP provides a control plane mechanism to test the continuity and connectivity (and naming) of individual links. A
single management operation is required to initiate the test at one end of the link, and LMP handles the coordination
with the other end of the link. The test mechanism for an MPLS packet link relies on the LMP Test message inserted into
the data stream at one end of the link and extracted at the other end of the link. This mechanism need not be disruptive
to data flowing on the link.Note that a link in LMP may in fact be an LSP tunnel used to form a link in the MPLS-TP network.GMPLS signaling (RSVP) offers two mechanisms that may also assist with testing for faults. First,
defines the Admin_Status object that allows an LSP to be set into "testing mode". The interpretation of this
mode is implementation specific and could be documented more precisely for MPLS-TP. The mode sets the whole LSP into a
state where it can be tested; this need not be disruptive to data traffic.The second mechanism provided by GMPLS to support testing is provided in . This protocol
extension supports the configuration (including enabling and disabling) of OAM mechanisms for a specific LSP.Fault isolation is the process of determining exactly where a fault has occurred. It is often the case the fault detection
only takes place at key points in the network (such as at LSP end points, or MEPs). This means that the fault may be located
anywhere within a segment of the LSP concerned.If segment or end-to-end protection are in use, this level of information is often sufficient to repair the LSP. However,
if a finer granularity of information is needed (either to implement optimal recovery actions or to diagnose the fault), it is
necessary to isolate the fault more closely.LMP provides a cascaded test-and-propagate mechanism specifically designed for this purpose.GMPLS signaling uses the Notify message to report faults. The Notify message can apply to a single LSP or can carry fault
information for a set of LSPs to improve the scalability of fault notification.Since the Notify message is targeted at a specific node it can be delivered rapidly without requiring hop-by-hop processing.
It can be targeted at LSP end-points, or at segment end-points (such as MEPs). The target points for Notify messages can be
manually configured within the network or may be signaled as the LSP is set up. This allows the process to be made consistent
with segment protection and the concept of Maintenance Entities.GMPLS signaling also provides a slower, hop-by-hop mechanism for reporting individual LSP faults on a hop-by-hop basis using
the PathErr and ResvErr messages. provides a mechanism to coordinate alarms and other event or fault information through GMPLS
signaling. This mechanism is useful to understand the status of the resources used by an LSP and to help understand why an LSP
is not functioning, but it is not intended to replace other fault reporting mechanisms.GMPLS routing protocols and are used to advertise link availability and
capabilities within a GMPLS-enabled network. Thus, the routing protocols can also provide indirect information about network
faults. That is, the protocol may stop advertising or withdraw the advertisement for a failed link, or may advertise that the
link is about to be shut down gracefully. This mechanisms is, however, not normally considered to be fast enough to be used
as a trigger for protection switching.Fault coordination is an important feature for certain protection mechanisms (such as
bidirectional 1:1 protection). The use of the GMPLS Notify message for this purpose is
described in , however, specific message field values remain to be
defined for this operation.A further piece of work in needed to allow control and configuration of reversion behavior
for end-to-end and segment protection, and the coordination of timers' values.It should not be forgotten that protection and recovery depend on the establishment of
suitable LSPs. The management plane may be used to set up these LSPs, but the control plane
may be used if it is present.Several protocol extensions exist to make this process more simple: provides features in support of end-to-end protection
switching. describes how to establish a single, segment protected
LSP. Note that end-to-end protection is a sub case of segment protection and
can be used also to provide end-to-end protection. allows one LSP to be signaled with a request that its
path excludes specified resources (links, nodes, SRLGs). This allows a disjoint protection
path to be requested, or a recovery path to be set up avoiding failed resources.Lastly, it should be noted that provides an overview of the
GMPLS techniques available to achieve protection in multi-domain environments. Pseudowire is one of the clients of the MPLS LSP layer of MPLS-TP. It is viewed as a layer of the
MPLS-TP network. Pseudowires provide end-to-end connectivity over the MPLS-TP network and may be
comprised of a single pseudowire segment, or multiple segments "stitched" together to provide
end-to-end connectivity.The pseudowire may, itself, require a level of protection in order to meet the guarantees or service
level of its SLA. This protection could be provided by the MPLS-TP LSPs that support the pseudowire, or could be a feature of the pseudowire layer itself.As indicated above, the functional architecture described in this document applies to both LSPs and pseudowires. However the
recovery mechanisms for pseudowires are for further study and will be defined in a separate document in the PWE3 working group.MPLS-TP PWs are carried across the network inside MPLS-TP LSPs. Therefore, an obvious way to protect a PW is to protect
the LSP that carries it. Such protection can take any of the forms described in this document. The choice of recovery scheme
will depend on the speed of recovery necessary and the traffic loss that is acceptable for the SLA that the PW is providing.If the PW is a multi-segment PW, then LSP recovery can only protect the PW on individual segments. That is, LSP recovery
cannot protect against a failure of a PW switching point (an S-PE), nor can it protect more than one segment at a time since
the LSP tunnel is terminated at each S-PE. In this respect, the LSP protection of a PW is very much like the link-level
protection offered to the MPLS-TP LSP layer by an underlying network layer (see Section 4.6).Recovery in the PW layer can be provided simply by running separate PWs end-to-end. Other recovery
mechanisms in the PW layer, such as segment or concatenated segment recovery, or service-level
recovery involving survivability of T-PE or AC faults is for future study in a separate document.As with any recovery mechanism, it is important to coordinate between layers. This coordination is necessary to ensure
that recovery mechanisms are only actioned in one layer at a time (that is, the recovery of an underlying LSP needs to be
coordinated with the recovery of the PW itself), and to make sure that the working and protection PWs do not both use the
same MPLS resources within the network (for example, by running over the same LSP tunnel – compare with Section 4.6.2).TBDTBDThis informational document makes no requests for IANA action.Thanks for useful comments and discussions to Italo Busi, David McWalter, Lou Berger, Yaacov Weingarten,
Stewart Bryant, and Dan Frost.Resource ReserVation Protocol &mdash Version 1 Functional SpecificationRSVP-TE: Extensions to RSVP for LSP TunnelsGeneralized Multi-Protocol Label Switching (GMPLS) Signaling Functional DescriptionGeneralized Multi-Protocol Label Switching (GMPLS) Signaling Resource ReserVation
Protocol-Traffic Engineering (RSVP-TE) Extensions Generalized Multi-Protocol Label Switching (GMPLS) ArchitectureIS-IS Extensions in Support of Generalized Multi-Protocol Label Switching
(GMPLS)The Link Management Protocol (LMP)Recovery (Protection and Restoration) Terminology for
Generalized Multi-Protocol Label Switching (GMPLS)Analysis of Generalized Multi-Protocol Label Switching (GMPLS) – based
Recovery Mechanisms (including Protection and Restoration) Recovery (Protection and Restoration) Terminology for
Generalized Multi-Protocol Label Switching (GMPLS)GMPLS Segment RecoveryIS-IS Extensions in Support of Generalized Multi-Protocol Label Switching
(GMPLS)Joint Working Team (JWT) Report on MPLS Architectural Considerations for a Transport
ProfileGeneric Protection Switching – Linear trail and subnetwork protectionTypes and Characteristics of SDH Network Protection ArchitecturesRequirements of an MPLS Transport ProfileMPLS Generic Associated ChannelA Framework for MPLS in Transport NetworksRequirements for OAM in MPLS Transport NetworksA Framework for MPLS in Transport NetworksMultiprotocol Label Switching ArchitectureNetwork Hierarchy and Multilayer SurvivabilityFramework for Multi-Protocol Label Switching (MPLS)-based RecoveryGeneralized Multiprotocol Label Switching
(GMPLS) Recovery Functional SpecificationGMPLS – Communication of Alarm InformationRSVP-TE
Extensions in Support of End-to-End Generalized Multi-Protocol Label Switching (GMPLS) RecoveryExclude Routes – Extension to Resource ReserVation Protocol-
Traffic Engineering (RSVP-TE)Analysis of Inter-Domain Label Switched Path (LSP)
RecoveryMPLS-TP Linear ProtectionOAM Configuration Framework and Requirements for GMPLS RSVP-TEMPLS-TP Linear ProtectionA Thesaurus for the Terminology used in Multiprotocol Label Switching Transport Profile (MPLS-TP)
drafts/RFCs and ITU-T's Transport Network Recommendations