An Introduction to UPS Redundancy
As mission critical designers and engineers, we are often asked to explain the concept of redundancy. After all, a system’s redundancy can have a major effect on availability, reliability, maintainability, and total cost of ownership. Often, one of the first design decision made relates to redundancy. In recent years the waters have been muddied with terms like “distributed redundancy”, “3 to make 2”, and “catcher systems”. In this article we will explain the different levels of redundancy and how it impacts other system characteristics.
Before we get started, lets define two terms that will be used throughout this article:
UPS Module or UPM - a UPM can be standalone (figure a) or parallel with other UPMs (figure b) for capacity or redundancy. When paralleled it is connected to a common output bus and shares controls with the parallel modules.
UPS System or UPS - a UPS is comprised of either a single UPM (figure a) or multiple, parallel UPMs (figure b). Figure (c) shows two UPS systems, each comprised of a single UPM. A UPS system does not share controls or output buses with other UPS systems.
Notice in figure (b), the UPMs are designated as A1 and A2, as if to say they are modules 1 and 2 of system ‘A’. Similarly, in figure (c), the two different UPS systems are designated as ‘A’ and ‘B’. While these designations are not universal, they are very common and will be used for the purposes of this article.
UPS designs have evolved as the importance of data centers have grown. In the beginning, as it relates to UPS power, the concern was to protect the critical load from power outages, sags and swells and other power anomalies. A single UPS system provided the means to bridge the gap for when utility power failed, and a generator started and powered the UPS and its critical load. This was a non-redundant UPS design. It met the basic requirement of protecting the load from utility issues and little more. If the UPS failed, the load was dropped.
Relative to other levels of redundancy, this is designated as N+0, where N represents the ‘System’ of N capacity. Figure (d) below shows a basic, non-redundant UPS system.
Module Redundancy, N+1
Figure (e) shows a N+1 redundant UPS system. In this arrangement a single UPS system, comprised of two parallel modules, connected to a common system output, powers all the load. Each module is rated for N capacity and share the load. If the load remains at or below N, the system is redundant. The ‘+1’ indicates there is one more module than needed to power the load. However, if the load increases above N, the system will be non-redundant. If there was a third module A3 installed, then the system would be N+2 redundant when the load was at or below N. The system capacity of these types of systems is governed by the common system output bus rating, not the sum of the module capacities. Site operators must manage the load and ensure the total load does not exceed the redundant capacity of ‘N.’ N+1 or N+2 redundancy is sometimes referred to as ‘module redundancy’, as opposed to ‘system redundancy’ described next.
System Redundancy, N+N or 2N
Figure (f) shows a typical redundant design for UPS systems that power IT equipment with dual corded power supplies. In this arrangement there are two UPS systems, each system is comprised of one module with N capacity. Under normal conditions, each system powers half the load. Should one UPS system fail, the load will automatically transfer to the other UPS system via the IT equipment dual corded power supplies. Site operators must manage the load and ensure that the total load does not exceed the capacity ‘N’ of one system. Since UPS A and B are separate systems, they share neither common output buses nor common controls and operate independently of one another.
Redundant Systems with Redundant Modules, 2N+1
Higher levels of redundancy can be achieved by using two N+1 UPS systems as shown in figure (g). This is sometimes referred to as 2N+1. Each system is comprised of three 1MW modules. If the common bus is rated for 2MW, then each system would be considered N+1 redundant. If UPS system A should fail, UPS B will assume all 2MW and still have module redundancy. This system design can withstand a complete system failure and one module failure and still deliver full UPS capacity. This design is very costly and mostly used by financial companies or the highest mission critical installations.
Redundancy and UPS Utilization
In figure (g) the redundant capacity, N, is 2MW. The design uses six 1MW UPS modules and connects them in a way to provide 2MW of redundant capacity. Here is where it gets a little confusing. How do we distinguish between module utilization, system utilization and overall utilization? System A’s redundant capacity is 2MW. Normally, it will carry a load of 1MW, when both systems are operational. Therefore, the system utilization is 50%. The 50% of unused capacity is waiting in reserve to support System B should it fail. System A consists of three 1MW modules and the load is equally divided between them. Therefore, each module supports 333kW for a module utilization of 33%. What about the overall utilization? Well, if System A is 50% utilized, System B is 50% utilized and they back each other up, the overall utilization is 100%. In other words, no more load can be added to the redundant system even though there is more capacity on an individual system (A or B) and their individual modules. If more load were to be added the system would not meet its designed redundancy target.
The opposite can be said about systems that are underutilized. If each system was only carrying 25% of its redundant capacity, then a system could afford to have two modules fail and still be redundant. This is a common occurrence for new systems where IT loads are prescribed over time and why we often include modular expansion capabilities in our design.
This concept of utilization is very important because it has a direct impact on CAPEx and OPEx for two reasons:
The higher the module utilization, the more efficient the modules will operate.
The most efficient use of capital are designs that have the highest module and system utilization.
Let’s see how this applies to the redundancy examples we introduced so far. Table 1 shows higher levels of redundancy have lower utilization for the four examples we considered so far. That’s because more module capacity is waiting in reserve in case of failure. A non-redundant system (N+0), provides only enough capacity to match the load with nothing in reserve and therefore has the highest utilization but at the risk of dropping the load. The 2N+1 system provides the most redundancy but least amount of module utilization. This added protection comes at a higher cost of installation and operation. Remember, the 2N+1 system requires 6MW of total capacity to provide 2MW of redundant capacity.
As the data center industry matured, designers and operators looked for more efficient ways to deliver high redundant systems. The N+N system design has been the standard configurations once dual cord power supplies were adopted. The design is simple, reliable and easy to manage. The downside is low utilization and stranded capacity. Since many data center builds never reach their full load potential, real utilization can be closer to 30%.
To address this issue and lower the total cost of ownership, distributed redundancy has gained favor in the industry. The concept is not new but different from the classical N+N designs. In fact, our first distributed redundant design was installed and commissioned in 2007. The advantages of distributed redundancy include higher capacity utilization with system level redundancy. Maintaining system level redundancy means there are no single points of failure like non-redundant or N+1 systems.
Figure (h) shows the normal configuration of a 3 to make 2, distributed redundant system. Three systems are installed to deliver the capacity of two. Any one of the three systems can fail, and load will seamlessly transfer to the other two systems as shown in figure (i). With this arrangement, each system can be loaded to 2/3 capacity so that any two systems will not exceed 100% of their capacity in a failure scenario. For distributed redundancy to work, downstream loads must be properly managed. If not done so properly a failure of one UPS system may result in an overload of one of the two remaining systems. Successful load management requires proper design, monitoring and discipline to maintain distributed redundancy. This is slightly more complicated than a classical N+N system where its obvious how loads will transfer.
The distributed redundant concept can be expanded to a 4 to make 3 arrangement shown in figure (j). This is a popular co-location design. In this arrangement, four systems are installed to deliver the capacity of three. Each system can be loaded up to 75% and when a system fails, its load divides between the three remaining systems as shown in figure (k).
Table 2 shows how the two distributed redundant designs compare to the four other types of redundancy examples previously discussed. With distributed redundancy, system level redundancy can be achieved with fewer modules and higher utilization which helps lower the total cost of ownership.
Theoretically, more than four systems can be combined to create distributed redundancy. This will further increase utilization, but we don’t recommend it. Going beyond 4 to make 3 systems introduces additional complexities and will ultimately compromise the system reliability for diminishing returns.
One final redundancy design worth mentioning is the catcher system shown in figure (l). In a catcher system there is one redundant system capable of backing up all the rest. The redundant system will be online and unloaded. When a system fails, all its load is transferred to the redundant system, usually through static transfer switches. The redundant system can be over sized to support more than one failed system at a time but typically they are sized to match the other systems as shown in figure (l).
Redundancy comes in a variety of flavors. No one design fits all applications. The best choice depends on the project requirements. Determining factors include a data center’s operational and maintenance requirements, uptime targets, risk tolerance, and total cost of ownership. The general concepts presented here can be extrapolated out into greater detail to include cost and failure models but hopefully this overview helps provide a foundation for understanding the many options available for your next data center build.