Continue to Site

Welcome to EDAboard.com

Welcome to our site! EDAboard.com is an international Electronics Discussion Forum focused on EDA software, circuits, schematics, books, theory, papers, asic, pld, 8051, DSP, Network, RF, Analog Design, PCB, Service Manuals... and a whole lot more! To participate you need to register. Registration is free. Click here to register now.

Question about power supply failures

Status
Not open for further replies.

jeremygb

Newbie level 3
Joined
Aug 18, 2014
Messages
3
Helped
0
Reputation
0
Reaction score
0
Trophy points
1
Activity points
36
Not sure if this post is in the correct category.

I am a data center technician and I routinely deal with hot swappable power supplies in servers. This is what I see happen. The server's hardware monitoring processor will send out an alert for a failed power supply. The amber warning light on the power supply will be on and the event log will indicate a failed power supply. All very normal sequence of events. One thing the I am curious about is after re-seating the power supply, it will come back on with the green status light and continue working. This will only last anywhere from 2 to 8 hours when it will fail again. Re-seating the power supply again yields the same behavior. This will happen with almost every power supply failure that I have encountered.

I get that from a computer hardware perspective that the monitoring system detects a pre-defined failure condition and takes the appropriate action. The re-seating of the powers supply somehow clears this failure condition. Can someone tell me what is happening here from an electronics perspective to cause the power supply to get a temporary second wind after being re-seated?

Thanks for looking.

Jeremy
 

When you say "reseating" do you REALLY mean removing power from the power supply and then reapplying power? Removing power could clear the failure.
 

When you say "reseating" do you REALLY mean removing power from the power supply and then reapplying power? Removing power could clear the failure.

Reseating means removing the power supply from the power supply bay of the server, disconnecting it's connection with the power backplane. In most cases this also means unplugging the power cord as the mechanical release can't be operated with the cord in the way. These are legitimate power supply failures, just wondering why the removal of power temporarily "fixes" the failure.

My very limited knowledge of electronics speculates it may be due to a capacitor that is not able to hold the correct voltage and removing power allows the capacitor to drain and then reload when plugged back in only to go back to it's incorrect voltage. Hoping someone with the proper knowledge to clue me in if possible.
 

Some controllers have "features" such as latched overcurrent,
overvoltage, overtemperature protection. In high reliability and
hard-to-service applications (like aerospace, on station) these
are often preferred to be self-resetting or absent, but in a
consumer and plaintiff's-bar exposed product safety upon
safety (as opposed to operational reliability in the now) is
the requirement.

I'd suggest you instrument up a failing supply with a data
logger for voltage, temperature and current sense points.
This ought to give you some idea of what's bugging.

Marginal components can lead to some of these protection
features activating. But so can lousy cooling or load spikes
on overcurrent protection, long-wire / inductive-appearing load
flyback effects on overvoltage clamps, and so on. You might
like to swap positions between a never-fails and a always-
fails unit, just to see whether this is internal or induced.
 

Decent power supplies contain an electronic "crowbar". If a power supply goes faulty it could easily put 12 V on a 5V line so destroying whole processors, so a crowbar circuit monitors the output voltage very closely and typically will fire when the +5V line touches 5.1V, the crowbar action puts a short circuit across the power supply. This does not matter as the maximum current delivered by the power supply is also controlled. So what happens is the output voltage just fails. Pulling the power supply out and pushing it back in resets the crow bar and the output voltage appears.
The crowbar can be actuated by spikes in the mains or on the output lines pushing the output voltage to the trip level or it could be the trip level is set too low.
Frank
 
capacitor that is not able to hold the correct voltage and removing power allows the capacitor to drain

I believe it is something similar. As we know a computer problem is not always solved by merely shutting down and starting up again, but the same problem can be solved by removing all power for 30 seconds. This lets circuits fall back to 0V.

The power supply failures may be due to one or more overheating components. That would account for variations in operating time until failure.
 

Decent power supplies contain an electronic "crowbar". If a power supply goes faulty it could easily put 12 V on a 5V line so destroying whole processors, so a crowbar circuit monitors the output voltage very closely and typically will fire when the +5V line touches 5.1V, the crowbar action puts a short circuit across the power supply. This does not matter as the maximum current delivered by the power supply is also controlled. So what happens is the output voltage just fails. Pulling the power supply out and pushing it back in resets the crow bar and the output voltage appears.
The crowbar can be actuated by spikes in the mains or on the output lines pushing the output voltage to the trip level or it could be the trip level is set too low.
Frank

This makes sense for what I am seeing. The servers are deployed in an enterprise level data center so the chances of having power spikes from the infrastructure are extremely rare.

Thanks everyone for your input. I wasn't trying to determine if there were failures or not, just wanting to know why removing power from them breaths a few hours of life back into them. The manufacturer's support explaination doesn't go any deeper than, "...if the event log says its failed, then it is failed."
 

Servers often use redundant supplies or N+1. Fault detection needs to be classified to be useful, otherwise , you get transient unknown glitches , with transient fault conditions that are obviously frustrating.

Clearly there is an EMI problem somewhere, but where. You may need to escalate Engineering support from Mfg who built the supplies and ask for all causes of transient faults.

Some possibility questions are;
Does fault indicate shutdown from an unrecoverable fault or all faults?

What are the precise criteria for faults?

Overtemp? Overcurrent? Overvoltage? Undervoltage?
Time duration?,thresholds?

Is there current sharing? Is there any condition where light loads cause it to be more susceptible to step load induced overshoot?

What is the distribution length of power?

Is there distributed low ESR capacitance that triggers an undervoltage on hot insertion of other cards?

Does this failure coincide with any operator intervention?

Etc etc...

if you don't get answers at first, email and CC your buyers and their marketing group. I'd like to know the brand too, so I avoid them like flees.

But, it could also be your backplanes should have new low ESR caps installed.

We used to build servers, and we had to bring in the supplier to fix their design, which required at 10% pre-load on tandem current shared PSU's to avoid a transient OVP trip.

These faults can easily be prevented with any decent Eng. Support or a Test Engineer.
 

Status
Not open for further replies.

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Back
Top