What Really Happened During the 2003 Blackout?
[Note that this article is a transcript of the video embedded above.]
On August 14, 2003, a cascading failure of the power grid plunged more than 50 million people into darkness in the northeast US and Canada. It was the most significant power outage ever in North America, with an economic impact north of ten billion dollars. Calamities like this don’t happen in a bubble, and there were many human factors, political aspects, and organizational issues that contributed to the blackout. But, this is an engineering channel, and a bilateral task force of energy experts from the US and Canada produced this in-depth 240-page report on all of the technical causes of the event that I’ll try to summarize here. Even though this is kind of an older story, and many of the tough lessons have already been learned, it’s still a nice case study to explore a few of the more complicated and nuanced aspects of operating the electric grid, essentially one of the world’s largest machines. I’m Grady, and this is Practical Engineering. Today, we’re talking about the Northeast Blackout of 2003.
Nearly every aspect of modern society depends on a reliable supply of electricity, and maintaining this reliability is an enormous technical challenge. I have a whole series of videos on the basics of the power grid if you want to keep learning after this, but I’ll summarize a few things here. And just a note before we get too much further, when I say “the grid” in this video, I’m really talking about the Eastern Interconnection that serves the eastern two-thirds of the continental US plus most of eastern Canada.
There are two big considerations to keep in mind concerning the management of the power grid. One: supply and demand must be kept in balance in real-time. Storage of bulk electricity is nearly non-existent, so generation has to be ramped up or down to follow the changes in electricity demands. Two: In general, you can’t control the flow of electric current on the grid. It flows freely along all available paths, depending on relatively simple physical laws. When a power provider agrees to send electricity to a power buyer, it simply increases the amount of generation while the buyer decreases their own production or increases their usage. This changes the flow of power along all the transmission lines that connect the two. Each change in generation and demand has effects on the entire system, some of which can be unanticipated.
Finally, we should summarize how the grid is managed. Each individual grid is an interconnected network of power generators, transmission operators, retail energy providers, and consumers. All these separate entities need guidance and control to keep things running smoothly. Things have changed somewhat since 2003, but at the time, the North American Electric Reliability Council (or NERC) oversaw ten regional reliability councils who operated the grid to keep generation and demands in balance, monitored flows over transmission lines to keep them from overloading, prepared for emergencies, and made long-term plans to ensure that bulk power infrastructure would keep up with growth and changes across North America. In addition to the regional councils, there were smaller reliability coordinators who performed the day-to-day grid management and oversaw each control area within their boundaries.
August 14th was a warm summer day that started out fairly ordinarily in the northeastern US. However, even before any major outages began, conditions on the electric grid, especially in northern Ohio and eastern Michigan were slowly degrading. Temperatures weren’t unusual, but they were high, leading to an increase in electrical demands from air conditioning. In addition, several generators in the area weren’t available due to forced outages. Again, not unusual. The Midwest Independent System Operator (or MISO), the area’s reliability coordinator, took all this into account in their forecasts and determined that the system was in the green and could be operated safely. But, three relatively innocuous events set the stage for what would follow that afternoon.
The first was a series of transmission line outages outside of MISO’s area. Reliability coordinators receive lots of real-time data about the voltages, frequencies, and phase angles at key locations on the grid. There’s a lot that raw data can tell you, but there’s also a lot of things it can’t. Measurements have errors, uncertainties, and aren’t always perfectly synchronized with each other. So, grid managers often use a tool called a state estimator to process all the real-time measurements from instruments across the grid and convert them into the likely state of the electrical network at a single point in time, with all the voltages, current flows, and phase angles at each connection point. That state estimation is then used to feed displays and make important decisions about the grid.
But, on August 14th, MISO’s state estimator was having some problems. More specifically, it couldn’t converge on a solution. The state estimator was saying, “Sorry. All the data that you’re feeding me just isn't making sense. I can’t find a state that matches all the inputs.” And the reason it was saying this is that twice that day, a transmission line outside MISO’s area had tripped offline, and the state estimator didn’t have an automatic link to that information. Instead it had to be entered manually, and it took a bunch of phone calls and troubleshooting to realize this in both cases. So, starting around noon, MISO’s state estimator was effectively offline.
Here’s why that matters: The state estimator feeds into another tool called a Real-Time Contingency Analysis or RTCA that takes the estimated state and does a variety of “what ifs.” What would happen if this generator tripped? What would happen if this transmission line went offline? What would happen if the load increased over here? Contingency analysis is critical because you have to stay ahead of the game when operating the grid. NERC guidelines require that each control area manage its network to avoid cascading outages. That means you have to be okay, even during the most severe single contingency, for example, the loss of a single transmission line or generator unit. Things on the grid are always changing, and you don’t always know what the most severe contingency would be. So, the main way to ensure that you’re operating within the guidelines at any point in time is to run simulations of those contingencies to make sure the grid would survive. And MISO’s RTCA tool, which was usually run after every major change in grid conditions (sometimes several times per day), was offline on August 14th up until around 2 minutes before the start of the cascade. That means they couldn’t see their vulnerability to outages, and they couldn’t issue warnings to their control area operators, including FirstEnergy, the operator of a control area in northern Ohio including Toledo, Akron, and Cleveland.
That afternoon, FirstEnergy was struggling to maintain adequate voltage within their area. All those air conditioners use induction motors that spin a magnetic field using coils of wire inside. Inductive loads do a funny thing to the power on the grid. Some of the electricity used to create the magnetic field isn’t actually consumed, but just stored momentarily and then returned to the grid each time the current switches direction (that’s 120 times per second in the US). This causes the current to lag behind the voltage, reducing its ability to perform work. It also reduces the efficiency of all the conductors and equipment powering the grid because more electricity has to be supplied than is actually being used. This concept is kind of deep in the weeds of electrical engineering, but we normally simplify things by dividing bulk power into two parts: real power (measured in Watts) and reactive power (measured in var). On hot summer days, grid operators need more reactive power to balance the increased inductive loads on the system caused by millions of air conditioners running simultaneously.
Real power can travel long distances on transmission lines, but it’s not economical to import reactive power from far away because transmission lines have their own inductance that consumes the reactive power as it travels along them. With only a few running generators within the Cleveland area, FirstEnergy was importing a lot of real power from other areas to the south, but voltages were still getting low on their part of the grid because there wasn’t enough reactive power to go around. Capacitor banks are often used to help bring current and voltage back into sync, providing reactive power. However, at least four of FirstEnergy’s capacitor banks were out of service on the 14th. Another option is to over-excite the generators at nearby power plants so that they create more reactive power, and that’s just what FirstEnergy did.
At the Eastlake coal-fired plant on Lake Erie, operators pushed the number 5 unit to its limit, trying to get as much reactive power as they could. Unfortunately, they pushed it a little too hard. At around 1:30 in the afternoon, its internal protection circuit tripped and the unit was kicked offline - the second key event preceding the blackout. Without this critical generator, the Cleveland area would have to import even more power from the rest of the grid, putting strain on transmission lines and giving operators less flexibility to keep voltage within reasonable levels.
Finally, at around 2:15, FirstEnergy’s control room started experiencing a series of computer failures. The first thing to go was the alarm system designed to notify operators when equipment had problems. This probably doesn’t need to be said, but alarms are important in grid operations. People in the control room don’t just sit and watch the voltage and current levels as they move up and down over the course of a day. Their entire workflow is based on alarms that show up as on-screen or printed notifications so they can respond. All the data was coming in, but the system designed to get an operator’s attention was stuck in an infinite loop. The FirstEnergy operators were essentially driving on a long country highway with their fuel gauge stuck on “full,” not realizing they were nearly out of gas. With MISO’s state estimator out of service, Eastlake 5 offline, and FirstEnergy’s control room computers failing, the grid in northern Ohio was operating on the bleeding edge of the reliability standards, leaving it vulnerable to further contingencies. And the afternoon was just getting started.
Transmission lines heat up as they carry more current due to resistive losses, and that is exacerbated on still, hot days when there’s no wind to cool them off. As they heat up, they expand in length and sag lower to the ground between each tower. At around 3:00, as the temperatures rose and the power demands of Cleveland did too, the Harding-Chamberlin transmission line (a key asset for importing power to the area) sagged into a tree limb, creating a short-circuit. The relays monitoring current on the line recognized the fault immediately and tripped it offline. Operators in the FirstEnergy control room had no idea it happened. They started getting phone calls from customers and power plants saying voltages were low, but they discounted the information because it couldn’t be corroborated on their end. By this time their IT staff knew about the computer issues, but they hadn’t communicated them to the operators, who had no clue their alarm system was down.
With the loss of Harding-Chamberlin, the remaining transmission lines into the Cleveland area took up the slack. The current on one line, the Hanna-Juniper, jumped from around 70% up to 88% of its rated capacity, and it was heating up. About half an hour after the first fault, the Hanna-Juniper line sagged into a tree, short circuited, and tripped offline as well. The FirstEnergy IT staff were troubleshooting the computer issues, but still hadn’t notified the control room operators. The staff at MISO, the reliability coordinator, with their state estimator issues, were also behind on realizing the occurrence and consequences of these outages.
FirstEnergy operators were now getting phone call after phone call, asking about the situation while being figuratively in the dark. Call transcripts from that day tell a scary story.
“[The meter on the main transformer] is bouncing around pretty good. I’ve got it relay tripped up here…so I know something ain't right,” said one operator at a nearby nuclear power plant.
A little later he called back: “I’m still getting a lot of voltage spikes and swings on the generator… I don’t know how much longer we’re going to survive.”
A minute later he calls again: “It’s not looking good… We aint going to be here much longer and you’re going to have a bigger problem.”
An operator in the FirstEnergy control room replied: “Nothing seems to be updating on the computers. I think we’ve got something seriously sick.”
With two key transmission lines out of service, a major portion of the electricity powering the Cleveland area had to find a new path into the city. Some of it was pushed onto the less efficient 138 kV system, but much of it was being carried by the Star-South Canton line which was now carrying more than its rated capacity. At 3:40, a short ten minutes after losing Hanna-Juniper, the Star-South Canton line tripped offline when it too sagged into a tree and short-circuited. It was actually the third time that day the line had tripped, but it was equipped with circuit breakers called reclosers that would energize the line automatically if the fault had cleared. But, the third time was the charm, and Star-South Canton tripped and locked out. Of course, FirstEnergy didn’t know about the first two trips because they didn’t see an alarm, and they didn’t know about this one either. They had started sending crews out to substations to get boots on the ground and try to get a handle on the situation, but at that point, it was too late.
With Star-South Canton offline, flows in the lower capacity 138 kV lines into Cleveland increased significantly. It didn’t take long before they too started tripping offline one after another. Over the next half hour, sixteen 138 kV transmission lines faulted, all from sagging low enough to contact something below the line. At this point, voltages had dropped low enough that some of the load in northern Ohio had been disconnected, but not all of it. The last remaining 345 kV line into Cleveland from the south came from the Sammis Power Plant. The sudden changes in current flow through the system now had this line operating at 120% of its rated capacity. Seeing such an abnormal and sudden rise in current, the relays on the Star-Sammis line assumed that a fault had occurred and tripped the last remaining major link to the Cleveland area offline at 4:05 PM, only an hour after the first incident. After that, the rest of the system unraveled.
With no remaining connections to the Cleveland area from the south, bulk power coursing through the grid tried to find a new path into this urban center.
First overloads progressed northward into Michigan, tripping lines and further separating areas of the grid. Then the area was cut off to the east. With no way to reach Cleveland, Toledo, or Detroit from the south, west, or north, a massive power surge flowed east into Pennsylvania, New York, and then Ontario in a counter-clockwise path around Lake Erie, creating a major reversal of power flow in the grid. All along the way, relays meant to protect equipment from damage saw these unusual changes in power flows as faults and tripped transmission lines and generators offline
Relays are sophisticated instruments that monitor the grid for faults and trigger circuit breakers when one is detected. Most relaying systems are built with levels of redundancy so that lines will still be isolated during a fault, even if one or more relays malfunction. One type of redundancy is remote backup, where separate relays have overlapping zones of protection. If the closest relay to the fault (called Zone 1) doesn’t trip, the next closest relay will see the fault in its Zone 2 and activate the breakers. Many relays have a Zone 3 that monitors even farther along the line.
When you have a limited set of information, it can be pretty hard to know whether a piece of equipment is experiencing a fault and should be disconnected from the grid to avoid further damage or just experiencing an unusual set of circumstances that protection engineers may not have anticipated. That’s especially true when the fault is far away from where you’re taking measurements. The vast majority of lines that went offline in the cascade were tripped by Zone 3 relays. That means the Zone 1 and 2 relays, for the most part, saw the changes in current and voltage on the lines and didn’t trip because they didn’t fall outside of what was considered normal. However, the Zone 3 relays - being less able to discriminate between faults and unusual but non-damaging conditions - shut them down. Once the dominos started falling in the Ohio area, it took only about 3 minutes for a massive swath of transmission lines, generators, and transformers to trip offline. Everything happened so fast that operators had no opportunity to implement interventions that could have mitigated the cascade.
Eventually enough lines tripped that the outage area became an electrical island separated from the rest of the Eastern Interconnection. But, since generation wasn’t balanced with demands, the frequency of power within the island was completely unstable, and the whole area quickly collapsed. In addition to all of the transmission lines, at least 265 power plants with more than 508 generating units shut down. When it was all over, much of the northeastern United States and the Canadian province of Ontario were completely in the dark. Since there were very few actual faults during the cascade, reenergizing happened relatively quickly in most places. Large portions of the affected area had power back on before the end of the day. Only a few places in New York and Toronto took more than a day to have power restored, but still the impacts were tremendous. More than 50 million people were affected. Water systems lost pressure forcing boil-water notices. Cell service was interrupted. All the traffic lights were down. It’s estimated that the blackout contributed to nearly 100 deaths.
Three trees and a computer bug caused a major part of North America to completely grind to a halt. If that’s not a good example of the complexity of the power grid, I don’t know what is. If you asked anyone working in the power industry on August 13, whether the entire northeast US and Canada would suffer a catastrophic loss of service the next day, they would have said no way. People understood the fragility of the grid, and there were even experts sounding alarms about the impacts of deregulation and the vulnerability of transmission networks, but this was not some big storm. It wasn’t even a peak summer day. It was just a series of minor contingencies that all lined up just right to create a catastrophe.
Today’s power grid is quite different than it was in 2003. The bilateral report made 46 recommendations about how to improve operations and infrastructure to prevent a similar tragedy in the future, many of which have been implemented over the past nearly 20 years. But, it doesn’t mean there aren’t challenges and fragilities in our power infrastructure today. Current trends include more extreme weather, changes in the energy portfolio as we move toward more variable sources of generation like wind and solar, growing electrical demands, and increasing communications between loads, generators, and grid controllers. Just a year ago, Texas saw a major outage related to extreme weather and the strong nexus between natural gas and electricity. I have a post on that event if you want to take a look after this. I think the 2003 blackout highlights the intricacy and interconnectedness of this critical resource we depend on, and I hope it helps you appreciate the engineering behind it. Thank you for reading and let me know what you think.