Preventing Cascade Failure in Networks

For some reason I was thinking about the cascade failure problem that I mentioned before my Holiday Hiatus. I discovered that this problem has gone largely unsolved when attending ACSAC ’08, and it’s been in the back of my mind (and on my whiteboard) since then. I’ve looked online a bit to see wether any researchers are attacking the problem, and it seems that very few are. Most of the work is also around containment of the failure rather than prevention. Most of the solutions resemble burning swaths of trees to save the forest from a fire; this disgustingly requires global knowledge of the network. Also, the systems today seem to me to be built rather naively: nodes very simply reroute traffic based on failure, without taking into account the possibility that in so doing they might overtraffic their neighbors, and thus trigger a cascade. There also seems to be a lack of information in textbooks about running networks near capacity (this is desired for efficient utilization of the network, but can drastically increase the probability of cascade).

So, I want to introduce a few constraints and assumptions before I state my rough solution. Assuming the network can be viewed as a graph with nodes and edges, where some goods (power, packets, traffic) is flowing along edges from one node to another, I’d prefer a solution that used local knowledge (node capacity, and possible neighbor capacities) instead of global knowledge. So I’ll assume that each node can keep track of how much stuff is flowing through it, and what the traffic patterns actually are; that is: how much stuff goes from edge 1 to edge 2.

In computer networks it is also possible for a node to track the possible re-routings that certain packets can take and still arrive at their destination. (I don’t know wether this is true of power delivery in the electric grid, but I’m going to assume that something analogous still holds.) Each node will be responsible for measuring traffic capacity: That is for each stream of traffic, it will keep track of what other edges could have had that traffic routed onto them. Each node will also observe a capacity constraint: It will discourage traffic that cannot be re-routed if the primary edge for that traffic suddenly fails. In this way we can keep a reserve capacity in case of edge failure.

When an edge fails, the node will re-route its current traffic in a load balanced manner along the secondary edges available for that traffic. This will temporarily violate the reserve capacity constraint, but not over-traffic other nodes. [Actually I don’t think this is true, unless the node can use capacity info from its neighbors] In this way, outage of a single edge cannot trigger a cascade failure, because enough reserve capacity is always available for the traffic that was allocated to the failed edge.

    Still to be addressed:

  • Node failures which can cause multiple simultaneous edge failures.
  • Characterize the efficiency in terms of what is kept as reserve capacity and theoretical max capacity in the naive systems
  • Formalize the measurements taken by each node
  • Investigate push vs pull flows (power grid is demand based, computer networks are push)
  • Do nodes still need a means of capping their flow? (can the power grid drop packets?) I’d prefer not to require this.
  • Specify what each node is to communicate to its neighbors
  • Investigate the use of probability for multiple node failures (earthquake, software bug). Failures are not generally identically distributed or independent