Engineering – eric the fruitbat

Documentation for Progress

erich — Sat, 24 Sep 2011 00:12:43 +0000

I’ve noticed in my work recently that documenting my work is one of the most reliable ways of making steady progress. I likely gathered the idea from the internet somewhere, or perhaps from the generous amounts of advice spewed forth from my postdoc. But I do remember, when I was looking up some stuff surrounding the scheme publishing language, skribilo I came across the Nonpareil Project. The author, Jeffrey H. Kingston has kept a nice summary log of all the detailed work that goes into such a project. I’d like to quote a bit, just to give the flavor of the summary:

22 March 2007. The new version of the type system is compiled and tested today. This is about 12 days after I started the revision. The next problem, causing a core dump today, is that range types get frozen but are left unresolved at the end of MatchFunction, so we can’t be sure about coercions and run-time types at that point; we have to defer those until we descend again, during code generation. So there is some re-thinking and rewriting to do there.

23 March 2007. Sorting out the relationship between coercions and subtype tests. This has led to a realization that subtypes have to be tested twice: once when manifesting, without asking for coercions because they are unavailable in the presence of range types, and once during code generation after range types have all been resolved, at which point an error could occur. Also privatised expr_rec further, hiding it from its own subtypes.

28 March 2007. Finally finished working on the relationship between coercions and subtype tests. There were two problems: when the upper constraint is a meet type, and when the lower constraint is a variable. Solutions to both have now been documented, implemented and tested. The next step is to insert subtype calls, and the resulting coercions, during code generation.

As you can see, it’s a pretty high-level summary of the coding concerns for that day. This log runs from 2002 to 2008, and documents the progress of the project. Since, I’ve now moved to WebKit in my Information Flow project, I’ve also started keeping a similar log. Recording each minor milestone of implementation work. I’m expecting this log to help me write my thesis, as it will give me remembrance of details that I might otherwise forget.

I’ve also experienced a secondary benefit from keeping this log. The desire to grow the log, compels me to make progress in my work. I’m motivated to do tasks, just so I can mark them as done in the log. Additionally, since the log is geared toward recording smaller changes, I can break big todo’s up into a series of loggable items. I no longer get flummoxed by thinking of the sum total of work that might go into the implementation. Instead, I’m motivated by smaller tasks, which are easily accomplished.

Finally, the very act of documenting the work, and putting my thoughts into words, has helped to concretize my ideas and concerns. Instead of worrying abstractly about nebulous feelings, I now have mold them into precise explanations. Doing this helps me to realize which aspects are ignorable, and what specific actions I can take to answer the open questions.

Finally, I have a technique that allows me to make steady, targeted progress.

Building Linguistic Structure

erich — Thu, 04 Jun 2009 07:54:10 +0000

Yesterday, I had an interesting thought. My advisor once made the cultural observation that many people in Computer Science invent their own language and then immediately write a self-hosting compiler. I agree that a compiler is quite a feat of engineering and serves as a nice test case to demonstrate that the language you’ve invented is powerful enough that it can handle real-world complexity. Unfortunately, this test fails in a few important ways.

First, It doesn’t actually show as much as you think it might. There is a very strong filter on failed languages. By using this test the author runs the risk of re-designing the language, specifically to insert constructs that help them build the compiler. Now, this isn’t necessarily a bad thing, except that compiler writing is now a fairly mature field. There are standard abstractions (esp. in the lexing and parsing) that a new language will probably not experiment with. So, the author will usually just build these existing and well-understood abstractions into the new language. Rather than encouraging language experimentation we get more of the same, but with different syntax.

Second, Not all useful languages even have their own compiler. I’m specifically thinking of the domain specific languages (DSL). Nobody would write an awk interpreter in awk; or a mail engine using sendmail (even if it is Turing Complete). These are languages designed to do a specific task, many of them are quite essential to their respective fields, but none of them are self-hosting. Nor should we expect them to be.

My argument here is that the cultural practice of writing a self-hosting compiler is a big distraction. New languages should be for experimenting with new linguistic constructs. We should be looking toward the DSLs, and incorporating their innovations into our more main-stream languages. Right now, we seem to be optimizing our languages for compiler construction.

I’d rather see our languages evolve in a different direction. I’m really eager to witness the birth of an AI. For this to happen though, we need languages for expressing patterns of thought, not patterns of bits. We need the ability to cohesively and flexibly assemble the stuff of thought. I’m thinking Society of Mind stuff here. We need languages that allow for statistical fuzziness, sloppy associativity, and the ability to construct metaphor.

The linguistic tools that we find useful for building compilers are not necessarily the same tools that will help us build a mind.

Revisiting Cascade

erich — Wed, 18 Mar 2009 00:27:35 +0000

I did an earlier post about finding a decentralized solution to the Cascade Failure problem. I was fortunate enough to explore this topic a bit more in depth by proposing it as my term project for the Distributed Systems class. It took between two and three weeks of work to write a paper and hack together a small simulator to test out some of my ideas.

Since this marks my first attempt at a paper, and I think the results came out rather weak, I’ll make some brief comments about the process. Because it was a class paper, I spent substantial space and time condensing the other papers on the topic. Studies on the topic are quite sparse, there were really only four such papers, all published later than 2000. So about half of my paper is devoted to repeating their models and observations, followed by an all-too-brief comparative analysis.

The next section introduces some arguments about why the decentralized approach might be doomed to failure. I actually set out with the complete opposite goal, I wanted to find a solution to the problem. But, after many shot-in-the-dark attempts at finding one, I went after an impossibility proof instead. It isn’t what I would really call a rigorous proof though, it’s much more intuitive and informal. But I hadn’t seen the argument made in any other papers, so if I claim a small contribution to the field, this would have to be it.

I actually did follow up the proof, with a section on the methodology for my simulation. I think I was much more detailed about the setup than any of the other researchers were about theirs. I tried, and failed, to repeat the observations of others; so I can’t claim any guarantees about the accuracy of my model and simulation. I tested several off-the-cuff first-shot attempts at what protocols might help alleviate or stall a cascade, with spectacularly poor results. Every protocol I tried was worse than doing nothing! In fact, far worse than doing nothing! Worse by a factor of 50–200! So after all that work, I just figured out some stuff that doesn’t solve the problem, and had to explain why.

I think the best part, though, is the suggestions for Future Work. I’ll probably always enjoy writing this part, because it’s nearly impossible for me to get through my work without thinking of other ways to do stuff, or more things to try out.

Basically, it takes forever to write a paper. And as far as the sciences go, the paper really is really only the tip of the iceberg in terms of effort. It summarizes and represents a substantial amount of effort in data, simulation, code, research, analysis, etc…

LaTeX gives pretty results, and is the de-facto standard for scientific publication, but it’s kludgy, hard to use, and has a high learning curve.
There’s no decent and easy graphing/charting programs. So, it’s best to invest time into learning one. I had to write python scripts to parse and reduce my data anyway, so I used matplotlib.
And there’s no such thing as a vector drawing package that beats drawing sketches on napkins. So conventional artists have nothing to fear. It took hours to find, learn to use, and finally produce, a silly little, nearly superfluous, 7-node graph.
Network/Graph visualization tools are also scarce, so I had to write my own, just to double check that the simulation behaved properly. I didn’t mention any of this kind of effort in the paper though.

Anyway, go ahead and have a look at my paper: Cascade Failure in Distributed Networks

Preventing Cascade Failure in Networks

erich — Fri, 09 Jan 2009 08:57:21 +0000

For some reason I was thinking about the cascade failure problem that I mentioned before my Holiday Hiatus. I discovered that this problem has gone largely unsolved when attending ACSAC ’08, and it’s been in the back of my mind (and on my whiteboard) since then. I’ve looked online a bit to see wether any researchers are attacking the problem, and it seems that very few are. Most of the work is also around containment of the failure rather than prevention. Most of the solutions resemble burning swaths of trees to save the forest from a fire; this disgustingly requires global knowledge of the network. Also, the systems today seem to me to be built rather naively: nodes very simply reroute traffic based on failure, without taking into account the possibility that in so doing they might overtraffic their neighbors, and thus trigger a cascade. There also seems to be a lack of information in textbooks about running networks near capacity (this is desired for efficient utilization of the network, but can drastically increase the probability of cascade).

So, I want to introduce a few constraints and assumptions before I state my rough solution. Assuming the network can be viewed as a graph with nodes and edges, where some goods (power, packets, traffic) is flowing along edges from one node to another, I’d prefer a solution that used local knowledge (node capacity, and possible neighbor capacities) instead of global knowledge. So I’ll assume that each node can keep track of how much stuff is flowing through it, and what the traffic patterns actually are; that is: how much stuff goes from edge 1 to edge 2.

In computer networks it is also possible for a node to track the possible re-routings that certain packets can take and still arrive at their destination. (I don’t know wether this is true of power delivery in the electric grid, but I’m going to assume that something analogous still holds.) Each node will be responsible for measuring traffic capacity: That is for each stream of traffic, it will keep track of what other edges could have had that traffic routed onto them. Each node will also observe a capacity constraint: It will discourage traffic that cannot be re-routed if the primary edge for that traffic suddenly fails. In this way we can keep a reserve capacity in case of edge failure.

When an edge fails, the node will re-route its current traffic in a load balanced manner along the secondary edges available for that traffic. This will temporarily violate the reserve capacity constraint, but not over-traffic other nodes. [Actually I don’t think this is true, unless the node can use capacity info from its neighbors] In this way, outage of a single edge cannot trigger a cascade failure, because enough reserve capacity is always available for the traffic that was allocated to the failed edge.

Node failures which can cause multiple simultaneous edge failures.
Characterize the efficiency in terms of what is kept as reserve capacity and theoretical max capacity in the naive systems
Formalize the measurements taken by each node
Investigate push vs pull flows (power grid is demand based, computer networks are push)
Do nodes still need a means of capping their flow? (can the power grid drop packets?) I’d prefer not to require this.
Specify what each node is to communicate to its neighbors
Investigate the use of probability for multiple node failures (earthquake, software bug). Failures are not generally identically distributed or independent

ACSAC 2008

erich — Fri, 19 Dec 2008 00:31:42 +0000

So, I went to ACSAC the other week, and sketched out some crazy notes:

It seems that nobody is really working on large scale distributed systems reliability. By this I mean things like the Internet and the Power Grid. Both of these are highly distributed systems, where nodes automatically route around local failures. Unfortunately, and probably due to locality of knowledge, cascade failure is still an unsolved issue. Most of the solutions require early detection, and prevent systemic failure by removing of key nodes before things begin to cascade. This is quite similar to bulldozing trees to save the forest. It’s a sub-optimal solution. It’s also extremely difficult to automate, as it requires more global knowledge and assumes reliable control system connectivity in order to direct shutdown. It would be much better if each node behaved in such a way as to dampen the ‘cascade wave’ while still having only localized knowledge.

In some cases, rebooting becomes an issue. The graphic on wikipedia shows this quite well, when the center node reboots and becomes the only link in a high-demand chain. In the real power grid not all stations are capable of coming online without a fairly large kick-start from the grid itself. That is, some of the emergency/peak generators require power to start up. In a total network outage, you might find yourself without power for a while, even though you have a generator. Also, the power grid is run as close to maximum capacity as possible (for efficiency reasons) even though this increases the likelihood that a local failure can trigger a cascade.

Also these lines, I’ve also scrawled that it would be nice if a network protocol could give early warning of DDOS or similar attacks. Is early warning of global events possible in a distributed system with very localized knowledge at each node? given the lack of central authority, who would receive the warning? neighbor nodes? Can the protocol be specified so as to dampen over-traffic attacks? What info must a node disclose about itself in order to prevent a cascade failure (econ, info theory)?

Whitfield Diffie also spoke at the conference. He was nice enough to give explicit Creative Commons permission to record and distribute his speech, if any of us felt so inclined. He gave an off-the-cuff history of cryptography and society, with a focus on how we modify our behavior and adjust our risk-cost tradeoffs as technology changes. He also mentioned that he prefers Availability to Integrity to Confidentiality; that is, He thinks the rest of the security world has priories reversed.

Some more chicken-scratch:
Is it possible to do a BCC encryption? So that Alice post a single message a public BBS that can be read by both Bod and Charles, but each thinks they are the only recipient. Can this be done (a) transparently (b) in a way that adjusts to addition and removal of members in the group. Secure IRC, IM chat.

The best way to secure computer systems is probably to look toward biology, and create an immune system. Learning, dynamic collection of agents that monitor and repair.

More Accurate Speedometer

erich — Mon, 16 Jun 2008 02:18:17 +0000

I’ve been sitting on this idea for at least a year, and I’m not really sure why I haven’t blogged about it yet. It turns out that car speedometers are not very accurate. In fact, they can be on the order of 10% wrong, with the error more prominent at higher speeds. Typically they work by counting the revolutions per unit time of the axle, and then some quick math involving π. The odometer is usually measured in a similar fashion. I think that this is a terrible way to measure speed and distance. A much better way would be to have a photo-detector scanning the surface of the wheel. The system would work on the same principles that enable optical mice to detect their motion. It would be more accurate because there’d be less variation in tire width, and error would not compound at higher speeds. The only real drawback is that mud or dirt might obscure the sensor (but then you can fall back on the existing methods).

Bounding Problution.

erich — Wed, 04 Jun 2008 05:07:21 +0000

Problutions (noun, pl.): the problems that are an inherent part of solution to a very difficult and complex problem. These problems arise from within the solution itself, and are extrinsic to the original problem.

Lately, I’ve been feeling a bit of ennui, despite preparing for a move to UCI. I’ve also bits from Minksy‘s Society of Mind, Kauffman‘s At Home in the Universe which got me thinking:

Some of the problems in life are complicated, sure. But, that doesn’t mean a large brain is worth developing. Bacteria are enormously successful, by volume, mass, sheer number, range of environments, pretty much any way you measure evolutionary success, they’ve outdone us humans by several orders of magnitude. Yet, they have no brains, no language, no society, no bureaucracy, no wars, etc. When faced with the problem of finding and exploiting an evolutionary strategy, simple solutions are very effective. And it really gets no simpler than a bacterium, and no more effective than horizontal gene exchange. Furthermore brains come at a very high cost: cranium is needed for protection, it consumes an immense amount of resources (in humans it accounts for about 1/3 of the basal metabolic rate), and cognition brings with it a host of other psychiatric problems. So I’m wonder if we aren’t over-optimized in some way. If we might be some evolutionary dead-end (requiring a genetic scientist to take over), that our history is one of evolution trying out bigger and bigger brains in a race with the Red Queen.

So, accursed with my big brain, I kept thinking:
There must be a general principle going on here. Life presents insoluble problems, evolution attempts solution by expanding computational hardware. I get the feeling that there must be a relationship between the computational complexity¹ of a problem and that of it’s solution. In general given a problem, we should be able to at least bound the computational resources that would be needed to solve it, a bound on the complexity of solution. I figure that this has got to be formalizable in a manner akin to the Minimum Description Length or Kolmogorov complexity. At any rate, it ought to be useful to compute these bounds, because then we can have both an estimate of how difficult it will be to engineer a solution, and an indicator of when we’ve over-invested down a dead-end.

1. I don’t mean to mention the official computational hierarchy here. I mean something more practical, complexity in the engineering sense.

The Dryer Protocol

erich — Wed, 19 Mar 2008 01:20:02 +0000

I’m becoming increasingly interested in systems engineering, but still don’t really know much about the subject. One thing that I read about recently (though I completely forgot where) was about how to choose a robust cooperation protocol, using the communal clothes dryer as an example.

In a system there’s often more that one way of doing things, or more that one way of ordering the things that need to be done. When it comes to drying your clothes, we have the two basic strategies:

EmptyLintFirst
The program can first empty the lint trap, and then dry the clothes.
EmptyLintLast
The program can first dry the clothes, then empty the lint trap and leave it clear for the next agent to use.

So, obviously, both of these strategies will work with themselves quite well. If everyone follows the EmptyLintLast protocol then the lint trap will always be clear when you begin to dry your clothes. So, how to choose between the two strategies? First we set up a cost function for when things go wrong, one that reflects the actual system cost. For this example, we’ll charge 1 minute for emptying the lint trap, if it was already clean, and 60 minutes for drying clothes with a dirty lint trap. Then we ask the all important question: In a mixed system what is the associated cost for each strategy?

EmptyLintFirst
Half the time the EmptyLintLast guy will have left a clean lint trap, so, being simple-minded folk, we clean it again at a penalty of 1 min. But the trap is always clean when we go to dry the clothes, so no penalty there.
Total cost: 0.5 minutes.
EmptyLintLast
Half the time the EmptyLintFirst will have left a dirty lint trap, so we end up with soggy clothes for a penalty of 60 minutes. But the trap is always dirty when we go to clean it, so no penalty there.
Total cost: 30 minutes.

Clearly the EmptyLintFirst is the more robust of the two strategies, not only does it work if everyone follows this strategy, but it also has a much lower cost in the mixed environment. Most of the time in software development, the setup and initialization cost (cleaning the lint trap) is much less than the actual computation cost (drying the clothes), so it clearly pays to always initialize your environment prior to using it, even if you think it’s unnecessary, it’s an incremental cost that buys you a more robust and reliable system. And since you can’t depend on that junior programmer (or yourself) to always clean up the environment after using it, you save yourself maintenance nightmares down the road.

So from Systems Engineering 101:

List all the alternative ways of performing the task.
Associate a cost for when each thing can go wrong.
Evaluate each of the alternatives in a mixed environment.
Choose the protocol with lowest cost.

If the cost analysis results in a tie, then adjust the cost function to arbitrate some differences.