eric the fruitbatBlog
Sounding out the Noosphere.

Ideas

Typestate and SSA

Posted by Eric Hennigan
On July 19th, 2010 at 17:07

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas

Since Rust was announced, I looked into the underlying idea of Typestate[1]. The most striking aspect about this paper is its age. The authors consider an extension to ordinary type checking to also verify that the state of a variable is valid for the operation being performed. The most immediate benefit is that it helps to identify operations being performed on uninitialized variables. Of course, in today’s most popular OO languages, such checks are usually buried into the semantics of the language itself.

In reading the paper, one example made me immediately think of SSA:

Another mechanism which can be used to enhance program reliability is one based on static scope rules in block-structured languages of the Algol family. Static scope rules enable a compiler to detect references to variables which are made outside the lifetime of these variables. For example, in this program segment

declare
  A: integer;
begin
  A := 3;
  Print(2*A);
end;
Y := A

the final statement will be detected by the compiler as an illegal reference to variable A, thereby avoiding a potential reference to deallocated storage.

The insight follows from SSA’s separation of value from variable. Here we should immediately identify that the typestate of a variable is actually an attribute of its SSA value. So all the research that has gone into SSA should find immediate applicability here (esp. linear construction).

The idea of checking for a variables state as well as it’s type seems pretty valuable. For example, it would be nice if I could annotate function signatures that they expect initialized types to be passed in, so that I can receive a compiler error. It’s not really much more work for me to annotate than it is to assert( arg != null );. But I’d rather have the error occur much earlier, at compile-time rather than test/run-time.

The typestate idea also naturally fits with the OO-paradigm, when you think of each class being a state machine and methods providing transitions between states. Experimental languages with this feature already exist[2] and make the expression of object state much more explicit. A suitably extend type checker can handle the additional complexity, and typestate-aware SSA can handle any specialized optimizations that can be performed with the increased knowledge.


[1] Robert E Strom and Shaula Yemini. Typestate: A Programming Language Concept for Enhancing Software Reliability. IEEE Transactions on Software Engineering 12(1):157-171, 1986.

[2] Jonathan Aldrich, Joshua Sunshine, Darpan Saini, and Zachary Sparks. Typestate-Oriented Programming. In Proceedings of Onward!, 2009.

Quarter Update

Posted by Eric Hennigan
On December 12th, 2009 at 17:12

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Education, Ideas, Self

So the quarter finally ended, and I can get back to having a social life over the holidays!

The compiler that I wrote was, in my personal opinion, a pathetic failure. Though we were able to generate some x86_64 code that actually ran (given a bootstrapping program), it would crash on certain valid inputs. We were also unable to handle arrays properly; our calling convention was a nightmare (I now heartily recommend going with proven standards, rather than trying to come up with one yourself); and our register allocator was basically non-existent (it just spilled everything, assigning all ssa values to their own slot, and never re-using slots). On the positive side, it was able to do factorial, and towers of hanoi, so we nailed recursion!

The TA’ing gig was wonderful. I had great fun teaching the kids concepts, tricks, ways of thinking. It was my first time, so I occasionally gave away answers in my eagerness to explain, but overall I think I did a reasonable job. Some of the kids regularly came to my office hours, and thanked me for helping them pass the class (which felt really wonderful, I made a difference!) I ended up loving the TA work so much that I even let it take time from some of my other obligations (research, and my own classwork).

Given my experience, there’s one big thing that I’d change in the current system: I’d make discussion sections mandatory, and limit their size to about 10. This would make discussion an actual discussion, rather than just another lecture. I think the discussion sections should encourage the design aspects of the homework/projects. That is, we all gather round the table, and whiteboard a design. One of the things that I noticed in grading the programming assignments was that our students are sometimes really creative in their solutions (and not always in a good way). I think that we probably do a poor job of teaching design, because there isn’t really a quantitative scale that can be used to measure how good a design is. So, they need to practice for an hour each week, offer each other design options, constructively criticize by identifying holes in each other’s attempts (with guidance from the TA), until a satisfactory solution is converged upon. I don’t think our current system quite has the resources to do this though.

I still don’t get too much of a break though: I’d like to re-do the compiler, this time with a better design; I need to catch up on some neglected research; and I want to play around with Google’s new language Go.

Intellectual Land Grab

Posted by Eric Hennigan
On September 4th, 2009 at 01:09

Permalink | Trackback | Links In |

Comments (1) |
Posted in Ideas, Idiocracy, People, Philosophy, Politics, Punditry

The Libertarian think tank CATO recently published a small, trite piece that attempts to establish The Case against Literary (and Software) Patents. Being a Libertarian, I actually agree with the position; I just don’t think that this article fully explored the issue. Here, I seek to provide some links to more fundamental content.

It begins with the hypothetical existence of a ‘Literature patent’. I consider such an idea to be terrible at face value, and the article actually dismisses it as much. It would be ridiculous to expect every author to carefully comb over their work making sure that it doesn’t infringe on any registered plots or (worse!) plot devices (good buy holodeck!) Acquiring knowledge of registered patents would be prohibitive for a beginning author, they’d have to rely on publishers/editors. This significantly raises the cost of creating an innovative work. Not to mention the human effort the government must spend to maintain consistency in it’s patent database, and the legal costs and liabilities for the inevitable infringement.

The article then proceeds to demonstrate what happens in patentable areas. Immediately, there is a land grab on the ‘low hanging fruit’. During this process, established market leaders tend to benefit, because the have the resources (both funds and people) to make a large number of claims and file the required paperwork (economics of scale apply to paper shuffling too). Typically only a relatively few companies will be successful in this endeavor. The initial grab might also appear to be an ‘economic stimulus’, as it will show a remarkably steep and sudden interest in the field, resulting from the underlying similarity of the tragedy of the commons. After the market settles, a few incumbents then use their patent portfolios to threaten up-start competition. As the article points out, in the world of software patents, so much of the field is so obvious, that agents without an explicit interest in software will find themselves infringing as a normal course of their business but will be without their own patent portfolio and unable to make a bargaining counter-threat.

Another economic phenomenon that happens as a result of the patent system’s existence is economic stalemate. This actually happened with the sewing machine, as recorded by Adam Mossoff in A Stitch in Time: The Rise and Fall of the Sewing Machine Patent Thicket, which was blogged about at the Volokh Conspiracy. He recounts how the marketing and distribution of the sewing machine was actually encumbered by the patent system, because the machine required the combination of several innovations, and no single agent held all patents on the functionality. History also demonstrates the practice of ‘patent trolling’, whereby a company, which doesn’t actually produce anything, seeks to profit by legal threats of infringement and licensing agreements on its patent portfolio. The resulting stalemate was finally resolved through the explicit creation of a patent-holding company, whose sole function was to share the patents and resulting profits of all involved manufacturing firms.

So we can see that for areas where copyright is already established practice, the introduction of an extension of the patent system results in litigation and paperwork and encourages the preservation of an established regime of a few powerful companies working in loose collusion, both of which tend to outweigh any potential benefits to development and innovation

Now, I’d like to go out on a limb here, and reject the very concept of ‘Intellectual Property’. Richard Stallman has spoken out against its use, and continuously advertises the fact that it’s deceptive and misleading. At the root of the issue is that ideas and physical matter behave differently. That is, copying != stealing. The reasoning behind this position is fairly simple, when an idea is copied that does not deprive the original possessor from the idea. When you tell me about your theory of X, you don’t suddenly forget after telling me. In contrast, if you give me an apple, now you no longer have that apple. Ideas are part of a different realm of existence.

Finally, I’d like to point out the slippery slope, what happens if we go too far with this property idea: we might lose The Right to Read, or watch How creativity is being strangled by the law.

I hope that through these references, you can see where, how, and why I’ve developed my position on the patent issue; I’m firmly on the side of maximum freedom (and that includes the opening up of all media: open-music, open-software, open-hardware, open-design, open-architecture, open-video, open-government, open-literature, etc…)

Fully Homomorphic Encryption

Posted by Eric Hennigan
On June 20th, 2009 at 11:06

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas

I didn’t know it at the time I posted my idea on secure computing, but others have been working on the idea for quite some time. It turns out that Craig Gentry recently published Fully homomorphic encryption using ideal lattices. For those, like myself, who are not well versed in the crypto terminology, Scott Aaronson provides a nice summary.

First, in what might or might not turn out to be the biggest cryptographic breakthrough in decades, Craig Gentry has proposed a fully homomorphic encryption scheme based on ideal lattices: that is, a scheme that lets you perform arbitrary computations on encrypted data without decrypting it. Currently, Gentry’s scheme is not known to be breakable even by quantum computers—despite a 2002 result of van Dam, Hallgren, and Ip, which said that if a fully homomorphic encryption scheme existed, then it could be broken by a quantum computer. (The catch? Van Dam et al.’s result applied to deterministic encryption schemes; Gentry’s is probabilistic.)

So this essentially fulfills my idea, that in order to take advantage of cloud services, it’s imperative that you be able to store and compute in an encrypted domain. Otherwise, your service provider might spy on you.

Building Linguistic Structure

Yesterday, I had an interesting thought. My advisor once made the cultural observation that many people in Computer Science invent their own language and then immediately write a self-hosting compiler. I agree that a compiler is quite a feat of engineering and serves as a nice test case to demonstrate that the language you’ve invented is powerful enough that it can handle real-world complexity. Unfortunately, this test fails in a few important ways.

First, It doesn’t actually show as much as you think it might. There is a very strong filter on failed languages. By using this test the author runs the risk of re-designing the language, specifically to insert constructs that help them build the compiler. Now, this isn’t necessarily a bad thing, except that compiler writing is now a fairly mature field. There are standard abstractions (esp. in the lexing and parsing) that a new language will probably not experiment with. So, the author will usually just build these existing and well-understood abstractions into the new language. Rather than encouraging language experimentation we get more of the same, but with different syntax.

Second, Not all useful languages even have their own compiler. I’m specifically thinking of the domain specific languages (DSL). Nobody would write an awk interpreter in awk; or a mail engine using sendmail (even if it is Turing Complete). These are languages designed to do a specific task, many of them are quite essential to their respective fields, but none of them are self-hosting. Nor should we expect them to be.

My argument here is that the cultural practice of writing a self-hosting compiler is a big distraction. New languages should be for experimenting with new linguistic constructs. We should be looking toward the DSLs, and incorporating their innovations into our more main-stream languages. Right now, we seem to be optimizing our languages for compiler construction.

I’d rather see our languages evolve in a different direction. I’m really eager to witness the birth of an AI. For this to happen though, we need languages for expressing patterns of thought, not patterns of bits. We need the ability to cohesively and flexibly assemble the stuff of thought. I’m thinking Society of Mind stuff here. We need languages that allow for statistical fuzziness, sloppy associativity, and the ability to construct metaphor.

The linguistic tools that we find useful for building compilers are not necessarily the same tools that will help us build a mind.

Decentralized Garbage Collection

Posted by Eric Hennigan
On May 20th, 2009 at 00:05

Permalink | Trackback | Links In |

Comments (2) |
Posted in Comp*, Ideas

When I visited Washington D.C. for ASPLOS/VEE, I was working on my paper about cascade failure, and had distributed systems on my mind. Some of the papers at the joint conference were about garbage collection. One of the more interestring had an approach called leak pruning, for solving logical leaks. After awhile, I had an interesting idea that merged a bit of these two fields.

You see, distributed systems and garbage collection are both about graph problems. In the case of distributed systems, it’s usually about minimizing traffic along the edges, or performing some kind of global computation by communication with neighbors. In garbage collection, we’re concerned about reference cycles and reachability. But what if you ask: Is it possible for an object (node in graph) to decide, for itself, when it has been disconnected from a root node? In answering this question I came up with a very simple approach. I’ve been sitting on it for awhile, but, just yesterday, the paper was typed up in a hurry and submitted to the Fun Ideas and Thoughts session of PLDI.

The primary idea is this: If root nodes are tagged with a high value, and heap nodes are tagged with a low value, and each object acts as a node performing a distributed computation, then the average value of any subgraph that becomes disconnected from the roots will continuously decay. When it falls below a threshold value, that whole cluster can safely be collected.

There are some implementation details though. It’s going to be easiest to process these updates in an event queue in a separate garbage collection thread, rather than as a real distributed computation. When updating a single node, if we add to the queue all the points-to nodes, then the queue will rapidly grow without bound. But, we actually don’t care too much about how quickly the average worthiness is computed, so we can easily get away with adding only those nodes that are not already present in the queue.

The second problem is that references are directional. So, if we rely only the queue to hold references to objects, then it is possible for a set of nodes to become stranded (referenced neither by the queue, nor by the program) For example, the deleted branch of a subtree. We’d initially process the root of the branch, and proceed down to the leaves, but since it has no cycles we will loose all references to it, because the processed nodes will not be re-entered into the queue. (the program asked for it to be deleted, and the garbage collector does not maintain a list of all allocation sites) But this problem only happens for acyclic structures, so it can be solved by reference counting. When we go to update a node, it is safe to collect it if its reference count is zero, regardless of its worthiness value.

The third problem is that the average worthiness value of the program-reachable cluster of nodes will go down with the number of objects held by the program. This might limit the program size, but I think that the natural range of an unsigned double should be enough to solve this problem. Root nodes can have the value MAX_DOUBLE, and even if the program holds 10^6 objects, the average is still very high.

A fourth problem is the decay rate of unreachable clusters. This will be a function of both the computed worthiness at the time of disconnect, and the number of nodes in the cluster. I will probably have to adjust the formula given in my paper so that it has a faster decay rate. Unfortunately, this and the third problem above are in conflict with each other. Having high valued root nodes allows programs with more objects, but also extends the decay time.

I don’t claim this approach to be efficient, but I do claim that it’s simple to implement, and should scale nicely. Also, it might be useful for process collection in a distributed system. I’ll know if the idea was too crazy when review committee results are released on May 29.

Automatic Thesaurus

Posted by Eric Hennigan
On May 7th, 2009 at 20:05

Permalink | Trackback | Links In |

Comments (4) |
Posted in Ideas, Language

Last week, I landed on another PhD worthy research project.

Given a very large corpus of sentences, such as a digitized version of the Library of Congress, or a less noisy version of the Internet, how can you automatically generate a Thesaurus?

At first I thought the problem should be fairly easy, but the more I thought about it the more difficult and daunting the task became. For example, as a first approach, we might assume that textual substitution would be a good proxy for identifying synonymous terms. That is if a couple terms really are synonymous, then they ought to be substitutable for each other in a sentence. With a large enough set of sentences, we should be able to identify such situations, and thereby bootstrap the building of the thesaurus. But there’s a small problem, provided by my good friend EvB:

The sky is blue.
The ocean is blue.

But sky is not the same as ocean. Sure they are similar. A poet could compose a nice metaphor of fish swimming through their sky above the bottom feeders. But this metaphorical relationship isn’t one that would necessarily make it into a human compile thesaurus. So, textual substitution can easily lead us astray.

Continuing with EvB’s particularly good example we can also identify another problem. Suppose that we incorporate a bit of natural language understanding, enough to pull out parts of speech. Then, the system would easily identify the equation of sky with blue, or ocean with blue. But neither of these statements is true either. Usually people take the example to mean not that sky and blue are the same thing, but that the sky belongs to the set of objects that have a property called color, the value of which is blue. So this understanding depends on what the definition of ‘is’ is (obviously not a simple affair). We also would like to avoid drawing a relationship between any of the pronouns and the rest of the language.

Next lets look at how people tend to write. Any good library is gonna be full of metaphor, simile, pun, allusion, word play, sound play, and other such highly nuanced expression. All of these things will trump any reasonably simple attempt at drawing a link between synonymous words. Political propaganda and polemic, will probably be particularly bad at equating terms that should probably be kept logically distinct. Furthermore, at least when I write, I’m reminded of other things during the process, things that are associated, but not necessarily synonymous. These remindings are an important part of the essay writing process, but will certainly throw noise into the digital library.

But if it’s so hard to make a mechanical system for identifying synonyms, then how do humans do it? Here I have a hypothesis: that similar words stimulate similar patterns in the brain. Thus when a human tries to think up synonyms it’s really the same as playing word association with a filter. First, the word stimulates the brain, bring up certain associations. These associations will be based on ‘brain distance’, a measure of the similarity of brain activity for certain words and thoughts. But some associations will be radically different from the synonyms that we’re looking for. For example, antonyms and non-sequiturs often come up in word association games. So a filter is applied to weed these out, and what’s left is passed through a dictionary/meaning check. Anything passing this process will be reported as a synonymous term.

So, in order to really generate a thesaurus, we do need AI (or at least an underlying cognitive model). When I first thought of the thesaurus problem, I was hoping that it was paired down enough, small and simple enough that it would be doable without all this complexity. We might have to reduce the problem further, make it looser. Say, build an association dictionary, rather than a thesaurus. An association dictionary might be possible, because it forgoes the understanding of meaning and similarity, it doesn’t have to question or measure why two words should be associated, only record that they are used similarly.

So, if you can automate the building of a thesaurus, you should get a PhD in Linguistics.

Preventing Cascade Failure in Networks

Posted by Eric Hennigan
On January 9th, 2009 at 01:01

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Engineering, Ideas

For some reason I was thinking about the cascade failure problem that I mentioned before my Holiday Hiatus. I discovered that this problem has gone largely unsolved when attending ACSAC ‘08, and it’s been in the back of my mind (and on my whiteboard) since then. I’ve looked online a bit to see wether any researchers are attacking the problem, and it seems that very few are. Most of the work is also around containment of the failure rather than prevention. Most of the solutions resemble burning swaths of trees to save the forest from a fire; this disgustingly requires global knowledge of the network. Also, the systems today seem to me to be built rather naively: nodes very simply reroute traffic based on failure, without taking into account the possibility that in so doing they might overtraffic their neighbors, and thus trigger a cascade. There also seems to be a lack of information in textbooks about running networks near capacity (this is desired for efficient utilization of the network, but can drastically increase the probability of cascade).

So, I want to introduce a few constraints and assumptions before I state my rough solution. Assuming the network can be viewed as a graph with nodes and edges, where some goods (power, packets, traffic) is flowing along edges from one node to another, I’d prefer a solution that used local knowledge (node capacity, and possible neighbor capacities) instead of global knowledge. So I’ll assume that each node can keep track of how much stuff is flowing through it, and what the traffic patterns actually are; that is: how much stuff goes from edge 1 to edge 2.

In computer networks it is also possible for a node to track the possible re-routings that certain packets can take and still arrive at their destination. (I don’t know wether this is true of power delivery in the electric grid, but I’m going to assume that something analogous still holds.) Each node will be responsible for measuring traffic capacity: That is for each stream of traffic, it will keep track of what other edges could have had that traffic routed onto them. Each node will also observe a capacity constraint: It will discourage traffic that cannot be re-routed if the primary edge for that traffic suddenly fails. In this way we can keep a reserve capacity in case of edge failure.

When an edge fails, the node will re-route its current traffic in a load balanced manner along the secondary edges available for that traffic. This will temporarily violate the reserve capacity constraint, but not over-traffic other nodes. [Actually I don't think this is true, unless the node can use capacity info from its neighbors] In this way, outage of a single edge cannot trigger a cascade failure, because enough reserve capacity is always available for the traffic that was allocated to the failed edge.

    Still to be addressed:

  • Node failures which can cause multiple simultaneous edge failures.
  • Characterize the efficiency in terms of what is kept as reserve capacity and theoretical max capacity in the naive systems
  • Formalize the measurements taken by each node
  • Investigate push vs pull flows (power grid is demand based, computer networks are push)
  • Do nodes still need a means of capping their flow? (can the power grid drop packets?) I’d prefer not to require this.
  • Specify what each node is to communicate to its neighbors
  • Investigate the use of probability for multiple node failures (earthquake, software bug). Failures are not generally identically distributed or independent

ACSAC 2008

Posted by Eric Hennigan
On December 18th, 2008 at 17:12

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Engineering, Ideas

So, I went to ACSAC the other week, and sketched out some crazy notes:

It seems that nobody is really working on large scale distributed systems reliability. By this I mean things like the Internet and the Power Grid. Both of these are highly distributed systems, where nodes automatically route around local failures. Unfortunately, and probably due to locality of knowledge, cascade failure is still an unsolved issue. Most of the solutions require early detection, and prevent systemic failure by removing of key nodes before things begin to cascade. This is quite similar to bulldozing trees to save the forest. It’s a sub-optimal solution. It’s also extremely difficult to automate, as it requires more global knowledge and assumes reliable control system connectivity in order to direct shutdown. It would be much better if each node behaved in such a way as to dampen the ‘cascade wave’ while still having only localized knowledge.

In some cases, rebooting becomes an issue. The graphic on wikipedia shows this quite well, when the center node reboots and becomes the only link in a high-demand chain. In the real power grid not all stations are capable of coming online without a fairly large kick-start from the grid itself. That is, some of the emergency/peak generators require power to start up. In a total network outage, you might find yourself without power for a while, even though you have a generator. Also, the power grid is run as close to maximum capacity as possible (for efficiency reasons) even though this increases the likelihood that a local failure can trigger a cascade.

Also these lines, I’ve also scrawled that it would be nice if a network protocol could give early warning of DDOS or similar attacks. Is early warning of global events possible in a distributed system with very localized knowledge at each node? given the lack of central authority, who would receive the warning? neighbor nodes? Can the protocol be specified so as to dampen over-traffic attacks? What info must a node disclose about itself in order to prevent a cascade failure (econ, info theory)?

Whitfield Diffie also spoke at the conference. He was nice enough to give explicit Creative Commons permission to record and distribute his speech, if any of us felt so inclined. He gave an off-the-cuff history of cryptography and society, with a focus on how we modify our behavior and adjust our risk-cost tradeoffs as technology changes. He also mentioned that he prefers Availability to Integrity to Confidentiality; that is, He thinks the rest of the security world has priories reversed.

Some more chicken-scratch:
Is it possible to do a BCC encryption? So that Alice post a single message a public BBS that can be read by both Bod and Charles, but each thinks they are the only recipient. Can this be done (a) transparently (b) in a way that adjusts to addition and removal of members in the group. Secure IRC, IM chat.

The best way to secure computer systems is probably to look toward biology, and create an immune system. Learning, dynamic collection of agents that monitor and repair.

P vs NP

Posted by Eric Hennigan
On December 4th, 2008 at 22:12

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas, Math

We’ve reached the NP-completeness section of my Fundamental Algorithms class, and I’ve noticed something interesting about P and NP.

class P NP-complete
Euler tour Hamiltonian Cycle
2-SAT 3-SAT
Shortest path Longest simple path

We have problems from a variety of different fields, for which one instance of the problem is in class P and the other is NP-Complete. I conjecture that whatever property it is that separates one class from the other, it must show up in all such fields. So the underlying fundamental reason for separating P from NP must translate from graph theory to boolean logic theory. It would be nice to gather such example divisions and then try to formulate an isomorphisms between fields. I’d think that eventually an isomorphism will be found that sheds enough light on the P vs NP problem that it can finally be cracked.

What really got me interested in this problem was noticing that Hamiltonian Cycle would be very easy to solve if we could convert it into an Euler tour by constructing an interchange graph; where all the nodes become edges and all the edges become nodes. But interchange is not a well defined operation on graphs, indeed it isn’t even possible in some cases. Example: The graph of 5 nodes and 0 edges. The interchange would have 0 nodes and 5 edges, which isn’t a graph. So, I was wildly speculating that perhaps what separates P from NP are these impossible operations. But to shed some light on the situation, I’d first want to construct a reduction of Euler tour from 2-SAT which mirrors the reduction of Hamiltonian cycle from 3-SAT.