eric the fruitbatBlog
Sounding out the Noosphere.

Comp*

Fully Homomorphic Encryption

Posted by Eric Hennigan
On June 20th, 2009 at 11:06

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas

I didn’t know it at the time I posted my idea on secure computing, but others have been working on the idea for quite some time. It turns out that Craig Gentry recently published Fully homomorphic encryption using ideal lattices. For those, like myself, who are not well versed in the crypto terminology, Scott Aaronson provides a nice summary.

First, in what might or might not turn out to be the biggest cryptographic breakthrough in decades, Craig Gentry has proposed a fully homomorphic encryption scheme based on ideal lattices: that is, a scheme that lets you perform arbitrary computations on encrypted data without decrypting it. Currently, Gentry’s scheme is not known to be breakable even by quantum computers—despite a 2002 result of van Dam, Hallgren, and Ip, which said that if a fully homomorphic encryption scheme existed, then it could be broken by a quantum computer. (The catch? Van Dam et al.’s result applied to deterministic encryption schemes; Gentry’s is probabilistic.)

So this essentially fulfills my idea, that in order to take advantage of cloud services, it’s imperative that you be able to store and compute in an encrypted domain. Otherwise, your service provider might spy on you.

Computer Language Comparison

Posted by Eric Hennigan
On May 31st, 2009 at 17:05

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language

Guillaume Marceau has used data from the Computer Language Benchmark Game to provide a graphical comparison of many different languages.

If you drew the benchmark results on an XY chart you could name the four corners. The fast but verbose languages would cluster at the top left. Let’s call them system languages. The elegantly concise but sluggish languages would cluster at the bottom right. Let’s call them script languages. On the top right you would find the obsolete languages. That is, languages which have since been outclassed by newer languages, unless they offer some quirky attraction that is not captured by the data here. And finally, in the bottom left corner you would find probably nothing, since this is the space of the ideal language, the one which is at the same time fast and short and a joy to use.

Of course the C compilers do a very good job on performance, but seem to do average on verbosity (better than I expected). Haskell (ghc) does a surprisingly nice job. I wish that I’d thought to do this kind of visualization, it’s really pretty neat. The only improvement that I could think of, would be to do the performance axis on a logarithmic scale.

Decentralized Garbage Collection

Posted by Eric Hennigan
On May 20th, 2009 at 00:05

Permalink | Trackback | Links In |

Comments (1) |
Posted in Comp*, Ideas

When I visited Washington D.C. for ASPLOS/VEE, I was working on my paper about cascade failure, and had distributed systems on my mind. Some of the papers at the joint conference were about garbage collection. One of the more interestring had an approach called leak pruning, for solving logical leaks. After awhile, I had an interesting idea that merged a bit of these two fields.

You see, distributed systems and garbage collection are both about graph problems. In the case of distributed systems, it’s usually about minimizing traffic along the edges, or performing some kind of global computation by communication with neighbors. In garbage collection, we’re concerned about reference cycles and reachability. But what if you ask: Is it possible for an object (node in graph) to decide, for itself, when it has been disconnected from a root node? In answering this question I came up with a very simple approach. I’ve been sitting on it for awhile, but, just yesterday, the paper was typed up in a hurry and submitted to the Fun Ideas and Thoughts session of PLDI.

The primary idea is this: If root nodes are tagged with a high value, and heap nodes are tagged with a low value, and each object acts as a node performing a distributed computation, then the average value of any subgraph that becomes disconnected from the roots will continuously decay. When it falls below a threshold value, that whole cluster can safely be collected.

There are some implementation details though. It’s going to be easiest to process these updates in an event queue in a separate garbage collection thread, rather than as a real distributed computation. When updating a single node, if we add to the queue all the points-to nodes, then the queue will rapidly grow without bound. But, we actually don’t care too much about how quickly the average worthiness is computed, so we can easily get away with adding only those nodes that are not already present in the queue.

The second problem is that references are directional. So, if we rely only the queue to hold references to objects, then it is possible for a set of nodes to become stranded (referenced neither by the queue, nor by the program) For example, the deleted branch of a subtree. We’d initially process the root of the branch, and proceed down to the leaves, but since it has no cycles we will loose all references to it, because the processed nodes will not be re-entered into the queue. (the program asked for it to be deleted, and the garbage collector does not maintain a list of all allocation sites) But this problem only happens for acyclic structures, so it can be solved by reference counting. When we go to update a node, it is safe to collect it if its reference count is zero, regardless of its worthiness value.

The third problem is that the average worthiness value of the program-reachable cluster of nodes will go down with the number of objects held by the program. This might limit the program size, but I think that the natural range of an unsigned double should be enough to solve this problem. Root nodes can have the value MAX_DOUBLE, and even if the program holds 10^6 objects, the average is still very high.

A fourth problem is the decay rate of unreachable clusters. This will be a function of both the computed worthiness at the time of disconnect, and the number of nodes in the cluster. I will probably have to adjust the formula given in my paper so that it has a faster decay rate. Unfortunately, this and the third problem above are in conflict with each other. Having high valued root nodes allows programs with more objects, but also extends the decay time.

I don’t claim this approach to be efficient, but I do claim that it’s simple to implement, and should scale nicely. Also, it might be useful for process collection in a distributed system. I’ll know if the idea was too crazy when review committee results are released on May 29.

Philosophy of Computer Science: Naming

Posted by Eric Hennigan
On May 2nd, 2009 at 01:05

Permalink | Trackback | Links In |

Comments (1) |
Posted in Comp*, Language, Literature, Philosophy, Religion

For a very long time, western culture has had a strong undercurrent about naming. Conceptually, it starts with the recognition that the ability to name a thing gives you power over it. This is reflected in many deep and ancient cultural mythologies.

The creation story in the Bible begins with:

In the beginning,…the earth was a formless void…. Then God said: Let there be light. God called the light Day, and the darkness he called Night.

So God is able to create the Earth with only his Word, and give life to mankind with only his breath. This power is nearly transferred to Adam, when he is given the task of naming all the plants and animals. Only mankind is given this linguistic power.

Jewish mythology picks up on this issue with the story of the Golem.

In many tales the Golem is inscribed with magic or religious words that keep it animated. Writing one of the names of God on its forehead, a slip of paper in its mouth, or inscribed on its body, or writing the word Emet (אמת, “truth” in the Hebrew language) on its forehead are examples of such words. By erasing the first letter aleph in Emet to form Met (מת, “dead” in Hebrew, when the aleph letter א is cancelled) the golem could be deactivated.

Jewish culture continues this tradition with the Kabbalah’s search for the True Name of God. Other cultures also demonstrate this idea. In witchcraft, a demon is both summoned and controlled by speaking its name. In the Hindu tradition AUM is the sacred word that encompasses everything, and is the sole syllable upon which focus is kept during meditation. The idea is also reflected in more modern works, as clearly expressed in Ursula LeGuin’s A Wizard of Earthsea:

Ged sighed sometimes, but he did not complain. He saw that in this dusty and fathomless matter of learning the true name of each place, thing, and being, the power he wanted lay like a jewel at the bottom of a dry well. For magic consists in this, the true naming of a thing.

Or the so recently popular, Harry Potter, where Dumbledore advises Harry:

Call him Voldemort, Harry. Always use the proper name for things. Fear of a name increases fear of the thing itself.” (PS17)

But how does this relate to Computer Science? Being a very textual discipline, we have many conventions that relate to naming. In Computer Science, we have the ability to create virtual worlds, and thus we need systems of naming the objects within those worlds. At the Language level we see a focus on naming conventions:

  • Hungarian notation, in which variables have a prefix that describes their type, such as strName for a string, or pX for a pointer to X.
  • Fortran, which had an implicit typing scheme, where any names beginning with I, J, K, L, M, N were always integer and the rest were reals.
  • The Ruby on Rails framework, which has the ability to automatically map a model named “Person” to the “people” table in the database just by name inspection.

But naming actually turn out to be a much deeper issue than these linguistic examples show. In the Distributed Systems world, we have a large focus on naming, for a remote resource can only be accessed through its name, in what’s called name resolution. The easiest example to pick on here, is DNS, the system that allows a person to reference a remote computer by using an easy to remember domain (such as www.example.com) instead of a hard to remember physical address (such as 127.0.01). We can also identify a confluence of two separate concepts: The name of a machine can be used to locate it. This allows machines to operate with the previous cultural ‘power of naming’, knowing a machines name gives one access to that machine.

Since my research focuses on computer security, this duality between names and locations can be really critical. For example, there is a model for building secure software, called the object capabilities model, that not only identifies this power of naming, but actually explicitly states it as an axiom of the model:

  • Objects (actors) can interact only by sending messages to unforgeable addresses.
  • An object acquires knowledge of other objects in one of two ways:
    1. It is created with addresses that it receives from its creator
    2. It receives a message with an address to another object.

So, the security of the system is brought down to names. Communication and therefore power over other objects can only be obtained by learning their true names, which must be kept secret (unforgeable). For if a malicious object were able to easily guess the names of other objects in the system, it could quickly wreak havoc.

As such systems work their way into our daily lives, our personal names (read: personal identification) have also become much more important recently, as anyone that has been a victim of identity theft can attest. But this is an issue I won’t go into here. There are also other cultural impacts, for names change the way we think about each other.

Philosophy of Computer Science: Boundaries

Posted by Eric Hennigan
On April 17th, 2009 at 15:04

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Philosophy, Physics

One of the reasons that I’ve decided to spend my life studying Computer Science is its deep connections with philosophy. Mankind has been asking deep philosophical questions for a very long time, but we’ve been able to chip away at these questions only recently. I’ll review a few more of these connections later, but right now I’d like to focus on Boundaries.

When a program is running it does so within a ‘memory space’. Modern Operating Systems practice various forms of memory isolation between processes. This is done to prevent the memory manipulations performed by one program from affecting the state of any other programs. So the OS enforces a boundary between the memories of different running programs.

The OS also provides to each program an ‘environment’. The environment can be used for a number of things, and is typically used to pass settings implicitly (via key/value pairs), so that the program is able to behave slightly differently depending on the environment in which it is run. The OS also provides services to the program, so that it may (in a well-defined manner) communicate with the ‘outside’ world. It’s through these services that the program can access files, get keystrokes and other user input, talk to other computers using network protocols, etc.

The OS also maintains for each program the illusion that it is the only one running. As far as the program is concerned it is the only thing in the universe, and the universe (other programs, external devices, etc.) responds to communication in a well-defined manner. Any running program is the center of a virtual universe. It’s this virtuality that I’d like to explore.

Tron: Can a program break out of its universe? and if so, What’s out there?

Operating Systems didn’t always isolate programs as effectively as they do now. It used to be possible for one program to read, and even edit the memory space of a different program. Of course this almost always leads to erratic behavior for either or both programs, that ultimately winds up with a crash. But if a program could step outside of its universe, what would it see?

The memory of other programs. In the Tron link above, the light-cycles were able to break out of their battle arena, and wander around the system. Unfortunately, since they actually wrote back the light trail, the memory they wandered over would be corrupted. But suppose they only read? This would never harm the assumptions made by another program, so it wouldn’t be likely to lead to a system crash. And if assumptions about the size and address range of physical memory is known by the program, then it can be made to never trigger an out-of-bounds exception by the OS. This underlies the classic buffer vulnerability. Using this technique, a program can print out the memory contents of other programs (potentially revealing information that was supposed to be confidential).

So far I’ve been talking about a conceptual boundary that programmers deal with. What about the users of a system? do they ever notice the boundaries? Steve Yegge observes that in games like Mario Kart, there’s an Invisible Wall marking the boundary of game play. Outside the wall lies undefined territory.

When we reason about our own universe, we should adopt the same point of view that we have for programs. We can’t access ‘outside’ our universe. So lets ask some of these questions.

Question Program’s Answer Our Answer
What’s outside the universe? Memory in an undefined state It’s an ill-defined question. By definition it’s not matter, energy, or space; and we don’t have any way to reason about it.
What was there before the universe? There is no before. In order to exist, the OS had to allocate and overwrite what was in memory before the program starts. It’s an ill-defined question. Time began with the universe: there is no ‘before the beginning of time’.
Can we tell that we are in a virtualized environment? Sometimes, yes. There are ways to Detect System Emulators. Possibly, but I don’t know of any way. (Possibly experiments to detect non-conservation of information). Further research is needed.

That last question deserves more discussion:

  • It’s possible to go the other way! Instead of breaking out of the matrix, it’s possible to put the universe into one. Using the Blue Pill.
  • Writers of software for penetration and infiltration wish to avoid honeypots. Honeypots are cheaply created via Virtual Machines, but, as in the case of VMWare, introspection on the drivers can allow a program to tell that its underlying OS uses VMWare devices, and is therefore virtual.
  • The program can look for differences in behavior between real and virtual machines: CPU specific bugs, model specific registers, timing of system calls, relative performance of assembly instructions, memory alignment, cache performance, etc.

Linguistics and Computer Languages

Posted by Eric Hennigan
On April 12th, 2009 at 17:04

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language, People

Of course, I would never think that I was the only one to have the idea of studying computer languages from a linguistics point of view. Well, I found an interesting character, by the name of Chris Barker that gave an interesting keynote at POPL in 2004. He’s mentioned in a recent LtU discussion about the “Influence of cognitive models on programming language design”. Unlike most linguists, that get branched off into anthropology and soft models of cognition, Barker really knows what he’s talking about when it comes to formal models. He even has an (interactive!) tutorial on lambda calculus.

Unfortunately, I wasn’t able to scare up any recording of his keynote, but the abstract is available.

Linguists seek to understand the semantics of expressions in human languages. Taking a computational point of view, there are many natural language expressions—operators in the wild, so to speak— that control evaluation in ways that are familiar from programming languages: just think of the natural-language counterparts of if, unless, while, etc. But in general, how well-behaved are control operators found in the wild? Can we always understand them in terms of familiar programming constructs, or do they go significantly beyond the expressive power of programming languages?

I’d love to take a whole class devoted to this kind of stuff!

P vs NP

Posted by Eric Hennigan
On December 4th, 2008 at 22:12

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas, Math

We’ve reached the NP-completeness section of my Fundamental Algorithms class, and I’ve noticed something interesting about P and NP.

class P NP-complete
Euler tour Hamiltonian Cycle
2-SAT 3-SAT
Shortest path Longest simple path

We have problems from a variety of different fields, for which one instance of the problem is in class P and the other is NP-Complete. I conjecture that whatever property it is that separates one class from the other, it must show up in all such fields. So the underlying fundamental reason for separating P from NP must translate from graph theory to boolean logic theory. It would be nice to gather such example divisions and then try to formulate an isomorphisms between fields. I’d think that eventually an isomorphism will be found that sheds enough light on the P vs NP problem that it can finally be cracked.

What really got me interested in this problem was noticing that Hamiltonian Cycle would be very easy to solve if we could convert it into an Euler tour by constructing an interchange graph; where all the nodes become edges and all the edges become nodes. But interchange is not a well defined operation on graphs, indeed it isn’t even possible in some cases. Example: The graph of 5 nodes and 0 edges. The interchange would have 0 nodes and 5 edges, which isn’t a graph. So, I was wildly speculating that perhaps what separates P from NP are these impossible operations. But to shed some light on the situation, I’d first want to construct a reduction of Euler tour from 2-SAT which mirrors the reduction of Hamiltonian cycle from 3-SAT.

Probability Programming

Posted by Eric Hennigan
On November 15th, 2008 at 20:11

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language, Math, People

Yesterday a very interesting speaker, Eric Hehner, gave a talk at the graduate seminar:

TITLE

A Probability Perspective

ABSTRACT

This talk could be called “probability meets programming”. It draws together four perspectives that contribute to a new understanding of probability and solving problems involving probability. The first is the Subjective Bayesian perspective that probability is affected by ones knowledge, and that it is updated as ones knowledge changes. The problem of assigning prior probabilities is mitigated by the Information Theory perspective, which equates probability with information. The main point of the talk is that the formal perspective (formalize, calculate, unformalize) is beneficial to solving probability problems. And finally, the programmer’s perspective provides us with a suitable formalism.

I found the talk extremely fascinating. He first compared measures of probability, entropy, and information, demonstrating that they were (in a sense) substitutable concepts (analogous with Energy and Mass, or Energy and Temperature).

b bits = 2b states = 2-b chance
log(s) bits = s states = 1/s states
-log(c) bits = 1/c states = c chance

He also waxed poetical about how we are often fooled about probabilities, so it pays considerably to mechanize our calculations regarding those probabilities. It helps even further if we have a formal language into which we can directly translate our real-life word problems, so that we don’t accidentally setup and then solve the wrong problem. That is, we can then move on from debating about interpretations of the problem, and into actual calculation.

  • If I have two children and one of them is a girl, what is the probability the other is also a girl? (ans: 1/3)
  • If I have two children and the older one is a girl, what is the probability that the younger is also a girl? (ans: 1/2)

He also talked quite a bit on the Bayesian approach to probability, and why it is much nicer than the frequentist approach, because it assumes much less about the world. There is no need for a prior, and your measured probabilities are updated naturally as your knowledge of the world changes.

From this he moved on to providing a proto-language approach to how one would setup and solve these sorts of problems. He co-opted existing computer language constructs to do this. First we notice that in statements like

IF cond THEN this ELSE that ENDIF

the cond is a boolean value. But there’s a priori no reason why we are prevented from interpreting it as a probability, a real value in the range [0,1].

IF 1/2 THEN print(”heads”) ELSE print(”tails”) ENDIF

He quite nicely demonstrated a calculus that gives you the ability to compute the result of such random decision trees. So, for example if you were faced with the Two Envelopes Problem how you could compute the value of a strategy expressed in his probability language.

I really liked the talk because of the way in which it drew upon existing fields and showed a very curious intersection of them. After the talk I asked if he could use this language to calculate an optimum strategy (he said no, he hadn’t done that, but it would be a good area of research) and if he had considered the addition of a switch-case statement (he had, but though he didn’t know how it would look in the language, he quite liked the idea of adding it).

The research paper, a Probability Perspective underlying his talk is available, as is his book, A Practicle Theory of Programming. He also mentioned that he has a grad student that has applied this probability programming to proof of quantum algorithms with much success, and that he’s yet to find a student to implement (write a compiler/interpreter for the language)

Function calling notation.

Posted by Eric Hennigan
On October 25th, 2008 at 15:10

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language

I was reading Yegge’s rant Rhinos and Tigers, and he mentioned that:

So it’s kind of unfortunate when you have to use functions, because if you have to say, you know, HTMLElement.getChildren.whatever, it gets inverted with functions: whatever(getChildren(HTMLElement)). You have to call from the innermost one to the outermost… it’s “backwards”, right?

This doesn’t necessarily have to be true, we could institute a new functional calling convention where the arguments precede the function name:
((HTMLElement)getChildren)whatever

Let’s try this out a bit.
(5)factorial
("Hello World\n")print
(5, 3)add
(math.pi)math.cos

It’s apparently works nicely for certain functions: Yegge’s example and factorial; but not at all for most others. This could be because in ordinary discourse, Subject precedes Action precedes Object. I suppose you could get used to this notation, but I find it too backwards for too many things.

Heterogenous Lists

Posted by Eric Hennigan
On August 12th, 2008 at 16:08

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas, Language

I’m used to C’s version of unions and structs. In C a union is simply a spot of memory into which various types of things can be stored. C doesn’t do all that much checking on the data types though. For example, if you construct a union of an 4-byte int and a 4-byte double. C will allow you to assign to the int and then later read from the double. Because these values share the same memory space, you will very likely get back garbage (the bytes of the int will be interpreted as if they were actually a double). C essentially allows you to subvert the type system in this fashion, leaving it up to the (unreliable) programmer to keep track of the type which is stored in that location, so that those bits are not ever used as if they represented a different type. C++ inherits this feature, making it especially dangerous, because the programmer might accidentally pull out the wrong class, and thus use those bits for a nonsensical dynamic method resolution.

What would be really nice is to have a union type, which the compiler is able to check for you. According to section 7.3 Records (Structures) and Variants (Unions) of Programming Language Pragmatics, Ada requires the use of a discriminant in its variants. Essentially, this means that the compiler is able to keep track of what type of value that part of memory represents. It also disallows converting of types. Once a value of type A is stored in that location, it will always be treated as type A, and never as a type B. This tag uses up extra memory, but does help provide type safety.

Also, in C, there is the array. Essentially a packed section of memory that contains many of the same type of object. It’s homogeneous. Only one type can be stored in it. Since I started using some of the more dynamic scripting languages, I’ve always found homogeneous data structures to be quite constraining. It’s just so inflexible to store a list of only one kind of thing. Many times I like to store a list (preserves order) of different kinds of things. I like heterogeneous lists. But how to implement them?

In C we can easily create a union of all the different types we might want to put in the list, and then just put them in like that. But wait! When we pull them out, we’ve now forgotten what type we used when we put them in. So we could create a struct with a tag (to tell us the type) and a union to store the data. Then every time we pull a value out of the array, we first look and the tag, and then use the appropriate data type. The compiler will be in the dark about most of this, and won’t catch certain mistakes (namely, pulling a value out as the wrong type, or omitting a type*)

In my dream language, I definitely want the compiler to type check my unions, such that I don’t accidentally try to pull out data with a different type than was used when I put it in. In this way I would be able to safely and easily create a heterogeneous list by simply creating a homogeneous list of unions. There will probably be a large run-time type-checking or tag-verifying code that would be unavoidable, but I’d like to make the analysis as static as possible for performance and safety reasons. I’m hoping that a typing system with powerful enough type-inference could help me out with that a bit.


* Actually, if the tag is an enum then C++ will verify that any switch-case exhaustively checks the enum. So omitting a type should raise a compiler warning.