eric the fruitbatBlog
Sounding out the Noosphere.

Language

Cultural and Linguistic Sexism

Posted by Eric Hennigan
On July 26th, 2009 at 16:07

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Language, Politics

I was working on a paper today, and noticed some very peculiar about linguistic gender-neutrality. I know that we are all encouraged to use female character in our examples to combat the inherent chauvinism of the English language. Despite following the recent gender politics over at Less Wrong (summarized by a post on The Nature of Offense) and hearing Douglas Hofstadter explore the topic in some of his work, I’m still not entirely convinced that enough women feel alienated when males are used in examples.

Nevertheless, the issue has been raised to the level of my awareness, and I’m now sensitive to it. So I find myself in a huge conundrum, as my example involves shopping. So, I’m a bigoted sexist no matter what gender I choose!

Damn, English is sucks!

P.S. Assuming that example involving idiots and geniuses occur with equal frequency, what exactly stops people from using one gender as the canonical example for idiocy and the other gender for genius? Of course this association can be carried as far as one’s bigotry allows.

Building Linguistic Structure

Yesterday, I had an interesting thought. My advisor once made the cultural observation that many people in Computer Science invent their own language and then immediately write a self-hosting compiler. I agree that a compiler is quite a feat of engineering and serves as a nice test case to demonstrate that the language you’ve invented is powerful enough that it can handle real-world complexity. Unfortunately, this test fails in a few important ways.

First, It doesn’t actually show as much as you think it might. There is a very strong filter on failed languages. By using this test the author runs the risk of re-designing the language, specifically to insert constructs that help them build the compiler. Now, this isn’t necessarily a bad thing, except that compiler writing is now a fairly mature field. There are standard abstractions (esp. in the lexing and parsing) that a new language will probably not experiment with. So, the author will usually just build these existing and well-understood abstractions into the new language. Rather than encouraging language experimentation we get more of the same, but with different syntax.

Second, Not all useful languages even have their own compiler. I’m specifically thinking of the domain specific languages (DSL). Nobody would write an awk interpreter in awk; or a mail engine using sendmail (even if it is Turing Complete). These are languages designed to do a specific task, many of them are quite essential to their respective fields, but none of them are self-hosting. Nor should we expect them to be.

My argument here is that the cultural practice of writing a self-hosting compiler is a big distraction. New languages should be for experimenting with new linguistic constructs. We should be looking toward the DSLs, and incorporating their innovations into our more main-stream languages. Right now, we seem to be optimizing our languages for compiler construction.

I’d rather see our languages evolve in a different direction. I’m really eager to witness the birth of an AI. For this to happen though, we need languages for expressing patterns of thought, not patterns of bits. We need the ability to cohesively and flexibly assemble the stuff of thought. I’m thinking Society of Mind stuff here. We need languages that allow for statistical fuzziness, sloppy associativity, and the ability to construct metaphor.

The linguistic tools that we find useful for building compilers are not necessarily the same tools that will help us build a mind.

Computer Language Comparison

Posted by Eric Hennigan
On May 31st, 2009 at 17:05

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language

Guillaume Marceau has used data from the Computer Language Benchmark Game to provide a graphical comparison of many different languages.

If you drew the benchmark results on an XY chart you could name the four corners. The fast but verbose languages would cluster at the top left. Let’s call them system languages. The elegantly concise but sluggish languages would cluster at the bottom right. Let’s call them script languages. On the top right you would find the obsolete languages. That is, languages which have since been outclassed by newer languages, unless they offer some quirky attraction that is not captured by the data here. And finally, in the bottom left corner you would find probably nothing, since this is the space of the ideal language, the one which is at the same time fast and short and a joy to use.

Of course the C compilers do a very good job on performance, but seem to do average on verbosity (better than I expected). Haskell (ghc) does a surprisingly nice job. I wish that I’d thought to do this kind of visualization, it’s really pretty neat. The only improvement that I could think of, would be to do the performance axis on a logarithmic scale.

Automatic Thesaurus

Posted by Eric Hennigan
On May 7th, 2009 at 20:05

Permalink | Trackback | Links In |

Comments (4) |
Posted in Ideas, Language

Last week, I landed on another PhD worthy research project.

Given a very large corpus of sentences, such as a digitized version of the Library of Congress, or a less noisy version of the Internet, how can you automatically generate a Thesaurus?

At first I thought the problem should be fairly easy, but the more I thought about it the more difficult and daunting the task became. For example, as a first approach, we might assume that textual substitution would be a good proxy for identifying synonymous terms. That is if a couple terms really are synonymous, then they ought to be substitutable for each other in a sentence. With a large enough set of sentences, we should be able to identify such situations, and thereby bootstrap the building of the thesaurus. But there’s a small problem, provided by my good friend EvB:

The sky is blue.
The ocean is blue.

But sky is not the same as ocean. Sure they are similar. A poet could compose a nice metaphor of fish swimming through their sky above the bottom feeders. But this metaphorical relationship isn’t one that would necessarily make it into a human compile thesaurus. So, textual substitution can easily lead us astray.

Continuing with EvB’s particularly good example we can also identify another problem. Suppose that we incorporate a bit of natural language understanding, enough to pull out parts of speech. Then, the system would easily identify the equation of sky with blue, or ocean with blue. But neither of these statements is true either. Usually people take the example to mean not that sky and blue are the same thing, but that the sky belongs to the set of objects that have a property called color, the value of which is blue. So this understanding depends on what the definition of ‘is’ is (obviously not a simple affair). We also would like to avoid drawing a relationship between any of the pronouns and the rest of the language.

Next lets look at how people tend to write. Any good library is gonna be full of metaphor, simile, pun, allusion, word play, sound play, and other such highly nuanced expression. All of these things will trump any reasonably simple attempt at drawing a link between synonymous words. Political propaganda and polemic, will probably be particularly bad at equating terms that should probably be kept logically distinct. Furthermore, at least when I write, I’m reminded of other things during the process, things that are associated, but not necessarily synonymous. These remindings are an important part of the essay writing process, but will certainly throw noise into the digital library.

But if it’s so hard to make a mechanical system for identifying synonyms, then how do humans do it? Here I have a hypothesis: that similar words stimulate similar patterns in the brain. Thus when a human tries to think up synonyms it’s really the same as playing word association with a filter. First, the word stimulates the brain, bring up certain associations. These associations will be based on ‘brain distance’, a measure of the similarity of brain activity for certain words and thoughts. But some associations will be radically different from the synonyms that we’re looking for. For example, antonyms and non-sequiturs often come up in word association games. So a filter is applied to weed these out, and what’s left is passed through a dictionary/meaning check. Anything passing this process will be reported as a synonymous term.

So, in order to really generate a thesaurus, we do need AI (or at least an underlying cognitive model). When I first thought of the thesaurus problem, I was hoping that it was paired down enough, small and simple enough that it would be doable without all this complexity. We might have to reduce the problem further, make it looser. Say, build an association dictionary, rather than a thesaurus. An association dictionary might be possible, because it forgoes the understanding of meaning and similarity, it doesn’t have to question or measure why two words should be associated, only record that they are used similarly.

So, if you can automate the building of a thesaurus, you should get a PhD in Linguistics.

Philosophy of Computer Science: Naming

Posted by Eric Hennigan
On May 2nd, 2009 at 01:05

Permalink | Trackback | Links In |

Comments (1) |
Posted in Comp*, Language, Literature, Philosophy, Religion

For a very long time, western culture has had a strong undercurrent about naming. Conceptually, it starts with the recognition that the ability to name a thing gives you power over it. This is reflected in many deep and ancient cultural mythologies.

The creation story in the Bible begins with:

In the beginning,…the earth was a formless void…. Then God said: Let there be light. God called the light Day, and the darkness he called Night.

So God is able to create the Earth with only his Word, and give life to mankind with only his breath. This power is nearly transferred to Adam, when he is given the task of naming all the plants and animals. Only mankind is given this linguistic power.

Jewish mythology picks up on this issue with the story of the Golem.

In many tales the Golem is inscribed with magic or religious words that keep it animated. Writing one of the names of God on its forehead, a slip of paper in its mouth, or inscribed on its body, or writing the word Emet (אמת, “truth” in the Hebrew language) on its forehead are examples of such words. By erasing the first letter aleph in Emet to form Met (מת, “dead” in Hebrew, when the aleph letter א is cancelled) the golem could be deactivated.

Jewish culture continues this tradition with the Kabbalah’s search for the True Name of God. Other cultures also demonstrate this idea. In witchcraft, a demon is both summoned and controlled by speaking its name. In the Hindu tradition AUM is the sacred word that encompasses everything, and is the sole syllable upon which focus is kept during meditation. The idea is also reflected in more modern works, as clearly expressed in Ursula LeGuin’s A Wizard of Earthsea:

Ged sighed sometimes, but he did not complain. He saw that in this dusty and fathomless matter of learning the true name of each place, thing, and being, the power he wanted lay like a jewel at the bottom of a dry well. For magic consists in this, the true naming of a thing.

Or the so recently popular, Harry Potter, where Dumbledore advises Harry:

Call him Voldemort, Harry. Always use the proper name for things. Fear of a name increases fear of the thing itself.” (PS17)

But how does this relate to Computer Science? Being a very textual discipline, we have many conventions that relate to naming. In Computer Science, we have the ability to create virtual worlds, and thus we need systems of naming the objects within those worlds. At the Language level we see a focus on naming conventions:

  • Hungarian notation, in which variables have a prefix that describes their type, such as strName for a string, or pX for a pointer to X.
  • Fortran, which had an implicit typing scheme, where any names beginning with I, J, K, L, M, N were always integer and the rest were reals.
  • The Ruby on Rails framework, which has the ability to automatically map a model named “Person” to the “people” table in the database just by name inspection.

But naming actually turn out to be a much deeper issue than these linguistic examples show. In the Distributed Systems world, we have a large focus on naming, for a remote resource can only be accessed through its name, in what’s called name resolution. The easiest example to pick on here, is DNS, the system that allows a person to reference a remote computer by using an easy to remember domain (such as www.example.com) instead of a hard to remember physical address (such as 127.0.01). We can also identify a confluence of two separate concepts: The name of a machine can be used to locate it. This allows machines to operate with the previous cultural ‘power of naming’, knowing a machines name gives one access to that machine.

Since my research focuses on computer security, this duality between names and locations can be really critical. For example, there is a model for building secure software, called the object capabilities model, that not only identifies this power of naming, but actually explicitly states it as an axiom of the model:

  • Objects (actors) can interact only by sending messages to unforgeable addresses.
  • An object acquires knowledge of other objects in one of two ways:
    1. It is created with addresses that it receives from its creator
    2. It receives a message with an address to another object.

So, the security of the system is brought down to names. Communication and therefore power over other objects can only be obtained by learning their true names, which must be kept secret (unforgeable). For if a malicious object were able to easily guess the names of other objects in the system, it could quickly wreak havoc.

As such systems work their way into our daily lives, our personal names (read: personal identification) have also become much more important recently, as anyone that has been a victim of identity theft can attest. But this is an issue I won’t go into here. There are also other cultural impacts, for names change the way we think about each other.

Linguistics and Computer Languages

Posted by Eric Hennigan
On April 12th, 2009 at 17:04

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language, People

Of course, I would never think that I was the only one to have the idea of studying computer languages from a linguistics point of view. Well, I found an interesting character, by the name of Chris Barker that gave an interesting keynote at POPL in 2004. He’s mentioned in a recent LtU discussion about the “Influence of cognitive models on programming language design”. Unlike most linguists, that get branched off into anthropology and soft models of cognition, Barker really knows what he’s talking about when it comes to formal models. He even has an (interactive!) tutorial on lambda calculus.

Unfortunately, I wasn’t able to scare up any recording of his keynote, but the abstract is available.

Linguists seek to understand the semantics of expressions in human languages. Taking a computational point of view, there are many natural language expressions—operators in the wild, so to speak— that control evaluation in ways that are familiar from programming languages: just think of the natural-language counterparts of if, unless, while, etc. But in general, how well-behaved are control operators found in the wild? Can we always understand them in terms of familiar programming constructs, or do they go significantly beyond the expressive power of programming languages?

I’d love to take a whole class devoted to this kind of stuff!

Neologisms

Posted by Eric Hennigan
On November 15th, 2008 at 21:11

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Language, Religion, Self

Yesterday at the pub I was involved in a very extended (civil and remarkable productive) dialog about morality and society. We touch on many topics in the course of discussion, one of them being the difference between reason and faith. And I had to come up with a new word. Firstly, we agreed on the tautologies:

It reasonable to have reason.
It is unreasonable to have faith.

The second one is true because of the definition of faith (having belief without reason). So, by analogous reasoning, I also stated:

It is faithable to have faith.
It is unfaithable to have reason.

I’ll let you decide what I mean by that.

Probability Programming

Posted by Eric Hennigan
On November 15th, 2008 at 20:11

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language, Math, People

Yesterday a very interesting speaker, Eric Hehner, gave a talk at the graduate seminar:

TITLE

A Probability Perspective

ABSTRACT

This talk could be called “probability meets programming”. It draws together four perspectives that contribute to a new understanding of probability and solving problems involving probability. The first is the Subjective Bayesian perspective that probability is affected by ones knowledge, and that it is updated as ones knowledge changes. The problem of assigning prior probabilities is mitigated by the Information Theory perspective, which equates probability with information. The main point of the talk is that the formal perspective (formalize, calculate, unformalize) is beneficial to solving probability problems. And finally, the programmer’s perspective provides us with a suitable formalism.

I found the talk extremely fascinating. He first compared measures of probability, entropy, and information, demonstrating that they were (in a sense) substitutable concepts (analogous with Energy and Mass, or Energy and Temperature).

b bits = 2b states = 2-b chance
log(s) bits = s states = 1/s states
-log(c) bits = 1/c states = c chance

He also waxed poetical about how we are often fooled about probabilities, so it pays considerably to mechanize our calculations regarding those probabilities. It helps even further if we have a formal language into which we can directly translate our real-life word problems, so that we don’t accidentally setup and then solve the wrong problem. That is, we can then move on from debating about interpretations of the problem, and into actual calculation.

  • If I have two children and one of them is a girl, what is the probability the other is also a girl? (ans: 1/3)
  • If I have two children and the older one is a girl, what is the probability that the younger is also a girl? (ans: 1/2)

He also talked quite a bit on the Bayesian approach to probability, and why it is much nicer than the frequentist approach, because it assumes much less about the world. There is no need for a prior, and your measured probabilities are updated naturally as your knowledge of the world changes.

From this he moved on to providing a proto-language approach to how one would setup and solve these sorts of problems. He co-opted existing computer language constructs to do this. First we notice that in statements like

IF cond THEN this ELSE that ENDIF

the cond is a boolean value. But there’s a priori no reason why we are prevented from interpreting it as a probability, a real value in the range [0,1].

IF 1/2 THEN print(“heads”) ELSE print(“tails”) ENDIF

He quite nicely demonstrated a calculus that gives you the ability to compute the result of such random decision trees. So, for example if you were faced with the Two Envelopes Problem how you could compute the value of a strategy expressed in his probability language.

I really liked the talk because of the way in which it drew upon existing fields and showed a very curious intersection of them. After the talk I asked if he could use this language to calculate an optimum strategy (he said no, he hadn’t done that, but it would be a good area of research) and if he had considered the addition of a switch-case statement (he had, but though he didn’t know how it would look in the language, he quite liked the idea of adding it).

The research paper, a Probability Perspective underlying his talk is available, as is his book, A Practicle Theory of Programming. He also mentioned that he has a grad student that has applied this probability programming to proof of quantum algorithms with much success, and that he’s yet to find a student to implement (write a compiler/interpreter for the language)

Function calling notation.

Posted by Eric Hennigan
On October 25th, 2008 at 15:10

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Language

I was reading Yegge’s rant Rhinos and Tigers, and he mentioned that:

So it’s kind of unfortunate when you have to use functions, because if you have to say, you know, HTMLElement.getChildren.whatever, it gets inverted with functions: whatever(getChildren(HTMLElement)). You have to call from the innermost one to the outermost… it’s “backwards”, right?

This doesn’t necessarily have to be true, we could institute a new functional calling convention where the arguments precede the function name:
((HTMLElement)getChildren)whatever

Let’s try this out a bit.
(5)factorial
("Hello World\n")print
(5, 3)add
(math.pi)math.cos

It’s apparently works nicely for certain functions: Yegge’s example and factorial; but not at all for most others. This could be because in ordinary discourse, Subject precedes Action precedes Object. I suppose you could get used to this notation, but I find it too backwards for too many things.

Heterogenous Lists

Posted by Eric Hennigan
On August 12th, 2008 at 16:08

Permalink | Trackback | Links In |

Leave a Comment |
Posted in Comp*, Ideas, Language

I’m used to C’s version of unions and structs. In C a union is simply a spot of memory into which various types of things can be stored. C doesn’t do all that much checking on the data types though. For example, if you construct a union of an 4-byte int and a 4-byte double. C will allow you to assign to the int and then later read from the double. Because these values share the same memory space, you will very likely get back garbage (the bytes of the int will be interpreted as if they were actually a double). C essentially allows you to subvert the type system in this fashion, leaving it up to the (unreliable) programmer to keep track of the type which is stored in that location, so that those bits are not ever used as if they represented a different type. C++ inherits this feature, making it especially dangerous, because the programmer might accidentally pull out the wrong class, and thus use those bits for a nonsensical dynamic method resolution.

What would be really nice is to have a union type, which the compiler is able to check for you. According to section 7.3 Records (Structures) and Variants (Unions) of Programming Language Pragmatics, Ada requires the use of a discriminant in its variants. Essentially, this means that the compiler is able to keep track of what type of value that part of memory represents. It also disallows converting of types. Once a value of type A is stored in that location, it will always be treated as type A, and never as a type B. This tag uses up extra memory, but does help provide type safety.

Also, in C, there is the array. Essentially a packed section of memory that contains many of the same type of object. It’s homogeneous. Only one type can be stored in it. Since I started using some of the more dynamic scripting languages, I’ve always found homogeneous data structures to be quite constraining. It’s just so inflexible to store a list of only one kind of thing. Many times I like to store a list (preserves order) of different kinds of things. I like heterogeneous lists. But how to implement them?

In C we can easily create a union of all the different types we might want to put in the list, and then just put them in like that. But wait! When we pull them out, we’ve now forgotten what type we used when we put them in. So we could create a struct with a tag (to tell us the type) and a union to store the data. Then every time we pull a value out of the array, we first look and the tag, and then use the appropriate data type. The compiler will be in the dark about most of this, and won’t catch certain mistakes (namely, pulling a value out as the wrong type, or omitting a type*)

In my dream language, I definitely want the compiler to type check my unions, such that I don’t accidentally try to pull out data with a different type than was used when I put it in. In this way I would be able to safely and easily create a heterogeneous list by simply creating a homogeneous list of unions. There will probably be a large run-time type-checking or tag-verifying code that would be unavoidable, but I’d like to make the analysis as static as possible for performance and safety reasons. I’m hoping that a typing system with powerful enough type-inference could help me out with that a bit.


* Actually, if the tag is an enum then C++ will verify that any switch-case exhaustively checks the enum. So omitting a type should raise a compiler warning.