February 28, 2005

Languages

Natural languages are easier to understand and read than they are to speak and write. Programming languages are easier to write than they are to read. Why is this?

My own rather unsatisfying answer is that programming languages have much simpler grammars and smaller vocabularies, making them easier to write, and that there must be some sort of trade-off that natural languages are taking advantage of that makes them easier to understand. The large vocabulary in particular might be hard to learn, but lead to easier comprehension because listeners can figure out some unknown vocabulary from context.

I can think of dozens of arguments against my guess, so feel free to post your own hypotheses. If you want to argue that the original question assumes too much, just consider this: why is it more attractive to write code from scratch than to maintain existing code?

How can we make programming languages easier to read?

February 27, 2005

Choosing at random 2

I mentioned earlier that there's a relatively well-known algorithm on the net for generating solar systems from scratch. The core of this algorithm, Accrete, simulates the depletion of a dust disk as proto-planets sweep up material in orbit around a star. The program begins with a dust disk starting at distance A from the star and ending at distance B. Then a location between A and B is selected at random and the amount of dust that would accrete onto a proto-planet orbiting there is calculated. Once this is done, the disk is divided into two dust bands, AC and DB.

Now a second proto-planet is created, sweeping up more dust and further subdividing the dust bands. This process continues until all the dust has been turned into planets.

At first glance, there's nothing wrong with this algorithm. It generates a solar system in negligible time and uses an algorithm sufficiently complex to generate interesting results. However, generating one solar system isn't enough. Like all simulations, this one has enough randomness in it that you need to run it thousands of times and compare the results if you want to find the interesting patterns. When you do this, you discover that it does not take a negligible amount of time to generate thousands of solar systems.

When I examined the standard implementation I discovered that a lot of time is wasted generating proto-planets in locations where the dust has already been swept. After subdividing AB into AC and DB, instead of picking a location in the remaining dust bands, the program picks locations between A and B until it finds one that isn't in a dust gap. By the time most of the planets are generated there are only a few tiny dust bands left, so most random insertions miss. I found that most runs need about 300 insertions to sweep all the dust, and some need 3000. The largest solar system I saw had 20 planets (i.e. 20 successful insertions).

This is a perfect example of a simulation written by a scientist (the astronomer S.H. Dole in this case, as well as various tweaks by, among others, Carl Sagan) that could have benefited a lot from the help of an experienced programmer. Keep in mind that this algorithm was published in 1969, and Sagan was working on it in 1977, when it cost him ~$1 of computer time per solar system (3 seconds).

I had just learned the trick I described in my last post, the one you use when you don't know the size of the set you're selecting from (the number of dust bands, in this case), so my first instinct was to try it out on this problem. I succeeded (how is an exercise left to the reader) and speeded up the generation of solar systems by a factor of 10 or so.

The smart thing to do, as I realized later, was to keep track of the total width of the remaining dust bands. The program was already calculating the bounds of the dust gaps being generated, so almost no extra overhead was introduced. This way, only one random number needed to be generated, instead of one for each dust band. The speed increase over my first approach was modest, but more importantly the code was simpler.

February 26, 2005

Choosing at random

Ok, somehow I drifted into posts about rather abstract concepts. Today I'll give you something more concrete to chew on, and tomorrow an extension of it.

When you want to pick an item at random from a set, it helps to know the size of the set first, so that you can generate a random number in the correct range. For example, if you want to pick a line R at random from a file (perhaps for a random quote generator) and you know the number of lines in the file, you only have to read in the first R lines of the file. Often, though, you don't know how many lines there are in the file.

When that happens, you can use the following algorithm:
sum = 0
while there's another line L:
sum += 1
pick a random number R from [0, sum)
if R < 1 then P = L
When this is finished, all the lines in the file have been read in and P contains a line selected at random.

You can convince yourself this works by noting that P always gets set to the first line. Then, if there's a second line, P will get set to it half the time. If there's a third line, P will get set to it one third of the time, and otherwise will remain as the first or second line at 50/50 probability. In other words, you can prove that this works inductively, by noting that it works for number of lines N = 1, and then noting that if it works for N = K it will also work for N = K + 1, where K is an arbitrary number.

February 25, 2005

Balancing Simplicity and Complexity

I recently wasted a large amount of time designing a very complex algorithm when a simple one would have done the job. Learning when to simplify and when to complexify is key to making good simulations. Too simple, and the resulting simulation is useless. Too complex, and it's incomprehensible and irreproducible.

The same sense of balance is required when you're trying to write efficient code. Efficient code isn't necessarily more complex than inefficient code, but writing it is usually more complex and takes more time. Where some programmers prefer to quickly create inefficient code, others never finish creating efficient code.

Balance points, by their very nature, make it difficult for balanced objects to stay upright. The best we can do is attempt to recognize which side of the dividing line we're on and move in the right direction.

However, in this case, I think it's quite a bit more difficult to realize when you're making things too efficient or too complex than the reverse. This has big implications. Think about it.
A designer knows when he has acheived perfection not when there is nothing left to add, but when there is nothing left to take away.

- Antoine de Saint-Exup'ery

February 24, 2005

Efficiency still matters

Writing efficient code is, arguably, the problem which spawned the entire field of computer science. It's a hard problem, obviously.

However, before you can start working on this problem, you must first solve the problem of writing code that works. Since one of the virtues of programmers is laziness, programmers sometimes fall into the trap of deciding that once the problem of writing correct code is solved, the programmer's work is done. Frequently Moore's Law is invoked as an excuse: "The processor speeds are so fast that the wasted time is negligible."

Part of this mindset can be explained when you realize that most programmers were trained on small problems, where this rationalization is in fact correct. There are many problems that the Ancient Masters optimized to death, and that student programmers must study, that can now be safely solved using brute force.

Practically all of simulation lies outside this domain.

February 23, 2005

Weibull distribution

Ok, I'm busy today, so I'll keep it really short:

You can generate numbers in a Weibull distribution by using the exponential distribution formula, then raising the result to some power.
pow(-a*ln(1-x), b)
You should be able to see from the link that these distributions can be pretty useful. The article uses k. b = 1/k.

February 22, 2005

Interesting Simulations

Today I want to bring to your attention some really neat simulations that, unfortunately, are no longer available. So, I turn to the Wayback Machine:

World Builder
Creates a map through simulated continental drift, then computes temperature, rainfall patterns, and rivers as influenced by the landforms. Surprisingly, there is no common algorithm out there for simple plate tectonics simulations; this is the only one I've found.

Nation Builder
Simulates the rise and fall of nations, given a map. Simulates some of the factors discussed in Guns, Germs, and Steel, among others. I've never seen another simulation like this one.

Galaxy Builder
Attempts to simulate the rise and fall of intelligent species in a small patch of stars. It's intended to be used for generating a backround for a Sci-Fi RPG, I think, in the form of ruins of past civilizations and artefacts waiting to be discovered.

Nation Builder impressed me a lot. I've spent way too much time watching the animations on the example pages 1 2 3. (The animation on the 3rd page is missing from the archive, probably forever.)(Animation from 1st page temporarily available here, since the Wayback Machine is being temperamental today.)

The World Builder is also quite impressive because, like the Nation Builder, it simulates something that nobody else does. There are lots of fractal terrain generators out there, but none that do plate tectonics. In contrast, there's at least one well known algorithm out there for building solar systems from scratch 1 2 3.

February 21, 2005

Picking points

My biggest problem with this blog is that every time I write something, I want to extend it into a long, rambling essay. I'll try to keep it a little shorter today.

We've all figured out how to pick a random point on the edge of a circle: pick an angle from [0, 2π), then use trig and the radius of the circle to figure out the x and y coordinates.

Contrary to what you might think, picking a point inside the circle at random is harder than just using a random radius. Similarly, it is not easy to pick a point from the surface of a sphere, nor from inside a sphere. MathWorld has good explanations of the problem and the solution for disk point picking and sphere point picking. Combining the two answers to solve the problem of picking a point inside a sphere (aka ball point picking) is an exercise I shall leave to the reader.

The common algorithm for picking points in a sphere is to pick points in the cube around the sphere, then eliminate those points that are too distant from the center. This discards just under half of the points generated, but may still be more efficient than the exact method due to its use of expensive functions. If I weren't trying to keep this short, I would now figure out which one is more efficient and present my results.

February 20, 2005

Time slice vs. critical events

The most common way to simulate something is to build a representation of the system being simulated, update the entire representation, and repeat. For example, in a gravity simulation the positions and velocities of objects at tn (time n) can be used to figure out their positions and velocities at tn+1. This is called the time slice strategy. The size of a time slice is often constant, but more advanced simulations may vary the size of the time slice. For example, gravity simulations sometimes use narrower time slices when two objects are making a close approach, because the standard algorithm applies the high close approach acceleration to the objects for an entire time slice, when in reality the two objects would only have been that close for a very short time. Game programmers often vary the time slices in their simulation of the game world to match the frame rate of the display.

Taking this further, some simulations don't need to know anything at all about the situation between certain times, because the situation at any point between those times is trivial to calculate. The classic example of this is a queue for service in front of a bank teller. If you want to simulate the length of the queue, all you need to know are the times at which customers join the queue, and the times at which the teller finishes serving a customer. Between those times, the length of the queue doesn't change (which is certainly trivial to calculate). This is called the critical event strategy.

In fact, to simulate the queue, you only need 4 variables: the length of the queue, the current time, the time until the next customer arrival, and the time until the next customer departure (Q, C, TA, and TD, respectively). Suppose TA < TD (a customer will arrive before one leaves):
C += TA
TD -= TA
Q += 1
TA = -A*ln(1 - rand())
That last line deserves some explanation. The time between arrivals usually follows an exponential distribution, and -A*ln(1 - rand()) will generate a number from an exponential distribution with average value A.

This system saves you a lot of computing time. Unfortunately, beginning programmers sometimes use the constant time slice method for everything. In this case, the novice might repeatedly do the following:
C += 1
TA -= 1
TD -= 1
if TA <= 0 then
Q += 1
TA = -A*ln(1 - rand())
This is a bad idea. In one case, on a school assignment, I saw a student use this method to run a simulation of operating system file access so slowly that it took 20min because the loop went through so many iterations where nothing happened, while a critical event version took negligible time to generate the same results, and allowed fractional times on the clock too.

You can't use the critical event strategy to simulate everything, of course (gravity being the canonical counter-example). But when you can, it really speeds things up.

February 19, 2005

What do you do after "Hello, world"?

What small program do you often write to teach yourself a programming language? Also, what capabilities do you really need, beyond the basics such as if-then-else or while loops?

I find myself needing to do a lot of basic graphics for output. It's a lot harder to do graphics now than it was back in the days when every language had a setPixel(x, y, color) and clearScreen() command. To draw anything you first have to open a window and put something in it that you can draw on. If you want to animate anything you have to figure out how to do time delays. You often have to figure out where the mouse pointer is, so that the user can interact with the program.

One simple learning program draws an object and has it follow the mouse pointer. Here's a version in Perl:
use Tk;

$mw = MainWindow->new;
$c = $mw->Canvas()->pack(-fill=>'both',
-expand=>'1');
$id = $c->createText(0, 0,
-text=>'Hello, world!');
$mx = $my = 100;
$c->CanvasBind('<Motion>'=>\&drag);
$c->repeat(50, \&move);
MainLoop;

sub drag {
$mx = $Tk::event->x;
$my = $Tk::event->y;
}

sub move {
($x, $y) = $c->coords($id);
$vx = ($mx - $x) / 10;
$vy = ($my - $y) / 10;
$c->move($id, $vx, $vy);
}
If I were really doing this, I'd use strict and warnings too.

Oddly enough, this program doesn't require if-then-else, loops, or even print.

February 18, 2005

Continuous vs. Discreet

Almost all modern programming languages provide a function that returns a random real number between 0 (inclusive) and 1 (exclusive). You can write that more compactly as "a real number in the range [0, 1)". Most programmers have figured out how to convert to a random integer by doing something like: int(rand() * 100) + 1, which gets you an integer in the range [1, 100]. Unfortunately, this is often followed by a table like the following:
 1 -  10 Blue
11 - 30 Green
31 - 70 Yellow
71 - 100 Orange,
in which Blue, Green, Yellow, and Orange are selected with 10%, 20%, 40%, and 30% probability respectively. An example I've run across more than once is a random star generated by picking a stellar type from OBAFGKM, then looking up statistics like size, temperature, mass, and luminosity from a table. This is a poor way to to pick a random star. Astronomy students and Sci-Fi RPGers take note!

Frequently the problem is that the programmer doesn't know offhand how to convert [0, 1) to a non-uniform distribution and instead constructs a lookup table. Knowing a few simple conversions to non-uniform distributions can save you a lot of typing, and make your program much more useful too.

One of the most commonly needed is the exponential distribution. The classic use for this is to generate an interarrival time (the time to the next arrival of a customer, or the time to the next radioactive decay for example). -a*ln(1 - x) where x is a number from [0, 1) will give you an exponentially distributed number between [0, ∞) with average value a. The average star in a globular cluster has a mass 3.333 times the Sun, in an exponential distribution, before aging eliminates the large stars. From the mass you can calculate all the other characteristics of a main-sequence star.

The best known distribution is probably the normal distribution. Wikipedia has a short description of how to generate normally distributed numbers using the Box-Muller transform. Interestingly, you can also generate a normally distributed number by adding together a large number of 0/1 coin tosses (or calls to rand). This explains why the normal distribution pops up so often: things like your height are the result of the sum of a large number of random factors (both genetic and environmental).

Any time you find yourself asking the user to pick a number from a list, or picking from a set of radio buttons labelled "Low, Average, High", you might be making the same mistake. Internally, these buttons are probably converted to preset numbers, like "0.2, 0.5, 0.8". You should be allowing the user to pick any number between 0 and 1 in this case.

February 17, 2005

Why "Automated Imagination"?

My secondary purpose for creating this blog is to let me write about simulations. (My primary purpose is to get myself to write more often.) When you program a simulation, you have automated your imagined model of the way something works. This is incredibly important, because suddenly it is no longer necessary to hold the entire model in your head at all times. You can work on one part of the model, and see how it interacts with your previous work, even if you couldn't possibly hold the entire model in your head at once. This makes it possible for non-genius scientists to work on things previously understood only by a select few. It also allows programmers to simulate things that scientists previously only imagined.

Simulation programming isn't a commonly acknowledged field of computer science, but it probably deserves to be. In order to write a simulation, a programmer has to grok what's being simulated. The result is that most simulations are written by physicists, chemists, biologists, geologists, economists, social scientists, etc., often using only the most basic programming skills, rather than an experienced programmer. Since you can, in theory, compute anything using only the most basic programming skills, the simulations usually work correctly. But there is much wasted time and potential. I intend to write about a few of the tricks, tools, and concepts that have helped me automate my imagination.