 2005.06.20, 13:27
 Ir viens teksts, kas man lika mazliet tuvāk apjēgt kāpēc lietas, kas liekas pavisam vienkāršas, īstenībā ir ļoti sarežģītas. (Precīzāk, kāpēc reālās pasaules problēmas ir tik grūti risināmas. Nevar teikt ka es to tekstu izprastu, bet... ;)
HighDimensional Spaces Are Counterintuitive, Part One
A friend of mine over in Microsoft Research pointed out to me the other day that highdimensional spaces are really counterintuitive. He'd just attended a lecture by the research guys who wrote this excellent paper and we were geeking out at a party about it. I found this paper quite eyeopening and I thought I might talk a bit about some of the stuff that's in here at a slightly less highfalutin level  the paper assumes a pretty deep understanding of highdimensional spaces from the getgo. **************************
It's hard to have a geometrical picture in your head of a more than threedimensional space. I usually have to use one of two analogies. The first analogy I like goes like this: think of a line  one dimensional. Imagine that you have a slider control that determines your position on that line, from, say, 1 to 1, lefttoright. That's pretty visually clear. Add another slider that determines your upanddown position, and you've got a square. Each point on the square has a unique set of slider positions. Add another slider that determines your outandin position, and you've got a cube. Again, these are easy to picture. Every point in the cube has a unique combination of slider positions that gets you to that point. Now think of a cube with a slider control below it that lets you slide from intense red on one end through dark red and to black on the other. Now you've got four axes you can move around  height, width, depth and redness. The top right front corner of the bright red cube is a certain "colour distance" from the corner of the top right front black cube. That this is not a spatial dimension isn't particularly relevant; we're just picturing a dimension as redness for convenience. Every time you want to add another dimension, add another slider  just make sure that whatever is sliding is completely independent of every other dimension. Once you've added green and blue sliders then you've got a sixdimensional hypercube. The "distance" between any two 6d points is a function of how much you have to move how many sliders to get from one to the other.
That analogy gets across one of the key ideas of multidimensional spaces: that each dimension is simply another independent degree of freedom through which you can move. But this is a quite mathematical and not very geometric way of thinking about dimensionality, and I want to think about the geometry of these objects. Let's abandon this analogy. The second analogy is a little bit more geometrical. Think of a line, say two units long. Now associate every point on that line with another line, also two units long "crossing" it at the new line's center. Clearly that's a filledin square  after all, every point along one side of a square has a straight line coming out from it perpendicularly. In our slider analogy, one slider determines the point along the "main line", and the second determines how far to go along its associated line.
Think of another line, but this time, associate every point on it with a square. That's a solid cube.
Now think of yet another line. Associate every point on it with a cube, and you've got a 4cube. At this point it gets hard to visualize, but just as a cube is an infinite number of equallysized squares stuck together along a line, so is a 4cube an infinite number of 3cubes stuck together along a line. Similarly, a 5cube is a line of 4cubes, and so on. Where things get weird is when you start to think about hyperspheres instead of hypercubes. Hyperspheres have some surprising properties that do not match our intuition, given that we only have experience with two and three dimensional spheres. (2spheres are of course normally called "circles".) The definition of a hypersphere is pretty simple  like a 2sphere or 3sphere, a hypersphere is the collection of points that are all the same distance from a given center point. (But distance works strangely in higher dimensions, as we'll see in future episodes!) Its hard to picture a hypercube; it's even hard to picture a hypersphere. The equivalent analogy for nspheres requires us to think about size. Again, imagine a line two units long. Associate with each point on the line another line crossing at the middle. But this time, the associated lines are of different lengths. The lines associated with the end points are tiny, and the lines associated with the middle are longer. This describes a circular disk  for each point along the diameter of a circle you can draw a perpendicular line through the point extending to the boundaries of the disk on each side.
Now do the same thing again. Take a line, and associate each point on the line with a circle. If the circles are all the same size, you have a cylinder. But if they vary from small at the ends to big in the middle, you've got a sphere. Successive crosssections of a sphere are all circles, but they start small and get big and then get small again. Now do the same thing again. Take a line and associate each point on the line with a sphere, small at the ends and big in the middle, and you've got a 4sphere. Successive "cross sections" of a 4sphere are 3spheres of varying size. Keep going to 5, 6, etc, spheres. A circle of diameter 2 fits into a square of edge length 2, and a sphere of diameter 2 fits into a cube of edge length 2. Clearly an nsphere of diameter two fits exactly into an ncube of edge length two  the nsphere "kisses" the center of each face of the ncube. You can't make the ncube smaller without the nsphere poking out of it somewhere. But things start getting weird when you consider the volume of an nsphere. Tomorrow we'll compare the volume of an nsphere to the volume of an ncube, and discover some surprising and counterintuitive things about where that volume is.
HighDimensional Spaces Are Counterintuitive, Part Two
The volume of an ncube of edge length s is easy to work out. A 2cube has s^{2} units of area. A 3cube has s^{3} units of volume. A 4cube has s^{4} units of 4volume, and so on  an ncube has s^{n} units of nvolume. If the ncube has edge of s>1, say s=2, then clearly the nvolume dramatically increases as the dimensionality increases  each dimension adds a lot more "room" to the ncube.
A 2sphere (ie, circle) is pretty close in area to the smallest 2cube (ie, square) that encloses it  sure, you lose some area at the four corners, but not a whole lot. Though the circle is far from the square at the four corner, it is very close to the square at the four sides. A circle has about 75% the area of its enclosing square. But a 3sphere inside a the smallest 3cube that encloses it is far from eight corners and close to only six sides. A 3sphere is about half the volume of the 3cube. As you go up in dimensions, you get more and more corners that are far from the nsphere  there are 2^{n} corners and only 2n sides so the comparative volume of the sphere goes down.
In fact, you don't even need to compare nspheres to ncubes  after you reach 5 dimensions, the nvolume of an nsphere starts going down, not up, as dimensionality increases. With some pretty easy calculus you can show that the nvolume of an nsphere of radius r is:
V_{1} = 2 r
V_{2} = π r^{2}
V_{n} = V_{n2} 2 π r^{2} / nFor any fixed radius this rapidly approaches zero as n gets big. Pick a big n, say 100. The volume of a 100sphere is going to be (π r^{2} /50) x (π r^{2} /49) x (π r^{2}/48) ... (π r^{2}/1). Suppose r is 1  then all of those terms except for the last few are going to be quite a bit smaller than 1. Every time you add more dimensions, the nvolume of the unit nsphere gets smaller and smaller even as the nvolume of the smallest ncube that encloses the unit nsphere gets exponentially larger and larger!
Here's another weird fact about the volume of hypersolids. Consider two squares, one inside the other. How big does the small square have to be in order to have, say, 1% the area of the larger square? That's pretty easy. If the inner square has 10% the edge length of the outer square, then it has 1% of the area of the outer square.
What about nested 3cubes? An inner 3cube with edges 10% the length of the edge of the outer 3cube would have 0.1% the volume, too small. Rather, it needs to havean edge about 21% of the edge of the outer 3cube, because .21 x .21 x .21 = about 0.01.
What about a nested ncube? In order to have 1% the nvolume of the outer ncube, the inner ncube needs to have an edge of (0.01)^{1/n} of the outer ncube side. For n=100, that's 0.955. Think about that for a moment. You've got two 100cubes, one has edges 2 units long, the other has edges 1.91 units long. The larger ncube contains ONE HUNDRED TIMES more volume.
Try to visualize the smaller ncube being entirely inside the larger ncube, the two ncubes having the same central point. Now wrap your mind around the fact that the smaller ncube is 1% the volume of the larger. The conclusion is unavoidable: in high dimensions the vast majority of the volume of a solid is concentrated in a thin shell near its surface! Remember, there's 2^{100} corners in a 100cube, and that makes for a lot of space to put stuff.
It's counterintuitive because the very idea of "near" is counterintuitive in higher dimensions. Every time you add another dimension, there's more room for points to be farther apart.
The distance between opposite corners of a square of edge 2 is 2√2. The distance between opposite corners of a 3cube is 2√3, quite a bit bigger. The distance between opposite corners of a 100cube is 2√100 = 20 units! There are a whole lot of dimensions to move through, and that adds distance.
We could make the same argument for an nsphere and show that the vast majority of its (comparatively tiny) volume is also in a thin shell near the surface; I'm sure you can see how the argument would go, so I won't bother repeating myself.
Because distance is so much more "expensive" in higher dimensions, this helps explain why nspheres have so much less volume than ncubes. Consider a 100cube of edge 2 centered on the origin enclosing a 100sphere of diameter 2, also centered on the origin. The point (1,0,0,0,0,0...,0) is on both the 100cube and the sphere, and is 1 unit from the origin. The point (1,1,1,...,1) is on the 100cube and is ten units away from the origin. But a 100sphere by definition is the set of points equidistance from the origin, and distance is expensive in high dimensions. The nearest point on the 100sphere to that corner is (0.1, 0.1, 0.1, ..., 0.1), 9 units away from the corner of the 100cube. Now its clear just how tiny the 100sphere is compared to the 100cube.
OK, so far we've been considering ncubes that entirely enclose nspheres, ie, an ncube of edge length 2 that encloses a unit nsphere, kissing the sphere at 2n points. But we know that this ncube has ginormously more volume than the nsphere it encloses and that most of that volume is near the edges and corners. What if we abandon the constraint that the ncube contains 100% of the nsphere's volume. After all, there are only 200 points where the 100sphere kisses the 100cube, and that's not very many at all.
Suppose we want an 100cube that contains 99.9% of the volume of the unit 100sphere. We can cover virtually all of the volume of the 100sphere with an 100cube of edge 0.7 instead of 2. Sure, we're missing (1,0,0,0,...,0), but we're still hitting (0.1,0.1,0.1,...) with huge amounts of room to spare. Most of the volume inside the 100sphere isn't near the 200 points with coordinates near the axes.
How much do we reduce the volume of the 100cube by shrinking it from 2 on the edge to 0.7? We go from 2^{100} nunits of volume to 0.7^{100}, a factor of around 4x10^{45} times smaller volume! And yet we still enclose virtually all the volume of the 100sphere. The corner of the smaller 100cube at (0.35, 0.35, 0.35, ...) is now only 2.5 units away from (0.1, 0.1, ...) instead of 9 units away. This is a much better approximation of the unit 100sphere. It's still hugely enormous compared to the unit 100sphere in terms of sheer volume, but look at how much volume we save by approximating the 100sphere as a small 100cube!
Feeling dizzy yet? Next time we'll see that these facts
 nspheres are tiny compared to ncubes
 hypersolids have most of their volume close to their surfaces
 you can enclose almost all the volume of an nsphere with a small ncube
HighDimensional Spaces Are Counterintuitive, Part Three
Last time we noticed a few interesting facts:
 most of the volume of a highdimensional object is very near its surface.
 small movements away from a point in many dimensions add up to a long total distance.
Suppose you have a bunch of points in space scattered randomly around some average. You would expect that if the points are more or less evenly distributed then the number of points in a given volume of the space would be proportional to the volume. (In fact, that’s probably how we’d define "more or less evenly distributed.")
Let’s think of a lowdimensional example – points on a line. For example, suppose you've got a thousand 30yearold women and you measure their heights. If the average is 150 cm, then you'd expect that the number of women that fall into the 2cm range from 145147 is roughly the same as from 150152. There will probably be more in the range closer to the mean, but you'd expect that the order of magnitude would be roughly the same. It's not like there's going to be 100 in one range and 10 in the other. I'm assuming here that we're considering regions close to the mean so that the distribution is pretty even throughout.
Now suppose you take a 2cm range and a 1cm range, both close to the mean. You would similarly expect that the 1cm range would have roughly half as many points in it as the 2cm range, right? The number of points in a region is roughly proportional to the size of the region.
What if the distribution is 2dimensional? Suppose now you plot both the heights and hair lengths of these 1000 women. Again, there will be some average around which the points cluster, say (150 cm, 30 cm). Now imagine that you draw two circles of the same area close to the mean. Again, you'd expect that two circles of the same area would have roughly the same number of points inside them. The circle with its center closer to the mean probably has a few more, but you'd expect roughly the same number of points in each circular region.
Now what if you make one of those circles half the radius of the other? Then it will have only one quarter of the area, and therefore only on average one quarter as many points.
Now suppose that you have a bunch of points scattered around a 100dimensional space, all clustered around a particular average point. Pick a hyperspherical region centered on the average, with radius large enough to contain most of the points.
We expect to find more points in regions of the space with more volume. We know that 99% of the volume of this nsphere is concentrated in a thin shell near its surface. The conclusion is inescapable: points clustered around a mean in high dimensions are far more likely to be found in the highvolume narrow shell far from the mean, not in the lowvolume region near the mean itself!
This again seems really counterintuitive at first but its not if you think about it for a while. In our example above, most of the women are of near average height. And most of the women are of near average hair length. But clearly the number of women that are nearaverage height AND nearaverage hair length is considerably smaller than either taken separately.
As we add more and more dimensions to the mix, the number of points where all measurements are near the mean point gets very small indeed. Of the six billion people in the world, how many of them are within 1% of average height and average hair length and average age and average income and average cholesterol level and average IQ? Probably a very, very small number. Most everyone is slightly abnormal somehow! And in a highdimensional space, many small deviations from the mean adds up to a large distance from the mean. If you plotted 100 characteristics of 6 billion people in a 100dimensional space, the number of people very near the middle of every axis is quite small; there's hardly any volume there.
What the heck does this have to do with computer science? Suppose you have all that personal data in a database, you have a dead guy in the morgue with no ID, and you want to find a match in your database, if there is one? What if you have 100 variables about a song and you want to identify whether a given scrap of music is in your database? There are lots of realworld search problems that have sparse, large data sets spread out over a highdimensional space.
HighDimensional Spaces Are Counterintuitive, Part Four
It's reasonably common to have a database containing a few thousand or million points in some highdimensional space. If you have some "query point" in that space you might like to know whether there is a match in your database. If you're looking for an exact match then the problem is pretty easy  you just come up with a suitable hash algorithm that hashes points in your space and build a big old hash table. That's extremely fast lookup. But what if you're not looking for an exact match, you're looking for a close match? Perhaps there is some error, either in the measurements that went into the database or the measurement of the query point, that needs to be accounted for. Now the problem is "for a given query point, what is the closest point in the database to it?" That's not a problem that's amenable to hashing. (Well, you could use a Locality Sensitive Hash algorithm, but we'll rule that out later.) What if the closest point to the query point is so far away from the query point that it's more likely that the query point simply has no match at all in the database? In that case we don't really want the closest point, because that's not really relevant. Essentially you can think of every point in the database as being surrounded by a "probability sphere". As you move farther away from a given point in the database, the probability that you have a match to that point gets smaller and smaller. Eventually its small enough that the probability that the query point is "junk"  not a match to any point in the database at all  gets larger than the probability that the closest point is a match. To sum up the story so far: we've got a query point, which may or may not correspond to a point in our database of points. We need a way to say "is this point junk? If not, what are some reasonably close points in the database that might be matches?" Here's an idea for an algorithm: compute the distance between the query point and every point in the database. Discard all points where the distance is outside of a certain tolerance. Geometrically, we're constructing a sphere of a certain radius around each point in the database. We check to see whether the query point is inside each sphere.
We know that the volume of an nsphere is tiny. That's good. If we do get a match then we know that the match is probably in a very small volume and therefore likely to be correct. However, we also know that such a system is not tolerant of small errors in many dimensions  because distances grow so fast, a small deviation in many dimensions leads to a big distance. That means that we probably will have to construct a sphere with a fairly large radius in order to allow for measurement error in many dimensions.
But that's not really so bad. What's bad about this scheme is that if there are a million points in the database then we have to calculate one million Cartesian distances for every query. In a 100dimensional space that means computing 100 differences, 100 multiplications, 100 additions and one comparison a million times just to do one query.
We could build some optimizations in  we could check at each addition whether the total radius computed so far exceeds the tolerance and automatically discard the point. Maybe we could cut it down to on average 50 differences, multiplications and additions per point  but then we've just added in 50 comparison operations. No matter how you slice it, we're doing a lot of math.
A hash table works by automatically discarding all but a tiny number of points, so that you just have to check a small number rather than the whole database. A Locality Sensitive Hash algorithm (which I might write about in another series) would work well here except that LSH algorithms have lousy performance if a large percentage of the queries are junk. Let's assume that there are going to be a lot of junk queries. Is there some way that we can more rapidly find valid points?
We learned earlier that 99.9% of the volume of an nsphere is contained by an ncube with the same center as the sphere and an edge length fairly close to that of the radius.
Hmm.
Determining if a point is inside a cube is a lot less math than determining if a point is inside a sphere. With a cube you don't need to compute any distance. You just compare the upper and lower boundaries of each dimension to the position of the point in that dimension and see if they overlap. For a 100dimensional space, that's 100 to 200 comparisons per point, and as soon as even one of the dimensions is bad, you can skip this cube and move on.
Maybe we can get it down to around a few dozen comparisons per point for a straightforward linear search. That's pretty good compared to the distance metric.
We know that a cube of a given side is much, much more voluminous than a sphere of similar radius. We don't really care about the 0.1% false negative rate caused by clipping off the bits of the sphere that are close to the axes. But what about the false positive rate of the huge volume of points that are inside the cube but not inside the sphere? These are easily dealt with: once we have winnowed the likely matches down to a small subset through the cheap cube method, we can do our more expensive spherical check on the small number of remaining candidates to eliminate false positives.
By this point your bogosity sense should be tingling.
This armchair performance analysis is completely bogus. Yes, ~70 comparisons is cheaper than ~100 subtractions, multiplications and additions. Who cares? That's not the most expensive thing. Performance analysis always has to be about what the most expensive thing is.
Think about this database for a moment  a million records, each record is a 100dimensional point in some space. Let's suppose for the sake of argument that it's a space of 8byte doubleprecision floats. That's 800 megabytes of memory. Now, there are certainly database servers out there that can keep an 800 meg file in physical memory, but we're clearly pushing an envelope here. What if the database is large enough that there isn't enough physical memory to keep the whole thing in all at once? At that point, the cost of iterating over all one million records becomes the cost of swapping big chunks of the database to disk and back. Now consider what happens if there are multiple threads doing searches at the same time. Unless you can synchronize them somehow so that they all use the same bits of the database at the same time, you're just going to exacerbate the swapping problem as different threads access different parts of the record set. (And if you can synchronize them like that, why even have multiple threads in the first place?) The fundamental problem with both the sphere and the cube methods isn't that the math per record is expensive, it's that they must consider every record in the database every time you do a query. Getting those records into memory is the expensive part, not the math. What we really need is some index that is small enough to be kept in memory that quickly eliminates lots of the records from consideration in the first place. Once we're down to just a few candidates, they can be pulled into main memory and checked using as computationallyintensive algorithm as we like.
There's no obvious way to build an index of hyperspheres, but there might be things we can do with hypercubes. Stay tuned.
 4 rakstair doma
 20.6.05 16:25 #

nu, īstenībā, visa pasaules "vienkāršība" ir TIKAI tīra vienkāršošana līdz cilvēciskās sapratnes līmenim.
ja kaut kas (jebkas) šķiet vienkāršs = tas nozīmē, ka mēs esam, to apskatot, atmetuši lielu kaudzi ar īpašībām, kuras it kā neietekmē mūsu konkrēto mērķi.  Atbildēt
 20.6.05 19:23 #

ja ieskatās "part 3", tur ir atbilde, kāpēc ir spēkā 10/90 % likums (tipa 9099% of everything is shit)
 Atbildēt
 so much fun
 20.6.05 22:04 #

Tu mani piespiedīsi lasīt. Starpcitu vakar naktī jūs kopā ar Raiti K. mani piespiedāt klausīties Gridlock. Paši to neapzinoties.
 Atbildēt