Thursday, May 23, 2013

probabilistic alpha beta search

Any good chess player can tell you the difference in the styles of someone like Mikhail Tal and Anatoly Karpov.  Tal was known as the magician, someone who sought to guide games into the sharpest possible lines where he could induce mistakes from his opponents, even while taking great risks of his own.  Karpov on the other hand, he was a python.  He'd safely accumulate small advantages, and slowly squeeze his opponent until they cracked.  There styles couldn't be more different, but there results were undeniable; both joined the exclusive club of World Chess Champions.  And yet, neither of these styles are well captured in Computer Chess, who's engines have become the cold assassins of the chess world.  I think capturing the essence of these stylistic differences is a factor that could be used to further improve the results of computer chess engines, and help tailor their response to a given candidates strengths and weaknesses, without compromising quality.  What is the missing factor?  Quantification of risk.

Figure 1: Game tree with mimimax highlighted in red, and prospective better path in green
What does it mean to include "risk" in a turn based two player zero sum game?  Consider Figure 1, which highlights the typical  minimax approach to game trees in red.  Each player has competing goals, Max is looking to choose a path that maximizes his score, thus picking the maximum nodes in the next level, while Min is trying to minimize the score on his turn.  Scores are bubbled up from the bottom.  We see from this perspective, Max's optimal choice would be to pick the left branch, because given optimal play from Min, the score of 1.2 produces a slightly superior position.  If the scores were perfect, indeed this analysis would be correct, and risk would be completely mitigated.

The reality in games where the search depth rarely allows one to reach terminal nodes is that we use a heuristic as a proxy for the true score.  When the frontier node is finally reached, we repeat the process of creating a tree to search, bubbling up a new score that serves as a better estimation of the actual value of the position at that node.  This means there is a variance in the heuristic that must be considered when choosing our path through the tree.  When we eventually reach a pre-frontier node, we will have reached the last responsible moment from which we can change our decision.  The score of pre-frontier nodes as viewed from the root must then be some combination that accounts for the likelihood that the proxy score for any frontier node has improved by the time we reach the pre-frontier level.

Consider node A vs node B.  If we were to reach node A and then discover that our heuristic evaluation overestimated the score of the child node valued 1.2, we have left ourselves without a viable alternative, as the variance in our heuristic score is unlikely to overcome the 6.2 point gap we predicted in the alternative branch.  If the true score of the best choice at node A was -1.0, we would have no recourse but to continue down that path.  Node B, on the other hand, provides two choices of similar value.  If we were to reach Node B and discover that our 1.1 option actually produced a score of -1.0, we would still have a viable alternative in the 1.0 node.  Minimax fails to capture this characteristic.

Now consider nodes C and D.  From the prospective of Max, we again recognize that Min's options are restricted at node C due to the large difference between the first and second best nodes.  It is not likely that the variance in the 4.0 node will overcome the gap in scores, which means that a miss-evaluation of the frontier node in Max's favor cannot be mitigated by a second choice from Min.  Node D however does provide that alternative, in the similar score of the 1.3 node.

Figure 2: Normal Distribution
What we are interested in is creating a probability distribution that models the difference between the quiescence score of a frontier node (such as the 1.2 valued node in Figure 1), and the score of that same node if it were fully evaluated one ply deeper than the root (such as the position of node D in Figure 1).  That is to say, we need to understand how much our heuristic evaluation will vary between the time when we first encounter a node in the search tree, and our last chance to change our mind.  A normal distribution (see Figure 2) may be an appropriate model for a good heuristic, where the x-axis represents the score, and the y-axis represents the probability.  This model can be fitted empirically ahead of time, sampling quiescence and max depth evaluation pairs.

Figure 3: Cumulative Distribution Function
Integration of the normal distribution leads to the cumulative distribution function, pictured in Figure 3.  This function can be used to describe the probability that a random draw from the normal distribution will be less than X.  For example, a random draw from the distribution represented by the green curve in Figure 3 will be less than -2.5 20% of the time, if we trace 0.2 on the y-axis to the intersection.  That means with 80% certainty, that distribution is greater than -2.5.

Figure 4: Three choices represented by three cumulative distribution functions
Now consider the situation in Figure 4, where our search has reached node A (a pre-frontier node), in the context of a much larger search.  Nodes B, C, and D are individually represented by the cumulative distribution functions F1, F2, and F3.  We can hedge against the risk of any single node by combining the three cumulative distribution functions to get a single distribution, one which represents the probability of the score of the best of these three alternatives as determined by a full depth search rooted at node A.
Equation 1: Recursively defined CDF accumulating all move alternatives

Equation 1 expresses a combined cumulative probability distribution recursively, where Fi represents CDF for the ith alternative move choice.  Here, we can say at any point that the probability that a random draw from a distribution is at least as great as X is equal to the probability that Fi is greater than X, plus the probability that Fi is less than X times the probability of the combination of the remaining alternatives is greater than X.  A concrete example can be again taken from Figure 3.  Consider the blue, red, and orange distributions, each with its mean at 0 (ie: F1(0) = F2(0) = F3(0) = 0.5), and assume those represent the probability distributions for Figure 4.  Each by itself has a 50% chance of scoring higher than 0.  Combining those into Equation 1 and expanding the probability around 0, we find that we have an 87.5 percent chance of scoring better than zero.  This completely upends our understanding of the score at node A in Figure 4.

We can incorporate this probabilistic model into the existing framework of Alpha Beta Search.  Our search algorithm is initialized with a confidence parameter denoting the risk we are willing to take in selection of our next move.  I would expect that requiring a high confidence (95%) would produce strategically sound games with little risk, as in the style of Karpov.  Requiring low confidence (5%) would result in riskier search paths, but higher rewards, as in the style of Tal.

At every terminal node, the Quiesce function instead returns the cumulative probability function mapping to the score produced by our heuristic evaluation.  Every non-terminal node produces a combined CDF using the methodology described by Equation 1.  Bounds are set by evaluating a CDF at the prescribed confidence level, giving us a fixed expected value to compare against as Beta.  Then, when a child is iterating through its move choices, it updates the probability that one of its subnodes will be greater than Beta.  If the probability is greater than our prescribed confidence, we can do a beta cutoff.  One of the nice consequences of this strategy is that as the variance of our heuristic evaluation drops to 0, the search converges back to a normal alpha beta search.

References to other related approaches:
B* Search
ProbCut Pruning
Conspiracy Search

Similar approaches in the literature:
An optimal game tree propagation for imperfect players

Sunday, May 5, 2013

2013 Google Code Jam Round 1B

Better than Round 1, but still missed by several hundred.  If I had been able to get a second problem complete, I think I could have advanced.  However, my attempt at the third problem turned out to be way too slow.

Problem Osmos:

Armin is playing Osmos, a physics-based puzzle game developed by Hemisphere Games. In this game, he plays a "mote", moving around and absorbing smaller motes.

A "mote" in English is a small particle. In this game, it's a thing that absorbs (or is absorbed by) other things! The game in this problem has a similar idea to Osmos, but does not assume you have played the game.

When Armin's mote absorbs a smaller mote, his mote becomes bigger by the smaller mote's size. Now that it's bigger, it might be able to absorb even more motes. For example: suppose Armin's mote has size 10, and there are other motes of sizes 9, 13 and 19. At the start, Armin's mote can only absorb the mote of size 9. When it absorbs that, it will have size 19. Then it can only absorb the mote of size 13. When it absorbs that, it'll have size 32. Now Armin's mote can absorb the last mote.

Note that Armin's mote can absorb another mote if and only if the other mote is smaller. If the other mote is the same size as his, his mote can't absorb it.

You are responsible for the program that creates motes for Armin to absorb. The program has already created some motes, of various sizes, and has created Armin's mote. Unfortunately, given his mote's size and the list of other motes, it's possible that there's no way for Armin's mote to absorb them all.

You want to fix that. There are two kinds of operations you can perform, in any order, any number of times: you can add a mote of any positive integer size to the game, or you can remove any one of the existing motes. What is the minimum number of times you can perform those operations in order to make it possible for Armin's mote to absorb every other mote?

For example, suppose Armin's mote is of size 10 and the other motes are of sizes [9, 20, 25, 100]. This game isn't currently solvable, but by adding a mote of size 3 and removing the mote of size 100, you can make it solvable in only 2 operations. The answer here is 2.

My Solution:

This is actually a pretty easy problem, both for the small and large sets.  The strategy in Osmos should always be just absorb the smallest remaining mote, if you can.  Once you can't absorb the smallest, you can't absorb any other ones, so it is up to us to either add a new mote which you can absorb, or remove all the remaining motes.  Since we have a vested interest in seeing you absorb the motes with as little help as possible, we want to give you the biggest mote you can currently absorb.  This lends itself well to a recursive solution, where you have your current size, and a sorted list of the remaining motes.

There is one edge case to concern ourselves with... if we are a mote of size 1, then it isn't possible for our mote to absorb anyone.  In this case, we have to return that we remove all remaining motes, as our strategy for comparison would go into an infinite loop.

Another concern is the runtime.  Fortunately, our recursion only needs to run down a single branch for each decision, since if we have to remove motes, we know we have to remove all of them.  With that in mind, we run in O(N) time with respect to the length of the mote list...  The other concern is when the next remaining mote >> our current mote.  But by picking the largest mote available to absorb, we nearly double in size with each iteration, which means we can cover the gap in O(log(M)) time, where M is with respect to the maximum size of a mote.  So our algorithm is something like O( N log(M) ).

Problem Falling Diamonds:

Diamonds are falling from the sky. People are now buying up locations where the diamonds can land, just to own a diamond if one does land there. You have been offered one such place, and want to know whether it is a good deal.
Diamonds are shaped like, you guessed it, diamonds: they are squares with vertices (X-1, Y), (X, Y+1), (X+1, Y) and (X, Y-1) for some X, Y which we call the center of the diamond. All the diamonds are always in the X-Y plane. X is the horizontal direction, Y is the vertical direction. The ground is at Y=0, and positive Y coordinates are above the ground.
The diamonds fall one at a time along the Y axis. This means that they start at (0, Y) with Y very large, and fall vertically down, until they hit either the ground or another diamond.
When a diamond hits the ground, it falls until it is buried into the ground up to its center, and then stops moving. This effectively means that all diamonds stop falling or sliding if their center reaches Y=0.
When a diamond hits another diamond, vertex to vertex, it can start sliding down, without turning, in one of the two possible directions: down and left, or down and right. If there is no diamond immediately blocking either of the sides, it slides left or right with equal probability. If there is a diamond blocking one of the sides, the falling diamond will slide to the other side until it is blocked by another diamond, or becomes buried in the ground. If there are diamonds blocking the paths to the left and to the right, the diamond just stops.

Consider the example in the picture. The first diamond hits the ground and stops when halfway buried, with its center at (0, 0). The second diamond may slide either to the left or to the right with equal probability. Here, it happened to go left. It stops buried in the ground next to the first diamond, at (-2, 0). The third diamond will also hit the first one. Then it will either randomly slide to the right and stop in the ground, or slide to the left, and stop between and above the two already-placed diamonds. It again happened to go left, so it stopped at (-1, 1). The fourth diamond has no choice: it will slide right, and stop in the ground at (2, 0).

My Solution:

So I didn't reach the problem until after the contest had ended, but I was I had.  There are a couple of insights that make this problem much easier to solve than the third problem.  First, you may notice that what we are making is a pyramid, one layer at a time.  Each layer of the pyramid (centered at 0,0) is completed before the next layer can begin, because we need the peak of the pyramid in place before a diamond can slide out to the next tier.

So how many diamonds are in a tier?  Well, we can inspect the sequence to find a pattern.

So if we sum all those layers together, we get...

Where D is the number of diamonds, and L is the layer of interest.  So given that that previous layer must be complete before we start our own layer, we can subtract away everything from the layer below our own, and are left with N-D diamonds.

Now I'd reform the question as follows... how many of the remaining diamonds would have to fall on the opposite side from ours for the layer not to complete up to our location?  Well, our location at height Y means that there would have to be at least Y diamonds falling on our side for us to get a diamond.  That means that if only Y-1 diamonds fell to our side, N-D-(Y-1) diamonds must fall to the other side.  If this number is greater than the other side can hold, then there is a 100% chance that our location will get a diamond.  If N-D < Y-1, then there is a 0% chance that our location will get a diamond.

A quick google search tells me...

The general formula for the probability of w wins out of n trials, where the probability of each win is p, is combin(n,w) × pw × (1-p)(n-w) = [n!/(w! × (n-w)!] × pw × (1-p)(n-w)"  

Since the probability p = 0.5 (its equally likely to fall left or right), and we want to consider all cases from Y to N-D, we can reformulate as follows...

So we've arrived at a formula for calculating the probability.  I haven't checked my numbers, since I didn't actually reach this part in competition, but I'm fairly certain that it will work.  Another optimization in implementing this summation is to recognize that as w gets larger, the added probability gets smaller.  We probably can stop the search early if the probability of the last iteration times the remaining iterations is less than the threshold google checks for... In cases where N-D is quite large relative to Y, this can be very useful.

Problem Garbled Email:

Gagan just got an email from her friend Jorge. The email contains important information, but unfortunately it was corrupted when it was sent: all of the spaces are missing, and after the removal of the spaces, some of the letters have been changed to other letters! All Gagan has now is a string S of lower-case characters.

You know that the email was originally made out of words from the dictionary described below. You also know the letters were changed after the spaces were removed, and that the difference between the indices of any two letter changes is not less than 5. So for example, the string "code jam" could have become "codejam", "dodejbm", "zodejan" or "cidejab", but not "kodezam" (because the distance between the indices of the "k" change and the "z" change is only 4).

What is the minimum number of letters that could have been changed?

My Solution:

This was the problem I actually attempted... and while I produced valid results for the small example test cases, it was far too slow to actually finish even the small dataset.  Funny, I work at an email company, and have a coworker named Jorge... I couldn't resist trying this problem instead of the second.  I'll present my ideas here, but would be interested in seeing what the correct approach is to this problem from someone else.  It reminded me of a problem I'd done on my blog a few months ago... a Morse Decoder... so I felt like I could reuse some of those ideas.  The wildcard matching needed seemed to make this problem much harder... so I went about setting up the dictionary in a tree, where each node of the tree contains a dictionary mapping the next character to the list of possible nodes, as well as a list of all sub nodes.  When inserting the dictionary into this tree, I broke it up into N trees, where each tree was of words of length n.  The tree class then had a function for returning the distance between itself and a given substring from the test case... that code looked like this:

Here, scoreSoFar is initialized to 0, and bestSoFar would be the length of the word (where we would have to change all letters to match something in the dictionary).  Then this code was used in a recursive search, where we'd map out all possible first word fits in the test case, and recursively search the remaining part, keeping the minimum score of that.  I tried caching as I went for reuse, but that didn't really help. This was just way too slow.  Interested in feedback on successful solutions.

So those are my thoughts on Round 1B of the code jam.  Last chance for me is 1C... if I can stay awake that late.