# The Bleeding Edge Machine

## Tuesday, April 11, 2017

### Call me maybe

I think its super important to understand the limitations of the systems you work with, and the architecture of those systems. I think the best blog I've seen for this is the Jepsen blogs. They are as good as it gets.

## Sunday, July 12, 2015

### Do Androids Dream of Electric Sheep?

Google's deep dream: counting sheep in the clouds |

Over the last few weeks, google open sourcing it's "Deep Dream" has taken the internet by storm, with a stream of images ranging from transformation of images to a beautiful impressionist style, to a nightmarish acid trip of dogs, spiders, and slugs, to seemingly conjuring believable images out of thin air. It has captured the imagination of the public, as we seemingly are reaching for the dawn of a new era of thinking machines, that can understand what we say, can look at things and know what they are, and can even imagine things that don't exist.

A Simple feed forward neural network |

But how does this work under the hood? The great advance in deep learning over the past decade has taken an old idea (neural networks), and extended it to extream depths. A neural network is composed of layers of neurons, with one side consisting of input neurons (say an input for each pixel in an image), some hidden layers that transform those inputs into an intermediate representation of features and patterns discovered about the input, and a layer of output neurons, which represent a classification of some thing you are trying to recognize. Each connection between two neurons carries a multiplier "weight", to scale the contribution from the incoming connection when it reaches the next layer. Each neuron in the hidden layer sums up all the contributions from the incoming connections, and applies a non-linear "activation" function that tends to produce values either close to 1 or 0. The weights between connections are adjusted during a training period, where a series of images are presented to the network, and the weights are tweaked every time the network produces an incorrect classification. Eventually, the network reaches a point where no further improvements can be discovered, and the training ends. This isn't always the best set of weights to determine the class, but it can be a good approximation.

In the simple example above there is a single hidden layer, but multiple hidden layers can exist, each layer creating more complex representations than the one before it. In the first hidden layers, features like edges and corners are discovered, forming the input for the next hidden layers. Later hidden layers discover individual features, such as eyes or noses. The final layer combines those discovered features to recognize whole faces, or whatever category you are trying to discover in your dataset. Each layer is trained in a process called "backpropagation", which involves looking at each individual weight, and peturbing it to see if it makes the overall classification more or less accurate. Then the weight is bumped in the direction that makes it more accurate.

A problem occurs in training these networks with many hidden layers. Layers close to the output of the network have a dominant impact on the outcome of the training. Layers towards the input make diminishing contributions. So unless the hidden layers near the front of the network start out at good values, the network tends to converge to a local minimum that is not a good representation of the knowledge.

Autoencoder |

About a decade ago, researchers developed a new technique for doing a greedy training of each layer independent of the later layers in the network, called an autoencoder. Here, weights are chosen so that the incoming layer could be reproduced as the input to the next layer, but where the hidden layer is smaller than the original layer that was fed into it... you can think of this as a compression of the original knowledge. This is important because the network is forced to only capture the most critical features of the original inputs, so each hidden neuron must capture a specific concept to be effective. Later, this was done by enforcing a sparseness constraint for the hidden layer, where the network could only fire a small percentage of nodes in the hidden layer for each input. This pre-training step could be used to pick a good starting point for backpropagation, and in the end we can have very deep neural networks.

Animals in the stars |

One of the problems with neural networks is there tends to be an opaqueness as to what these weights are actually encoding in a hidden neuron. This is where "deep dreaming" comes into play. Each neuron in the network will output numbers closer to 1 when it is excited by the input, and numbers closer to 0 otherwise. So the trick is to take the input layer of a network, and tweak the values to maximise the excitement for a given neuron that we want to find out what it does. We also must use a constraint that makes the pixels of an image conform to a normal spectrum of input. As this process is repeated over and over, new features in the image emerge. In the image above, the network sees animals in a nebula, and after repeated iterations, you see these faces too.

before |

after |

Of course, any curious software developer wants to play around with the code to try it for himself. While google has open sourced the code to achieve this, there are a number of dependencies that need to be included, and I ended up with conflicting versions during the install. The best solution I found was to use a docker image that someone provided the community. The one conclusion I seem to have made is androids don't dream of electric sheep. They dream about dogs.

## Tuesday, October 28, 2014

### Applying Interfaces to external dependencies: Golang

Many object oriented languages center on the interface as a communication contract between components. Engineering classes with this in mind typically enhances the maintainability and reusability of code by allowing dependencies to be injected, agnostic to the particular implementation. Good unit tests take advantage of this by mocking all dependencies at the interface level, and injecting them into the unit under test. This becomes very important at the boundaries of a system, where complex interactions with other external systems can make testing a nightmare.

As good a practice as this is, many libraries that we find ourselves depending on don't implement an interface. Consequently, we find ourselves having to make the decision on if we want to include those dependencies within the scope of our tests, or if we want to extract an interface from the library, then create an adapter wrapper linking the original library to the new interface, and finally a separate mock, also implementing the interface. All this work to mock out the external dependency, and you would still have holes in your coverage in the adapter, unless you wanted to then drag the external dependency into your tests again. Yuk!

Enter Golang duck typing. More commonly seen in dynamic languages, the duck typing mantra asserts that "if it quacks like a duck, it is a duck". The behavior implies the type, rather than explicitly having to call it out through inheritance. Any class that implements Quack() is implicitly a duck, even if the original developer never made that distinction.

So how does that impact our testing strategy? We simply create an interface with signatures that match the pieces of our dependency we actually use. Our tests can use mocks conforming to this interface, and we don't have the overhead of an adapter, nor do we have the holes in our test coverage or the complexity of scoping tests to include the dependency. Its a pretty nice use case for applying interfaces after the fact.

As good a practice as this is, many libraries that we find ourselves depending on don't implement an interface. Consequently, we find ourselves having to make the decision on if we want to include those dependencies within the scope of our tests, or if we want to extract an interface from the library, then create an adapter wrapper linking the original library to the new interface, and finally a separate mock, also implementing the interface. All this work to mock out the external dependency, and you would still have holes in your coverage in the adapter, unless you wanted to then drag the external dependency into your tests again. Yuk!

Enter Golang duck typing. More commonly seen in dynamic languages, the duck typing mantra asserts that "if it quacks like a duck, it is a duck". The behavior implies the type, rather than explicitly having to call it out through inheritance. Any class that implements Quack() is implicitly a duck, even if the original developer never made that distinction.

So how does that impact our testing strategy? We simply create an interface with signatures that match the pieces of our dependency we actually use. Our tests can use mocks conforming to this interface, and we don't have the overhead of an adapter, nor do we have the holes in our test coverage or the complexity of scoping tests to include the dependency. Its a pretty nice use case for applying interfaces after the fact.

## Thursday, January 30, 2014

### RabbitMQ

Our company has grown to the point where our service is so highly distributed, that node reliability is becoming a major pain point. You figure any given server in our datacenter fails maybe once every 1000 days... but now we have over 1000 servers, and node failures are becoming a daily problem.

Like any high availability service, we had focused our efforts on having failure strategies and fallback modes in the event of a node failure to insure our customer's jobs are faithfully driven to completion. Our system scales horizontally, and redundancy is provided for new work coming into the system by passing jobs to other workers in the load balancer, even when an individual node is failing. However, ownership of jobs within the worker means that hardware failures like network outages and disk crashes make it very hard to recover those in-flight jobs without assistance from operations. Timely completion of our customers tasks makes these types of failures a race against the clock for our team, which means late night pages and all nighters. What this really indicates is a more holistic view of our system is needed.

Enter message queuing services like RabbitMQ. The big win from my view is an approach to jobs which assures that jobs never are only located on any single point of failure. As jobs flow from the queue to the worker, they live on both the queue AND the worker processing the job. If a node fails without acknowledging the completion of a job it has in flight, RabbitMQ will redirect the job to another node in the system. Rabbit queues are themselves replicated within a rabbit cluster that is abstracted from the producer, so a failure in a node in the queuing system silently switches over without additional action from either producers or consumers. Workflows become a series of handshakes, where a producer maintains any messages until it successfully hears acknowledgement from the queue, and a consumer only acknowledges the queue when it has finished processing a job, and handed it to the next queue in the pipeline.

This frees our code base in a number of ways. First, it simplifies failure modes. Any local failures could potentially be treated as unacked work, relying on the queues dedication to retry jobs on other workers in the event of failures to assure the message eventually makes it through. Second, it provides a consistent interface between workers in our workflow. Should we need to inject a new worker stage into the workflow, the previous and following stages can remain blissfully unaware of the changes. Third, is simplifies how workers manage their own queues internally. Where as before, potentially complicated structures are needed to manage queues locally. With this new view, management of the queue is outsourced, and we can focus our effort on our specific domain instead.

There are many other benefits, but between increasing our systems reliability against node failures, while simplifying our code base, I think services like RabbitMQ present a clear win for developers of highly scalable systems.

## Monday, August 26, 2013

### Nokia 1020

So I've been really excited to get my new Nokia Lumia 1020... for all you shutter bugs out there, it has a lot of great features. But beware, the Nokia apps still have a lot of bugs, and Nokia doesn't seem to have great community support for feedback.

**To date:**

Creative studios will always crash on some pictures when I mix color pop with radial tilt blurring.

Cinemagraph is the coolest app, but it isn't the most intuitive feature. Unfortunately, it lost some great pictures for no apparent reason. I'm really bummed about that more than anything. Also, uploads to skydrive don't come as animated gifs, they are converted to still frame jpegs.

Sometimes, the Nokia software seems to cause the phone to hang.

When photos are "Saved" in the Nokia apps, a copy isn't saved to the camera roll, and the saved pictures directory doesn't have an obvious way to navigate to it. It often becomes easier to pull them off the skydrive afterwards.

## Saturday, August 17, 2013

### A Heuristic for Ultimate Tic Tac Toe

If you haven't heard of ultimate tic-tac-toe, you have been missing out on all the rage in the tic-tac-toe world. Ok, maybe there isn't that much of a tic-tac-toe world... but this variant of the game which everyone has played and mastered as a child, it adds additional levels of complexity, bringing back the game to the world of adults. One of my coworkers has been working on a website for the game, still in beta, found here... And in playing it a few times, I decided to try and write a bot that could beat me at the game.

The rules for ultimate tic-tac-toe are similar to the original game, but with an extra dimension. As you can see, each square of the large tic-tac-toe board has another tic-tac-toe squarer within it, and to take a square in the larger board, you must win the game within that square. The real wrinkle comes next...

Any move made in a little board dictates which little board is to be played next. In this example, the red X is in the top right corner of its own little board, which means the next move mus be played in the top right corner of the big board.

Anyone who has done turn-based complete information game programming knows that heuristic search using some form of alpha beta pruning is the way to go, such as this example for regular tic-tac-toe. For simple games like that, all you need is a move generator for each position, and some heuristic score which can be used to compare positions at the end of each search to tell you who is winning. In game trees that you can search the full depth, you can just return something like 1 for X winning, -1 for O winning, and 0 for a cats game. But there is a maximum depth of 81 with a branching factor of 9 in ultimate tic tac toe, so the heuristic becomes the interesting part.

We can find some interesting observations looking at one mini board in isolation. Where in regular tic-tac-toe, there is always a back and forth of who plays next... but here, one player could be sent to the same board many more times than their opponent. In fact, it is possible that only one player ever makes moves on a given board. Also, since the macro view of the game dominates the micro view, we find that even when a board could clearly be won on the next move, that doesn't imply that the next move should be to take the win. Good moves almost appear to be random from the prospective of the mini-board. We can take this to the extreme, and say that from any starting position on a mini board, we can estimate the probability of a winner for that board by simulating out thousands of random continuations of pieces from either side to empty squares. Since each miniboard has nine squares, and each square has either an X, an O, or nothing... there are 3^9 possible positions to consider, which works out to just 19683 positions. I wrote a program to pre-populate this table mapping any possible starting position for a little board to a probability that this board would be won by either X or O, given the assumptions of randomness from above.

The rules for ultimate tic-tac-toe are similar to the original game, but with an extra dimension. As you can see, each square of the large tic-tac-toe board has another tic-tac-toe squarer within it, and to take a square in the larger board, you must win the game within that square. The real wrinkle comes next...

Any move made in a little board dictates which little board is to be played next. In this example, the red X is in the top right corner of its own little board, which means the next move mus be played in the top right corner of the big board.

Anyone who has done turn-based complete information game programming knows that heuristic search using some form of alpha beta pruning is the way to go, such as this example for regular tic-tac-toe. For simple games like that, all you need is a move generator for each position, and some heuristic score which can be used to compare positions at the end of each search to tell you who is winning. In game trees that you can search the full depth, you can just return something like 1 for X winning, -1 for O winning, and 0 for a cats game. But there is a maximum depth of 81 with a branching factor of 9 in ultimate tic tac toe, so the heuristic becomes the interesting part.

We can find some interesting observations looking at one mini board in isolation. Where in regular tic-tac-toe, there is always a back and forth of who plays next... but here, one player could be sent to the same board many more times than their opponent. In fact, it is possible that only one player ever makes moves on a given board. Also, since the macro view of the game dominates the micro view, we find that even when a board could clearly be won on the next move, that doesn't imply that the next move should be to take the win. Good moves almost appear to be random from the prospective of the mini-board. We can take this to the extreme, and say that from any starting position on a mini board, we can estimate the probability of a winner for that board by simulating out thousands of random continuations of pieces from either side to empty squares. Since each miniboard has nine squares, and each square has either an X, an O, or nothing... there are 3^9 possible positions to consider, which works out to just 19683 positions. I wrote a program to pre-populate this table mapping any possible starting position for a little board to a probability that this board would be won by either X or O, given the assumptions of randomness from above.

52.3% | 50.9% | 49.3% |

45.2% | 62.7% | 59.9% |

11.1% | 90.1% | 50.3% |

Applied at game time, we can reduce our tic-tac-toe board to a 9x9 board of probabilities that any one square is going to be won by a given player. You could sort of use this to speculate on what your chances are for winning a given row or column by combining the probabilities... for instance, my chances of winning the vertical center row are .509 * .627 * .901, or approximately a 28% chance. However, the game is a race, and while I may have better chances of winning the center column, it is balanced with the chances that my opponent wins the left column. To adjust for this, I decided to take a probability board like the one above, and again randomly simulate thousands of times from such positions the process of choosing squares using the probability to decide who would win that square, and repeating until the game had ended. This would give me a pseudo-probability of who would win the game in such a position. Of course, the size of such a search field is too large to cache, so I decided to use these probabilities to train a classifier for linear regression.

So for those of you who don't remember, linear regression involves collecting sample inputs to sample outputs in the form A*X + B = Y... where Y is the variable our model will try to produce. In our case, I want to take some combination of the probability board as our X, and the simulated probability of winning the game described above as our Y. Then we can solve the system of equations for some set of scalars A that can be used to combine an arbitrary set of inputs to produce a probability of winning. Rather than make X simply the probabilities from the board above, I decided to collect the combined probabilities for winning any given row, column, or diagonal for each player, and using those instead. So our A*X + B = Y starts to look like...

So for those of you who don't remember, linear regression involves collecting sample inputs to sample outputs in the form A*X + B = Y... where Y is the variable our model will try to produce. In our case, I want to take some combination of the probability board as our X, and the simulated probability of winning the game described above as our Y. Then we can solve the system of equations for some set of scalars A that can be used to combine an arbitrary set of inputs to produce a probability of winning. Rather than make X simply the probabilities from the board above, I decided to collect the combined probabilities for winning any given row, column, or diagonal for each player, and using those instead. So our A*X + B = Y starts to look like...

One of the reasons I like using linear classifiers, is because there are certain techniques that make training them extremely easy. In this case, using the pseudo-inverse allows us to solve this equation for A very easily, even given a large set of sample points that don't form a square matrix. The result is a set of weighs that can be multiplied by the input values, and summed to a predicted score for the game.

There was a bit of code to this work, I posted on github here. Feel free to use these ideas generically to create a heuristic score without having to understand the tactics and strategies of whichever game you are creating.

## Thursday, May 23, 2013

### probabilistic alpha beta search

Any good chess player can tell you the difference in the styles of someone like Mikhail Tal and Anatoly Karpov. Tal was known as the magician, someone who sought to guide games into the sharpest possible lines where he could induce mistakes from his opponents, even while taking great risks of his own. Karpov on the other hand, he was a python. He'd safely accumulate small advantages, and slowly squeeze his opponent until they cracked. There styles couldn't be more different, but there results were undeniable; both joined the exclusive club of World Chess Champions. And yet, neither of these styles are well captured in Computer Chess, who's engines have become the cold assassins of the chess world. I think capturing the essence of these stylistic differences is a factor that could be used to further improve the results of computer chess engines, and help tailor their response to a given candidates strengths and weaknesses, without compromising quality. What is the missing factor? Quantification of risk.

What we are interested in is creating a probability distribution that models the difference between the quiescence score of a frontier node (such as the 1.2 valued node in Figure 1), and the score of that same node if it were fully evaluated one ply deeper than the root (such as the position of node D in Figure 1). That is to say, we need to understand how much our heuristic evaluation will vary between the time when we first encounter a node in the search tree, and our last chance to change our mind. A normal distribution (see Figure 2) may be an appropriate model for a good heuristic, where the x-axis represents the score, and the y-axis represents the probability. This model can be fitted empirically ahead of time, sampling quiescence and max depth evaluation pairs.

Integration of the normal distribution leads to the cumulative distribution function, pictured in Figure 3. This function can be used to describe the probability that a random draw from the normal distribution will be less than X. For example, a random draw from the distribution represented by the green curve in Figure 3 will be less than -2.5 20% of the time, if we trace 0.2 on the y-axis to the intersection. That means with 80% certainty, that distribution is greater than -2.5.

Now consider the situation in Figure 4, where our search has reached node A (a pre-frontier node), in the context of a much larger search. Nodes B, C, and D are individually represented by the cumulative distribution functions F1, F2, and F3. We can hedge against the risk of any single node by combining the three cumulative distribution functions to get a single distribution, one which represents the probability of the score of the best of these three alternatives as determined by a full depth search rooted at node A.

Equation 1 expresses a combined cumulative probability distribution recursively, where Fi represents CDF for the ith alternative move choice. Here, we can say at any point that the probability that a random draw from a distribution is at least as great as X is equal to the probability that Fi is greater than X, plus the probability that Fi is less than X times the probability of the combination of the remaining alternatives is greater than X. A concrete example can be again taken from Figure 3. Consider the blue, red, and orange distributions, each with its mean at 0 (ie: F1(0) = F2(0) = F3(0) = 0.5), and assume those represent the probability distributions for Figure 4. Each by itself has a 50% chance of scoring higher than 0. Combining those into Equation 1 and expanding the probability around 0, we find that we have an 87.5 percent chance of scoring better than zero. This completely upends our understanding of the score at node A in Figure 4.

We can incorporate this probabilistic model into the existing framework of Alpha Beta Search. Our search algorithm is initialized with a confidence parameter denoting the risk we are willing to take in selection of our next move. I would expect that requiring a high confidence (95%) would produce strategically sound games with little risk, as in the style of Karpov. Requiring low confidence (5%) would result in riskier search paths, but higher rewards, as in the style of Tal.

At every terminal node, the Quiesce function instead returns the cumulative probability function mapping to the score produced by our heuristic evaluation. Every non-terminal node produces a combined CDF using the methodology described by Equation 1. Bounds are set by evaluating a CDF at the prescribed confidence level, giving us a fixed expected value to compare against as Beta. Then, when a child is iterating through its move choices, it updates the probability that one of its subnodes will be greater than Beta. If the probability is greater than our prescribed confidence, we can do a beta cutoff. One of the nice consequences of this strategy is that as the variance of our heuristic evaluation drops to 0, the search converges back to a normal alpha beta search.

References to other related approaches:

B* Search

ProbCut Pruning

Conspiracy Search

Similar approaches in the literature:

An optimal game tree propagation for imperfect players

Figure 1: Game tree with mimimax highlighted in red, and prospective better path in green |

What does it mean to include "risk" in a turn based two player zero sum game? Consider Figure 1, which highlights the typical minimax approach to game trees in red. Each player has competing goals, Max is looking to choose a path that maximizes his score, thus picking the maximum nodes in the next level, while Min is trying to minimize the score on his turn. Scores are bubbled up from the bottom. We see from this perspective, Max's optimal choice would be to pick the left branch, because given optimal play from Min, the score of 1.2 produces a slightly superior position. If the scores were perfect, indeed this analysis would be correct, and risk would be completely mitigated.

The reality in games where the search depth rarely allows one to reach terminal nodes is that we use a heuristic as a proxy for the true score. When the frontier node is finally reached, we repeat the process of creating a tree to search, bubbling up a new score that serves as a better estimation of the actual value of the position at that node. This means there is a variance in the heuristic that must be considered when choosing our path through the tree. When we eventually reach a pre-frontier node, we will have reached the last responsible moment from which we can change our decision. The score of pre-frontier nodes as viewed from the root must then be some combination that accounts for the likelihood that the proxy score for any frontier node has improved by the time we reach the pre-frontier level.

Consider node A vs node B. If we were to reach node A and then discover that our heuristic evaluation overestimated the score of the child node valued 1.2, we have left ourselves without a viable alternative, as the variance in our heuristic score is unlikely to overcome the 6.2 point gap we predicted in the alternative branch. If the true score of the best choice at node A was -1.0, we would have no recourse but to continue down that path. Node B, on the other hand, provides two choices of similar value. If we were to reach Node B and discover that our 1.1 option actually produced a score of -1.0, we would still have a viable alternative in the 1.0 node. Minimax fails to capture this characteristic.

Now consider nodes C and D. From the prospective of Max, we again recognize that Min's options are restricted at node C due to the large difference between the first and second best nodes. It is not likely that the variance in the 4.0 node will overcome the gap in scores, which means that a miss-evaluation of the frontier node in Max's favor cannot be mitigated by a second choice from Min. Node D however does provide that alternative, in the similar score of the 1.3 node.

Figure 2: Normal Distribution |

Figure 3: Cumulative Distribution Function |

Figure 4: Three choices represented by three cumulative distribution functions |

Equation 1: Recursively defined CDF accumulating all move alternatives |

Equation 1 expresses a combined cumulative probability distribution recursively, where Fi represents CDF for the ith alternative move choice. Here, we can say at any point that the probability that a random draw from a distribution is at least as great as X is equal to the probability that Fi is greater than X, plus the probability that Fi is less than X times the probability of the combination of the remaining alternatives is greater than X. A concrete example can be again taken from Figure 3. Consider the blue, red, and orange distributions, each with its mean at 0 (ie: F1(0) = F2(0) = F3(0) = 0.5), and assume those represent the probability distributions for Figure 4. Each by itself has a 50% chance of scoring higher than 0. Combining those into Equation 1 and expanding the probability around 0, we find that we have an 87.5 percent chance of scoring better than zero. This completely upends our understanding of the score at node A in Figure 4.

At every terminal node, the Quiesce function instead returns the cumulative probability function mapping to the score produced by our heuristic evaluation. Every non-terminal node produces a combined CDF using the methodology described by Equation 1. Bounds are set by evaluating a CDF at the prescribed confidence level, giving us a fixed expected value to compare against as Beta. Then, when a child is iterating through its move choices, it updates the probability that one of its subnodes will be greater than Beta. If the probability is greater than our prescribed confidence, we can do a beta cutoff. One of the nice consequences of this strategy is that as the variance of our heuristic evaluation drops to 0, the search converges back to a normal alpha beta search.

B* Search

ProbCut Pruning

Conspiracy Search

Similar approaches in the literature:

An optimal game tree propagation for imperfect players

Subscribe to:
Posts (Atom)