Improving event detectors

September 21, 2008

Followup to: Levels of representation.

Consider a binary event detector that, say, detects the presence of tigers in a picture. This detector defines a certain set of pictures for which it activates. If this is a simple preliminary detector constructed from general description of tigers, it can be significantly improved. But what constitutes an improvement? A change in detector could make it a better tiger-detector, worse tiger-detector, or even turn it into a cloud-detector.

The original detector is a vague question, and the improved detector is an answer. The initial outline allows to identify the clusters of known events fitting the description and change the detector to includes these clusters but not stray peripheral events that come through the boundary, and also to plan for extrapolation of these clusters. Event detector is constructed to be applied in the future, so it needs to be able to recognize not only causal patterns that were never observed before, but also causal patterns that never existed but will appear in the future. In a way, learning is a self-improvement action that agent applies to its future self to process certain events better.

Predicting the subsequent events requires a probabilistic model of environment, and relying on few exemplars that a newly-fledged detector managed to observe is not sufficient. If a new detector is expressed in terms of few other existing detectors (right from the vague question stage), rather than as a classifier applied to the whole input domain, the problem becomes much simpler. Existing detectors already have a good idea about probability distribution of their states, so probability distribution for the new detector roughly derives from them, modulo dependency. The form of the new detector can then be adjusted, guided by this probability distribution and facts about dependencies observed from few exemplars.

Intermediate events allow tractable inference, their purpose is in implementing computational steps that follow the natural structure of modeled environment. An event detector needs to at least not be redundant (not repeat another available detector) and be probable enough to assume more than one state during some reasonable timeframe (if a detector isn’t ever expected to activate, there is no point in keeping it around). When detector supports multiple states, many of these states need to hold sufficient probability mass. After original form of a new event detector distributes the probability mass among its states, based on the knowledge about detectors from which it’s constructed, one of the pressures for the refinement of new detector is in shifting the boundaries of its states to even out (or, at least, keep within limits) the distribution of probability.

The form events tend to assume depends on many aspects of the cognitive algorithm, and only on crude level do they correspond to natural events of the environment. Within the joint detectors of the natural events, factoring into individual detectors may look rather unnatural, like set of floating-point numbers that have a “zero” at twentieth bit in memory. The high-level dynamic of events repeats the structure of environment, but the low-level dynamic of individual detectors may look differently. The low-level dynamic needs to be specifically designed to implement the required high-level dynamic.


Causation as robust indication

August 1, 2008

Followup to: The flow of reality, Semantics of the rules of thumb.

There are multiple senses in which words causation and causality are used, and many historically accumulated complications. Basically, causality is a relationship between two events, cause and effect, that establishes the effect as reliably following (from) the cause. Causal relations are naturally used for describing functional structure of environment. In addition to enumeration of events (“button was pressed; light turned on”), causal relations establish the internal structure that binds these events (“button was pressed, which caused light to turn on”). Functional structure allows to model the dynamics of environment in different contexts, acting as an algorithm that computes state of environment given different initial conditions. This way, causal relations not just describe the structure of a fixed scene, but generalize to other scenes.

The sense in which effect follows from cause reliably is the subject of many disputes. Restricting the notion of causality only to relations that hold deterministically makes it almost useless for modeling real environment. Thus, effect should be allowed to sometimes not follow the cause, which contributes to the problem of connection between correlation and causation. Two events may be highly correlated, yet not causally dependent, and otherwise, two events may be causally dependent, but without overwhelming correlation. Apples often lie on the ground near the apple tree, but they don’t cause the apple tree to appear nearby. Smoking causes cancer, but only sometimes.

The difference that is usually pointed out between statistical properties and causal properties appears when studied scene is changed. One of the central problems in statistics is regression, inference of the whole distribution from limited number of observations. It can consist in e.g. finding the values of parameters that give a distribution from parameterized set that fits the data best. The distribution that describes the data is considered to be fixed and the same for all data points. Even when data points are taken from a process that changes its state over time, distribution that describes development of the process is still specific (even though it’s uncertain). Statistical properties of a single distribution describing static conditions are contrasted to causal properties that describe the behavior of the scene when conditions change. Statistical properties change with conditions, but causal properties are fixed, and can be used to regenerate statistical properties of the scene from changed conditions.

Yet if we consider the development of environment as a whole, there are no external changes to be applied to the dynamics of environment. Everything that happens, happens within the environment, every change is interaction between configurations within the dynamics of reality. From this perspective, causal properties may be regarded as statistical hypotheses that hold for considered part of the environment both before and after the change is applied to it through interaction with other parts of the environment. Statistical properties of a fixed distribution thus apply only to a limited part of the dynamics of the scene when it isn’t changed, while causal properties are more general statistical properties that persevere through “external” changes.

Let’s focus on a specific class of causal properties that describe rules of thumb which indicate events given other events. This way, a causality relationship between cause and effect becomes a rule of thumb that provides evidence for effect given the cause. The feature that distinguishes causal rules from other rules of thumb is that they provide evidence for effect given the cause in many contexts, even in contexts that were changed through interaction with various causal patterns. Causal rules are rules of thumb that are general enough not to break down as the dynamics of environment unfolds. In this sense, causal rules are the only reliable kind of rules of thumb.


Semantics of the rules of thumb

July 25, 2008

Followup to: Levels of structure.

Rules of thumb work only in certain environment. Even under the constraints of laws of physics an unusual physical configuration may act contrary to most of them. Each rule applies only in a certain context, and together they can support a plausible model of environment only in a certain global context. When a rule is applied, context is only checked (for applicability) based on the information about events derived from surface properties. On the other hand, rules of thumb encode information about the environment that allows to perform inference just from surface properties, without knowing physical configurations accurately enough to perform the same inferences using laws of physics.

Rules of thumb are fundamentally hypotheses, probability distributions over physical configurations, that can gain high probability given information about distribution of configurations represented by other probable events (current context), which in turn lends higher probability to other events under the hypothesis. Starting only from physical laws leaves hypotheses representing the traces of different physical configurations in time equally likely. Singling out laws of thumb prioritizes certain kinds of configurations, those expected to be possible (more likely) in the actual state of the environment and at the same time useful for inference from the surface properties. The whole collection of rules of thumb provides an expressive language for assembling complex hypotheses about environment from reusable parts. The probabilities assigned to hypotheses can be represented implicitly, by supporting a set of events that pass a certain threshold of probability. Rules of thumb act as update rules on this set, removing the events that are inferred to be improbable given the rest of the events, and adding events inferred to be probable. Doubt everything, but one at a time.


Odds, evidence, and an intuitive form of Bayes’ theorem

June 23, 2008

Bayes’ theorem establishes how to update probabilities when given additional evidence. The standard form of Bayes’ theorem is as follows:

\displaystyle P(A|B)=\frac{P(A)P(B|A)}{P(B|A)P(A)+P(B|\neg A)P(\neg A)}

It shows how to update prior probability P(A) of event A, when given event B as evidence. Updated probability, P(A|B), reads “probability of A given B”. Conditional probability is defined as P(A|B)=\frac{P(A,B)}{P(B)} where P(A,B) is probability of both events A and B happening at the same time.

To get the updated probability P(A|B), we also need to know how strong B is as evidence for A. This is specified by conditional probabilities P(B|A) and P(B|\neg A), that is probability of obtaining evidence B when A is present (correct indication), and probability of receiving evidence B even though A is not present (wrong indication).

Bayes’ theorem follows straightforwardly from definition of conditional probability, and has sufficiently simple form, but obtaining intuitive sense of it is rather tricky. The intuitive understanding is useful for discussions of less technical situations, which can still be informed by technical knowledge from probability theory. The formula has all these probabilities mixed up, so the structure of probability updating is not intuitively clear.

One thing that is not explicitly expressed is that the result doesn’t depend on individual values of P(B|A) and P(B|\neg A), but only on their ratio, \frac{P(B|A)}{P(B|\neg A)}. If we rewrite the formula to express this fact, and rearrange other terms a little, we can get the following form:

\displaystyle \frac{P(A|B)}{P(\neg A|B)}=\frac{P(A)}{P(\neg A)}\cdot\frac{P(B|A)}{P(B|\neg A)}

The value \mathrm{Odds}(X)=\frac{P(X)}{1-P(X)} is called odds of event X. Odds is the ratio of probability of event happening to probability of it not happening, P(X): P(\neg X). Also, the strength of B as evidence for A is called Bayes factor, K(B|A)=\frac{P(B|A)}{P(B|\neg A)}. In these terms, Bayes’ theorem can be rewritten in the following simple form:

\displaystyle \mathrm{Odds}(A|B)=\mathrm{Odds}(A)\cdot K(B|A)

Both odds and Bayes factor show the ratios between alternative outcomes: presence to absence of event for odds, and correct indication to wrong indication for Bayes factor. In this form of Bayes’ theorem, it is intuitively clear how the odds for the presence of event A given an event B are calculated: we start with previously known odds for event A, and multiply it by odds of B being a correct indication of A. Observing B only shifts the odds of our event A, depending on how good an indicator B is for A.

For example, if we have a test B for illness A, and probability of testing positive for people with the illness, P(B|A), is 90%, while the probability of testing positive for people without illness, P(B|\neg A), is 1%, and the portion of population that has the illness, P(A), is 0.1%, then what is the probability that a person that tested positive has the illness, P(A|B)? It’s nowhere near 90%: the prior probability of having illness, 0.1%, was shifted by the evidence, and even though evidence is rather strong, prior probability is still too low, so the posterior probability is only 8.3%.

Let’s see how the same problem looks using the form of Bayes’ theorem based on odds. Prior odds of having the illness are \frac{0.1\%}{100\%-0.1\%}\approx 1:1000, Bayes factor is \frac{90\%}{1\%}=90:1, so the posterior odds are

\displaystyle \mathrm{Odds}(A|B)\approx(1:1000)\cdot(90:1)=9:100

So, the odds are about 1:10 against the conclusion that patient has the illness. Converting from odds back to probabilities, using the formula P(X)=\frac{\mathrm{Odds}(X)}{1+\mathrm{Odds}(X)}, we get the result that P(A|B) is about 8.3%. Note that we don’t always need to convert back to probabilities.

Odds range in (0,+\infty), with value of 1 showing even odds, so that “positive” odds range in (1,+\infty), while “negative” odds range in (0,1). A more symmetric range, and even simpler form of Bayes’ theorem, can be obtained by taking the logarithm of odds and Bayes factor. The logarithm of odds is called log odds or logit. The logarithm of Bayes factor is called weight of evidence. It is useful to take base 2 logarithms, so that the result can be measured in bits. These are not bits on a hard drive, but a mathematician’s bits, so the value in these bits can be fractional and even negative. Taking the logarithm of the both sides of odds-based Bayes’ theorem, we get this rule:

\displaystyle \log_2(\mathrm{Odds}(A|B))=\log_2(\mathrm{Odds}(A))+\log_2(K(B|A))

In this form, the probability of event is represented by evidence about event, measured in bits. Evidence ranges in (-\infty,+\infty), and probability of 50%, or odds of 1:1, corresponds to evidence of 0. Positive evidence corresponds to probability above 50% or odds above 1, and so on. Evidence provided by the indicator, \log_2(K(B|A)), is added to previously known evidence about the event. This also roughly shows how to chain the Bayes’ rule: you just add up all the evidence, each piece of evidence weighting either positively or negatively, and the resulting sum of all evidence shows the conclusion (although one should be very careful not to add the same piece of evidence multiple times, explicitly or implicitly through dependent events).

In conclusion, let’s see how our example works with log odds. Prior evidence for A is log_2(1:1000)\approx -10, the positive test gives additional evidence log_2(90:1)\approx 6.5, so the sum of evidence is -10+6.5=-3.5. This calculation immediately shows that the outcome A is unlikely even given B. The odds are given by 1:2^{3.5}\approx 1:11.

Also note that the negative result of the test would have given a different amount of evidence, namely

\displaystyle \log_2(\mathrm{Odds}(\neg B|A))=\log_2\left(\frac{100\%-90\%}{100\%-1\%}\right)
=\log_2(10:99)\approx -3.3

so that the total evidence for A after receiving the negative test would be -10-3.3=-13.3.