PointNet and SENet

Came across two different network architectures that I think can discuss together as I feel that they share a similar core idea. Extract and leverage some global information from the data with global pooling. 

PointNet

The objective of PointNet is to classify each voxel of a point cloud and detect the potential object described by the point cloud. Consequently, each voxel can belong to a different semantic class but they all share the same object class.

One important property of the point cloud is that all points are not in any particular order. Therefore, it does not make too much sense to train a convolutional layer or fully connected layer to intermix the point. The simplest reasonable operation to combine all information is just pooling (max or average). For PointNet, each point is first individually processed and then a global feature is generated using max pooling. The object described by the point cloud is also classified by the global feature.

And the input transform and feature transform in the above figure are trainable and will be adapted to the input. This serves the purpose of aligning the point cloud before classification. For example, the T-Net in the input transform is elaborated as shown below.

To classify individual voxel, the global feature is combined with the local feature also generated earlier in the classification network. The combined feature of each voxel will be individually processed into point features, which are then used to classify the semantic class of the voxel.

SENet

SENet was the winner of the classification competition of ImageNet competition in 2017. It reduces the top-5 error rate to 2.251% from the prior 2.991%. The key contribution of SENet is the introduction of the SENet module as follows

The basic idea is very simple. For a feature tensor with C channel, we want to adaptively adjust the contribution from each channel through training. Since there is no restriction in the order of the data inside the channel, the most rational thing to summarize that information is simply through pooling. For SE Net module, it is simply done with an average pooling (squeezing). The “squeezed” data are then used to estimate the contribution of each channel (different levels of “excitation”). The computed weights are then used to scale the values of each channel. This SE module can be applied in literally everywhere. For example, it can be combined with inception module to form SE-inception module or ResNet module to form SE-ResNet module as shown below.

Markov Chain Monte Carlo (MCMC)

Motivation

While Monte Carlo methods have lots of potential applications, we will just name two here as a motivation:

  1. Data generation. The generated data can be used for simulation or for visualization purposes.
  2. Inference. Infer unknown variables from observations.

For the first application, the use of Monte Carlo is rather straight forward. We will assume the model parameters are known and we can sample parameters sequentially with Gibbs sampling as described below. If all conditional distributions can be sampled directly (more explanation below), we can use regular Monte Carlo without the need for MCMC.

For the second application, there is an additional layer of complication as we typically would like to sample from the posterior distribution for efficiency. But often only the likelihood distributions are specified. In that case, we will need to sample from a proposal distribution and MCMC is needed.

Basic Idea of Monte Carlo Methods

Ultimately, almost all Monte Carlo Methods are doing nothing but estimating the expectation of some function f(x) with X \sim p(x) by

E_{X\sim p(x)}[f(X)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i),

where the above is guaranteed by the law of large number for sufficiently large N.

For example, in the coin-flipping examples in Q1 of HW5, we can estimate the ultimate probability of head by conducting the Monte Carlo experiment and counting the number of head. More precisely, denote i_H(\cdot) as an indicator function where i_H(\cdot) is 1 when Y is head. Then

Pr(Y=H)=E_{Y \sim p(y)}[i_H(Y)] \approx \frac{1}{N} \sum_{i=1}^N i_H(y_i).

The key question here is how we are going to collect samples of Y, y_1,\cdots,y_N. In this coin-flipping setup, the Monte-Carlo simulation is trivial. We can conduct the experiment exactly as in a real setup. That is, first draw a coin from the jar; that corresponds to sample a binary random variable X with the specified probability of getting coins A or B. Then, based on the outcome of X (i.e., which coin that actually picked), we will draw another binary random variable Y for each coin flip according to the probability of head of the chosen coin.

In the coin-flipping problem, since we are just drawing binary random variables, we can sample the distribution directly without any issue. But for many continuous distributions except a few special cases, direct sampling them is not possible and more involved sampling techniques are needed as described in the following.

This sequential sampling conditioned on previously specified random variables is generally known as Gibbs sampling. Strictly speaking, it is a form of MCMC as we will show later on. But I think it is more appropriate NOT to consider this simple case as MCMC because there is no convergence issue here as you will see that is generally not the case for MCMC.

Basic Sampling Methods

The key question left is how we can sample data from a distribution. For some standard distributions like Gaussian distribution, we can draw samples directly with some well-developed packages. However, there is no direct way of drawing samples for many regular distributions, not to mention those that do not even have a standard form. There are ways to sample any arbitrary distributions, we will mention a couple of simple ones here.

Rejection Sampling

Probably the simplest sampling method is the rejection sampling. Consider drawing sample X \sim p(x). Select any “sampleable” distribution q(x) such that c q(x) > p(x), \forall x with some constant c. Now, instead of drawing sample from p(x), we will draw from q(x) instead. However, in order to have the sample appear to have the same distribution of p(x), we will only keep a fraction \frac{p(x')}{c q(x')} of x' whenever x' is drawn and discard the rest. To do this, we simply sample a Z from [0,1] uniformly after each x sampled. And keep the current x only if Z<\frac{p(x)}{c q(x)}.

Note that in Q4 of HW5, we are essentially doing some form of rejection sampling. Draw the samples from the prior and the likelihood distributions, and then only accept samples that fit the observations. Rejection sampling is simple but it can be quite inefficient. For example, if the current estimates of variables used in sampling are far from the actual values, the sampled outcome most likely will not match well with the observations and we have to reject lots of samples. That’s why other sampling techniques are needed.

Importance Sampling

Rejection sampling is generally very inefficient as a large number of samples can be discarded. If we only care about computing the expection of some function, then we may not need to discard any sample but adjust the weight of each sample instead. Note that

E_{X\sim p(x)}[f(X)]=\int f(x) p(x) dx = \int \left[f(x)\frac{p(x)}{q(x)}\right] q(x) dx= E_{X\sim q(x)}\left[f(X)\frac{p(X)}{q(X)}\right].

Thus, we can estimate the expectation by drawing samples from q(x) and compute weighted average of f(x) instead. And the corresponding weights are \frac{p(x)}{q(x)}.

Even though we can now utilize all samples, importance sampling has two concerns.

  1. Unlike rejection sampling, we do not directly draw samples with the desired distribution. So we can use those samples to do other things rather than computing expectations.
  2. The variance of the estimate can be highly sensitive with the choice of q(\cdot). In particular, since the variance of the estimate is \frac{1}{N}E\left[ f(X)\frac{p(X)}{q(X)}\right], the variance will be especially large if a more probable region in p(x) is not covered well by q(x), i.e., we have extensive region where p(x) large but q(x) \approx 0.

Markov Chain Monte Carlo

Instead of sampling from p(x) directly, Markov chain Monte Carlo (MCMC) methods try to sample from a Markov chain (essentially a probabilistic state model). Probably the two most important properties of the Markov chain are irreducible and aperiodic. We say a Markov chain is irreducible if we can reach any single state to any other state. And we say a Markov chain is periodic if there exist two states such that one state can reach the other state only in a multiple of n steps with n>1. If a Markov chain is not periodic, we say it is aperiodic. Under the above two conditions (aperiodic and irreducible), regardless of the initial state, the distribution of states will converge to the steady-state probability distribution asymptotically as time goes to infinity.

Consider any two connected states with transition probabilities T(x|x') (from x' to x) and T(x'|x) (from x to x'). We say the detailed balance equations are satisfied if p(x)T(x'|x) = p(x')T(x|x'). Note that if detailed balance are satisfied among all states, p(x) has to be the steady state probability of state x. Because the probability “flux” going out from x to x' is exactly canceled by the “flux” coming in from x' to x when the chain reaches this equalibrium.

With the discussion above, we see that one can sample X \sim p(x) if we can create a Markov chain with p(x) designed to be the steady-state probability of the chain (by making sure that detailed balance is satisfied). Note that, however, we have to let the chain to run for a while before the probability distribution converges to the steady-state probability. Thus, samples before the chain reaching equilibrium are discarded. This initial step is known as the “burn-in”. Moreover, since adjacent samples drawn from the chain (even after burn-in) are highly correlated, we may also skip every several samples to reduced correlation.

Metropolis-Hastings

The most well-known MCMC method is the Metropolis-Hastings algorithm. Just as the discussion above, given the distribution p(x), we want to design Markov chain such that p(x) satisfies the detailed balance for all states x. Given the current state x, the immediate question is to which state x' we should transit to. Assuming that x\in \mathbb{R}^N, a simple possibility is just to do a random walk. Essentially perturb x with a zero-mean Gaussian random noise. The problem is that the transition probability q(x'|x) and the “reverse” transition probability q(x|x') most likely will not satisfy the detailed balance equation p(x)q(x'|x)=p(x')q(x|x'). Without loss of generality, we may have p(x) q(x'|x) > p(x')q(x|x'). To “balance” the equation, we may reject some of the transition to reduce the probability “flow” on the left hand side. More precisely, let’s denote A(x\rightarrow x')=\min(1,\frac{p(x')q(x|x')}{p(x)q(x'|x)}). Now, we will randomly reject A(x\rightarrow x') of the transition by drawing a uniform Z\in [0,1] and only allowing transition if Z<A(x\rightarrow x').  And similar adjustment is applied to all transitions (including x' to x). This way, the transition probability from x reduces to p(x) \underset{T(x'|x)}{\underbrace{q(x'|x) A(x\rightarrow x')}}= p(x) q(x'|x)\frac{p(x')q(x|x')}{p(x)q(x'|x)}=p(x')q(x|x')=p(x') \underset{T(x|x')}{\underbrace{q(x|x') A(x'\rightarrow x)}} and so the detailed balance equation is satisfied.

When the transition probabilities from x to x' and from x' to x are equal (i.e., q(x|x')=q(x'|x)), note that A(x\rightarrow x') just simplifies to \min\left(1,\frac{p(x')}{p(x)}\right).

Gibbs Sampling

When the state x can be split into multiple component, say p(x)=p(x_1,x_2,\cdots,x_n). We can transit one component at a time while keeping the rest fixed. For example, we can transit from x_1 to x'_1 with probability p(x'_1|x_2,\cdots,x_n). After updating one component, we can update another component in a similar manner until all components are updated. Then, the same procedure can be repeated starting with x_1. As we continue to update the state, the sample drawn will again converge to p(x) as the Markov chain reaches equilibrium. This sampling method is known as the Gibbs sampling and can be shown as a special case of Metropolis-Hastings sampling as follows.

Consider the step when transiting from x_1 to x'_1 while keeping the rest of components fixed. $A(x_1 \rightarrow x’_1)$ as defined earlier is $\frac{p(x_1,x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)}{p(x’_1,x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=\frac{p(x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)}{p(x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=1$. Thus, Gibbs sampling is really Metropolis-Hasting sampling but with all transitions always be accepted.

Hamiltonian Monte Carlo (HMC)

The trajectory sampled by the Metropolis-Hasting method is essentially a random walk like a drunk man. But we have the complete information of p(x) and we know how it looks like. Why don’t we leverage the “geometrical” information of p(x)?

Power of Physics

We definitely can. And recall the Boltzmann distribution p(x)=\exp(-E(x)) (ignoring temperature here) from statistical physics and we can model any distribution with a Boltzmann distribution with appropriate energy function E(x)=-\log(p(x)). Further, we expect lower energy states are more likely to happen than higher energy states as expected.

If we think of E(x) as the potential energy, a “particle” naturally moves towards lower energy state and the excessive energy will convert to kinetic energy as we learn in Newtonian mechanics. Let’s write the total energy as H(x,p)=PE(x)+KE(p), where the potential energy PE(x) is just E(x) here. (Sorry for the overloading of symbol p here, p is commonly used to represent momentum in physics and so we are sticking to that convention here. Please don’t be confuse with the p in p(x).) And the kinetic energy KE(p)=\frac{p^2}{2 m} with p and m being the momentum and the mass, respectively. The total energy, H(x,p), also known as the Hamiltonian is supposed to be conserved as x and p naturally vary in the phase space (x,p). Therefore, \frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=0.

As we know from classical mechanics, \frac{\partial H}{\partial x}=-\frac{d p}{dt} and \frac{\partial H}{\partial p}=\frac{d x}{dt}. For example, let’s consider an object just moving vertically and x is the height of the object, then H(x,p)=mgx+\frac{p^2}{2 m}, where g is the gravitational force constant.  Then \frac{\partial H}{\partial x}=mg=-F=-\frac{d p}{dt} with F being the gravitational force and \frac{\partial H}{\partial p}=\frac{p}{m}=\frac{dx}{dt} as desired. Moreover, the total energy H(x,p) indeed conserves as p and v changes as \frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=-\frac{dp}{dt}\frac{d x}{dt}+\frac{dx}{dt}\frac{d p}{d t}=0.

Back to Monte Carlo

Why are we talking all these? Because we can draw samples following natural physical trajectories more efficiently than random walks as in Metropolis-Hastings. Given the distribution of interest p(x), we again define the Hamiltonian H(x,p)=E(x)+KE(p). Now, instead of trying to draw samples of x, we will draw samples of (x,p) instead. And p(x,p) \propto \exp(-H(x,p))=\exp(-E(x))\exp(-KE(p))=p(x)\exp(-KE(p)). So if we marginalize out the momentum p, we get back p(x) as desired.

And as we samples from the phase space (x,p), rather than random walking in the phase space, we can just follow the flow according to Hamiltonian mechanics as described earlier. For example, as we  start from (x,p), we may simulate a short time instace \Delta t to (x+\Delta x,p+\Delta p) with \Delta x = \frac{dx}{dt}\Delta t = \frac{\partial H}{\partial p}\Delta t=\frac{p}{m}\Delta t and \Delta p = \frac{dp}{dt}\Delta t= -\frac{\partial H}{\partial x}\Delta t=-\frac{\partial E}{\partial x}\Delta t. One can also update x first and then update p. The latter has many different forms and sometimes known as the leapfrog integration. As we apply multiple (L) leapfrog steps and reach (x',p'), we may decide whether to accept the sample as in Metropolis-Hasting by evaluating A((x,p)\rightarrow (x',p'))=\min\left(1,\frac{exp(-H(x',p'))}{exp(-H(x,p)}\right) as the transition probabilities are assumed to be the same for both forward and reverse directions. Moreover, if the leapfrog integration is perfect, H(x,p) is supposed to the same as H(x',p') because of conservation of energy. So we have a high chance of accepting a sample.

No-U-Turn Sampling (NUTS)

The main challenge of applying HMC is that the algorithm is quite sensitive to the number of leapfrog step L and the step size \Delta t (more commonly known as \epsilon). The goal of NUTS is to automatically select the above two parameters. The main idea is to simulate both forward and backward directions until the algorithm detects reaching a “boundary” and needed to go U-turn. By detecting U-turns, the algorithm can adjust the parameters accordingly. The criterion is simply to check if the momentum starts to turn direction, i.e., (x^+-x^-)\cdot p^+ <0 or (x^+ -x^-)\cdot p^- <0, where (x^+,p^+) and (x^-,p^-) are the two extremes as we go forward and backward in time.

Reference:

  1. https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo
  2. Mackay, Information Theory, Inference, and Learning Algorithms
  3. Pattern Recognition and Machine Learning by Bishop
  4. A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt
  5. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo by Matthew D. HoffmanAndrew Gelman

Bayesian networks, Markov random fields, and factor graphs

The three most common graphical models are Bayesian networks (aka directed acyclic graphs), Markov random fields (aka undirected graphs), and factor graphs.

Bayesian networks

A Bayesian network describes the dependency of variables directly represented by the expansion of the conditional probabilities. Take a very simple example with three variables X, Y, and Z. If the joint distribution can be represented by p(x) p(y|x) p(z|y). Then, the corresponding Bayesian network will be a three-node graph with X, Y, and Z as vertices. And we have two directed edges one from X to Y and another one from Y to Z.

Note that by Bayes rule, the joint probability can be rewritten as p(z) p(y|z) p(x|y). So the following Bayesian network with directed edges one from Z to Y and another one from Y to X is identical to the earlier one.

Given X, Y, and Z, we may also have an empty graph with no edge at all. This corresponds to the case when p(x,y,z)=p(x)p(y)p(z) that all three variables are independent as shown below.

We may also have just one edge from x to y. This corresponds to the case when p(x,y,z)=p(x)p(y|x) p(z), which describes the situation when Z is independent of X and Y as shown below.

Now, back to the cases with two edges, the ones with two directed edges flowing the same direction will just have the same structure as the example we have earlier. So let’s look at other interesting setups. First, let’s consider the case that both directed edges are coming out of the same variable. For example, we have two directed edges from Y to X and from Y to Z as shown below.

This corresponds to the joint distribution p(x,y,z)=p(y)p(x|y)p(z|y)=p(x)p(y|x)p(z|y). So this case is actually the same as the case we have earlier after all. Note that however, this equivalence is not generalizable when we have more variables. For example, say we also have another variable W and an edge from W to Y.

And if we have the direction of the edge Y to X flipped as below, W and Y become independent but the earlier representation does not imply that.

Now, let’s remove W and assume we only have three variables and two edges as described earlier. For all cases we have above (say edges from X to Y and Y to Z), the variables at the two ends X and Z are not independent. But they are independent given the variable Y in the middle. Be very careful since the situation will be totally flipped as we change the direction of the edge from Y to Z in the following.

The corresponding joint probability for the above graph is p(x,y,z)=p(x) p(y) p(z|x,y). We have X and Y are independent as p(x,y)=\sum_z p(x,y,z)=\sum_z p(x)p(y)p(z|x,y)=p(x)p(y). On the other hand, we don’t have p(x,y|z)=p(x|z)p(y|z). And so X and Y are not conditional independent given Z is observed. The classic example is X and Y are two independent binary outcome taking values \{0,1\} and Z = X \oplus Y, where \oplus is the xor operation such that 1\oplus 1=0\oplus 0=0 and 0\oplus 1=1 \oplus 0=1. As the xor equation also implies X =Z \oplus Y, X is completely determined by Y when Z is given. Therefore, X is in no way to be independent of Y given Z. Quite the opposite, X and Y are maximally correlated given Z.

Let’s consider the case with three edges. Note that we can’t have edges forming a loop since Bayesian networks have to be an acyclic graph. A possibility can be having three directed edges one from X to Y, one from Y to Z, and the last one from X to Z as shown below.

In this case, the joint probability will be p(x,y,z)=p(x)p(y|x)p(z|x,y). Note that the joint probability can be directly obtained by Bayes rule and so the graph itself does not assume any dependence/independence structure. The other Bayesian networks with three edges will all be identical and do not imply any independence among variables as well.

As we see above, the dependency of variables can be derived from a Bayesian network as the joint probability of the variables can be written down directly. Actually, an important function of all graphical models is to represent such dependency among variables. In terms of Bayesian networks, there is a fancy term known as d-separation which basically refers to the fact that some variables are conditional independent from another given some other variables as can be observed from the Bayesian network. Rather than trying to memorize the rules of d-separation, I think it is much easier to understand the simple three variables example and realize under what situations that variables will become conditionally independent and that variables will become conditionally dependent. And for two variables to be conditionally independent of one another, all potential paths of dependency have to be broken down. For example, for the three variables three edges example above (with edges X\rightarrow Y, Y\rightarrow Z, and X\rightarrow Z), observing Y will break the dependency path of X\rightarrow Y\rightarrow Z for X and Z. But they are not yet conditionally independent because of the latter path X\rightarrow Z.

Let’s consider one last example with four variables X,Y,Z and W, and three edges (X\rightarrow Y, $Z\rightarrow Y$, and $Z\rightarrow W$) below.

Note that as root nodes, X and Z are independent. And thus X and W are independent as well since W only depends on Z. On the other hand, if Y is observed, X and Z are no longer conditionally independent given Y. And thus X and W are not conditionally independent given Y.

Undirected graphs and factor graphs

Compared with Bayesian networks, dependency between variables is much easier to understand. Two variables are independent if they are not connected and two variables are conditionally independent given some variables if the variables are not connected if the vertices of the given variables are removed from the graph.

For example, consider a simple undirected graph above with three variables X, Y, and Z with two edges X - Y and Y-Z. Variables X and Z are then independent given Y.

By Hammersley-Clifford theorem, the joint probability of any undirected graph models can be represented into the product of factor function of the form \prod_{\bf i} \phi_i ({\bf x}_i). For example, the joint probability of the above graph can be rewritten as f_1(x,y)f_2(y,z) with f_1(x,y)=p(x|y), and f_2(y,z)=p(y)p(z|y). We can use a bipartite graph to represent the factor product above in the following.

  1. Represent each factor in the product with a vertex (typically known as factor node and display with a square by convention)
  2. Represent each random variable with a vertex (typically known as a variable node)
  3. Connect each factor node to all of its argument with an undirected edge

The graph constructed above is known as a factor graph. With the example described above, we have a factor graph as shown below.

The moralization of Bayesian networks

Consider again the simple Bayesian network with variables X, Y, and Z, and edges X\rightarrow Y and Z\rightarrow Y below.

What should be the corresponding undirected graph to represent this structure?

It is tempting to have the simple undirected graph before with edges X - Y and Y-Z below.

However, it does not correctly capture the dependency between X and Z. Recall that when Y is observed or given, X and Z are no longer independent. But for the undirected graph above, it exactly captured the opposite. That is X and Z are conditionally independent given Y. So, to ensure that this conditional independence is not artificially augmented into the model, what we need to do is to add an additional edge X - Z as shown below.

This procedure is sometimes known as moralization. It came for the analogy of a child born out of wedlock and the parents should get married and “moralized”.

Undirected graphs cannot represent all Bayesian networks

Despite the additional edge X-Z, the resulting graph, a fully connected graph with three vertices X, Y, and Z still does not accurately describe the original Bayesian network. Namely, in the original network, X and Z are independent but the undirected graph does not capture that.

So if either way the model is not a perfect representation, why we bother to moralize the parents anyway? Note that for each edge we added, we reduce the (dependency) assumption behind the model. Solving a relaxed problem with fewer assumptions will not give us the wrong solution. Just maybe will make the problem more difficult to solve. But adding an incorrect assumption that does not exist in the original problem definitely can lead to a very wrong result. So the moralization step is essential when we convert directed to undirected graphs. Make sure you keep all parents in wedlock.

Bayesian networks cannot represent all undirected graphs

Then, does that mean that Bayesian networks are more powerful or have a higher representation power than undirected graphs? Not really also, as there are undirected graphs that cannot be captured by Bayesian networks as well.

Consider undirected graph with four variables X, Y, Z, and W and four edges X-Y, Y-W, Z-W, and Z-X as shown below.

Note that we have conditional independence of X and W given Y and Z. This conditional independence is captured only by one of the following three Bayesian networks.

       

Because of the moralization we discussed earlier, when we convert the above models into undirected graphs, they all require us to add an additional edge between parents Y and Z resulting in the undirected graph below.

Note that this undirected graph is definitely not the same as the earlier one since it cannot capture the conditional independence between Y and Z given W and X. In conclusion, no Bayesian network can exactly capture the structure of the square undirected graph above.

Tossing unknown biased coin

Assume that there are two biased coins. Coin A heavily biased towards head with the probability of head equal to 0.9. And Coin B is heavily biased towards tail with the probability of tail equal to 0.9. Now, we randomly and equally likely select one of the coins and toss it twice. Let’s call the outcome Y_1 and Y_2. Now, the question is, what is the mutual information between Y_1 and Y_2?

I put a similar question as above in a midterm, and I didn’t expect to stumble the entire class.

All students thought that the mutual information I(Y_1;Y_2)=0 because the two outcomes are independent. When we toss a coin sequentially, the outcomes are supposed to be independent, right? Yes, but that is only when we know what coin we are tossing.

Think intuitively with the above example. If we didn’t toss the coin twice, but toss it ten times and got ten heads. What do we expect the outcome to be if we toss it another time?

I think an intelligent guess should be another head. Because given the ten heads we got earlier, it has a very high chance that the picked coin is Coin A. And so the next toss is very likely to be the head as well.

Now, the same argument holds when we are back to the original setup. When the first toss is head, the second toss is likely to be head as well. So Y_1 and Y_2 are in no way to be independent.

So what is I(Y_1; Y_2)?

I(Y_1;Y_2)=H(Y_1)+H(Y_2)-H(Y_1,Y_2)=2H(Y_1)-H(Y_1,Y_2)

Let’s compute H(Y_1) and H(Y_1,Y_2).

Denote X as the coin that we pick.

Pr(Y_1=H)=Pr(X=A)0.9+Pr(X=B)0.1=0.5

So $H(Y_1)=H(0.5)=1$ bit.

Now, for H(Y_1,Y_2), note that

p_{Y_1,Y_2}(H,H)=Pr(X=A)(0.9\cdot 0.9) + Pr(X=B)(0.1\cdot 0.1)=0.41

p_{Y_1,Y_2}(T,T)=Pr(X=A)(0.1\cdot 0.1)+Pr(X=B)(0.9\cdot 0.9)=0.41

p_{Y_1,Y_2}(T,H)=p_{Y_1,Y_2}(H,T)=Pr(X=A)(0.1\cdot 0.9)+Pr(X=B)(0.9\cdot 0.1)=0.09

Therefore, H(Y_1,Y_2)=H([0.41,0.41,0.09,0.09])=1.68 bit.

So I(Y_1;Y_2)=H(Y_1)+H(Y_2)-H(Y_1,Y_2)=0.32 bit.

Note that this is an example that variables are conditional independent but not independent. More precisely, we have Y_1 \not\bot Y_2 but Y_1 \bot Y_2|X.

Probability education trap

I wondered a little bit why none of my students could answer the above question. I blame a trap that is embedded in most elementary probability courses. We were always introduced with a consecutive coin tossing or dice throwing example with each subsequent event to be independent of the earlier event. In those examples, we always assume that the probabilities of getting all outcomes are known but this assumption was never emphasized. As we see in the above example, even each subsequent tossing or throwing is independent relative to the current coin or the current dice, overall those events are not independent when the statistics behind the coin and dice are not known.

Actually, this also makes some “pseudo-science” not really that non-scientific after all. For example, we all tend to believe that the gender of a newborn is close to random and hence unpredictable. But what if there is some hidden variable that affects the gender of a newborn. And that factor may have a strong bias towards one gender over another. Then, it probably is not really that unlikely for someone to have five consecutive girls or six consecutive sons. Of course, I also tend to believe that the probability of a newborn to be a boy or a girl is very close to one half. A less extreme example may occur at the casino. If we believe that the odd from each lottery machine in a casino is not so perfectly tune (before the digital age, that is probably much more likely), then there is a chance that some machine has a higher average reward than another one. Then, a gambler trying to select a lottery machine to play is an essential strategy to win and is not really superstition after all. Of course, this is just the multi-armed bandit problem.

Independence and conditional independence

Independence and conditional independence are one of the most basic concepts in probabilities. Two random variables are independent, as the term suggested, if the outcome of one variable should not affect the outcome of another. Mathematically, two variables X and Y are independent if

p(x,y) = p(x) p(y) for any outcome x for variable X and outcome y for variable Y. And we often indicate this independece by X \bot Y.

Let’s inspect this definition more carefully, given Y=y and by Bayes’ rule, the probability of X=x is

$Pr(X=x|Y=y)=\frac{p(x,y)}{p(y)}= \frac{p(x)p(y)}{p(y)}=p(x)$. Indeed the probability does not depend on the outcome of Y and so X and Y are independent.

Similarly, when we say X and Y are conditionally independent given Z, it means that knowing Z, X and Y become independent. Note that X and Y do not need to be independent to start with. Many students have the misconceptions that independence implies conditional independence, or vice versa. The fact is that they are completely unrelated concepts. We can easily find examples that variables are independent but not conditional independent and vice versa. And also examples that variables are both independent and conditional independent. And of course cases when neither independence is satisfied.

Mathematically, we denote the independence by X \bot Y|Z, and we have

p(x,y|z)=p(x|z)p(y|z) for any x, y, and z.

Note that the definition above implies that p(x|y,z)=\frac{p(x,y|z)}{p(y|z)}=p(x|z). Hence, indeed if we are given z, the probability of X=x does not dependent on the outcome of Y. So it depicts well the conditional independence that we anticipated.

 

GPT3

OpenAI released its latest language model. I found the training compute comparison facinating (~1000x BERT-base). The large model has 175B parameters. And some said it costed $5M to train. While it definiitely is impressive, I agree with Yannic that probably no “reasoning” is involved. It is very likely that the model “remembers” the data somehow and “recalls” the best matches from all training data.

gpt3_comparison

 

Ref: video, paper

Supervised contrastive learning

The idea of contrastive learning has been around for a while. It was introduced for unsupervised/semi-supervised learning. When we only have unlabelled data, we would like to train a representation that groups similar data together. The way to do that in contrastive learning is to introduce perturbation to a target sample to generate positive samples. The perturbations can be translation, rotation, etc. And then we can treat all other samples as negative samples. The goal is to introduce a contrastive loss that pushs negative samples away from the target sample and pull the positive samples towards the target sample.

Even for the case of supervised learning, the contrastive learning step can be used as a pre-training step to train the entire network (excluding the last classification layer). After the pretraining, we can train the last layer with labelled data while keeping all the other layers fixed.

The main innovation in this work is that rather than treating all other samples as negative samples. It treats data with the same labels as positive samples as well.

Ref: paper, video

Copyright OU-Tulsa Lab of Image and Information Processing 2021
Tech Nerd theme designed by Siteturner