Tossing unknown biased coin

Assume that there are two biased coins. Coin A heavily biased towards head with the probability of head equal to 0.9. And Coin B is heavily biased towards tail with the probability of tail equal to 0.9. Now, we randomly and equally likely select one of the coins and toss it twice. Let’s call the outcome Y_1 and Y_2. Now, the question is, what is the mutual information between Y_1 and Y_2?

I put a similar question as above in a midterm, and I didn’t expect to stumble the entire class.

All students thought that the mutual information I(Y_1;Y_2)=0 because the two outcomes are independent. When we toss a coin sequentially, the outcomes are supposed to be independent, right? Yes, but that is only when we know what coin we are tossing.

Think intuitively with the above example. If we didn’t toss the coin twice, but toss it ten times and got ten heads. What do we expect the outcome to be if we toss it another time?

I think an intelligent guess should be another head. Because given the ten heads we got earlier, it has a very high chance that the picked coin is Coin A. And so the next toss is very likely to be the head as well.

Now, the same argument holds when we are back to the original setup. When the first toss is head, the second toss is likely to be head as well. So Y_1 and Y_2 are in no way to be independent.

So what is I(Y_1; Y_2)?

I(Y_1;Y_2)=H(Y_1)+H(Y_2)-H(Y_1,Y_2)=2H(Y_1)-H(Y_1,Y_2)

Let’s compute H(Y_1) and H(Y_1,Y_2).

Denote X as the coin that we pick.

Pr(Y_1=H)=Pr(X=A)0.9+Pr(X=B)0.1=0.5

So H(Y_1)=H(0.5)=1 bit.

Now, for H(Y_1,Y_2), note that

p_{Y_1,Y_2}(H,H)=Pr(X=A)(0.9\cdot 0.9) + Pr(X=B)(0.1\cdot 0.1)=0.41

p_{Y_1,Y_2}(T,T)=Pr(X=A)(0.1\cdot 0.1)+Pr(X=B)(0.9\cdot 0.9)=0.41

p_{Y_1,Y_2}(T,H)=p_{Y_1,Y_2}(H,T)=Pr(X=A)(0.1\cdot 0.9)+Pr(X=B)(0.9\cdot 0.1)=0.09

Therefore, H(Y_1,Y_2)=H([0.41,0.41,0.09,0.09])=1.68 bit.

So I(Y_1;Y_2)=H(Y_1)+H(Y_2)-H(Y_1,Y_2)=0.32 bit.

Note that this is an example that variables are conditional independent but not independent. More precisely, we have Y_1 \not\bot Y_2 but Y_1 \bot Y_2|X.

Probability education trap

I wondered a little bit why none of my students could answer the above question. I blame a trap that is embedded in most elementary probability courses. We were always introduced with a consecutive coin tossing or dice throwing example with each subsequent event to be independent of the earlier event. In those examples, we always assume that the probabilities of getting all outcomes are known but this assumption was never emphasized. As we see in the above example, even each subsequent tossing or throwing is independent relative to the current coin or the current dice, overall those events are not independent when the statistics behind the coin and dice are not known.

Actually, this also makes some “pseudo-science” not really that non-scientific after all. For example, we all tend to believe that the gender of a newborn is close to random and hence unpredictable. But what if there is some hidden variable that affects the gender of a newborn. And that factor may have a strong bias towards one gender over another. Then, it probably is not really that unlikely for someone to have five consecutive girls or six consecutive sons. Of course, I also tend to believe that the probability of a newborn to be a boy or a girl is very close to one half. A less extreme example may occur at the casino. If we believe that the odd from each lottery machine in a casino is not so perfectly tune (before the digital age, that is probably much more likely), then there is a chance that some machine has a higher average reward than another one. Then, a gambler trying to select a lottery machine to play is an essential strategy to win and is not really superstition after all. Of course, this is just the multi-armed bandit problem.

Independence and conditional independence

Independence and conditional independence are one of the most basic concepts in probabilities. Two random variables are independent, as the term suggested, if the outcome of one variable should not affect the outcome of another. Mathematically, two variables X and Y are independent if

p(x,y) = p(x) p(y) for any outcome x for variable X and outcome y for variable Y. And we often indicate this independece by X \bot Y.

Let’s inspect this definition more carefully, given Y=y and by Bayes’ rule, the probability of X=x is

Pr(X=x|Y=y)=\frac{p(x,y)}{p(y)}= \frac{p(x)p(y)}{p(y)}=p(x). Indeed the probability does not depend on the outcome of Y and so X and Y are independent.

Similarly, when we say X and Y are conditionally independent given Z, it means that knowing Z, X and Y become independent. Note that X and Y do not need to be independent to start with. Many students have the misconceptions that independence implies conditional independence, or vice versa. The fact is that they are completely unrelated concepts. We can easily find examples that variables are independent but not conditional independent and vice versa. And also examples that variables are both independent and conditional independent. And of course cases when neither independence is satisfied.

Mathematically, we denote the independence by X \bot Y|Z, and we have

p(x,y|z)=p(x|z)p(y|z) for any x, y, and z.

Note that the definition above implies that p(x|y,z)=\frac{p(x,y|z)}{p(y|z)}=p(x|z). Hence, indeed if we are given z, the probability of X=x does not dependent on the outcome of Y. So it depicts well the conditional independence that we anticipated.

 

GPT3

OpenAI released its latest language model. I found the training compute comparison facinating (~1000x BERT-base). The large model has 175B parameters. And some said it costed $5M to train. While it definiitely is impressive, I agree with Yannic that probably no “reasoning” is involved. It is very likely that the model “remembers” the data somehow and “recalls” the best matches from all training data.

gpt3_comparison

 

Ref: video, paper

Supervised contrastive learning

The idea of contrastive learning has been around for a while. It was introduced for unsupervised/semi-supervised learning. When we only have unlabelled data, we would like to train a representation that groups similar data together. The way to do that in contrastive learning is to introduce perturbation to a target sample to generate positive samples. The perturbations can be translation, rotation, etc. And then we can treat all other samples as negative samples. The goal is to introduce a contrastive loss that pushs negative samples away from the target sample and pull the positive samples towards the target sample.

Even for the case of supervised learning, the contrastive learning step can be used as a pre-training step to train the entire network (excluding the last classification layer). After the pretraining, we can train the last layer with labelled data while keeping all the other layers fixed.

The main innovation in this work is that rather than treating all other samples as negative samples. It treats data with the same labels as positive samples as well.

Ref: paper, video

Radioactivate data tracing through training

Motivation

  • Want to verify if someone has used your dataset for training

Idea

  • Introduce extra (unreal) feature to data. For example, cat feature to cat image, dog feature to dog image
  • Verify if dataset was used with hypothesis testing

Comments

  • It seems that it is training and testing the same classifier. If a different classifier is used to the marked data, not sure if the method will actually work
  • The “radioactive” images actually look very bad
  • The idea does not seem to be new. And the execution is quite doubtful. It reminds me watermarking techniques popular in early 2000 but it never seems to take off since it doesn’t really work.

Ref

Meta-Learning through Hebbian Plasticity in Random Networks

Motivation:

  • Most ML moderns become statistics once they are trained. Or one has to retrain the model to keep it adaptive to the new environment.

Main idea:

  • Instead of updating weights directly, update the rules (determined by the Hebbian coefficients) that update the weights

Some details:

  • Weights are updated by Hebbian ABCD model, $latex \Delta w_{i,j} = \eta_w (A_w o_i o_j + B_w o_i +C_w o_j + D_w)$
  • Let $latex bf h$ be the vectorize Hebbian coefficients, $latex {\bf h}_{t+1} \leftarrow h_t + \frac{\alpha} {n \sigma} \sum_{i=1}^n F_i ({\bf h}_t + \Delta {\bf h}_i) $, where $latex \Delta {\bf h}_i \sim  \mathcal{N} ({\bf 0}, \sigma{\bf I})$ and  $latex F_i$ is a fitnness evalution of $latex {\bf h}_t + \Delta {\bf h}_i$ (This I am not completely certained how it is evaluated)

Ref: video, paper, twitter posts

Reformer

Problem statement:

  • Transformer is computationally expensive when we have too many keys and queries (need to compute dot product between each pair). It can be memory intensive as well depending on implementations.

Proposed solution:

  • Group keys and queries into “buckets” first and only compute dot product of keys and queries in the same bucket.
  • Bucket can be implemented as a form of “locality sensitive hashing”. Seriously, I always forget what LSH means. Can they create an even more obscure jargon? (fairly speaking, I think they definitely could)
    • I think a simple example to under LSH is the binary case. Assumes two binary vector is close according to the Hamming distance. Then, for a say length-100 binary vector, we can sort them into 4 bins just according to the value of the first two bits. Actually, it is just coset and syndrome coding but with the parity check matrix “degenerated” (care nothing but just for the first two bits). If we assume the bits are independent, this degenerated choice is okay. Otherwise, we can just use a random coset as in Slepian Wolf coding.
    • For the reformer case, LSH is implemented as projections of the keys and queries onto random vectors. The signs of the resulting projections will determine the bucket. This is exactly the generalization Slepian Wolf coding to continuous case.
  • Since they don’t want to store the bucket projection vectors, they make the step reversible instead. It is basically the same as the lifting scheme in wavelet construction.

Ref: Video, code, paper

Copyright OU-Tulsa Lab of Image and Information Processing 2024
Tech Nerd theme designed by Siteturner