Conditional prob.. why P(X|Y)~=P(X)?

I understand that theoretically P(X|Y) = P(X) means that X, Y are independent. While I code it out, I did not find that exact equality. I want to know if I did anything wrong, otherwise, if there is any explanation for this roughly inequality?

Here is the code:

# Data
from numpy import random
obs = 100000
age = {}
purchase = {}
random.seed(0)
for i in range(1,8):
    age[i*10] = 0
    purchase[i*10] = 0
for _ in range(obs):
    rand_age = random.choice(list(age.keys()))
    age[rand_age] += 1
    rand_p = random.random()
    if rand_p > 0.5: # purchase probability is indenpendent with age
        purchase[rand_age] += 1
total_purchase = sum(list(purchase.values()))

#P(E)
p_purchase =  total_purchase/obs

#P(E|F): Given age = 30, P(purchase): P(purchase|age = 30)
p_purchase_cond_30 = purchase[30]/age[30]

#print result
print('age:',age)
print('purchase:',purchase)
print('total purchase:',total_purchase)
print('P(E) = P(purchase) =',p_purchase)
print('P(E|F) = P(purchase | age = 30) ~= P(purchase) = P(E)\n',p_purchase,'~=',p_purchase_cond_30)   

Here is the results:

age: {10: 14146, 20: 14167, 30: 14278, 40: 14363, 50: 14278, 60: 14405, 70: 14363}
purchase: {10: 7056, 20: 7130, 30: 7191, 40: 7278, 50: 7126, 60: 7100, 70: 7191}
total purchase: 50072
P(E) = P(purchase) = 0.50072
P(E|F) = P(purchase | age = 30) ~= P(purchase) = P(E)
0.50072 ~= 0.5036419666619975

The approximate equality bugs me, why arent they exactly equal?

1 Like

I can’t answer this because i don’t have knowledge of how pseudo random number generators in python are written, but i just wish to ask: Is it possible to make them completely the same number using random number generators in python?
A related question is why doesn’t this plot a perfectly horizontal line on the histogram?

import random

tries = [random.uniform(1,10) for _ in range(10000)] 
pd.Series(tries).plot.hist()
2 Likes

Umm, will this approximation gets closer to the perfect equality if lim_{n\to\infty}…but my jupyter notebook halt when n = 1,000,000,000 :upside_down_face:

Some interesting questions here. To get to unassailable and precise statements, more precision would be required on the questions themselves. Since Discourse isn’t really adequate for a back and forth, I’ll try to fill in some of the missing details in my assumptions.

It’s important to realize that empirical probability and actual probability are different things. The latter is a mathematical concept, the former is based on recorded data.

In this question, all calculated probabilities are actually empirical probabilities, while the usual laws that we use concern the theoretical probability. They are empirical probabilities because we are making an experiment and recording the data, basically. It’s just a different way of getting data than something more connected to the real world, like extracting data from database.

These observations alone are enough to answer these questions:

However, I think there’s value in exploring this further. Partially because the answer above does not help one train one’s own intuition.

One assumption in the question is that there’s independence between the events, specifically, that a customer’s age is independent of whether they purchased something. This assumption stems from, I think, the fact that the lines of code below do not depend on another, that their sequential order is irrelevant here, and so on.

rand_age = random.choice(list(age.keys()))
rand_p = random.random()

This isn’t true! But not because of how the functions numpy.random.choice and random.random are implemented, it really doesn’t matter if random.random uses numpy.random.choice in any way because independence is a theoretical concept.

By definition, two events X and Y are independent if, and only if P(X\cap Y) = P(X)P(Y), that’s it. Independence doesn’t care about lines of code, it doesn’t care about the real world or about

films Nicholas Cage appeared in.

The only thing that matters is the definition. Let’s investigate whether in the experiment the events E and F are independent, where:

  • E is the event that the customer purchased something; and
  • F is the event that the customer is 30 years old (or in the range 20-29, or whatever, it doesn’t really matter).

To do this, we will modify the code slightly in order to track the number of occurrences of E\cap F, i.e., in order to the track the number of occurrences of a 30 year old customer purchasing something. I also transformed your code into a function named simul and made only minimal non-structural changes.

Expand to see the code.
def simul(obs):
	age = {}
	purchase = {}
	e_and_f=0
	random.seed(0)
	for i in range(1,8):
		age[i*10] = 0
		purchase[i*10] = 0
	for _ in range(obs):
		rand_age = random.choice(list(age.keys()))
		age[rand_age] += 1
		rand_p = random.random()
		if rand_p > 0.5: # purchase probability is indenpendent with age
			purchase[rand_age] += 1
			if rand_age == 30:
				e_and_f += 1
	total_purchase = sum(list(purchase.values()))

	#P(E)
	p_purchase =  total_purchase/obs

	#P(E|F): Given age = 30, P(purchase): P(purchase|age = 30)
	p_purchase_cond_30 = purchase[30]/age[30]

	#P(F)
	p_30 = age[30]/obs

	#P(E & F)
	p_purchase_30 = e_and_f/obs

	#print result
	print('The number of observations is: {}.'.format(obs))

	print('The age dictionary is: {}'.format(age),
          'The purchase dictionary is: {}'.format(purchase),
          'There were {} purchases.'.format(total_purchase),
          ('The empirical probability of the event '
           '"There is a purchase" (E) is {}.').format(p_purchase),
          ('The empirical probability of the event '
           '"The customer is 30 years old" (F) is {}.').format(p_30),
          ('The empirical probability of the event '
           '"There is a purchase", given that the customer\'s age is 30 (E|F)'
           ' is {}.').format(p_purchase_cond_30),
          ('The empirical probability of the event '
           '"The customer is 30 years old and there is a purchase" (E & F)'
           ' is {}.').format(p_purchase_30),
          'And finally, P(E)P(F) = {}.\n'.format(p_purchase*p_30),
          sep='\n'
         )

	return age, purchase, total_purchase, p_purchase, p_30, p_purchase_30

Running simul(100000) prints the following:

The number of observations is: 100000.
The age dictionary is: {10: 14146, 20: 14167, 30: 14278, 40: 14363, 50: 14278, 60: 14405, 70: 14363}
The purchase dictionary is: {10: 7056, 20: 7130, 30: 7191, 40: 7278, 50: 7126, 60: 7100, 70: 7191}
There were 50072 purchases.
The empirical probability of the event "There is a purchase" (E) is 0.50072.
The empirical probability of the event "The customer is 30 years old" (F) is 0.14278.
The empirical probability of the event "There is a purchase", given that the customer's age is 30 (E|F) is 0.5036419666619975.
The empirical probability of the event "The customer is 30 years old and there is a purchase" (E & F) is 0.07191.
And finally, P(E)P(F) = 0.0714928016.

Note that the values in the two last lines are different, which tells us that the events are not independent. As such, you can’t attain equality.

To finalize, I assume you don’t necessarily mean to ask about random number generators in Python, but rather about generating “randomness” programmatically in general. The answer is actually trivial: absolutely not. For instance, in rolling a die, what if you only perform five experiments? You can’t get the number of outcomes on each because there are six possible outcomes.

3 Likes

I am convinced by the empirical and theoretical difference, but it intrigues me with an other question: if independence is only imaginary, everything could be correlated, how should we conclude or predict the number of drowned people in your given example that shows the obvious non-sense correlation between Films Nicolas and drowned people? I think it is a very realistic question, especially in casino? :slight_smile:

I’m not sure I understand the question, but I’ll try to answer it. I’m assuming that with “independence being imaginary” you’re referring to me having said that it is a theoretical concept that doesn’t care about the real world.

What I meant to do was to highlight exactly why we don’t get an equality in the starting question. It’s because theoretical concepts and laws are being mixed with the real world. In fact, there’ s nothing wrong with what you did.

The Nicholas Cage example serves to highlight that theory and practice do not always match. Assuming the data does show that correlation does exist, this just serves to show that models aren’t perfect, and we know they aren’t. That spurious correlation exists, and what of it? What this tells you is that you can’t base all of your knowledge on correlations.

And let’s not get into the fact that correlation does not imply causation.

1 Like

2 arguments for the usefulness of theoretical “imaginary” independence here:

It can guide decision making, in the sense that 2 random variables X,Y being statistically independent implies Pearson’s correlation coefficient is 0, which implies you should not be using either one to predict the other if you believe they have a linear relationship and want to make use of that linearity to interpolate/extrapolate/predict future/predict past.

Conditional Independence is the core of Naive Bayes formula. It allows people to formalize in math notation the idea that probabilities of observations with multiple features can be broken down into the product of probabilities of individual features.

Everything could be correlated, but not everything is correlated in an easily interpretable way, so people usually begin with linear correlation because that’s the most easily interpretable and easiest to express in math. Being interpretable also helps in inspiring and designing experiments, where additive/linear relationships are most easily studied for our puny human brains.

With machine learning, any correlation can be taken advantage of, including predicting number of drowned people using Nicholas Cage films. My view is what we judge as nonsense may actually be an unimaginable long chain of events that we do not understand, but can actually be used for prediction, so my point is this “nonsense” can still be used for prediction in the short term but keeping in mind this relationship may disappear at any moment since it may be based less on universal unchanging physical laws but more on ephemeral human behaviour.

Here’s an interesting dichotomy, i’m for team chomsky. https://towardsdatascience.com/predicting-vs-explaining-69b516f90796

It is not the first time I encounter the philosophy of correlation and causality, sorry if I am asking something stupid.

Certainly, the goal is for the latter. However, in reality, we can only observe the former by running some OLS/correlation matrix (maybe?). Keep in mind that the variable of causality may not be even in the Data-set (which is usually the case), how should we statistically/definitively distinguish between these two?

Even it passes statistical tests, such as 2SLS/Heckman/Hausman, it does not give any benchmark of causality. If so, does it still matter that correlation is not causality? Conclusions draw based on the observable correlation anyway.