Probability & Statistics/correlation with two independent variables from different sources
QUESTION: Hi Clyde,
I have a question I was hoping you could help me with relating to correlation. The concept of correlation does not trouble me, but I am confused as to how correlation works when you have two variables taken from different sources and run a correlation on them. Depending on how you enter and plot the data you can get very different results.
Normally the data for two variables (e.g. height and weight) will be taken from a source (e.g. a class of students) and you can plot the variables (x=weight; y=height) for each student. This makes sense and I understand how the correlation can be drawn from this. However, there may be cases where you take two independent variables from different sources (i.e. not taken from a set of students; e.g. one set of students used for height and the other for weight). In this case you can not pair x and y for each case, you simply have two columns of data which you want to correlate.
Now when you have this data and input it a certain way you can run a correlation (use this for ease: http://ncalculators.com/statistics/correlation-coefficient-calculator.htm
) and get a result. Then you can change the order of the data (moving the positions of the cells in a column - randomising it) and get a very different result. I presume this happens because the computation is pairing x and y and by reorganising this it is computing values for new pairs.
If doing such correlations this seems hugely problematic. Perhaps I am thinking of scenarios which are unlikely, and most of the time you should be correlating data from one source. But nevertheless I am confused as to how one would go about running correlations for two very different variables from different sources. It may be arbitrary, but there may be better examples that could benefit from a correlation. However, they would still suffer from the result which can be affected by the order of the data in the columns.
Any information on understanding this more would be much appreciated!
ANSWER: I'm not sure how to answer the question because I'm really not sure where to begin. It seems like you are asking whether something makes sense, but your question is about five times longer than it should be, so I'm having trouble making heads or tails of it. I will try to summarize, and in that case the answer is somewhat terse anyway.
I think you are trying to ask "How do you correlate two different dimensions from different data?" In other words, you can ask 100 people "what is your height?" and then 100 other people "what is your weight?" and collect two separate sets of data -- each with a different dimension (height or weight).
This is opposed to a normal survey, which would ask 100 people both questions, giving you one set of data with two dimensions.
The answer, though, is that if you do it the wrong way, you really can't do a correlation analysis -- or at least, it doesn't make much sense. The data are not related because none of the heights have anything to do with any of the weights -- they don't belong to the same data point.
---------- FOLLOW-UP ----------
QUESTION: Hi Clyde,
Thanks for your response. Apologies for the long-winded question, I just wanted to put the detail in to explain everything, which you managed to do more succinctly. I was essentially asking about correlating two different dimensions from different data.
So, running a correlation analysis on such data is meaningless, and hence you can get many different outcomes because the data is not paired (such as the height and weight of one person from a survey). So is there anything you can do with such data to show a relationship?
Frankly, no, you can't make any such a correlation without some kind of additional information. Height and weight we have some assumptions about -- that they should somehow be correlated automatically since a taller person with the same build as a shorter person is heavier most of the time. But this is inferring a correlation that you would then try to prove statistically somehow -- not a valid line of reasoning. But these data could be anything. Income and number of freckles. Who knows if those are correlated? Who knows if it's positive or negative? You wouldn't be able to tell even if the data "looked" correlated after pairing up the dimensions because you'd have to sort or reorder the data sets somehow.