Sunday 28 February 2010 at 10:05 am
Correlation is a concept based upon the notion of co-variance. Two varying quantities are said to co-vary when, whenever one is increasing in value, the other does so too, and conversely when one is decreasing in value. Two correlated quantities vary together so that they actually seem to interact.
In statistical terms, when the varying quantities are described as random variables X and Y, covariance has a specific meaning. Covariance is the expected value (denoted E[.] here) of the product between deviations of each variable from its mean. Cov(X,Y) = E[(X - E(X)) (Y - E(Y))]
When both variables take values larger than their respective means, their product is highly positive. When one of them takes a value higher than its mean while the other takes a value lower than its mean, the product is largely negative.
Correlation is computed like the covariance, except that it normalized so as to take values between -1 and 1, no matter which quantities are considered. It is defined as Cov(X,Y) / (std(X) std(Y)) where std(X) is the square root of the variance of X (its standard deviation) and is known as the Pearson Product Moment correlation.
A correlation of +1 between two variables indicates perfect co-variation, while a value of -1 indicates a perfect anti-variation ; when one of the variables is increasing, the other is decreasing, and vice versa.
Using the definition of the correlation with two binary variables can be seen as computing the average value of the result of the operation X XOR Y where XOR is the exclusive OR binary operator.
As far as continuous variables are concerned, the correlation is directly linked to the (normalized) error made by a linear model when one of the variables is used to predict the other and vice versa.
In that context, a correlation of +1 indicates a perfect linear relationship between both variables, that is on can be expressed as some factor times the other, plus some bias, like, for instance, Celsius and Fahrenheit temperature scales. A value of -1 refers to the same property except that in that case, the factor is negative. When two variables express a correlation of zero, it means that their relationship is not describable by a linear model. This can be either because there is no relation at all (variables are independent) or because their relation is non linear. For instance, the correlation between X and X^2 over a symmetrical interval (i.e. centered on zero) is zero while X and X^2 are obviously not independent. The correlation between X and X^2 over a non-negative interval (an interval whose bounds are both non negative) is not zero, but it is not +1 either ; it takes an intermediate value.
To overcome that limitation, Spearman (C. Spearman, "The proof and measurement of association between two things" Amer. J. Psychol. , 15 (1904) pp. 72–101) proposed another definition of the correlation, not based on the values of the variables, but on the ordering of the instances according to those variables. In practice, the Spearman correlation is computed as the correlation between the ranks of each instance when sorted according to that variable. When both ordering match perfectly, the correlation is +1 and it means that there is a perfect monotonic relationship between both variables. When one increases, the other one increases too, but not necessarily in a linear way. The Spearman correlation between X and X^2 over a non-negative intervals then perfectly +1. Kendall's tau coefficient (Kendall, M. (1948) Rank Correlation Methods, Charles Griffin & Company Limited), is computed slightly differently, but is expresses the same idea.
Correlation can also be computed between one variable and the same variable to which a delay is applied. It is then called auto-correlation (Spectral analysis and time series, M.B. Priestley (London, New York : Academic Press, 1982). Rather than a single value, the auto-correlation produces a function of the delay. The local maxima of that function indicate potential periodicity.
Correlation can be defined between groups of variables. It is then called canonical correlation and is strongly linked to Principal Component Analysis (Kanti V. Mardia, J. T. Kent and J. M. Bibby (1979). Multivariate Analysis. Academic Press.).
Finally, it is very important to distinguish correlation from causality. Two variables can be correlated while share absolutely no causality link, that is no one is actually influencing the other. A typical example is that if you compute the correlation between the number of crimes committed in a city per year and the number of churches in that city, you will most probably find a rather large value. This is simply the case because both are proportional to the number of inhabitants of the city.
Friday 29 January 2010 at 2:04 pm
In data mining and statistical data analysis, data need to be prepared before models can be built or algorithms can be used. In this context, preparing the data means transforming them prior to the analysis so as to ease the algorithm's job. Often, the rationale will be to alter the data so that the hypotheses, on which the algorithms are based, are verified, while at the same time preserving their information content intact. One of the most basic transformation is normalisation.
What is normalisation
The term normalisation is used in many contexts, with distinct, but related, meanings. Basically, normalizing means transforming so as to render normal. When data are seen as vectors, normalising means transforming the vector so that it has unit norm. When data are though of as random variables, normalising means transforming to normal distribution. When the data are hypothesized to be normal, normalising means transforming to unit variance.
Let us consider data as a table where each row corresponds to an observation (a data element), and each row corresponds to a variable (an attribute of the data). Let us furthermore assume that each data element has a response value (target) associated to it (i.e. we focus on supervised learning.)
Variable (column) normalisation
Why column normalisation ? The simple answer is so that variables can be compared fairly in terms of information content with respect to the target variable. This issue is most important for algorithms and models that are based on some sort of distance, such as the Euclidean distance. As the Euclidean distance is computed as a sum of variable differences, its result greatly depends on the ranges of the variables. Should a variable express a dynamic (or variance), say 100 times larger than the others, than its value will mostly dictate the value of the distance, merely ignoring the values of the other variables. Should those variables be of some importance, the distance would be merely useless in any algorithm.
To avoid the latter situation, variables (columns) are normalised to the same 'dynamic range', with no units (they become a-dimensional values). In practice the way the normalisation is handled depends on the hypotheses made (or, as a matter of fact, on the personal experience of the practitioner).
Alternative 1: variables are supposed normally distributed with distinct means and variances.
In such case, the idea is to centre all variables so they have a zero mean and divide them by their standard deviation so that they all express unit variance. The transformed variables are then what are called 'z-scores' in the statistical literature. They are expressed in 'number of standard deviations' in the original data. Most of the transformed values lie within the [-1, 1] interval.
Alternative 2: variables are supposed uniformly distributed with distinct ranges.
Then, the idea is to level all variables to the same minimum (e.g. zero) and maximum (e.g. one). The transformed values then of course lie in the interval [0, 1] and are expressed as percentages of the original range.
Alternative 3: no hypothesis is assumed.
When no hypothesis is made, the solution is to replace the original values with its percentile in the original variable distribution. The data are then squashed, in a non-linear way, between zero and one, based on the inverse cumulative distribution of the each variable.
Element (row) normalisation
Why row normalisation ? While column normalisation can be applied to any data table, row normalisation makes sense only when all variables are expressed in the same unit. This is often the case for instance with functional data, that is data that come from the discretisation of some function. Row normalisation makes sense in such context when the measurements are prone to measurement bias, or when the information lies in relative measurements rather than in absolute ones. Then, the same normalisation procedures can be applied as for column normalisation. Often, the mean and the variance, or the maximum and the minimum, of each data elemnt are added as extra variables, prior to the analysis.
Target normalisation
Why target normalisation ? Because building a model between the data elements and their associated target is made easier when the set of values to predict is rather compact. So when the distribution of the target variable is skewed, that is there are many lower values and a few higher values (e.g. the distribution of income ; the income is non-negative, most people are earn around the average, and few people make bigger money), it is preferable to transform the variable to a normal one by computing its logarithm. Then the distribution becomes more even.
Summary
Normalisation is a procedure followed to bring the data closer to the requirements of the algorithms, or at least to pre-process data so as to ease the algorithm's job. Variables can be normalised (to unit zero mean and unit variable, or to the interval [0, 1]), data elements can be normalised (when all their attributes have the same 'units') and the target variable can be normalised too (using a logarithmic transform). The choice to do or not to do normalization is of course left to the practitioner, but it can be advised with virtually no risk to always perform variable normalisation to [0, 1] when the variable values are bounded, to zero mean and unit variance otherwise, and to perform log transform of the target whenever it is skewed.
Thursday 29 October 2009 at 08:59 am
Whenever a prediction model is built, its performances must be estimated so as to grasp an idea of how accurate the model is. The fact is that many different measures have been proposed and used inconsistently, sometimes making it difficult to compare models. I have put together a list of the most common ones, along with their definition/equation to serve as a handy reminder, in the spirit of
cheatsheets.
You can download it from
here. Do not hesitate to email me any comment you might have about it.
P.S. the link was working from the front page only ; I have just corrected that (Thanks
Kevin and Alex)
Thursday 24 September 2009 at 3:30 pm
I regularly tweet about machine learning applications I find on the web and other AI-related web pages.
http://twitter.com/damienfrancois
Monday 26 January 2009 at 1:13 pm
The time when artificial intelligence will make robots
more intelligent than we are, has not arrived yet. Artificial intelligence is however more than a dream for illuminated scientists; it is a very active and broad research field from which many useful tools for solving problems have arisen.
The applications where artificial intelligence can help are broadly divided into three categories, that are detailed hereafter...
(more)
Monday 26 January 2009 at 1:10 pm
For people interested in leaning about machine learning, here is a list of websites where you can find video lectures on topics related to machine learning.
The main reference are www.videolectures.net, which has a section dedicated to videos from the PASCAL network, and AAAI.org.
On delicious, pskomoroch has compiled a huge list of videos in various domains, including machine learning. And of course, there's always Google video..
Some blogs also link to videos. Free Science and video lectures online! has compiled a list of video lecture given at Machine Learning Summer School 2003, 2005 and 2006. Most videos come from www.videolecture.com. Data Wrangling proposes a large list of Hidden Video Courses in Math, Science, and Engineering, and Business Intelligence, Data Mining & Machine Learning a list of Machine Learning OnLine Lectures. Olivier Bousquet also has a page dedicated to Machine Learning Videos. On Cgkt’s Weblog, you can find Video Lectures on Probabilistic Graphical Models, as well as on LectureFox; free university lectures » mathematics
Finally note that Berkeley and the MIT also publish online videos of lectures.
Feel free to comment and/or add more sources!
For less advanced lectures, here is a link to a video by Tom Mitchell, author of one of the founding reference book in machine learning. He is the head of the Machine Learning Department at Carnegie Mellon University. The video is aimed at people not knowing the field of machine learning and contains very few technical contents. Interested readers may also consider reading hiw white paper introducing Machine Learning.