Data normalization for statistical analysis
29 01 10 - 14:04 In data mining and statistical data analysis, data need to be prepared before models can be built or algorithms can be used. In this context, preparing the data means transforming them prior to the analysis so as to ease the algorithm's job. Often, the rationale will be to alter the data so that the hypotheses, on which the algorithms are based, are verified, while at the same time preserving their information content intact. One of the most basic transformation is normalisation.What is normalisation
The term normalisation is used in many contexts, with distinct, but related, meanings. Basically, normalizing means transforming so as to render normal. When data are seen as vectors, normalising means transforming the vector so that it has unit norm. When data are though of as random variables, normalising means transforming to normal distribution. When the data are hypothesized to be normal, normalising means transforming to unit variance.
Let us consider data as a table where each row corresponds to an observation (a data element), and each row corresponds to a variable (an attribute of the data). Let us furthermore assume that each data element has a response value (target) associated to it (i.e. we focus on supervised learning.)
Variable (column) normalisation
Why column normalisation ? The simple answer is so that variables can be compared fairly in terms of information content with respect to the target variable. This issue is most important for algorithms and models that are based on some sort of distance, such as the Euclidean distance. As the Euclidean distance is computed as a sum of variable differences, its result greatly depends on the ranges of the variables. Should a variable express a dynamic (or variance), say 100 times larger than the others, than its value will mostly dictate the value of the distance, merely ignoring the values of the other variables. Should those variables be of some importance, the distance would be merely useless in any algorithm.
To avoid the latter situation, variables (columns) are normalised to the same 'dynamic range', with no units (they become a-dimensional values). In practice the way the normalisation is handled depends on the hypotheses made (or, as a matter of fact, on the personal experience of the practitioner).
Alternative 1: variables are supposed normally distributed with distinct means and variances.
In such case, the idea is to centre all variables so they have a zero mean and divide them by their standard deviation so that they all express unit variance. The transformed variables are then what are called 'z-scores' in the statistical literature. They are expressed in 'number of standard deviations' in the original data. Most of the transformed values lie within the [-1, 1] interval.
Alternative 2: variables are supposed uniformly distributed with distinct ranges.
Then, the idea is to level all variables to the same minimum (e.g. zero) and maximum (e.g. one). The transformed values then of course lie in the interval [0, 1] and are expressed as percentages of the original range.
Alternative 3: no hypothesis is assumed.
When no hypothesis is made, the solution is to replace the original values with its percentile in the original variable distribution. The data are then squashed, in a non-linear way, between zero and one, based on the inverse cumulative distribution of the each variable.
Element (row) normalisation
Why row normalisation ? While column normalisation can be applied to any data table, row normalisation makes sense only when all variables are expressed in the same unit. This is often the case for instance with functional data, that is data that come from the discretisation of some function. Row normalisation makes sense in such context when the measurements are prone to measurement bias, or when the information lies in relative measurements rather than in absolute ones. Then, the same normalisation procedures can be applied as for column normalisation. Often, the mean and the variance, or the maximum and the minimum, of each data elemnt are added as extra variables, prior to the analysis.
Target normalisation
Why target normalisation ? Because building a model between the data elements and their associated target is made easier when the set of values to predict is rather compact. So when the distribution of the target variable is skewed, that is there are many lower values and a few higher values (e.g. the distribution of income ; the income is non-negative, most people are earn around the average, and few people make bigger money), it is preferable to transform the variable to a normal one by computing its logarithm. Then the distribution becomes more even.
Summary
Normalisation is a procedure followed to bring the data closer to the requirements of the algorithms, or at least to pre-process data so as to ease the algorithm's job. Variables can be normalised (to unit zero mean and unit variable, or to the interval [0, 1]), data elements can be normalised (when all their attributes have the same 'units') and the target variable can be normalised too (using a logarithmic transform). The choice to do or not to do normalization is of course left to the practitioner, but it can be advised with virtually no risk to always perform variable normalisation to [0, 1] when the variable values are bounded, to zero mean and unit variance otherwise, and to perform log transform of the target whenever it is skewed.
Trackback link:Please enable javascript to generate a trackback url