A univariate outlier is
a data point that consists of an extreme value on one variable. A multivariate outlier is a combination of unusual
scores on at least two variables. Both
types of outliers can influence the outcome of statistical analyses. Outliers exist for four reasons. Incorrect data entry can cause data to
contain extreme cases. A second reason
for outliers can be failure to indicate codes for missing values in a
dataset. Another possibility is that the
case did not come from the intended sample.
And finally, the distribution of the sample for specific variables may
have a more extreme distribution than normal.
In many parametric
statistics, univariate and multivariate outliers must be removed from the
dataset. When looking for univariate
outliers for continuous variables, standardized values (z scores) can be used. If
the statistical analysis to be performed does not contain a grouping variable,
such as linear regression, canonical correlation, or SEM among others, then the
data set should be assessed for outliers as a whole. If the analysis to be conducted does contain
a grouping variable, such as MANOVA, ANOVA, ANCOVA, or logistic regression,
among others, then data should be assessed for outliers separately within each
group. For continuous variables,
univariate outliers can be considered standardized cases that are outside the
absolute value of 3.29. However, caution
must be taken with extremely large sample sizes, as outliers are expected in
these datasets. Once univariate outliers
have been removed from a dataset, multivariate outliers can be assessed for and
removed.
Multivariate outliers
can be identified with the use of Mahalanobis distance, which is the distance
of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being
assessed. Each point is recognized as an
X, Y combination and multivariate outliers lie a given distance from the other
cases. The distances are interpreted
using a p < .001 and the
corresponding χ2 value with the degrees of freedom equal to the
number of variables. Multivariate
outliers can also be recognized using leverage, discrepancy, and
influence. Leverage is related to
Mahalanobis distance but is measured on a different scale so that the χ2 distribution
does not apply. Large scores indicate
the case if further out however may still lie on the same line. Discrepancy
assesses the extent that the case is in line with the other cases. Influence is determined by leverage and
discrepancy and assesses changes in coefficients when cases are removed. Cases > 1.00 are likely to be considered
outliers.