Outliers Unveiled: A Practical Approach to Detection and Solutions
In this discussion, we delve into the intriguing realm of outliers — how to spot them, what signifies their presence, and the optimal strategies for dealing with these data mavericks. However, before venturing into the nuanced art of handling outlier-related challenges, we must acquaint ourselves with what outliers are.
Outlier
An outlier, or maverick, is an observation within a dataset that deviates in pattern or value from the rest of the statements. According to Kleinbum et al. (2008), an outlier is something rare or an unusual observation that emerges at one of the extremes within the dataset. Extremes, in words, are values that significantly differ or stand out from the majority of other deals in their group, such as being too small or too large.
Let’s illustrate with a simple example: imagine a classroom where students’ exam scores are as follows — 50, 54, 62, 50, 52, 59, 61, 63, 65, 10, 53, 63, 65, 50, 59, 62, 50, 51, 57, 60, 63, 65, 65, 53, 99. Among these 25 students, two outliers stand out with extreme scores of 10 and 99, making them the distinctive outliers in this set of grades.
The Emergence of Outliers
Outliers in a dataset can stem from various possibilities, including:
- Errors in the data entry procedure
- Mistakes in measurement or analysis
- An unknown factor can influence respondents’ perspectives, leading to a deviation even the researcher might not be aware of in exceptional circumstances.
“Outliers are data points that are far from other data points”
Identifying Outliers
1. Scatter Plot
The advantage of this method lies in its ease of understanding, presenting data visually without involving complex calculations. However, relying solely on a scatter plot to determine outliers is not highly recommended, as deciding whether data is an outlier depends mainly on the researcher’s judgment.
2. Boxplot
Metode boxplot merupakan metode grafik yang kedua dengan menggunakan nilai kuartil dari jangkauan.
Boxplot method is the second graphical method using quartile values within the range.
3. Standardized Residual
As the name suggests, standardized residuals are standardized residuals. The primary advantage of using standardized residuals is their independence from measurement units, as they are all standardized. If the residual of an observation is three times greater than the standard deviation (or the standardized residual is greater than 3), that observation can be considered an outlier.
4. Cook's Distance
One outlier detection method is Cook's distance. This method is designed to measure the change in the estimator of the Beta parameter when a specific observation is omitted. This method indicates the substantial influence of outliers on the results (Rawlings et al., 1998). The procedure involves calculations and the display of a plot.
5. DFFITS Method (Difference Fitted Value FITS)
DFFITS is used to determine the impact of a particular observation on the regression model in terms of its fitted value. After understanding how to identify the position of an outlier (nth observation), the thought of removing the outlier may arise, aiming to normalize the research data or eliminate outliers. However, this action is strongly discouraged because an outlier observation could significantly influence the dataset. If the removal or exclusion of outlier data is forced, the statistical equations and conclusions will substantially alter. Another suggestion often given to address outlier data is to "transform the data," but this action frequently results in transformed data not meeting the underlying assumptions.
Data Transformation is like giving your data a makeover — it’s a strategic effort to transform the scale of raw measurements into a more captivating form, ensuring that your data not only meets but dazzles the underlying assumptions.
What’s the Solution for Data Outliers?
So, what should a researcher do to tackle this challenge? One approach is to swap our methods for more outlier-sensitive techniques, known as non-parametric methods.
Parametric Statistics is the statistical art that considers the type of distribution or data spread — whether the data follows a normal distribution or not. Parametric statistics are used to test hypotheses and measurable variables. In other words, data analyzed using parametric statistics must adhere to the normality assumption.
On the other hand, Non-parametric Statistics is a testing ground where the models don’t dictate conditions about population parameters. Simply put, non-parametric statistics don’t demand measurements as stringent as their parametric counterparts.
For instance, when we desire to employ multiple or straightforward regression analysis, but our data contains outliers, we can substitute it with robust research. Another example of using non-parametric statistics is replacing parametric t-tests with Mann-Whitney or Wilcoxon tests in non-parametric scenarios, and replacing parametric F-tests with Kruskal-Wallis tests in non-parametric settings, and so forth.
That concludes the explanation regarding issues with data identified to contain outliers, hoping it proves beneficial and insightful. Thank you.
References
- Kleinbum, D., Kupper, L., Nizam, A., & Keith, M. 2008. Applied Regression Analysis and Other Multivariable Methods. USA: Thomson.
- Rawlings, J. O., Pantula, S. G., & Dickey, D. A. 1998. Applied Regression Analysis:A Research Tool-Second Edition. New York: Springer-Verlag.
- Soemartini. 2007. Pencilan (Outlier). Bandung: UNPAD.