Distribution of Variables in Exploratory Data Analysis (EDA) – Univariate Analysis
Definition:
In Exploratory Data Analysis (EDA), univariate analysis refers to the study of a single variable to understand its distribution, central tendency, and spread. The main objective is to identify how data values are distributed, whether they are symmetric, skewed, uniform, or multimodal.
Types of Data Distributions
- Normal Distribution (Gaussian Distribution):
A symmetric, bell-shaped curve where most of the data points cluster around the mean, and probabilities decrease equally on both sides.
- Key characteristics: Mean = Median = Mode, symmetry, described by mean (µ) and standard deviation (σ).
- Example: Heights of people, test scores.
-
Skewed Distribution:
- Left-Skewed (Negative Skew): Tail lies to the left, mean < median.
Example: Age at retirement. - Right-Skewed (Positive Skew): Tail lies to the right, mean > median.
Example: Income distribution.
- Left-Skewed (Negative Skew): Tail lies to the left, mean < median.
-
Uniform Distribution:
All values are equally likely; distribution is flat.
- Example: Rolling a fair die (1–6 outcomes with equal probability).
- Bimodal Distribution:
Distribution with two distinct peaks or modes.
- Example: Test scores in a class with two groups of students (high scorers and low scorers).
- Multimodal Distribution:
Distribution with more than two peaks.
- Example: Heights in a mixed population of children, teenagers, and adults.
Measuring Distribution in Univariate Analysis
For Numerical Variables:
- Histogram: Graphical representation showing frequencies within ranges (bins). Helps identify shape (normal, skewed, uniform, etc.).
- Boxplot (Box-and-Whisker Plot): Displays five-number summary (minimum, Q1, median, Q3, maximum). Useful for detecting spread and outliers; skewness is shown if the median is not centered.
- Density Plot: Smoothed version of histogram, showing distribution as a continuous curve for clearer shape identification.
For Categorical Variables:
- Bar Chart: Represents frequency of categories with bars; height indicates count or frequency.
- Example: Survey responses for favorite fruit.
- Pie Chart: Represents proportions as circular slices; each slice shows percentage contribution of a category.
- Example: Distribution of students across different majors.