Numerical summaries

Numerical summaries are statistical measures used to describe the level (central tendency) and spread (dispersion) of a single variable (univariate data). Instead of using graphs, they summarize the dataset with meaningful numbers to understand its pattern and distribution.

1. Level – Measures of Central Tendency

These describe the typical or central value of the dataset.

Mean (Average):
$\text{Mean} = \frac{\text{Sum of values}}{\text{Number of values}}$
Example: For [10, 20, 30, 40, 50], mean = 30.
Median: Middle value when data is ordered.
For odd $n$ , it is the middle value; for even $n$ , average of two middle values.
Example: Median = 30.
Mode: Most frequently occurring value.
Example: In [2, 2, 3, 4, 4, 4, 5], mode = 4.

👉 These indicate the level (center) of the dataset.

2. Spread – Measures of Dispersion

These describe how much the data varies around the center.

Range:
$\text{Range} = \text{Max – Min}$
Example: 50 – 10 = 40.
Variance: Average of squared deviations from the mean.
Standard Deviation (σ): Square root of variance, shows typical distance from the mean.
Example: Std. dev ≈ 15.81 for [10, 20, 30, 40, 50].
Interquartile Range (IQR):
$IQR = Q3 - Q1$
Example: Q1 = 20, Q3 = 40 → IQR = 20.

👉 These indicate the spread (variability) of the dataset.

3. Percentiles and Quartiles

Percentiles: Divide data into 100 parts (e.g., 90th percentile = value below which 90% of data lies).
Quartiles: Divide data into 4 equal parts.

Q1 = 25% point
Q2 = Median (50%)
Q3 = 75% point

4. Descriptive Statistics with Pandas

Advantages in Univariate Analysis

Summarizes large data into meaningful numbers.
Shows both center (level) and variability (spread).
Helps in comparing datasets.
Forms the basis for advanced statistical and machine learning methods.