aggregation

Definition of Aggregation in Pandas

Aggregation in Pandas is the process of applying one or more mathematical or statistical functions to a dataset (or groups within a dataset) in order to produce a summarized result. It reduces multiple values into a single value, such as computing sum, mean, minimum, maximum, variance, or count over columns or groups of data.

Purpose:

To summarize data with meaningful statistics.
To reduce complexity by replacing raw data with aggregate values.
To identify key patterns (e.g., highest sales, average income).

Types of Aggregation Functions:

Single aggregation: Applying one function (e.g., sum of sales).
Multiple aggregation: Applying multiple functions simultaneously (e.g., mean, min, max of salaries).

sum() – Computes the sum of column values
min() – Computes the minimum value of a column
max() – Computes the maximum value of a column
mean() – Computes the mean of a column
size() – Computes the size of each group/column
describe() – Generates descriptive statistics
first() – Returns the first value of a group/column
last() – Returns the last value of a group/column
count() – Returns the count of values in a column
std() – Computes the standard deviation
var() – Computes the variance
sem() – Computes the standard error of the mean

import pandas as pd

# Create sample DataFrame

data = {

'Maths': [90, 85, 78, 92, 88],

'English': [75, 80, 65, 70, 85],

'Science': [88, 92, 85, 95, 90],

'History': [70, 60, 65, 75, 80]

}

df = pd.DataFrame(data)

print("Original Dataset:\n", df, "\n")

# Applying different aggregation functions

print("Sum of each column:\n", df.sum(), "\n")

print("Minimum of each column:\n", df.min(), "\n")

print("Maximum of each column:\n", df.max(), "\n")

print("Mean of each column:\n", df.mean(), "\n")

print("Size of DataFrame:\n", df.size, "\n") # total elements

print("Descriptive Statistics:\n", df.describe(), "\n")

print("First value of each column:\n", df.first(), "\n")

print("Last value of each column:\n", df.last(), "\n")

print("Count of values in each column:\n", df.count(), "\n")

print("Standard Deviation of each column:\n", df.std(), "\n")

print("Variance of each column:\n", df.var(), "\n")

print("Standard Error of Mean of each column:\n", df.sem(), "\n")

Expected Output (example)


Original Dataset:
   Maths  English  Science  History
0     90       75       88       70
1     85       80       92       60
2     78       65       85       65
3     92       70       95       75
4     88       85       90       80

Sum of each column:
 Maths      433
English    375
Science    450
History    350

Minimum of each column:
 Maths      78
English    65
Science    85
History    60

Maximum of each column:
 Maths      92
English    85
Science    95
History    80

Mean of each column:
 Maths      86.6
English    75.0
Science    90.0
History    70.0

Size of DataFrame:
 20

Descriptive Statistics:
            Maths    English    Science    History
count   5.000000   5.000000   5.000000   5.000000
mean   86.600000  75.000000  90.000000  70.000000
std     5.176873   7.905694   3.807887   7.905694
min    78.000000  65.000000  85.000000  60.000000
25%    85.000000  70.000000  88.000000  65.000000
50%    88.000000  75.000000  90.000000  70.000000
75%    90.000000  80.000000  92.000000  75.000000
max    92.000000  85.000000  95.000000  80.000000

First value of each column:
 Maths      90
English    75
Science    88
History    70

Last value of each column:
 Maths      88
English    85
Science    90
History    80

Count of values in each column:
 Maths      5
English    5
Science    5
History    5

Standard Deviation of each column:
 Maths      5.176873
English    7.905694
Science    3.807887
History    7.905694

Variance of each column:
 Maths
Expected Output (example)
Original Dataset:
   Maths  English  Science  History
0     90       75       88       70
1     85       80       92       60
2     78       65       85       65
3     92       70       95       75
4     88       85       90       80

Sum of each column:
 Maths      433
English    375
Science    450
History    350

Minimum of each column:
 Maths      78
English    65
Science    85
History    60

Maximum of each column:
 Maths      92
English    85
Science    95
History    80

Mean of each column:
 Maths      86.6
English    75.0
Science    90.0
History    70.0

Size of DataFrame:
 20

Descriptive Statistics:
            Maths    English    Science    History
count   5.000000   5.000000   5.000000   5.000000
mean   86.600000  75.000000  90.000000  70.000000
std     5.176873   7.905694   3.807887   7.905694
min    78.000000  65.000000  85.000000  60.000000
25%    85.000000  70.000000  88.000000  65.000000
50%    88.000000  75.000000  90.000000  70.000000
75%    90.000000  80.000000  92.000000  75.000000
max    92.000000  85.000000  95.000000  80.000000

First value of each column:
 Maths      90
English    75
Science    88
History    70

Last value of each column:
 Maths      88
English    85
Science    90
History    80

Count of values in each column:
 Maths      5
English    5
Science    5
History    5

Standard Deviation of each column:
 Maths      5.176873
English    7.905694
Science    3.807887
History    7.905694

Variance of each column:
 Maths      26.8
English    62.5
Science    14.5
History    62.5

Standard Error of Mean of each column:
 Maths      2.31365
English    3.53659
Science    1.70358
History    3.53659

      26.8
English    62.5
Science    14.5
History    62.5

Standard Error of Mean of each column:
 Maths      2.31365
English    3.53659
Science    1.70358
History    3.53659

Definition of Grouping in Pandas

Grouping in Pandas is a data manipulation process based on the split-apply-combine strategy. It involves:

Splitting the dataset into groups according to one or more keys,
Applying a function (such as aggregation, transformation, or filtering) to each group independently, and
Combining the results into a new data structure.

The groupby() function in Pandas is used to implement grouping operations.

Purpose:

To study patterns within subgroups.
To compare metrics (like mean, sum, count) across different categories.
To simplify large datasets into manageable groups.

Steps in Grouping:

Split – Divide data into groups based on column values.
Apply – Apply a function (e.g., mean, sum, count) on each group.
Combine – Merge results back into a summarized dataset.

Example 1 – Grouping by a single column:


df.groupby('Maths').first()

This groups the dataset based on values in the Maths column and returns the first entry of each group.

Example 2 – Grouping by multiple columns:


df.groupby(['Maths', 'Science']).first()

Here, grouping is performed first by Maths, and within each group, further grouped by Science.

Aggregation with Grouping

Once the data is grouped, we can apply aggregation functions.

Example 3 – Aggregating a group:


df.groupby('A').agg('min')

This computes the minimum of each column for every group in column A.

Example 4 – Multiple aggregations:


df.groupby('A').agg(['min', 'max'])

Applies both min and max to each group.

Example 5 – Column-specific aggregation:


df.groupby('A').B.agg(['min', 'max'])

Applies min and max only on column B.

Example 6 – Different aggregations per column:


df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})

Applies min and max on column B and sum on column C.

Example 7 – Grouping and summing data:


df = pd.DataFrame({
    'key': ['A', 'B', 'C', 'A', 'B', 'C'],
    'data': range(6)
})
print(df.groupby('key').sum())

Output:

key	data
A	3
B	5
C	7

refer goup example

https://chatgpt.com/share/68cdac39-7ff8-8001-93ef-4ada45c69bcb

https://chatgpt.com/share/68cdaf03-4e98-800f-800f-87ebe00a740d

https://chatgpt.com/share/68cdac39-7ff8-8001-93ef-4ada45c69bcb

https://colab.research.google.com/drive/1k1KQIMViqBYN-iU1e--hgmHymy1ILYU9?usp=sharing