aggregation

 

Definition of Aggregation in Pandas

Aggregation in Pandas is the process of applying one or more mathematical or statistical functions to a dataset (or groups within a dataset) in order to produce a summarized result. It reduces multiple values into a single value, such as computing sum, mean, minimum, maximum, variance, or count over columns or groups of data.

  • Purpose:

    • To summarize data with meaningful statistics.

    • To reduce complexity by replacing raw data with aggregate values.

    • To identify key patterns (e.g., highest sales, average income).

  • Types of Aggregation Functions:

    • Single aggregation: Applying one function (e.g., sum of sales).

    • Multiple aggregation: Applying multiple functions simultaneously (e.g., mean, min, max of salaries).

    • sum() – Computes the sum of column values

    • min() – Computes the minimum value of a column

    • max() – Computes the maximum value of a column

    • mean() – Computes the mean of a column

    • size() – Computes the size of each group/column

    • describe() – Generates descriptive statistics

    • first() – Returns the first value of a group/column

    • last() – Returns the last value of a group/column

    • count() – Returns the count of values in a column

    • std() – Computes the standard deviation

    • var() – Computes the variance

    • sem() – Computes the standard error of the mean




    import pandas as pd

    # Create sample DataFrame
    data = {
        'Maths': [90, 85, 78, 92, 88],
        'English': [75, 80, 65, 70, 85],
        'Science': [88, 92, 85, 95, 90],
        'History': [70, 60, 65, 75, 80]
    }

    df = pd.DataFrame(data)
    print("Original Dataset:\n", df, "\n")

    # Applying different aggregation functions
    print("Sum of each column:\n", df.sum(), "\n")
    print("Minimum of each column:\n", df.min(), "\n")
    print("Maximum of each column:\n", df.max(), "\n")
    print("Mean of each column:\n", df.mean(), "\n")
    print("Size of DataFrame:\n", df.size, "\n")   # total elements
    print("Descriptive Statistics:\n", df.describe(), "\n")
    print("First value of each column:\n", df.first(), "\n")
    print("Last value of each column:\n", df.last(), "\n")
    print("Count of values in each column:\n", df.count(), "\n")
    print("Standard Deviation of each column:\n", df.std(), "\n")
    print("Variance of each column:\n", df.var(), "\n")
    print("Standard Error of Mean of each column:\n", df.sem(), "\n")

    Expected Output (example)

    Original Dataset: Maths English Science History 0 90 75 88 70 1 85 80 92 60 2 78 65 85 65 3 92 70 95 75 4 88 85 90 80 Sum of each column: Maths 433 English 375 Science 450 History 350 Minimum of each column: Maths 78 English 65 Science 85 History 60 Maximum of each column: Maths 92 English 85 Science 95 History 80 Mean of each column: Maths 86.6 English 75.0 Science 90.0 History 70.0 Size of DataFrame: 20 Descriptive Statistics: Maths English Science History count 5.000000 5.000000 5.000000 5.000000 mean 86.600000 75.000000 90.000000 70.000000 std 5.176873 7.905694 3.807887 7.905694 min 78.000000 65.000000 85.000000 60.000000 25% 85.000000 70.000000 88.000000 65.000000 50% 88.000000 75.000000 90.000000 70.000000 75% 90.000000 80.000000 92.000000 75.000000 max 92.000000 85.000000 95.000000 80.000000 First value of each column: Maths 90 English 75 Science 88 History 70 Last value of each column: Maths 88 English 85 Science 90 History 80 Count of values in each column: Maths 5 English 5 Science 5 History 5 Standard Deviation of each column: Maths 5.176873 English 7.905694 Science 3.807887 History 7.905694 Variance of each column: Maths

    Expected Output (example)

    Original Dataset: Maths English Science History 0 90 75 88 70 1 85 80 92 60 2 78 65 85 65 3 92 70 95 75 4 88 85 90 80 Sum of each column: Maths 433 English 375 Science 450 History 350 Minimum of each column: Maths 78 English 65 Science 85 History 60 Maximum of each column: Maths 92 English 85 Science 95 History 80 Mean of each column: Maths 86.6 English 75.0 Science 90.0 History 70.0 Size of DataFrame: 20 Descriptive Statistics: Maths English Science History count 5.000000 5.000000 5.000000 5.000000 mean 86.600000 75.000000 90.000000 70.000000 std 5.176873 7.905694 3.807887 7.905694 min 78.000000 65.000000 85.000000 60.000000 25% 85.000000 70.000000 88.000000 65.000000 50% 88.000000 75.000000 90.000000 70.000000 75% 90.000000 80.000000 92.000000 75.000000 max 92.000000 85.000000 95.000000 80.000000 First value of each column: Maths 90 English 75 Science 88 History 70 Last value of each column: Maths 88 English 85 Science 90 History 80 Count of values in each column: Maths 5 English 5 Science 5 History 5 Standard Deviation of each column: Maths 5.176873 English 7.905694 Science 3.807887 History 7.905694 Variance of each column: Maths 26.8 English 62.5 Science 14.5 History 62.5 Standard Error of Mean of each column: Maths 2.31365 English 3.53659 Science 1.70358 History 3.53659
    26.8 English 62.5 Science 14.5 History 62.5 Standard Error of Mean of each column: Maths 2.31365 English 3.53659 Science 1.70358 History 3.53659


    Definition of Grouping in Pandas

    Grouping in Pandas is a data manipulation process based on the split-apply-combine strategy. It involves:

    1. Splitting the dataset into groups according to one or more keys,

    2. Applying a function (such as aggregation, transformation, or filtering) to each group independently, and

    3. Combining the results into a new data structure.

    The groupby() function in Pandas is used to implement grouping operations.

    Purpose:

    • To study patterns within subgroups.

    • To compare metrics (like mean, sum, count) across different categories.

    • To simplify large datasets into manageable groups.

    Steps in Grouping:

    1. Split – Divide data into groups based on column values.

    2. Apply – Apply a function (e.g., mean, sum, count) on each group.

    3. Combine – Merge results back into a summarized dataset.


    Example 1 – Grouping by a single column:

    df.groupby('Maths').first()

    This groups the dataset based on values in the Maths column and returns the first entry of each group.

    Example 2 – Grouping by multiple columns:

    df.groupby(['Maths', 'Science']).first()

    Here, grouping is performed first by Maths, and within each group, further grouped by Science.


    Aggregation with Grouping

    Once the data is grouped, we can apply aggregation functions.

    Example 3 – Aggregating a group:

    df.groupby('A').agg('min')

    This computes the minimum of each column for every group in column A.

    Example 4 – Multiple aggregations:

    df.groupby('A').agg(['min', 'max'])

    Applies both min and max to each group.

    Example 5 – Column-specific aggregation:

    df.groupby('A').B.agg(['min', 'max'])

    Applies min and max only on column B.

    Example 6 – Different aggregations per column:

    df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})

    Applies min and max on column B and sum on column C.

    Example 7 – Grouping and summing data:

    df = pd.DataFrame({ 'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6) }) print(df.groupby('key').sum())

    Output:

    keydata
    A3
    B5
    C7



    refer goup example

    https://chatgpt.com/share/68cdac39-7ff8-8001-93ef-4ada45c69bcb


    https://chatgpt.com/share/68cdaf03-4e98-800f-800f-87ebe00a740d

    https://chatgpt.com/share/68cdac39-7ff8-8001-93ef-4ada45c69bcb


    https://colab.research.google.com/drive/1k1KQIMViqBYN-iU1e--hgmHymy1ILYU9?usp=sharing