Pandas GroupBy Defined With Examples

# Introduction

Pandas is likely one of the hottest Python libraries for knowledge evaluation. It offers you easy instruments for cleansing, reshaping, summarizing, and exploring structured knowledge. One of the crucial helpful options in pandas is GroupBy. It helps you reply questions that require grouping rows by a number of classes.

For instance, if you’re working with gross sales knowledge, you might wish to calculate whole income by area, common order worth by product class, or the variety of orders dealt with by every gross sales consultant. As a substitute of manually filtering every class one after the other, GroupBy enables you to carry out these calculations in a clear and environment friendly manner.

On this tutorial, we’ll stroll by means of sensible examples of utilizing Pandas GroupBy with a small gross sales dataset. I’m utilizing Deepnote because the coding atmosphere, so some outputs are proven as pocket book screenshots instantly underneath the code blocks.

# Making a Pattern Dataset

Earlier than utilizing GroupBy, we first create a small retail gross sales dataset with columns resembling order_id, area, class, sales_rep, items, unit_price, low cost, and order_date. We then convert the dictionary right into a pandas DataFrame and create two new columns: gross_sales and net_sales.

knowledge = {
    "order_id": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112],
    "area": ["North", "South", "North", "West", "South", "West", "North", "South", "West", "North", "South", "West"],
    "class": ["Electronics", "Furniture", "Electronics", "Furniture", "Clothing", "Electronics",
                 "Clothing", "Furniture", "Clothing", "Furniture", "Electronics", "Clothing"],
    "sales_rep": ["Ayesha", "Bilal", "Ayesha", "Chen", "Bilal", "Chen",
                  "Ayesha", "Bilal", "Chen", "Ayesha", "Bilal", "Chen"],
    "items": [2, 1, 3, 2, 5, 4, 6, 2, 7, 1, 2, 8],
    "unit_price": [500, 800, 450, 700, 60, 550, 55, 850, 65, 750, 520, 70],
    "low cost": [0.05, 0.10, 0.00, 0.08, 0.00, 0.12, 0.05, 0.10, 0.00, 0.07, 0.03, 0.00],
    "order_date": pd.to_datetime([
        "2026-01-05", "2026-01-06", "2026-01-08", "2026-01-10",
        "2026-01-12", "2026-01-15", "2026-02-02", "2026-02-05",
        "2026-02-08", "2026-02-12", "2026-02-15", "2026-02-20"
    ])
}

df = pd.DataFrame(knowledge)

df["gross_sales"] = df["units"] * df["unit_price"]
df["net_sales"] = df["gross_sales"] * (1 - df["discount"])

df

The gross_sales column is calculated by multiplying items by unit_price, whereas net_sales adjusts that worth after making use of the low cost. This provides us a clear dataset that we will use for all GroupBy examples.

# Utilizing the Primary GroupBy Syntax

Essentially the most primary GroupBy operation follows a easy sample: choose a grouping column, choose the worth column, and apply an aggregation operate. On this instance, we group the information by area and calculate the overall net_sales for every area.

df.groupby("area")["net_sales"].sum()

The consequence exhibits that North, South, and West every have their very own whole gross sales worth. That is the best and most typical use case for GroupBy when summarizing knowledge.

area
North    3311.0
South    3558.8
West     4239.0
Identify: net_sales, dtype: float64

# Utilizing GroupBy With `as_index=False`

By default, pandas makes use of the grouped column because the index within the output. Whereas that is helpful in some instances, it’s usually simpler to work with a traditional DataFrame the place the grouped column stays an everyday column. That’s the place as_index=False is beneficial.

df.groupby("area", as_index=False)["net_sales"].sum()

On this instance, we once more calculate whole web gross sales by area, however the result’s returned as a clear DataFrame, which is less complicated to export, merge, or use in stories.

# Making use of A number of Aggregations on One Column

GroupBy is just not restricted to a single calculation. You may apply a number of aggregation features to the identical column utilizing agg().

On this instance, we calculate the sum, imply, minimal, most, and depend of net_sales for every area.

This provides us a fast statistical abstract of regional gross sales efficiency and helps us examine not solely whole income but in addition common order dimension and order quantity.

df.groupby("area")["net_sales"].agg(["sum", "mean", "min", "max", "count"])

# Utilizing Named Aggregations

Named aggregations make GroupBy outputs simpler to learn and use. As a substitute of returning generic column names like sum or imply, we outline our personal names resembling total_sales, average_order_value, total_units, and number_of_orders.

That is particularly useful when getting ready evaluation for dashboards, stories, or tutorials as a result of the output column names clearly clarify what every metric represents.

region_summary = (
    df.groupby("area", as_index=False)
      .agg(
          total_sales=("net_sales", "sum"),
          average_order_value=("net_sales", "imply"),
          total_units=("items", "sum"),
          number_of_orders=("order_id", "depend")
      )
)

region_summary

# Grouping by A number of Columns

It’s also possible to group knowledge by a couple of column. On this instance, we group by each area and class to calculate whole web gross sales for every product class inside every area.

This provides us a extra detailed view of the information in comparison with grouping by area alone. Multi-column grouping is beneficial whenever you wish to analyze efficiency throughout totally different dimensions, resembling area and product, division and worker, or month and buyer phase.

df.groupby(["region", "category"], as_index=False)["net_sales"].sum()

# Sorting GroupBy Outcomes

After grouping and aggregating knowledge, you usually wish to type the outcomes to seek out the best or lowest values.

On this instance, we calculate whole gross sales by product class after which type the ends in descending order.

This makes it simple to determine which class generated essentially the most income. Sorting grouped outcomes is a straightforward however highly effective step when turning uncooked summaries into helpful insights.

category_sales = (
    df.groupby("class", as_index=False)
      .agg(total_sales=("net_sales", "sum"))
      .sort_values("total_sales", ascending=False)
)

category_sales

# Understanding Depend vs Dimension

Pandas offers each depend() and dimension(), however they aren’t precisely the identical. The dimension() technique counts the overall variety of rows in every group, together with rows with lacking values. The depend() technique counts solely non-missing values in a specific column.

On this instance, we deliberately add a lacking worth to the sales_rep column. The output exhibits that dimension() nonetheless counts 4 rows for every area, whereas depend() returns three for North as a result of one sales_rep worth is lacking.

import numpy as np

df_missing = df.copy()
df_missing.loc[2, "sales_rep"] = np.nan

print("Utilizing dimension():")
show(df_missing.groupby("area").dimension())

print("Utilizing depend() on sales_rep:")
show(df_missing.groupby("area")["sales_rep"].depend())

Output:

Utilizing dimension():
area
North    4
South    4
West     4
dtype: int64

Utilizing depend() on sales_rep:
area
North    3
South    4
West     4
Identify: sales_rep, dtype: int64

# Utilizing `remodel()` for Group-Stage Options

The remodel() technique is beneficial whenever you wish to calculate a group-level worth and add it again to the unique DataFrame.

On this instance, we calculate whole gross sales for every area and retailer it in a brand new column known as region_total_sales.

We then calculate every order’s share of its area’s whole gross sales. In contrast to agg(), which reduces the information to 1 row per group, remodel() returns values aligned with the unique rows, making it very helpful for characteristic engineering.

df["region_total_sales"] = df.groupby("area")["net_sales"].remodel("sum")
df["order_share_of_region"] = df["net_sales"] / df["region_total_sales"]

df[["order_id", "region", "net_sales", "region_total_sales", "order_share_of_region"]]

# Filtering Teams With `filter()`

The filter() technique enables you to maintain or take away whole teams based mostly on a situation. On this instance, we maintain solely the areas the place whole web gross sales are higher than 3,000.

As a substitute of returning one abstract row per group, filter() returns the unique rows from the teams that meet the situation. That is helpful whenever you wish to take away low-performing teams or maintain solely teams that fulfill a enterprise rule.

high_sales_regions = df.groupby("area").filter(lambda group: group["net_sales"].sum() > 3000)

high_sales_regions

# Making use of Customized Logic With `apply()`

The apply() technique offers you extra flexibility as a result of it means that you can run customized logic on every group.

On this instance, we use apply() with nlargest() to seek out the highest order by web gross sales in every area. That is helpful when built-in aggregation features will not be sufficient on your evaluation.

Nevertheless, apply() may be slower than built-in strategies like sum(), imply(), agg(), and remodel(), so it’s best to make use of it solely whenever you want customized group-wise operations.

top_order_by_region = (
    df.groupby("area", group_keys=False)
      .apply(lambda group: group.nlargest(1, "net_sales"))
)

top_order_by_region

# Grouping by Dates

GroupBy can also be very helpful for time-based evaluation.

On this instance, we extract the month from the order_date column and group the information by month.

We then calculate whole gross sales and whole orders for every month. This strategy is useful when analyzing developments over time, resembling month-to-month gross sales, weekly person exercise, or yearly income development.

df["month"] = df["order_date"].dt.to_period("M").astype(str)

monthly_sales = (
    df.groupby("month", as_index=False)
      .agg(total_sales=("net_sales", "sum"), total_orders=("order_id", "depend"))
)

monthly_sales

# Grouping by Dates With `pd.Grouper`

pd.Grouper offers a cleaner solution to group time collection knowledge with out manually making a separate month column.

On this instance, we group the DataFrame by order_date utilizing a month-to-month frequency and calculate whole gross sales and whole orders.

That is particularly helpful when working with real-world datasets that comprise timestamps and also you wish to summarize knowledge by day, week, month, quarter, or yr.

monthly_sales_grouper = (
    df.groupby(pd.Grouper(key="order_date", freq="M"))
      .agg(total_sales=("net_sales", "sum"), total_orders=("order_id", "depend"))
      .reset_index()
)

monthly_sales_grouper

# Making a Pivot-Type Abstract With GroupBy

You may mix groupby() with unstack() to create a pivot-style abstract desk.

On this instance, we group the information by area and class, calculate whole web gross sales, after which reshape the consequence in order that classes develop into columns. This makes the output simpler to check throughout areas and classes. It’s a nice approach whenever you need a compact desk for reporting or fast evaluation.

region_category_table = (
    df.groupby(["region", "category"])["net_sales"]
      .sum()
      .unstack(fill_value=0)
)

region_category_table

# Conclusion

Pandas GroupBy is likely one of the strongest instruments for knowledge evaluation in Python. It helps you summarize knowledge, examine teams, create new options, filter outcomes, and apply customized calculations with out writing pointless handbook logic.

Whereas engaged on this tutorial, I noticed how a lot depth there may be in GroupBy. Even after working with knowledge for years, I discovered new and higher methods to unravel frequent issues. Options like pd.Grouper, customized aggregation features, and remodel() stood out as a result of they make many duties quicker, cleaner, and simpler to take care of.

That is additionally why understanding the native instruments issues. It’s tempting to depend on vibe coding or fast customized options, however these can usually produce slower, extra sophisticated code. When you realize what pandas already offers, you possibly can write options which are extra environment friendly, reusable, and sensible for real-world knowledge evaluation.

On this tutorial, we coated essentially the most helpful GroupBy operations, together with primary aggregation, named aggregation, multi-column grouping, sorting, depend() vs dimension(), remodel(), filter(), apply(), date grouping, and pivot-style summaries. When you perceive these patterns, you need to use GroupBy to reply many real-world knowledge evaluation questions rapidly and confidently.

Abid Ali Awan (@1abidaliawan) is an authorized knowledge scientist skilled who loves constructing machine studying fashions. At present, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students combating psychological sickness.

Pandas GroupBy Defined With Examples

# Introduction

# Making a Pattern Dataset

# Utilizing the Primary GroupBy Syntax

# Utilizing GroupBy With `as_index=False`

# Making use of A number of Aggregations on One Column

# Utilizing Named Aggregations

# Grouping by A number of Columns

# Sorting GroupBy Outcomes

# Understanding Depend vs Dimension

# Utilizing `remodel()` for Group-Stage Options

# Filtering Teams With `filter()`

# Making use of Customized Logic With `apply()`

# Grouping by Dates

# Grouping by Dates With `pd.Grouper`

# Making a Pivot-Type Abstract With GroupBy

# Conclusion

About The Author

Admin

Leave a reply Cancel reply

Recent Posts

Recent Comments

Contact Details

Quick Links

Pandas GroupBy Defined With Examples

# Introduction

# Making a Pattern Dataset

# Utilizing the Primary GroupBy Syntax

# Utilizing GroupBy With as_index=False

# Making use of A number of Aggregations on One Column

# Utilizing Named Aggregations

# Grouping by A number of Columns

# Sorting GroupBy Outcomes

# Understanding Depend vs Dimension

# Utilizing remodel() for Group-Stage Options

# Filtering Teams With filter()

# Making use of Customized Logic With apply()

# Grouping by Dates

# Grouping by Dates With pd.Grouper

# Making a Pivot-Type Abstract With GroupBy

# Conclusion

About The Author

Admin

Related Posts

How AI Brokers Will Rework Knowledge Science Work in 2026

Tips on how to Write to Recordsdata in Python: A Newbie’s Information

Planning for AI Induced Financial Volatility – AI Time Journal

Artificial Time Journal: 7 Important Steps for a Powerful Cloud Migration and Modernization

Leave a reply Cancel reply

Recent Posts

Recent Comments

# Utilizing GroupBy With `as_index=False`

# Utilizing `remodel()` for Group-Stage Options

# Filtering Teams With `filter()`

# Making use of Customized Logic With `apply()`

# Grouping by Dates With `pd.Grouper`