Data Analysis with Pandas in Python: Intermediate Guide

This tutorial assumes you already know basic Python and have used Pandas at least once. The goal is to move from “I can load a CSV” to confidently performing intermediate analysis: cleaning, reshaping, grouping, joining, time series handling, and building reproducible pipelines.

1) Setup: environment, versions, and display options

Install and verify versions

Use a virtual environment if possible.

python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows PowerShell

python -m pip install --upgrade pip
python -m pip install pandas numpy pyarrow openpyxl matplotlib seaborn

Check versions:

python -c "import pandas as pd, numpy as np; print(pd.__version__); print(np.__version__)"

Recommended display settings

These settings help you inspect data without constantly truncating important information.

import pandas as pd

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)
pd.set_option("display.max_colwidth", 80)
pd.set_option("display.float_format", "{:,.4f}".format)

2) Loading data: CSV, Excel, Parquet, and common pitfalls

CSV with explicit dtypes and missing values

A common mistake is letting Pandas infer types incorrectly (IDs become integers, leading zeros lost; dates remain strings; mixed columns become object). Prefer explicit control.

import pandas as pd

df = pd.read_csv(
    "sales.csv",
    dtype={
        "order_id": "string",
        "customer_id": "string",
        "country": "category",
    },
    parse_dates=["order_date"],
    na_values=["", "NA", "N/A", "null", "None"],
    keep_default_na=True,
)

Key ideas:

dtype= prevents silent type changes.
parse_dates= converts to datetime early, enabling time-series operations.
na_values= standardizes missing values.

Excel

df_xl = pd.read_excel(
    "sales.xlsx",
    sheet_name="Orders",
    dtype={"order_id": "string"},
    engine="openpyxl",
)

Parquet (fast, typed, compressed)

Parquet is often the best format for analytics pipelines.

df.to_parquet("sales.parquet", index=False)
df_parquet = pd.read_parquet("sales.parquet")

If you installed pyarrow, Parquet will be fast and preserve types well.

3) First inspection: shape, schema, and sanity checks

Basic inspection

df.shape
df.head(10)
df.tail(5)
df.sample(5, random_state=42)
df.columns
df.dtypes
df.info()

Summaries and missingness

df.describe(include="all")
df.isna().sum().sort_values(ascending=False)
df.nunique().sort_values(ascending=False)

Interpretation tips:

If describe(include="all") shows unique and top, you likely have categorical/object columns.
A column with many unique values might be an ID; a column with very few might be a category.
High missingness may require imputation, dropping, or careful modeling decisions.

4) Core data types: why they matter

Strings vs objects

Pandas historically used object for text, but string dtype is usually better.

df["customer_id"] = df["customer_id"].astype("string")

Categories for repeated labels

category saves memory and can speed up groupby operations.

df["country"] = df["country"].astype("category")

Datetimes and time zones

df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["order_date_utc"] = df["order_date"].dt.tz_localize("UTC", nonexistent="shift_forward", ambiguous="NaT")

Use errors="coerce" to convert invalid strings into NaT (missing datetime), then inspect how many were invalid:

df["order_date"].isna().sum()

5) Indexing and selecting data (beyond the basics)

`.loc` vs `.iloc` vs boolean masks

.loc is label-based (includes endpoints for slices on labels).
.iloc is position-based (end-exclusive like Python slicing).
Boolean masks are the workhorse for filtering.

# Boolean filtering
us = df[df["country"] == "US"]

# Multiple conditions: use parentheses and & / |
recent_us = df[(df["country"] == "US") & (df["order_date"] >= "2025-01-01")]

# Select columns
subset = df.loc[:, ["order_id", "order_date", "country", "revenue"]]

# Position-based selection
first_100 = df.iloc[:100, :5]

Avoid chained indexing

This can lead to confusing “SettingWithCopyWarning” and sometimes incorrect results.

Bad:

df[df["country"] == "US"]["revenue"] = 0

Good:

mask = df["country"] == "US"
df.loc[mask, "revenue"] = 0

6) Cleaning data: duplicates, missing values, and inconsistent text

Duplicates

Decide what “duplicate” means: entire rows, or duplicates by a subset of columns.

df.duplicated().sum()
df = df.drop_duplicates()

# Deduplicate by order_id, keeping the latest order_date
df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")

Missing values strategies

First, measure missingness:

missing_rate = df.isna().mean().sort_values(ascending=False)
missing_rate.head(10)

Then decide:

Drop rows if missingness is small and random.
Fill with a constant for categories (e.g., "Unknown").
Use statistical imputation for numeric columns (median is robust).

df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")

median_revenue = df["revenue"].median()
df["revenue"] = df["revenue"].fillna(median_revenue)

Text normalization

Common issues: extra whitespace, inconsistent casing, hidden characters.

df["product_name"] = (
    df["product_name"]
    .astype("string")
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)

df["country_code"] = df["country_code"].astype("string").str.upper().str.strip()

7) Creating and transforming columns (vectorized operations)

Use vectorized operations over `.apply` when possible

Vectorized:

df["revenue_per_item"] = df["revenue"] / df["quantity"]

Conditional logic with np.where:

import numpy as np

df["is_large_order"] = np.where(df["quantity"] >= 10, True, False)

Multiple conditions with np.select:

conditions = [
    df["revenue"] < 100,
    df["revenue"].between(100, 500, inclusive="both"),
    df["revenue"] > 500
]
choices = ["low", "mid", "high"]

df["revenue_band"] = np.select(conditions, choices, default="unknown")

Date parts and time-based features

df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
df["order_weekday"] = df["order_date"].dt.day_name()

8) GroupBy: split-apply-combine at an intermediate level

Basic aggregations

by_country = (
    df.groupby("country", observed=True)
      .agg(
          orders=("order_id", "nunique"),
          customers=("customer_id", "nunique"),
          revenue=("revenue", "sum"),
          avg_order_value=("revenue", "mean"),
      )
      .sort_values("revenue", ascending=False)
)

by_country.head(10)

Notes:

observed=True can reduce output size when grouping on categoricals by excluding unused categories.
Named aggregations (revenue=("revenue","sum")) are clearer than older dict syntax.

Multi-level groupby

by_country_month = (
    df.groupby(["country", "order_month"], observed=True)
      .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
      .reset_index()
)

by_country_month.head()

Transform vs aggregate

agg reduces rows (one row per group).
transform returns a value per original row (same length as input), enabling comparisons like “row revenue vs country average”.

df["country_avg_revenue"] = df.groupby("country", observed=True)["revenue"].transform("mean")
df["revenue_vs_country_avg"] = df["revenue"] - df["country_avg_revenue"]

Filtering groups

Example: keep only countries with at least 100 orders.

country_counts = df.groupby("country", observed=True)["order_id"].nunique()
big_countries = country_counts[country_counts >= 100].index

df_big = df[df["country"].isin(big_countries)]

9) Pivot tables and reshaping: wide vs long

Pivot table (aggregated)

pivot = pd.pivot_table(
    df,
    index="order_month",
    columns="country",
    values="revenue",
    aggfunc="sum",
    fill_value=0,
)

pivot.head()

Stack/unstack and MultiIndex

If you have a MultiIndex result from groupby, you can reshape:

g = df.groupby(["order_month", "country"], observed=True)["revenue"].sum()
wide = g.unstack("country", fill_value=0)
long_again = wide.stack().reset_index(name="revenue")

Melt: convert wide to long

Useful for plotting libraries and tidy data workflows.

wide_df = pivot.reset_index()
long_df = wide_df.melt(id_vars="order_month", var_name="country", value_name="revenue")
long_df.head()

10) Joining and merging: combining datasets correctly

Assume you have:

orders (order-level data)
customers (customer attributes)
products (product info)

Merge basics

orders = df  # for illustration
customers = pd.read_csv("customers.csv", dtype={"customer_id": "string"})
products = pd.read_csv("products.csv", dtype={"product_id": "string"})

orders_customers = orders.merge(customers, on="customer_id", how="left", validate="many_to_one")

Important parameters:

how="left" keeps all orders even if customer info is missing.
validate="many_to_one" enforces expected cardinality and catches data issues early (e.g., duplicate customer IDs).

Merge on different column names

orders_products = orders.merge(
    products,
    left_on="product_id",
    right_on="id",
    how="left",
    validate="many_to_one",
)

Common merge problems and fixes

Duplicate keys in lookup table
If customers has repeated customer_id, a many-to-one merge becomes many-to-many, duplicating rows.

Check:

customers["customer_id"].duplicated().sum()

Fix by deduplicating:

customers = customers.sort_values("updated_at").drop_duplicates("customer_id", keep="last")

Type mismatch ("00123" vs 123)
Ensure both sides are the same dtype:

orders["customer_id"] = orders["customer_id"].astype("string")
customers["customer_id"] = customers["customer_id"].astype("string")

11) Sorting, ranking, and selecting top-N per group

Sorting with multiple keys

df_sorted = df.sort_values(["country", "revenue"], ascending=[True, False])

Ranking within groups

df["revenue_rank_in_country"] = (
    df.groupby("country", observed=True)["revenue"]
      .rank(method="dense", ascending=False)
)

Top 3 orders per country

top3 = (
    df.sort_values(["country", "revenue"], ascending=[True, False])
      .groupby("country", observed=True)
      .head(3)
)

top3[["country", "order_id", "revenue"]]

12) Time series: resampling, rolling windows, and period comparisons

Time series analysis becomes easier when your date column is a datetime and (often) the index.

ts = df.set_index("order_date").sort_index()

Resample to monthly revenue

monthly = ts["revenue"].resample("MS").sum()  # MS = month start frequency
monthly.head()

Rolling averages (e.g., 3-month moving average)

monthly_ma3 = monthly.rolling(window=3, min_periods=1).mean()

Year-over-year comparison

If you have multiple years:

monthly_df = monthly.to_frame("revenue")
monthly_df["revenue_yoy"] = monthly_df["revenue"].pct_change(12)  # 12 months back
monthly_df.tail(15)

Grouped resampling (per country)

country_monthly = (
    df.set_index("order_date")
      .groupby("country", observed=True)["revenue"]
      .resample("MS")
      .sum()
      .reset_index()
)

country_monthly.head()

13) Performance: memory, speed, and best practices

Prefer vectorization and built-in methods

Use Series.str, Series.dt, Series.cat, np.where, np.select.
Avoid row-wise .apply(axis=1) unless necessary.

Reduce memory usage

Convert repeated strings to category, use nullable integer types, and downcast numerics.

df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce", downcast="integer")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce", downcast="float")
df["country"] = df["country"].astype("category")

Use Parquet for intermediate outputs

CSV is slow and loses types. Parquet is faster and preserves schema.

df.to_parquet("clean_sales.parquet", index=False)
df = pd.read_parquet("clean_sales.parquet")

Measure time for expensive operations

In a notebook, you can use:

# In Jupyter:
# %timeit df.groupby("country")["revenue"].sum()

In scripts:

import time

t0 = time.time()
_ = df.groupby("country", observed=True)["revenue"].sum()
print("seconds:", time.time() - t0)

14) Building a reproducible analysis pipeline

A practical approach is to write small, testable functions: load → clean → transform → summarize.

import pandas as pd
import numpy as np

def load_sales(path: str) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "customer_id": "string", "country": "category"},
        parse_dates=["order_date"],
        na_values=["", "NA", "N/A", "null", "None"],
    )

def clean_sales(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Standardize text
    df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")

    # Coerce numerics
    df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

    # Drop impossible rows
    df = df[df["quantity"].notna() & (df["quantity"] > 0)]
    df = df[df["revenue"].notna() & (df["revenue"] >= 0)]

    # Deduplicate by order_id
    df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")

    return df

def add_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
    df["aov"] = df["revenue"]  # if revenue is order-level; otherwise compute per order
    df["revenue_band"] = np.select(
        [df["revenue"] < 100, df["revenue"].between(100, 500, inclusive="both"), df["revenue"] > 500],
        ["low", "mid", "high"],
        default="unknown",
    )
    return df

def summarize(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df.groupby(["country", "order_month"], observed=True)
          .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
          .reset_index()
          .sort_values(["order_month", "revenue"], ascending=[True, False])
    )

sales = load_sales("sales.csv")
sales = clean_sales(sales)
sales = add_features(sales)
report = summarize(sales)

report.head(20)

Why this structure works:

Each function has a single responsibility.
You can unit test each step (e.g., “cleaning removes negative revenue”).
You can reuse the pipeline for new months of data.

15) Visualization (quick but real): Matplotlib/Seaborn with Pandas outputs

After summarizing, plot trends.

import matplotlib.pyplot as plt
import seaborn as sns

monthly_rev = (
    sales.set_index("order_date")
         .sort_index()["revenue"]
         .resample("MS")
         .sum()
         .reset_index()
)

plt.figure(figsize=(10, 4))
sns.lineplot(data=monthly_rev, x="order_date", y="revenue")
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()

For country comparison:

country_monthly = (
    sales.set_index("order_date")
         .groupby("country", observed=True)["revenue"]
         .resample("MS")
         .sum()
         .reset_index()
)

plt.figure(figsize=(12, 5))
sns.lineplot(data=country_monthly, x="order_date", y="revenue", hue="country")
plt.title("Monthly Revenue by Country")
plt.tight_layout()
plt.show()

16) Common intermediate mistakes (and how to avoid them)

Silent type issues
Fix by specifying dtype= and using pd.to_numeric / pd.to_datetime with errors="coerce".
Chained assignment
Use .loc[mask, col] = value and consider df = df.copy() before complex operations.
Many-to-many merges
Use validate= in merge to catch unexpected duplication.
Grouping on high-cardinality columns
Grouping by a near-unique column (like order_id) is rarely useful and can be slow.
Assuming missing means zero
Missing can mean “unknown”, “not applicable”, or “not recorded”. Decide explicitly per column.

17) Practice exercise (with a realistic workflow)

Given a sales.csv with columns like:

order_id, order_date, customer_id, country, product_id, quantity, revenue

Try:

Load with correct dtypes and parse dates.
Clean: remove duplicates by order_id, remove rows where quantity <= 0, and standardize country codes.
Create features: order_month, revenue_band.
Produce a report: monthly revenue and order count by country.
Save the cleaned dataset to Parquet and the report to CSV.

Skeleton:

import pandas as pd

sales = pd.read_csv(
    "sales.csv",
    dtype={"order_id": "string", "customer_id": "string", "product_id": "string", "country": "string"},
    parse_dates=["order_date"],
)

sales["country"] = sales["country"].str.upper().str.strip()
sales["quantity"] = pd.to_numeric(sales["quantity"], errors="coerce")
sales["revenue"] = pd.to_numeric(sales["revenue"], errors="coerce")

sales = sales.sort_values("order_date").drop_duplicates("order_id", keep="last")
sales = sales[sales["quantity"].notna() & (sales["quantity"] > 0)]
sales = sales[sales["revenue"].notna() & (sales["revenue"] >= 0)]

sales["order_month"] = sales["order_date"].dt.to_period("M").astype("string")

report = (
    sales.groupby(["country", "order_month"])
        .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
        .reset_index()
        .sort_values(["order_month", "revenue"], ascending=[True, False])
)

sales.to_parquet("sales_clean.parquet", index=False)
report.to_csv("country_month_report.csv", index=False)

18) Where to go next

To progress beyond intermediate Pandas, focus on:

Testing and validation: assert schemas, row counts, uniqueness constraints.
Advanced reshaping: complex MultiIndex operations, wide_to_long, custom aggregations.
Scaling: chunked CSV reads, pyarrow datasets, and distributed tools (Dask, Polars, Spark) when data outgrows memory.
Feature engineering: robust time-based features, cohort analysis, retention curves.

If you want, share a small sample of your dataset schema (column names + a few rows), and I can propose a tailored cleaning and analysis pipeline using the patterns shown above.

Data Analysis with Pandas in Python: Intermediate Guide

Data Analysis with Pandas in Python: Intermediate Guide

1) Setup: environment, versions, and display options

Install and verify versions

Recommended display settings

2) Loading data: CSV, Excel, Parquet, and common pitfalls

CSV with explicit dtypes and missing values

Excel

Parquet (fast, typed, compressed)

3) First inspection: shape, schema, and sanity checks

Basic inspection

Summaries and missingness

4) Core data types: why they matter

Strings vs objects

Categories for repeated labels

Datetimes and time zones

5) Indexing and selecting data (beyond the basics)

.loc vs .iloc vs boolean masks

Avoid chained indexing

6) Cleaning data: duplicates, missing values, and inconsistent text

Duplicates

Missing values strategies

Text normalization

7) Creating and transforming columns (vectorized operations)

Use vectorized operations over .apply when possible

Date parts and time-based features

8) GroupBy: split-apply-combine at an intermediate level

Basic aggregations

Multi-level groupby

Transform vs aggregate

Filtering groups

9) Pivot tables and reshaping: wide vs long

Pivot table (aggregated)

Stack/unstack and MultiIndex

Melt: convert wide to long

10) Joining and merging: combining datasets correctly

Merge basics

Merge on different column names

Common merge problems and fixes

11) Sorting, ranking, and selecting top-N per group

Sorting with multiple keys

Ranking within groups

Top 3 orders per country

12) Time series: resampling, rolling windows, and period comparisons

Resample to monthly revenue

Rolling averages (e.g., 3-month moving average)

Year-over-year comparison

Grouped resampling (per country)

13) Performance: memory, speed, and best practices

Prefer vectorization and built-in methods

Reduce memory usage

Use Parquet for intermediate outputs

Measure time for expensive operations

14) Building a reproducible analysis pipeline

15) Visualization (quick but real): Matplotlib/Seaborn with Pandas outputs

16) Common intermediate mistakes (and how to avoid them)

17) Practice exercise (with a realistic workflow)

18) Where to go next

Related Tutorials

`.loc` vs `.iloc` vs boolean masks

Use vectorized operations over `.apply` when possible