← Back to Tutorials

Data Analysis with Pandas in Python: Intermediate Guide

pandaspythondata-analysisdata-cleaningdata-wranglinggroupbymerge-jointime-seriesdataframesintermediate-python

Data Analysis with Pandas in Python: Intermediate Guide

This tutorial assumes you already know basic Python and have used Pandas at least once. The goal is to move from “I can load a CSV” to confidently performing intermediate analysis: cleaning, reshaping, grouping, joining, time series handling, and building reproducible pipelines.


1) Setup: environment, versions, and display options

Install and verify versions

Use a virtual environment if possible.

python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows PowerShell

python -m pip install --upgrade pip
python -m pip install pandas numpy pyarrow openpyxl matplotlib seaborn

Check versions:

python -c "import pandas as pd, numpy as np; print(pd.__version__); print(np.__version__)"

These settings help you inspect data without constantly truncating important information.

import pandas as pd

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)
pd.set_option("display.max_colwidth", 80)
pd.set_option("display.float_format", "{:,.4f}".format)

2) Loading data: CSV, Excel, Parquet, and common pitfalls

CSV with explicit dtypes and missing values

A common mistake is letting Pandas infer types incorrectly (IDs become integers, leading zeros lost; dates remain strings; mixed columns become object). Prefer explicit control.

import pandas as pd

df = pd.read_csv(
    "sales.csv",
    dtype={
        "order_id": "string",
        "customer_id": "string",
        "country": "category",
    },
    parse_dates=["order_date"],
    na_values=["", "NA", "N/A", "null", "None"],
    keep_default_na=True,
)

Key ideas:

Excel

df_xl = pd.read_excel(
    "sales.xlsx",
    sheet_name="Orders",
    dtype={"order_id": "string"},
    engine="openpyxl",
)

Parquet (fast, typed, compressed)

Parquet is often the best format for analytics pipelines.

df.to_parquet("sales.parquet", index=False)
df_parquet = pd.read_parquet("sales.parquet")

If you installed pyarrow, Parquet will be fast and preserve types well.


3) First inspection: shape, schema, and sanity checks

Basic inspection

df.shape
df.head(10)
df.tail(5)
df.sample(5, random_state=42)
df.columns
df.dtypes
df.info()

Summaries and missingness

df.describe(include="all")
df.isna().sum().sort_values(ascending=False)
df.nunique().sort_values(ascending=False)

Interpretation tips:


4) Core data types: why they matter

Strings vs objects

Pandas historically used object for text, but string dtype is usually better.

df["customer_id"] = df["customer_id"].astype("string")

Categories for repeated labels

category saves memory and can speed up groupby operations.

df["country"] = df["country"].astype("category")

Datetimes and time zones

df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["order_date_utc"] = df["order_date"].dt.tz_localize("UTC", nonexistent="shift_forward", ambiguous="NaT")

Use errors="coerce" to convert invalid strings into NaT (missing datetime), then inspect how many were invalid:

df["order_date"].isna().sum()

5) Indexing and selecting data (beyond the basics)

.loc vs .iloc vs boolean masks

# Boolean filtering
us = df[df["country"] == "US"]

# Multiple conditions: use parentheses and & / |
recent_us = df[(df["country"] == "US") & (df["order_date"] >= "2025-01-01")]

# Select columns
subset = df.loc[:, ["order_id", "order_date", "country", "revenue"]]

# Position-based selection
first_100 = df.iloc[:100, :5]

Avoid chained indexing

This can lead to confusing “SettingWithCopyWarning” and sometimes incorrect results.

Bad:

df[df["country"] == "US"]["revenue"] = 0

Good:

mask = df["country"] == "US"
df.loc[mask, "revenue"] = 0

6) Cleaning data: duplicates, missing values, and inconsistent text

Duplicates

Decide what “duplicate” means: entire rows, or duplicates by a subset of columns.

df.duplicated().sum()
df = df.drop_duplicates()

# Deduplicate by order_id, keeping the latest order_date
df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")

Missing values strategies

First, measure missingness:

missing_rate = df.isna().mean().sort_values(ascending=False)
missing_rate.head(10)

Then decide:

df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")

median_revenue = df["revenue"].median()
df["revenue"] = df["revenue"].fillna(median_revenue)

Text normalization

Common issues: extra whitespace, inconsistent casing, hidden characters.

df["product_name"] = (
    df["product_name"]
    .astype("string")
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)

df["country_code"] = df["country_code"].astype("string").str.upper().str.strip()

7) Creating and transforming columns (vectorized operations)

Use vectorized operations over .apply when possible

Vectorized:

df["revenue_per_item"] = df["revenue"] / df["quantity"]

Conditional logic with np.where:

import numpy as np

df["is_large_order"] = np.where(df["quantity"] >= 10, True, False)

Multiple conditions with np.select:

conditions = [
    df["revenue"] < 100,
    df["revenue"].between(100, 500, inclusive="both"),
    df["revenue"] > 500
]
choices = ["low", "mid", "high"]

df["revenue_band"] = np.select(conditions, choices, default="unknown")

Date parts and time-based features

df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
df["order_weekday"] = df["order_date"].dt.day_name()

8) GroupBy: split-apply-combine at an intermediate level

Basic aggregations

by_country = (
    df.groupby("country", observed=True)
      .agg(
          orders=("order_id", "nunique"),
          customers=("customer_id", "nunique"),
          revenue=("revenue", "sum"),
          avg_order_value=("revenue", "mean"),
      )
      .sort_values("revenue", ascending=False)
)

by_country.head(10)

Notes:

Multi-level groupby

by_country_month = (
    df.groupby(["country", "order_month"], observed=True)
      .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
      .reset_index()
)

by_country_month.head()

Transform vs aggregate

df["country_avg_revenue"] = df.groupby("country", observed=True)["revenue"].transform("mean")
df["revenue_vs_country_avg"] = df["revenue"] - df["country_avg_revenue"]

Filtering groups

Example: keep only countries with at least 100 orders.

country_counts = df.groupby("country", observed=True)["order_id"].nunique()
big_countries = country_counts[country_counts >= 100].index

df_big = df[df["country"].isin(big_countries)]

9) Pivot tables and reshaping: wide vs long

Pivot table (aggregated)

pivot = pd.pivot_table(
    df,
    index="order_month",
    columns="country",
    values="revenue",
    aggfunc="sum",
    fill_value=0,
)

pivot.head()

Stack/unstack and MultiIndex

If you have a MultiIndex result from groupby, you can reshape:

g = df.groupby(["order_month", "country"], observed=True)["revenue"].sum()
wide = g.unstack("country", fill_value=0)
long_again = wide.stack().reset_index(name="revenue")

Melt: convert wide to long

Useful for plotting libraries and tidy data workflows.

wide_df = pivot.reset_index()
long_df = wide_df.melt(id_vars="order_month", var_name="country", value_name="revenue")
long_df.head()

10) Joining and merging: combining datasets correctly

Assume you have:

Merge basics

orders = df  # for illustration
customers = pd.read_csv("customers.csv", dtype={"customer_id": "string"})
products = pd.read_csv("products.csv", dtype={"product_id": "string"})

orders_customers = orders.merge(customers, on="customer_id", how="left", validate="many_to_one")

Important parameters:

Merge on different column names

orders_products = orders.merge(
    products,
    left_on="product_id",
    right_on="id",
    how="left",
    validate="many_to_one",
)

Common merge problems and fixes

  1. Duplicate keys in lookup table
    If customers has repeated customer_id, a many-to-one merge becomes many-to-many, duplicating rows.

Check:

customers["customer_id"].duplicated().sum()

Fix by deduplicating:

customers = customers.sort_values("updated_at").drop_duplicates("customer_id", keep="last")
  1. Type mismatch ("00123" vs 123)
    Ensure both sides are the same dtype:
orders["customer_id"] = orders["customer_id"].astype("string")
customers["customer_id"] = customers["customer_id"].astype("string")

11) Sorting, ranking, and selecting top-N per group

Sorting with multiple keys

df_sorted = df.sort_values(["country", "revenue"], ascending=[True, False])

Ranking within groups

df["revenue_rank_in_country"] = (
    df.groupby("country", observed=True)["revenue"]
      .rank(method="dense", ascending=False)
)

Top 3 orders per country

top3 = (
    df.sort_values(["country", "revenue"], ascending=[True, False])
      .groupby("country", observed=True)
      .head(3)
)

top3[["country", "order_id", "revenue"]]

12) Time series: resampling, rolling windows, and period comparisons

Time series analysis becomes easier when your date column is a datetime and (often) the index.

ts = df.set_index("order_date").sort_index()

Resample to monthly revenue

monthly = ts["revenue"].resample("MS").sum()  # MS = month start frequency
monthly.head()

Rolling averages (e.g., 3-month moving average)

monthly_ma3 = monthly.rolling(window=3, min_periods=1).mean()

Year-over-year comparison

If you have multiple years:

monthly_df = monthly.to_frame("revenue")
monthly_df["revenue_yoy"] = monthly_df["revenue"].pct_change(12)  # 12 months back
monthly_df.tail(15)

Grouped resampling (per country)

country_monthly = (
    df.set_index("order_date")
      .groupby("country", observed=True)["revenue"]
      .resample("MS")
      .sum()
      .reset_index()
)

country_monthly.head()

13) Performance: memory, speed, and best practices

Prefer vectorization and built-in methods

Reduce memory usage

Convert repeated strings to category, use nullable integer types, and downcast numerics.

df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce", downcast="integer")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce", downcast="float")
df["country"] = df["country"].astype("category")

Use Parquet for intermediate outputs

CSV is slow and loses types. Parquet is faster and preserves schema.

df.to_parquet("clean_sales.parquet", index=False)
df = pd.read_parquet("clean_sales.parquet")

Measure time for expensive operations

In a notebook, you can use:

# In Jupyter:
# %timeit df.groupby("country")["revenue"].sum()

In scripts:

import time

t0 = time.time()
_ = df.groupby("country", observed=True)["revenue"].sum()
print("seconds:", time.time() - t0)

14) Building a reproducible analysis pipeline

A practical approach is to write small, testable functions: load → clean → transform → summarize.

import pandas as pd
import numpy as np

def load_sales(path: str) -> pd.DataFrame:
    return pd.read_csv(
        path,
        dtype={"order_id": "string", "customer_id": "string", "country": "category"},
        parse_dates=["order_date"],
        na_values=["", "NA", "N/A", "null", "None"],
    )

def clean_sales(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Standardize text
    df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")

    # Coerce numerics
    df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

    # Drop impossible rows
    df = df[df["quantity"].notna() & (df["quantity"] > 0)]
    df = df[df["revenue"].notna() & (df["revenue"] >= 0)]

    # Deduplicate by order_id
    df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")

    return df

def add_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
    df["aov"] = df["revenue"]  # if revenue is order-level; otherwise compute per order
    df["revenue_band"] = np.select(
        [df["revenue"] < 100, df["revenue"].between(100, 500, inclusive="both"), df["revenue"] > 500],
        ["low", "mid", "high"],
        default="unknown",
    )
    return df

def summarize(df: pd.DataFrame) -> pd.DataFrame:
    return (
        df.groupby(["country", "order_month"], observed=True)
          .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
          .reset_index()
          .sort_values(["order_month", "revenue"], ascending=[True, False])
    )

sales = load_sales("sales.csv")
sales = clean_sales(sales)
sales = add_features(sales)
report = summarize(sales)

report.head(20)

Why this structure works:


15) Visualization (quick but real): Matplotlib/Seaborn with Pandas outputs

After summarizing, plot trends.

import matplotlib.pyplot as plt
import seaborn as sns

monthly_rev = (
    sales.set_index("order_date")
         .sort_index()["revenue"]
         .resample("MS")
         .sum()
         .reset_index()
)

plt.figure(figsize=(10, 4))
sns.lineplot(data=monthly_rev, x="order_date", y="revenue")
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()

For country comparison:

country_monthly = (
    sales.set_index("order_date")
         .groupby("country", observed=True)["revenue"]
         .resample("MS")
         .sum()
         .reset_index()
)

plt.figure(figsize=(12, 5))
sns.lineplot(data=country_monthly, x="order_date", y="revenue", hue="country")
plt.title("Monthly Revenue by Country")
plt.tight_layout()
plt.show()

16) Common intermediate mistakes (and how to avoid them)

  1. Silent type issues
    Fix by specifying dtype= and using pd.to_numeric / pd.to_datetime with errors="coerce".

  2. Chained assignment
    Use .loc[mask, col] = value and consider df = df.copy() before complex operations.

  3. Many-to-many merges
    Use validate= in merge to catch unexpected duplication.

  4. Grouping on high-cardinality columns
    Grouping by a near-unique column (like order_id) is rarely useful and can be slow.

  5. Assuming missing means zero
    Missing can mean “unknown”, “not applicable”, or “not recorded”. Decide explicitly per column.


17) Practice exercise (with a realistic workflow)

Given a sales.csv with columns like:

Try:

  1. Load with correct dtypes and parse dates.
  2. Clean: remove duplicates by order_id, remove rows where quantity <= 0, and standardize country codes.
  3. Create features: order_month, revenue_band.
  4. Produce a report: monthly revenue and order count by country.
  5. Save the cleaned dataset to Parquet and the report to CSV.

Skeleton:

import pandas as pd

sales = pd.read_csv(
    "sales.csv",
    dtype={"order_id": "string", "customer_id": "string", "product_id": "string", "country": "string"},
    parse_dates=["order_date"],
)

sales["country"] = sales["country"].str.upper().str.strip()
sales["quantity"] = pd.to_numeric(sales["quantity"], errors="coerce")
sales["revenue"] = pd.to_numeric(sales["revenue"], errors="coerce")

sales = sales.sort_values("order_date").drop_duplicates("order_id", keep="last")
sales = sales[sales["quantity"].notna() & (sales["quantity"] > 0)]
sales = sales[sales["revenue"].notna() & (sales["revenue"] >= 0)]

sales["order_month"] = sales["order_date"].dt.to_period("M").astype("string")

report = (
    sales.groupby(["country", "order_month"])
        .agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
        .reset_index()
        .sort_values(["order_month", "revenue"], ascending=[True, False])
)

sales.to_parquet("sales_clean.parquet", index=False)
report.to_csv("country_month_report.csv", index=False)

18) Where to go next

To progress beyond intermediate Pandas, focus on:


If you want, share a small sample of your dataset schema (column names + a few rows), and I can propose a tailored cleaning and analysis pipeline using the patterns shown above.