Data Analysis with Pandas in Python: Intermediate Guide
This tutorial assumes you already know basic Python and have used Pandas at least once. The goal is to move from “I can load a CSV” to confidently performing intermediate analysis: cleaning, reshaping, grouping, joining, time series handling, and building reproducible pipelines.
1) Setup: environment, versions, and display options
Install and verify versions
Use a virtual environment if possible.
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShell
python -m pip install --upgrade pip
python -m pip install pandas numpy pyarrow openpyxl matplotlib seaborn
Check versions:
python -c "import pandas as pd, numpy as np; print(pd.__version__); print(np.__version__)"
Recommended display settings
These settings help you inspect data without constantly truncating important information.
import pandas as pd
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)
pd.set_option("display.max_colwidth", 80)
pd.set_option("display.float_format", "{:,.4f}".format)
2) Loading data: CSV, Excel, Parquet, and common pitfalls
CSV with explicit dtypes and missing values
A common mistake is letting Pandas infer types incorrectly (IDs become integers, leading zeros lost; dates remain strings; mixed columns become object). Prefer explicit control.
import pandas as pd
df = pd.read_csv(
"sales.csv",
dtype={
"order_id": "string",
"customer_id": "string",
"country": "category",
},
parse_dates=["order_date"],
na_values=["", "NA", "N/A", "null", "None"],
keep_default_na=True,
)
Key ideas:
dtype=prevents silent type changes.parse_dates=converts to datetime early, enabling time-series operations.na_values=standardizes missing values.
Excel
df_xl = pd.read_excel(
"sales.xlsx",
sheet_name="Orders",
dtype={"order_id": "string"},
engine="openpyxl",
)
Parquet (fast, typed, compressed)
Parquet is often the best format for analytics pipelines.
df.to_parquet("sales.parquet", index=False)
df_parquet = pd.read_parquet("sales.parquet")
If you installed pyarrow, Parquet will be fast and preserve types well.
3) First inspection: shape, schema, and sanity checks
Basic inspection
df.shape
df.head(10)
df.tail(5)
df.sample(5, random_state=42)
df.columns
df.dtypes
df.info()
Summaries and missingness
df.describe(include="all")
df.isna().sum().sort_values(ascending=False)
df.nunique().sort_values(ascending=False)
Interpretation tips:
- If
describe(include="all")showsuniqueandtop, you likely have categorical/object columns. - A column with many unique values might be an ID; a column with very few might be a category.
- High missingness may require imputation, dropping, or careful modeling decisions.
4) Core data types: why they matter
Strings vs objects
Pandas historically used object for text, but string dtype is usually better.
df["customer_id"] = df["customer_id"].astype("string")
Categories for repeated labels
category saves memory and can speed up groupby operations.
df["country"] = df["country"].astype("category")
Datetimes and time zones
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["order_date_utc"] = df["order_date"].dt.tz_localize("UTC", nonexistent="shift_forward", ambiguous="NaT")
Use errors="coerce" to convert invalid strings into NaT (missing datetime), then inspect how many were invalid:
df["order_date"].isna().sum()
5) Indexing and selecting data (beyond the basics)
.loc vs .iloc vs boolean masks
.locis label-based (includes endpoints for slices on labels)..ilocis position-based (end-exclusive like Python slicing).- Boolean masks are the workhorse for filtering.
# Boolean filtering
us = df[df["country"] == "US"]
# Multiple conditions: use parentheses and & / |
recent_us = df[(df["country"] == "US") & (df["order_date"] >= "2025-01-01")]
# Select columns
subset = df.loc[:, ["order_id", "order_date", "country", "revenue"]]
# Position-based selection
first_100 = df.iloc[:100, :5]
Avoid chained indexing
This can lead to confusing “SettingWithCopyWarning” and sometimes incorrect results.
Bad:
df[df["country"] == "US"]["revenue"] = 0
Good:
mask = df["country"] == "US"
df.loc[mask, "revenue"] = 0
6) Cleaning data: duplicates, missing values, and inconsistent text
Duplicates
Decide what “duplicate” means: entire rows, or duplicates by a subset of columns.
df.duplicated().sum()
df = df.drop_duplicates()
# Deduplicate by order_id, keeping the latest order_date
df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")
Missing values strategies
First, measure missingness:
missing_rate = df.isna().mean().sort_values(ascending=False)
missing_rate.head(10)
Then decide:
- Drop rows if missingness is small and random.
- Fill with a constant for categories (e.g.,
"Unknown"). - Use statistical imputation for numeric columns (median is robust).
df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")
median_revenue = df["revenue"].median()
df["revenue"] = df["revenue"].fillna(median_revenue)
Text normalization
Common issues: extra whitespace, inconsistent casing, hidden characters.
df["product_name"] = (
df["product_name"]
.astype("string")
.str.strip()
.str.replace(r"\s+", " ", regex=True)
)
df["country_code"] = df["country_code"].astype("string").str.upper().str.strip()
7) Creating and transforming columns (vectorized operations)
Use vectorized operations over .apply when possible
Vectorized:
df["revenue_per_item"] = df["revenue"] / df["quantity"]
Conditional logic with np.where:
import numpy as np
df["is_large_order"] = np.where(df["quantity"] >= 10, True, False)
Multiple conditions with np.select:
conditions = [
df["revenue"] < 100,
df["revenue"].between(100, 500, inclusive="both"),
df["revenue"] > 500
]
choices = ["low", "mid", "high"]
df["revenue_band"] = np.select(conditions, choices, default="unknown")
Date parts and time-based features
df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
df["order_weekday"] = df["order_date"].dt.day_name()
8) GroupBy: split-apply-combine at an intermediate level
Basic aggregations
by_country = (
df.groupby("country", observed=True)
.agg(
orders=("order_id", "nunique"),
customers=("customer_id", "nunique"),
revenue=("revenue", "sum"),
avg_order_value=("revenue", "mean"),
)
.sort_values("revenue", ascending=False)
)
by_country.head(10)
Notes:
observed=Truecan reduce output size when grouping on categoricals by excluding unused categories.- Named aggregations (
revenue=("revenue","sum")) are clearer than older dict syntax.
Multi-level groupby
by_country_month = (
df.groupby(["country", "order_month"], observed=True)
.agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
.reset_index()
)
by_country_month.head()
Transform vs aggregate
aggreduces rows (one row per group).transformreturns a value per original row (same length as input), enabling comparisons like “row revenue vs country average”.
df["country_avg_revenue"] = df.groupby("country", observed=True)["revenue"].transform("mean")
df["revenue_vs_country_avg"] = df["revenue"] - df["country_avg_revenue"]
Filtering groups
Example: keep only countries with at least 100 orders.
country_counts = df.groupby("country", observed=True)["order_id"].nunique()
big_countries = country_counts[country_counts >= 100].index
df_big = df[df["country"].isin(big_countries)]
9) Pivot tables and reshaping: wide vs long
Pivot table (aggregated)
pivot = pd.pivot_table(
df,
index="order_month",
columns="country",
values="revenue",
aggfunc="sum",
fill_value=0,
)
pivot.head()
Stack/unstack and MultiIndex
If you have a MultiIndex result from groupby, you can reshape:
g = df.groupby(["order_month", "country"], observed=True)["revenue"].sum()
wide = g.unstack("country", fill_value=0)
long_again = wide.stack().reset_index(name="revenue")
Melt: convert wide to long
Useful for plotting libraries and tidy data workflows.
wide_df = pivot.reset_index()
long_df = wide_df.melt(id_vars="order_month", var_name="country", value_name="revenue")
long_df.head()
10) Joining and merging: combining datasets correctly
Assume you have:
orders(order-level data)customers(customer attributes)products(product info)
Merge basics
orders = df # for illustration
customers = pd.read_csv("customers.csv", dtype={"customer_id": "string"})
products = pd.read_csv("products.csv", dtype={"product_id": "string"})
orders_customers = orders.merge(customers, on="customer_id", how="left", validate="many_to_one")
Important parameters:
how="left"keeps all orders even if customer info is missing.validate="many_to_one"enforces expected cardinality and catches data issues early (e.g., duplicate customer IDs).
Merge on different column names
orders_products = orders.merge(
products,
left_on="product_id",
right_on="id",
how="left",
validate="many_to_one",
)
Common merge problems and fixes
- Duplicate keys in lookup table
Ifcustomershas repeatedcustomer_id, a many-to-one merge becomes many-to-many, duplicating rows.
Check:
customers["customer_id"].duplicated().sum()
Fix by deduplicating:
customers = customers.sort_values("updated_at").drop_duplicates("customer_id", keep="last")
- Type mismatch (
"00123"vs123)
Ensure both sides are the same dtype:
orders["customer_id"] = orders["customer_id"].astype("string")
customers["customer_id"] = customers["customer_id"].astype("string")
11) Sorting, ranking, and selecting top-N per group
Sorting with multiple keys
df_sorted = df.sort_values(["country", "revenue"], ascending=[True, False])
Ranking within groups
df["revenue_rank_in_country"] = (
df.groupby("country", observed=True)["revenue"]
.rank(method="dense", ascending=False)
)
Top 3 orders per country
top3 = (
df.sort_values(["country", "revenue"], ascending=[True, False])
.groupby("country", observed=True)
.head(3)
)
top3[["country", "order_id", "revenue"]]
12) Time series: resampling, rolling windows, and period comparisons
Time series analysis becomes easier when your date column is a datetime and (often) the index.
ts = df.set_index("order_date").sort_index()
Resample to monthly revenue
monthly = ts["revenue"].resample("MS").sum() # MS = month start frequency
monthly.head()
Rolling averages (e.g., 3-month moving average)
monthly_ma3 = monthly.rolling(window=3, min_periods=1).mean()
Year-over-year comparison
If you have multiple years:
monthly_df = monthly.to_frame("revenue")
monthly_df["revenue_yoy"] = monthly_df["revenue"].pct_change(12) # 12 months back
monthly_df.tail(15)
Grouped resampling (per country)
country_monthly = (
df.set_index("order_date")
.groupby("country", observed=True)["revenue"]
.resample("MS")
.sum()
.reset_index()
)
country_monthly.head()
13) Performance: memory, speed, and best practices
Prefer vectorization and built-in methods
- Use
Series.str,Series.dt,Series.cat,np.where,np.select. - Avoid row-wise
.apply(axis=1)unless necessary.
Reduce memory usage
Convert repeated strings to category, use nullable integer types, and downcast numerics.
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce", downcast="integer")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce", downcast="float")
df["country"] = df["country"].astype("category")
Use Parquet for intermediate outputs
CSV is slow and loses types. Parquet is faster and preserves schema.
df.to_parquet("clean_sales.parquet", index=False)
df = pd.read_parquet("clean_sales.parquet")
Measure time for expensive operations
In a notebook, you can use:
# In Jupyter:
# %timeit df.groupby("country")["revenue"].sum()
In scripts:
import time
t0 = time.time()
_ = df.groupby("country", observed=True)["revenue"].sum()
print("seconds:", time.time() - t0)
14) Building a reproducible analysis pipeline
A practical approach is to write small, testable functions: load → clean → transform → summarize.
import pandas as pd
import numpy as np
def load_sales(path: str) -> pd.DataFrame:
return pd.read_csv(
path,
dtype={"order_id": "string", "customer_id": "string", "country": "category"},
parse_dates=["order_date"],
na_values=["", "NA", "N/A", "null", "None"],
)
def clean_sales(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
# Standardize text
df["country"] = df["country"].cat.add_categories(["Unknown"]).fillna("Unknown")
# Coerce numerics
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
# Drop impossible rows
df = df[df["quantity"].notna() & (df["quantity"] > 0)]
df = df[df["revenue"].notna() & (df["revenue"] >= 0)]
# Deduplicate by order_id
df = df.sort_values("order_date").drop_duplicates(subset=["order_id"], keep="last")
return df
def add_features(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df["order_month"] = df["order_date"].dt.to_period("M").astype("string")
df["aov"] = df["revenue"] # if revenue is order-level; otherwise compute per order
df["revenue_band"] = np.select(
[df["revenue"] < 100, df["revenue"].between(100, 500, inclusive="both"), df["revenue"] > 500],
["low", "mid", "high"],
default="unknown",
)
return df
def summarize(df: pd.DataFrame) -> pd.DataFrame:
return (
df.groupby(["country", "order_month"], observed=True)
.agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
.reset_index()
.sort_values(["order_month", "revenue"], ascending=[True, False])
)
sales = load_sales("sales.csv")
sales = clean_sales(sales)
sales = add_features(sales)
report = summarize(sales)
report.head(20)
Why this structure works:
- Each function has a single responsibility.
- You can unit test each step (e.g., “cleaning removes negative revenue”).
- You can reuse the pipeline for new months of data.
15) Visualization (quick but real): Matplotlib/Seaborn with Pandas outputs
After summarizing, plot trends.
import matplotlib.pyplot as plt
import seaborn as sns
monthly_rev = (
sales.set_index("order_date")
.sort_index()["revenue"]
.resample("MS")
.sum()
.reset_index()
)
plt.figure(figsize=(10, 4))
sns.lineplot(data=monthly_rev, x="order_date", y="revenue")
plt.title("Monthly Revenue")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()
For country comparison:
country_monthly = (
sales.set_index("order_date")
.groupby("country", observed=True)["revenue"]
.resample("MS")
.sum()
.reset_index()
)
plt.figure(figsize=(12, 5))
sns.lineplot(data=country_monthly, x="order_date", y="revenue", hue="country")
plt.title("Monthly Revenue by Country")
plt.tight_layout()
plt.show()
16) Common intermediate mistakes (and how to avoid them)
-
Silent type issues
Fix by specifyingdtype=and usingpd.to_numeric/pd.to_datetimewitherrors="coerce". -
Chained assignment
Use.loc[mask, col] = valueand considerdf = df.copy()before complex operations. -
Many-to-many merges
Usevalidate=inmergeto catch unexpected duplication. -
Grouping on high-cardinality columns
Grouping by a near-unique column (likeorder_id) is rarely useful and can be slow. -
Assuming missing means zero
Missing can mean “unknown”, “not applicable”, or “not recorded”. Decide explicitly per column.
17) Practice exercise (with a realistic workflow)
Given a sales.csv with columns like:
order_id,order_date,customer_id,country,product_id,quantity,revenue
Try:
- Load with correct dtypes and parse dates.
- Clean: remove duplicates by
order_id, remove rows wherequantity <= 0, and standardize country codes. - Create features:
order_month,revenue_band. - Produce a report: monthly revenue and order count by country.
- Save the cleaned dataset to Parquet and the report to CSV.
Skeleton:
import pandas as pd
sales = pd.read_csv(
"sales.csv",
dtype={"order_id": "string", "customer_id": "string", "product_id": "string", "country": "string"},
parse_dates=["order_date"],
)
sales["country"] = sales["country"].str.upper().str.strip()
sales["quantity"] = pd.to_numeric(sales["quantity"], errors="coerce")
sales["revenue"] = pd.to_numeric(sales["revenue"], errors="coerce")
sales = sales.sort_values("order_date").drop_duplicates("order_id", keep="last")
sales = sales[sales["quantity"].notna() & (sales["quantity"] > 0)]
sales = sales[sales["revenue"].notna() & (sales["revenue"] >= 0)]
sales["order_month"] = sales["order_date"].dt.to_period("M").astype("string")
report = (
sales.groupby(["country", "order_month"])
.agg(revenue=("revenue", "sum"), orders=("order_id", "nunique"))
.reset_index()
.sort_values(["order_month", "revenue"], ascending=[True, False])
)
sales.to_parquet("sales_clean.parquet", index=False)
report.to_csv("country_month_report.csv", index=False)
18) Where to go next
To progress beyond intermediate Pandas, focus on:
- Testing and validation: assert schemas, row counts, uniqueness constraints.
- Advanced reshaping: complex MultiIndex operations,
wide_to_long, custom aggregations. - Scaling: chunked CSV reads,
pyarrowdatasets, and distributed tools (Dask, Polars, Spark) when data outgrows memory. - Feature engineering: robust time-based features, cohort analysis, retention curves.
If you want, share a small sample of your dataset schema (column names + a few rows), and I can propose a tailored cleaning and analysis pipeline using the patterns shown above.