liminfo

Pandas Reference

Free reference guide: Pandas Reference

42 results

About Pandas Reference

The Pandas Reference is a comprehensive cheat sheet for the Python pandas library, covering all essential operations for data analysis and manipulation. It includes complete Python examples for eight categories: DataFrame creation and inspection, Series operations, label-based and position-based indexing, DataFrame merging and concatenation, groupby aggregation, time series operations, I/O with CSV/Excel/JSON/SQL/Parquet, and missing value handling.

This reference is designed for data scientists, data engineers, and Python developers who work with tabular data. Whether you need a quick reminder for `df.loc[]` vs `df.iloc[]`, the correct syntax for a multi-column `agg()` call, or how to resample a time series to monthly frequency, this reference surfaces the answer in seconds without requiring you to search through the official documentation.

The reference is organized to match a typical data analysis workflow: load data with `pd.read_csv()` or `pd.read_sql()`, explore it with `df.info()` and `df.describe()`, clean it with `df.dropna()` and `df.fillna()`, transform it with `df.groupby()` and `df.pivot_table()`, and export it with `df.to_csv()` or `df.to_parquet()`. Each entry shows the exact method signature and a runnable code snippet.

Key Features

  • DataFrame creation, inspection (info, describe, shape), and column renaming
  • Series value_counts, unique, apply, map, and str accessor for string operations
  • Label-based indexing with df.loc[] and position-based indexing with df.iloc[]
  • Boolean filtering, df.query() string expressions, and sort_values()
  • pd.merge() with inner/left/outer join types, pd.concat(), and df.join()
  • df.groupby() with agg() for multi-column aggregations and pivot_table()
  • Time series: pd.to_datetime(), resample(), rolling() moving averages, and dt accessor
  • Missing value detection with isna(), filling with fillna(), and interpolate()

Frequently Asked Questions

What is the difference between df.loc[] and df.iloc[]?

`df.loc[]` selects rows and columns by label — this includes index labels and column names. `df.iloc[]` selects by integer position, starting from 0. Use `loc` when your index has meaningful labels (like dates or IDs), and `iloc` for position-based slicing like `df.iloc[0:5, 0:2]` which selects the first 5 rows and first 2 columns.

How do I filter a DataFrame with multiple conditions?

Use boolean indexing with `&` for AND and `|` for OR: `df[(df["age"] > 20) & (df["city"] == "Seoul")]`. Wrap each condition in parentheses because of Python operator precedence. Alternatively, use `df.query("age > 20 and city == 'Seoul'")` for a more readable string-based syntax.

What is the difference between pd.merge() and pd.concat()?

`pd.merge()` joins DataFrames on a key column, similar to SQL JOIN operations. It supports inner, left, right, and outer join types. `pd.concat()` stacks DataFrames either vertically (along rows, axis=0) or horizontally (along columns, axis=1) without matching on a key. Use merge for relational joins and concat for combining datasets of the same structure.

How do I perform multiple aggregations in a single groupby?

Use the named aggregation syntax with `agg()`: `df.groupby("dept").agg(avg=("salary", "mean"), cnt=("name", "count"))`. This creates a result DataFrame with columns named `avg` and `cnt`. You can mix any aggregation functions including "sum", "mean", "min", "max", "count", "std", or a custom lambda function.

How do I resample a time series to a different frequency?

First set the datetime column as the index: `df = df.set_index("date")`. Then use `df.resample("M").mean()` for monthly mean, `"W"` for weekly, `"D"` for daily, or `"H"` for hourly. The `pd.date_range()` function generates a DatetimeIndex for creating complete date ranges with a specified frequency.

How do I handle missing values in pandas?

Check for missing values with `df.isna().sum()` to count NaN per column. Use `df.fillna(value)` to replace NaN with a constant, `df.fillna(df.mean())` to fill with column means, or `df.interpolate(method="linear")` for time series gap filling. Remove rows with missing values using `df.dropna(subset=["col1", "col2"])` to only drop rows where specific columns are NaN.

How do I read and write Parquet files with pandas?

Use `df.to_parquet("data.parquet")` to write and `pd.read_parquet("data.parquet")` to read. Parquet is a columnar binary format that is significantly faster to read and produces smaller files than CSV, especially for large datasets. It also preserves data types (including datetime and categorical) without needing to specify dtype on read.

What is the str accessor in pandas and when should I use it?

The `.str` accessor on a Series exposes vectorized string methods that apply to every element. Common methods include `str.lower()`, `str.upper()`, `str.contains("pattern")`, `str.split(" ")`, `str.strip()`, and `str.replace()`. Always use the `.str` accessor instead of applying Python string methods with `apply(lambda x: x.lower())` for much better performance on large Series.