From Pandas to Polars: Rust’s DataFrame Library

Kiet

Why Polars in Rust?

Speed without the drama. Polars runs a multi‑threaded, vectorized engine over Arrow memory—so group‑bys, joins, window functions, and scans can saturate your CPU cores out of the box.
Bigger‑than‑RAM processing. Flip on streaming execution to process big datasets in batches, instead of face‑planting on memory.
Lazy execution that actually optimizes. Build a query plan now, execute later; Polars pushes projections/filters down to the file scan and reorders work for fewer passes over data.

If you’re building CLIs, ETL jobs, services, or anything where runtime, memory, or deployable binaries matter, Polars + Rust is a good fit. If you want a notebook‑first workflow with a giant ecosystem of plotting/ML tools, pandas still has the advantage (more on that later).

Install & project setup

Add Polars to your Cargo.toml. You must enable features you use (like lazy and csv; parquet is optional but common):

[package]
name = "polars_intro"
version = "0.1.0"
edition = "2025"

[dependencies]
polars = { version = "0.50.0", features = ["lazy", "csv", "parquet"] }
planus = { version = "=1.1.1" }

Polars’ own install page shows enabling features via cargo add polars -F lazy or by listing them in Cargo.toml

Heads‑up: with some versions, turning on parquet may need an extra dependency (e.g., zlib-rs) due to resolver quirks. Pin versions and read the linked issue if you hit it

Sample Dataset:

customers Download

orders Download

5m-rows Download

Eager Polars: quick CSV read

Sample data set used in this post:

Use the eager API when you just need a DataFrame now.

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    // Read a CSV eagerly into a DataFrame
    let df = CsvReadOptions::default()
        .with_has_header(true)
        .try_into_reader_with_file_path(Some("./src/customers.csv".into()))?
        .finish()?;

    println!("DataFrame contents:");
    println!("{df}");
    
    println!("\nDataFrame metadata:");
    println!("Shape: {:?}", df.shape());
    println!("Columns: {}", df.get_column_names().len());
    println!("Rows: {}", df.height());
    
    println!("\nColumn information:");
    for (name, dtype) in df.get_column_names().iter().zip(df.dtypes().iter()) {
        println!("  {}: {:?}", name, dtype);
    }
    
    println!("\nMemory usage: {} bytes", df.estimated_size());
    
    // Show first few rows
    println!("\nFirst 5 rows:");
    println!("{}", df.head(Some(5)));
    Ok(())
}

// Result ...
DataFrame metadata:
Shape: (10000, 4)
Columns: 4
Rows: 10000
...
Column information:
  customer_id: Int64
  name: String
  email: String
  city: String

Memory usage: 551142 bytes
...

Lazy Polars: scan, filter, group, join

For anything non‑trivial, prefer the lazy API. You “scan” files (no full read yet), build an expression graph, then collect() the result. This enables predicate/project pushdown and better parallelism.

Build a top‑customers report

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    // Lazily scan the inputs (no data loaded yet)
    let orders = LazyCsvReader::new(PlPath::new("data/orders.csv"))
        .with_has_header(true)
        .finish()?; // -> LazyFrame

    let customers = LazyCsvReader::new(PlPath::new("data/customers.csv"))
        .with_has_header(true)
        .finish()?;

    // Derive metrics per customer: total spend + order count
    let top_customers = orders
        .group_by([col("customer_id")])
        .agg([
            col("amount").sum().alias("total_spend"),
            col("order_id").count().alias("orders"),
        ])
        // Join customer attributes
        .join(
            customers,
            vec![col("customer_id")],
            vec![col("customer_id")],
            JoinArgs::new(JoinType::Left),
        )
        // Sort by total_spend descending
        .sort_by_exprs(
            vec![col("total_spend")],
            SortMultipleOptions::new().with_order_descending(true),
        )
        .limit(10)
        .collect()?; // executes lazily-built plan

    println!("{top_customers}");
    Ok(())
}

Window/conditional logic with when/then/otherwise

You’ll often need bucketing or conditional columns. Polars has a clear DSL for this:

use polars::prelude::*;

fn bucketize_orders(orders: LazyFrame) -> PolarsResult<DataFrame> {
    orders
        .with_column(
            when(col("amount").lt_eq(lit(50)))
                .then(lit("small"))
                .when(col("amount").lt_eq(lit(200)))
                .then(lit("medium"))
                .otherwise(lit("large"))
                .alias("size_bucket"),
        )
        .select([
            col("order_id"),
            col("customer_id"),
            col("amount"),
            col("size_bucket"),
        ])
        .collect()
}

fn main() {
    let orders = match LazyCsvReader::new(PlPath::new("data/orders.csv"))
        .with_has_header(true)
        .finish()
    {
        Ok(orders) => orders,
        Err(e) => {
            eprintln!("Error reading CSV: {}", e);
            return;
        }
    };

    match bucketize_orders(orders) {
        Ok(df) => println!("{:?}", df.head(Some(10))),
        Err(e) => eprintln!("Error: {}", e),
    }
}

Streaming (bigger‑than‑RAM) execution

If your data doesn’t fit in memory, tell Polars to stream the query:

use std::time::Instant;

use polars::prelude::{col, lit, LazyCsvReader, LazyFileListReader, PlPath};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Record the start time
    let now = Instant::now();

    let lf = LazyCsvReader::new(PlPath::new("data/5m-rows.csv"))
        .with_has_header(true)
        .finish()?
        .filter(col("amount").gt(lit(5)))
        .group_by([col("customer_id")])
        .agg([col("amount").mean().alias("avg_amount")]);

    // Toggle streaming engine for the collection
    let result = lf.with_new_streaming(true).collect();

    // Record the end time and calculate the duration
    let elapsed = now.elapsed();
    
    // Print the result
    println!("{:?}", result?.head(Some(10)));
    
    // Print the execution time
    println!("Execution time: {:.2?}", elapsed);

    Ok(())
}

Write to Parquet

Parquet is a compact, columnar format that lines up well with Arrow in memory—which is why Polars reads/writes it quickly.

use std::fs::File;

use polars::{
    error::PolarsResult,
    frame::DataFrame,
    prelude::{
        LazyCsvReader, LazyFileListReader, ParquetWriter, PlPath, StatisticsOptions, col, lit,
    },
};

fn write_parquet(mut df: DataFrame) -> PolarsResult<()> {
    let mut file = File::create("out/sample.parquet")?;
    ParquetWriter::new(&mut file)
        .with_statistics(StatisticsOptions::default())
        .finish(&mut df)?;
    Ok(())
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let lf = LazyCsvReader::new(PlPath::new("data/5m-rows.csv"))
        .with_has_header(true)
        .finish()?
        .filter(col("amount").gt(lit(5)))
        .group_by([col("customer_id")])
        .agg([col("amount").mean().alias("avg_amount")]);
    let result = lf.with_new_streaming(true).collect();
    match result {
        Ok(df) => {
            write_parquet(df)?;
            Ok(())
        }
        Err(_) => todo!(),
    }
}

The same pipeline in pandas

import pandas as pd

orders = pd.read_csv("data/orders.csv")
customers = pd.read_csv("data/customers.csv")

top_customers = (
    orders.groupby("customer_id")
    .agg(total_spend=("amount", "sum"), orders=("order_id", "size"))
    .reset_index()
    .merge(customers, on="customer_id", how="left")
    .sort_values("total_spend", ascending=False)
    .head(10)
)

# bucketize
def bucket(x):
    if x <= 50: return "small"
    if x <= 200: return "medium"
    return "large"

orders["size_bucket"] = orders["amount"].apply(bucket)

For large files that don’t fit in memory, pandas recommends chunked reads (chunksize), but advanced operations (e.g., groupby) are not always straightforward chunk‑wise—at which point switching libraries (Dask/Modin/Polars) is often suggested

Polars vs pandas: the blunt comparison

Area	Polars (Rust)	pandas (Python)
Engine	Vectorized, multi‑threaded by default; Arrow columnar.	Mostly single‑threaded core; NumPy arrays by default (PyArrow optional).
Execution	Lazy + eager. Lazy enables pushdown/optimizations.	Eager only (no native lazy); can simulate pipelines but no global query optimizer.
Scale	Streaming lets you process larger‑than‑RAM datasets from a single binary.	Chunking works for IO; many transformations aren’t trivial to do chunk‑wise.
Language fit	Compiled Rust binaries; great for CLIs/services.	Best in notebooks/interactive analysis with a massive ecosystem.
Memory model	Arrow (zero‑copy interop, columnar).	NumPy by default; can opt into PyArrow‑backed dtypes for some wins.
Syntax	Expression DSL (`col`, `when`, `over`, etc.).	Imperative method chaining (`df.groupby(...).agg(...).sort_values(...)`).
Performance reality	Usually faster on big, columnar‑friendly workloads; don’t expect miracles on tiny frames (call overhead and IO dominate).	Often snappy on small/medium data; can be slower as size/complexity grows.

Benchmarking Result (May-2025) – Polars Document

Wrap‑up

Use eager Polars for small, one‑off tasks;
Use lazy Polars (and streaming) for serious pipelines;
Measure your real workload instead of trusting hype;
Don’t expect data‑viz/ML batteries from the Rust ecosystem the way you do in Python—pair Polars with the right tools for your target.

Solutions

Industry

Our thinking

From Pandas to Polars: Rust’s DataFrame Library

Kiet

Table of Contents

Why Polars in Rust?

Install & project setup

Eager Polars: quick CSV read

Lazy Polars: scan, filter, group, join

Window/conditional logic with when/then/otherwise

Streaming (bigger‑than‑RAM) execution

Write to Parquet

The same pipeline in pandas

Polars vs pandas: the blunt comparison

Wrap‑up

Share this:

Like this:

Related

Kiet

Leave a CommentCancel reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements

Discover more from NashTech Blog