Skip to content Skip to footer

Pandas vs. Polars: A Complete Comparison of Syntax, Speed, and Memory


Pandas vs. Polars: A Complete Comparison of Syntax, Speed, and Memory
Image by Author

 

Introduction

 
If you’ve been working with data in Python, you’ve almost certainly used pandas. It’s been the go-to library for data manipulation for over a decade. But recently, Polars has been gaining serious traction. Polars promises to be faster, more memory-efficient, and more intuitive than pandas. But is it worth learning? And how different is it really?

In this article, we’ll compare pandas and Polars side-by-side. You’ll see performance benchmarks, and learn the syntax differences. By the end, you’ll be able to make an informed decision for your next data project.

You can find the code on GitHub.

 

Getting Started

 
Let’s get both libraries installed first:

pip install pandas polars

 

Note: This article uses pandas 2.2.2 and Polars 1.31.0.

For this comparison, we’ll also use a dataset that’s large enough to see real performance differences. We’ll use Faker to generate test data:

 

Now we’re ready to start coding.

 

Measuring Speed By Reading Large CSV Files

 
Let’s start with one of the most common operations: reading a CSV file. We’ll create a dataset with 1 million rows to see real performance differences.

First, let’s generate our sample data:

import pandas as pd
from faker import Faker
import random

# Generate a large CSV file for testing
fake = Faker()
Faker.seed(42)
random.seed(42)

data = {
    'user_id': range(1000000),
    'name': [fake.name() for _ in range(1000000)],
    'email': [fake.email() for _ in range(1000000)],
    'age': [random.randint(18, 80) for _ in range(1000000)],
    'salary': [random.randint(30000, 150000) for _ in range(1000000)],
    'department': [random.choice(['Engineering', 'Sales', 'Marketing', 'HR', 'Finance'])
                   for _ in range(1000000)]
}

df_temp = pd.DataFrame(data)
df_temp.to_csv('large_dataset.csv', index=False)
print("✓ Generated large_dataset.csv with 1M rows")

 

This code creates a CSV file with realistic data. Now let’s compare reading speeds:

import pandas as pd
import polars as pl
import time

# pandas: Read CSV
start = time.time()
df_pandas = pd.read_csv('large_dataset.csv')
pandas_time = time.time() - start

# Polars: Read CSV
start = time.time()
df_polars = pl.read_csv('large_dataset.csv')
polars_time = time.time() - start

print(f"Pandas read time: {pandas_time:.2f} seconds")
print(f"Polars read time: {polars_time:.2f} seconds")
print(f"Polars is {pandas_time/polars_time:.1f}x faster")

 

Output when reading the sample CSV:

Pandas read time: 1.92 seconds
Polars read time: 0.23 seconds
Polars is 8.2x faster

 

Here’s what’s happening: We time how long it takes each library to read the same CSV file. While pandas uses its traditional single-threaded CSV reader, Polars automatically parallelizes the reading across multiple CPU cores. We calculate the speedup factor.

On most machines, you’ll see Polars is 2-5x faster at reading CSVs. This difference becomes even more significant with larger files.

 

Measuring Memory Usage During Operations

 
Speed isn’t the only consideration. Let’s see how much memory each library uses. We’ll perform a series of operations and measure memory consumption. Please pip install psutil if you don’t already have it in your working environment:

import pandas as pd
import polars as pl
import psutil
import os
import gc # Import garbage collector for better memory release attempts

def get_memory_usage():
    """Get current process memory usage in MB"""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

# — - Test with Pandas — -
gc.collect()
initial_memory_pandas = get_memory_usage()

df_pandas = pd.read_csv('large_dataset.csv')
filtered_pandas = df_pandas[df_pandas['age'] > 30]
grouped_pandas = filtered_pandas.groupby('department')['salary'].mean()

pandas_memory = get_memory_usage() - initial_memory_pandas
print(f"Pandas memory delta: {pandas_memory:.1f} MB")

del df_pandas, filtered_pandas, grouped_pandas
gc.collect()

# — - Test with Polars (eager mode) — -
gc.collect()
initial_memory_polars = get_memory_usage()

df_polars = pl.read_csv('large_dataset.csv')
filtered_polars = df_polars.filter(pl.col('age') > 30)
grouped_polars = filtered_polars.group_by('department').agg(pl.col('salary').mean())

polars_memory = get_memory_usage() - initial_memory_polars
print(f"Polars memory delta: {polars_memory:.1f} MB")

del df_polars, filtered_polars, grouped_polars
gc.collect()

# — - Summary — -
if pandas_memory > 0 and polars_memory > 0:
  print(f"Memory savings (Polars vs Pandas): {(1 - polars_memory/pandas_memory) * 100:.1f}%")
elif pandas_memory == 0 and polars_memory > 0:
  print(f"Polars used {polars_memory:.1f} MB while Pandas used 0 MB.")
elif polars_memory == 0 and pandas_memory > 0:
  print(f"Polars used 0 MB while Pandas used {pandas_memory:.1f} MB.")
else:
  print("Cannot compute memory savings due to zero or negative memory usage delta in both frameworks.")

 

This code measures the memory footprint:

  1. We use the psutil library to track memory usage before and after operations
  2. Both libraries read the same file and perform filtering and grouping
  3. We calculate the difference in memory consumption

Sample output:

Pandas memory delta: 44.4 MB
Polars memory delta: 1.3 MB
Memory savings (Polars vs Pandas): 97.1%

 

The results above show the memory usage delta for both pandas and Polars when performing filtering and aggregation operations on the large_dataset.csv.

  • pandas memory delta: Indicates the memory consumed by pandas for the operations.
  • Polars memory delta: Indicates the memory consumed by Polars for the same operations.
  • Memory savings (Polars vs pandas): This metric provides a percentage of how much less memory Polars used compared to pandas.

It’s common for Polars to demonstrate memory efficiency due to its columnar data storage and optimized execution engine. Typically, you’ll see 30% to 70% improvements from using Polars.

 

Note: However, sequential memory measurements within the same Python process using psutil.Process(...).memory_info().rss can sometimes be misleading. Python’s memory allocator doesn’t always release memory back to the operating system immediately, so a ‘cleaned’ baseline for a subsequent test might still be influenced by prior operations. For the most accurate comparisons, tests should ideally be run in separate, isolated Python processes.

 

Comparing Syntax For Basic Operations

 
Now let’s look at how syntax differs between the two libraries. We’ll cover the most common operations you’ll use.

 

// Selecting Columns

Let’s select a subset of columns. We’ll create a much smaller DataFrame for this (and subsequent examples).

import pandas as pd
import polars as pl

# Create sample data
data = {
    'name': ['Anna', 'Betty', 'Cathy'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
}

# Pandas approach
df_pandas = pd.DataFrame(data)
result_pandas = df_pandas[['name', 'salary']]

# Polars approach
df_polars = pl.DataFrame(data)
result_polars = df_polars.select(['name', 'salary'])
# Alternative: More expressive
result_polars_alt = df_polars.select([pl.col('name'), pl.col('salary')])

print("Pandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)

 

The key differences here:

  • pandas uses bracket notation: df[['col1', 'col2']]
  • Polars uses the .select() method
  • Polars also supports the more expressive pl.col() syntax, which becomes powerful for complex operations

Output:

Pandas result:
    name  salary
0   Anna   50000
1  Betty   60000
2  Cathy   70000

Polars result:
shape: (3, 2)
┌───────┬────────┐
│ name  ┆ salary │
│ — -   ┆ — -    │
│ str   ┆ i64    │
╞═══════╪════════╡
│ Anna  ┆ 50000  │
│ Betty ┆ 60000  │
│ Cathy ┆ 70000  │
└───────┴────────┘

 

Both produce the same output, but Polars’ syntax is more explicit about what you’re doing.

 

// Filtering Rows

Now let’s filter rows:

# pandas: Filter rows where age > 28
filtered_pandas = df_pandas[df_pandas['age'] > 28]

# Alternative Pandas syntax with query
filtered_pandas_alt = df_pandas.query('age > 28')

# Polars: Filter rows where age > 28
filtered_polars = df_polars.filter(pl.col('age') > 28)

print("Pandas filtered:")
print(filtered_pandas)
print("\nPolars filtered:")
print(filtered_polars)

 

Notice the differences:

  • In pandas, we use boolean indexing with bracket notation. You can also use the .query() method.
  • Polars uses the .filter() method with pl.col() expressions.
  • Polars’ syntax reads more like SQL: “filter where column age is greater than 28”.

Output:

Pandas filtered:
    name  age  salary
1  Betty   30   60000
2  Cathy   35   70000

Polars filtered:
shape: (2, 3)
┌───────┬─────┬────────┐
│ name  ┆ age ┆ salary │
│ — -   ┆ — - ┆ — -    │
│ str   ┆ i64 ┆ i64    │
╞═══════╪═════╪════════╡
│ Betty ┆ 30  ┆ 60000  │
│ Cathy ┆ 35  ┆ 70000  │
└───────┴─────┴────────┘

 

// Adding New Columns

Now let’s add new columns to the DataFrame:

# pandas: Add a new column
df_pandas['bonus'] = df_pandas['salary'] * 0.1
df_pandas['total_comp'] = df_pandas['salary'] + df_pandas['bonus']

# Polars: Add new columns
df_polars = df_polars.with_columns([
    (pl.col('salary') * 0.1).alias('bonus'),
    (pl.col('salary') * 1.1).alias('total_comp')
])

print("Pandas with new columns:")
print(df_pandas)
print("\nPolars with new columns:")
print(df_polars)

 

Output:

Pandas with new columns:
    name  age  salary   bonus  total_comp
0   Anna   25   50000  5000.0     55000.0
1  Betty   30   60000  6000.0     66000.0
2  Cathy   35   70000  7000.0     77000.0

Polars with new columns:
shape: (3, 5)
┌───────┬─────┬────────┬────────┬────────────┐
│ name  ┆ age ┆ salary ┆ bonus  ┆ total_comp │
│ — -   ┆ — - ┆ — -    ┆ — -    ┆ — -        │
│ str   ┆ i64 ┆ i64    ┆ f64    ┆ f64        │
╞═══════╪═════╪════════╪════════╪════════════╡
│ Anna  ┆ 25  ┆ 50000  ┆ 5000.0 ┆ 55000.0    │
│ Betty ┆ 30  ┆ 60000  ┆ 6000.0 ┆ 66000.0    │
│ Cathy ┆ 35  ┆ 70000  ┆ 7000.0 ┆ 77000.0    │
└───────┴─────┴────────┴────────┴────────────┘

 

Here’s what is happening:

  • pandas uses direct column assignment, which modifies the DataFrame in place
  • Polars uses .with_columns() and returns a new DataFrame (immutable by default)
  • In Polars, you use .alias() to name the new column

The Polars approach promotes immutability and makes data transformations more readable.

 

Measuring Performance In Grouping And Aggregating

 
Let’s look at a more useful example: grouping data and calculating multiple aggregations. This code shows how we group data by department, calculate multiple statistics on different columns, and time both operations to see the performance difference:

# Load our large dataset
df_pandas = pd.read_csv('large_dataset.csv')
df_polars = pl.read_csv('large_dataset.csv')

# pandas: Group by department and calculate stats
import time

start = time.time()
result_pandas = df_pandas.groupby('department').agg({
    'salary': ['mean', 'median', 'std'],
    'age': 'mean'
}).reset_index()
result_pandas.columns = ['department', 'avg_salary', 'median_salary', 'std_salary', 'avg_age']
pandas_time = time.time() - start

# Polars: Same operation
start = time.time()
result_polars = df_polars.group_by('department').agg([
    pl.col('salary').mean().alias('avg_salary'),
    pl.col('salary').median().alias('median_salary'),
    pl.col('salary').std().alias('std_salary'),
    pl.col('age').mean().alias('avg_age')
])
polars_time = time.time() - start

print(f"Pandas time: {pandas_time:.3f}s")
print(f"Polars time: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")
print("\nPandas result:")
print(result_pandas)
print("\nPolars result:")
print(result_polars)

 

Output:


Pandas time: 0.126s
Polars time: 0.077s
Speedup: 1.6x

Pandas result:
    department    avg_salary  median_salary    std_salary    avg_age
0  Engineering  89954.929266        89919.0  34595.585863  48.953405
1      Finance  89898.829762        89817.0  34648.373383  49.006690
2           HR  90080.629637        90177.0  34692.117761  48.979005
3    Marketing  90071.721095        90154.0  34625.095386  49.085454
4        Sales  89980.433386        90065.5  34634.974505  49.003168

Polars result:
shape: (5, 5)
┌─────────────┬──────────────┬───────────────┬──────────────┬───────────┐
│ department  ┆ avg_salary   ┆ median_salary ┆ std_salary   ┆ avg_age   │
│ — -         ┆ — -          ┆ — -           ┆ — -          ┆ — -       │
│ str         ┆ f64          ┆ f64           ┆ f64          ┆ f64       │
╞═════════════╪══════════════╪═══════════════╪══════════════╪═══════════╡
│ HR          ┆ 90080.629637 ┆ 90177.0       ┆ 34692.117761 ┆ 48.979005 │
│ Sales       ┆ 89980.433386 ┆ 90065.5       ┆ 34634.974505 ┆ 49.003168 │
│ Engineering ┆ 89954.929266 ┆ 89919.0       ┆ 34595.585863 ┆ 48.953405 │
│ Marketing   ┆ 90071.721095 ┆ 90154.0       ┆ 34625.095386 ┆ 49.085454 │
│ Finance     ┆ 89898.829762 ┆ 89817.0       ┆ 34648.373383 ┆ 49.00669  │
└─────────────┴──────────────┴───────────────┴──────────────┴───────────┘

 

Breaking down the syntax:

  • pandas uses a dictionary to specify aggregations, which can be confusing with complex operations
  • Polars uses method chaining: each operation is clear and named

The Polars syntax is more verbose but also more readable. You can immediately see what statistics are being calculated.

 

Understanding Lazy Evaluation In Polars

 
Lazy evaluation is one of Polars’ most helpful features. This means it doesn’t execute your query immediately. Instead, it plans the entire operation and optimizes it before running.

Let’s see this in action:

import polars as pl

# Read in lazy mode
df_lazy = pl.scan_csv('large_dataset.csv')

# Build a complex query
result = (
    df_lazy
    .filter(pl.col('age') > 30)
    .filter(pl.col('salary') > 50000)
    .group_by('department')
    .agg([
        pl.col('salary').mean().alias('avg_salary'),
        pl.len().alias('employee_count')
    ])
    .filter(pl.col('employee_count') > 1000)
    .sort('avg_salary', descending=True)
)

# Nothing has been executed yet!
print("Query plan created, but not executed")

# Now execute the optimized query
import time
start = time.time()
result_df = result.collect()  # This runs the query
execution_time = time.time() - start

print(f"\nExecution time: {execution_time:.3f}s")
print(result_df)

 

Output:

Query plan created, but not executed

Execution time: 0.177s
shape: (5, 3)
┌─────────────┬───────────────┬────────────────┐
│ department  ┆ avg_salary    ┆ employee_count │
│ — -         ┆ — -           ┆ — -            │
│ str         ┆ f64           ┆ u32            │
╞═════════════╪═══════════════╪════════════════╡
│ HR          ┆ 100101.595816 ┆ 132212         │
│ Marketing   ┆ 100054.012365 ┆ 132470         │
│ Sales       ┆ 100041.01049  ┆ 132035         │
│ Finance     ┆ 99956.527217  ┆ 132143         │
│ Engineering ┆ 99946.725458  ┆ 132384         │
└─────────────┴───────────────┴────────────────┘

 

Here, scan_csv() doesn’t load the file immediately; it only plans to read it. We chain multiple filters, groupings, and sorts. Polars analyzes the entire query and optimizes it. For example, it might filter before reading all data.

Only when we call .collect() does the actual computation happen. The optimized query runs much faster than executing each step separately.

 

Wrapping Up

 
As seen, Polars is super useful for data processing with Python. It’s faster, more memory-efficient, and has a cleaner API than pandas. That said, pandas isn’t going anywhere. It has over a decade of development, a massive ecosystem, and millions of users. For many projects, pandas is still the right choice.

Learn Polars if you’re considering large-scale analysis for data engineering projects and the like. The syntax differences aren’t huge, and the performance gains are real. But keep pandas in your toolkit for compatibility and quick exploratory work.

Start by trying Polars on a side project or a data pipeline that’s running slowly. You’ll quickly get a feel for whether it’s right for your use case. Happy data wrangling!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.





Source link

Leave a comment

0.0/5