Loading Large CSV Files in Pandas by Using Chunks

Jul 21, 2024

Loading Large CSV Files in Pandas by Using Chunks

Introduction

  • Learn how to load large CSV files into pandas by chopping them into chunks and processing step by step.

Problems with Loading Large CSV Files

  • As a data scientist, dealing with large datasets (multiple gigabytes) is common.
  • Example: 50GB CSV file with only 32GB RAM.
  • Pandas loads CSV into a DataFrame and keeps it in RAM for operations.
  • Issue: Large datasets cannot fit into limited RAM entirely.

Solution: Loading Data in Chunks

  • Approach: Split data into smaller chunks to process step-by-step.
  • Helps avoid filling up RAM entirely.

Prerequisites

  • You need pandas installed (pip install pandas).
  • Example CSV file: huge_dataset.csv (~4.23GB).
  • Import pandas: import pandas as pd.

Basic Method

  • Load few rows to avoid memory issues:
    df = pd.read_csv('huge_dataset.csv', nrows=100)
    
  • Skip initial rows if needed:
    df = pd.read_csv('huge_dataset.csv', skiprows=500, nrows=100)
    

Processing Data in Chunks

  • Objective: Load full dataset but process part by part.
  • Example operation: Calculate a metric using selected columns.

Code Example

  1. Define Column Names:
    columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    df.columns = columns
    
  2. Compute a metric:
    metric = df['E'] / df['G']
    

Full Chunk Processing Loop

import pandas as pd

metric_results = pd.Series([], dtype='float64')
counter = 0
chunk_size = 1000

for chunk in pd.read_csv('huge_dataset.csv', chunksize=chunk_size):
    chunk.columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    metric_results = pd.concat([metric_results, chunk['E'] / chunk['G']])
    counter += 1
    if counter == 20:  # Example: break after processing 20 chunks
        break

print(metric_results)
  • Iterate through chunks, compute metric, and concatenate results.
  • Limits memory usage by processing data in manageable pieces.

Conclusion

  • Efficiently process large datasets using pandas by reading and processing in chunks.
  • Avoids memory issues and optimizes resources.

Additional Notes

  • Adjust chunk size as per requirements.
  • Ensure to handle exceptions/errors for robustness.

Remember: Processing large data sets in smaller chunks is key to managing limited RAM while using pandas.