Loading Large CSV Files in Pandas by Using Chunks
Introduction
- Learn how to load large CSV files into pandas by chopping them into chunks and processing step by step.
Problems with Loading Large CSV Files
- As a data scientist, dealing with large datasets (multiple gigabytes) is common.
- Example: 50GB CSV file with only 32GB RAM.
- Pandas loads CSV into a DataFrame and keeps it in RAM for operations.
- Issue: Large datasets cannot fit into limited RAM entirely.
Solution: Loading Data in Chunks
- Approach: Split data into smaller chunks to process step-by-step.
- Helps avoid filling up RAM entirely.
Prerequisites
- You need pandas installed (
pip install pandas
).
- Example CSV file:
huge_dataset.csv
(~4.23GB).
- Import pandas:
import pandas as pd
.
Basic Method
- Load few rows to avoid memory issues:
df = pd.read_csv('huge_dataset.csv', nrows=100)
- Skip initial rows if needed:
df = pd.read_csv('huge_dataset.csv', skiprows=500, nrows=100)
Processing Data in Chunks
- Objective: Load full dataset but process part by part.
- Example operation: Calculate a metric using selected columns.
Code Example
- Define Column Names:
columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
df.columns = columns
- Compute a metric:
metric = df['E'] / df['G']
Full Chunk Processing Loop
import pandas as pd
metric_results = pd.Series([], dtype='float64')
counter = 0
chunk_size = 1000
for chunk in pd.read_csv('huge_dataset.csv', chunksize=chunk_size):
chunk.columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
metric_results = pd.concat([metric_results, chunk['E'] / chunk['G']])
counter += 1
if counter == 20: # Example: break after processing 20 chunks
break
print(metric_results)
- Iterate through chunks, compute metric, and concatenate results.
- Limits memory usage by processing data in manageable pieces.
Conclusion
- Efficiently process large datasets using pandas by reading and processing in chunks.
- Avoids memory issues and optimizes resources.
Additional Notes
- Adjust chunk size as per requirements.
- Ensure to handle exceptions/errors for robustness.
Remember: Processing large data sets in smaller chunks is key to managing limited RAM while using pandas.