TensorFlow Input Pipeline Lecture Notes

Jul 25, 2024

TensorFlow Input Pipeline Lecture Notes

Introduction

Importance of TensorFlow input pipeline for deep learning projects.
Benefits discussed in the video with practical coding examples.

Image Classification Example: Cats and Dogs

Problem Statement: Classification of cats and dogs images stored on hard disk.
Data Representation: Images must be converted to numbers (Numpy arrays or Pandas DataFrames) for ML models.
Scalability Issue: Training with 10 million images cannot be handled directly due to RAM limitations.

Streaming Approach

Load images in batches instead of all at once.
Utilizes a spatial data structure for efficient management.
Key Class: tf.data.Dataset for building input pipelines.

TensorFlow Data APIs

tf.data API framework facilitates transformation and handling of data.
Data Transformation: Support for operations like filtering and scaling.

Filtering and Scaling

Filtering: Use a custom function to filter blurry images:
- dataset = dataset.filter(filter_func)
Scaling: Usually scale pixel values from 0 to 255. A lambda function can be used:
- lambda x: x / 255

Single Line Data Pipeline

A complete pipeline can be created in a single function call:

dataset = (tf.data.Dataset.list_files('images/*.jpg')
            .map(load_and_preprocess_image)
            .filter(filter_func)
            .map(scale_function)
            .batch(batch_size))

ETL Process: Extract, Transform, Load.

Benefits of TensorFlow Input Pipeline

Handles large datasets easily via streaming (from disk or cloud).
Applies various transformations necessary for training models.

Coding Session Highlights

Created a TensorFlow Dataset Object using Python lists and filtering negative sales data:

tf_dataset = tf.data.Dataset.from_tensor_slices(sales_data)
filtered_data = tf_dataset.filter(lambda x: x > 0)

Map Function: Used to perform operations across the dataset like scaling and currency conversion.
Shuffling: Helps in randomizing the dataset for better training.
Batching: Important for distributing data across GPUs during training.

Advanced Processing

Read images from directories, create datasets, shuffle, and split for training/testing.
Use tf.io.read_file and tf.image.decode_image for loading images.
Resize images for uniformity before passing to the model.

Handling Labels

Function to extract labels from file paths using string operations:
- First, split the path and retrieve the necessary portions.

Summary and Exercise

The session covers creating an input pipeline for images utilizing transforms, mappings, filtering, etc.
Exercise: Create a TensorFlow data pipeline for text reviews, filtering out blank reviews and processing into x (reviews) and y (labels) format.
Encourage practice and exploration of the API documentation.

Additional Resources

Recommended resources for further reference on the tf.data API.

Emoji: 📊

Full transcript