TensorFlow Input Pipeline Lecture Notes

Jul 25, 2024

TensorFlow Input Pipeline Lecture Notes

Introduction

  • Importance of TensorFlow input pipeline for deep learning projects.
  • Benefits discussed in the video with practical coding examples.

Image Classification Example: Cats and Dogs

  • Problem Statement: Classification of cats and dogs images stored on hard disk.
  • Data Representation: Images must be converted to numbers (Numpy arrays or Pandas DataFrames) for ML models.
  • Scalability Issue: Training with 10 million images cannot be handled directly due to RAM limitations.

Streaming Approach

  • Load images in batches instead of all at once.
  • Utilizes a spatial data structure for efficient management.
  • Key Class: tf.data.Dataset for building input pipelines.

TensorFlow Data APIs

  • tf.data API framework facilitates transformation and handling of data.
  • Data Transformation: Support for operations like filtering and scaling.

Filtering and Scaling

  • Filtering: Use a custom function to filter blurry images:
    • dataset = dataset.filter(filter_func)
  • Scaling: Usually scale pixel values from 0 to 255. A lambda function can be used:
    • lambda x: x / 255

Single Line Data Pipeline

  • A complete pipeline can be created in a single function call:
    dataset = (tf.data.Dataset.list_files('images/*.jpg')
                .map(load_and_preprocess_image)
                .filter(filter_func)
                .map(scale_function)
                .batch(batch_size))
    
  • ETL Process: Extract, Transform, Load.

Benefits of TensorFlow Input Pipeline

  1. Handles large datasets easily via streaming (from disk or cloud).
  2. Applies various transformations necessary for training models.

Coding Session Highlights

  • Created a TensorFlow Dataset Object using Python lists and filtering negative sales data:
    tf_dataset = tf.data.Dataset.from_tensor_slices(sales_data)
    filtered_data = tf_dataset.filter(lambda x: x > 0)
    
  • Map Function: Used to perform operations across the dataset like scaling and currency conversion.
  • Shuffling: Helps in randomizing the dataset for better training.
  • Batching: Important for distributing data across GPUs during training.

Advanced Processing

  • Read images from directories, create datasets, shuffle, and split for training/testing.
  • Use tf.io.read_file and tf.image.decode_image for loading images.
  • Resize images for uniformity before passing to the model.

Handling Labels

  • Function to extract labels from file paths using string operations:
    • First, split the path and retrieve the necessary portions.

Summary and Exercise

  • The session covers creating an input pipeline for images utilizing transforms, mappings, filtering, etc.
  • Exercise: Create a TensorFlow data pipeline for text reviews, filtering out blank reviews and processing into x (reviews) and y (labels) format.
  • Encourage practice and exploration of the API documentation.

Additional Resources

  • Recommended resources for further reference on the tf.data API.

Emoji: 📊