Tech Capture: Data Flow Job - Cloud Storage to BigQuery

Jul 11, 2024

Tech Capture: Data Flow Job - Cloud Storage to BigQuery

Overview

  • Create a data flow job to load a cloud storage file into a BigQuery table using a Data Flow template.
  • Use Data Flow UI's overview page to access job templates and manage jobs.
  • Focus on batch processing template: CSV file on cloud storage to BigQuery.

Steps to Create Data Flow Job

  1. Set Up Job in Data Flow UI

    • Open Data Flow UI and navigate to the job section.
    • Select the appropriate batch template: "text file on cloud storage to BigQuery".
    • Name the job (e.g., "BQ load demo") and select the region (e.g., U.S Central One).
  2. Prepare Required Files

    • UDF Function File: JavaScript file to map the columns from CSV to BigQuery.
    • JSON File: Contains BigQuery schema (data type and column names).
    • Data Source File: CSV file with the source data.
  3. Create and Upload Files

    • Create UDF function in JavaScript to map columns. Example:
      function transform(line) {
          var fields = line.split(',');
          return { name: fields[1], city: fields[2], country: fields[4] };
      }
      
    • Define BigQuery schema in JSON:
      [
          {"name": "name", "type": "STRING"},
          {"name": "city", "type": "STRING"},
          {"name": "country", "type": "STRING"}
      ]
      
    • Upload files (UDF JavaScript, JSON schema, CSV data) to a cloud storage bucket.
  4. Configure Data Flow Job

    • Give paths for UDF JavaScript file and JSON schema file.
    • Specify the UDF function name.
    • Specify the BigQuery output table (e.g., project:dataset.table).
    • Provide the input path of the CSV file and a temporary path for job execution.
  5. Run and Monitor Job

    • Execute the job and monitor its status (e.g., starting, running, succeeded).
    • Observe job's CPU usage and worker nodes creation.
    • Check Data Flow progress stages for troubleshooting if necessary.
  6. Validate BigQuery Table

    • After job completion, verify the creation and data load in the BigQuery table.
    • Ensure only the specified columns (name, city, country) are loaded.

Key Points

  • Can use multiple streaming templates for various data sources (CDC streams, Spanner to BigQuery, etc.).
  • JSON schema defines data type and column mapping in BigQuery.
  • UDF function maps CSV file columns to BigQuery.
  • Monitor job stages and logs for troubleshooting.
  • Scalable and serverless execution handled by Data Flow.

Upcoming Topics

  • Extraction of data from Cloud Spanner to Google Cloud Storage.
  • More complex transformation examples.