Tech Capture: Data Flow Job - Cloud Storage to BigQuery

Overview

Create a data flow job to load a cloud storage file into a BigQuery table using a Data Flow template.
Use Data Flow UI's overview page to access job templates and manage jobs.
Focus on batch processing template: CSV file on cloud storage to BigQuery.

Set Up Job in Data Flow UI
- Open Data Flow UI and navigate to the job section.
- Select the appropriate batch template: "text file on cloud storage to BigQuery".
- Name the job (e.g., "BQ load demo") and select the region (e.g., U.S Central One).
Prepare Required Files
- UDF Function File: JavaScript file to map the columns from CSV to BigQuery.
- JSON File: Contains BigQuery schema (data type and column names).
- Data Source File: CSV file with the source data.

Create and Upload Files

Create UDF function in JavaScript to map columns. Example:

function transform(line) {
    var fields = line.split(',');
    return { name: fields[1], city: fields[2], country: fields[4] };
}

Define BigQuery schema in JSON:

[
    {"name": "name", "type": "STRING"},
    {"name": "city", "type": "STRING"},
    {"name": "country", "type": "STRING"}
]

Configure Data Flow Job
- Give paths for UDF JavaScript file and JSON schema file.
- Specify the UDF function name.
- Specify the BigQuery output table (e.g., project:dataset.table).
- Provide the input path of the CSV file and a temporary path for job execution.
Run and Monitor Job
- Execute the job and monitor its status (e.g., starting, running, succeeded).
- Observe job's CPU usage and worker nodes creation.
- Check Data Flow progress stages for troubleshooting if necessary.
Validate BigQuery Table
- After job completion, verify the creation and data load in the BigQuery table.
- Ensure only the specified columns (name, city, country) are loaded.

Can use multiple streaming templates for various data sources (CDC streams, Spanner to BigQuery, etc.).
JSON schema defines data type and column mapping in BigQuery.
UDF function maps CSV file columns to BigQuery.
Monitor job stages and logs for troubleshooting.
Scalable and serverless execution handled by Data Flow.