Stream Data Ingestion Pipeline in GCP from PubSub to BigQuery using Dataflow | Sagar Patil

Sagar Patil
4 min readNov 12, 2022

--

Stream Data Ingestion using Dataflow

In this article will discuss how you can create a streaming data ingestion pipeline on Google Cloud Platform (GCP) from Pub/Sub to BigQuery using the Dataflow job template. Before moving forward, let's look at the components we will use.

Google Cloud Components:

  • Google Cloud Pub/Sub
  • Google BigQuery
  • Cloud DataFlow

Google Cloud PubSub:

Pub/Sub is used for streaming analytics and data integration pipelines to ingest and distribute data. It allows services to communicate asynchronously.

Google BigQuery:

Google BigQuery is a cost-effective, serverless, multi-cloud enterprise data warehouse offered by Google Cloud.

Cloud DataFlow:

DataFlow is a serverless, fast and cost-effective service that supports both stream and batch processing. It provides portability with processing jobs written using the open-source Apache Beam libraries and removes operational overhead from your data engineering teams by automating the infrastructure provisioning and cluster management. Users can setup pipelines in Dataflow to integrate and process large datasets. It has some ready-to-use stream and batch processing templates, also we can use our custom template for these jobs.

Creating Pipeline

Now, let's start with creating a data ingestion pipeline.

Step 1: We’ll start by creating a Pub/Sub topic for publishing a message.

Creating Pub/Sub Topic
Pub/Sub Topic

Step 2: Next step is to create a dataset in BigQuery and a table inside this dataset. Specify the schema of the table as per the requirement.

Create BigQuery Dataset
Create a Table inside the BigQuery dataset

Step 3: Now create a DataFlow job from template.

  • Give a name to the job
  • Select the regional endpoint (default: us-central1)
  • Select “Pub/Sub Topic to BigQuery” from template menu.
  • Input Pub/Sub topic: Select the topic you have created for this job.
  • BigQuery output table: Path of the BigQuery table. You will find this in table details section in BigQuery.
  • Temporary Bucket: Create a temporary bucket to store temporary files. You can just create a bucket and in the path add the folder name. New folder will be automatically created at the time of execution.
  • Leave the remaining parameters as default.
Create a temporary bucket
Fill the parameters

Step 4: Run the job.

Once the job starts running a graph will be displayed as shown below.

Graph showing the execution flow.

Testing Pipeline:

  • Go to Pub/Sub Topic which you have created.
  • Click on “Publish Message” button and enter the data in JSON format as shown below. Publish message by clicking on “Publish” button.
Publishing a message in Pub/Sub
  • Go to BigQuery console and open query editor.
  • Run SELECT query for displaying all columns and rows from the table as shown below.
Querying the data in BigQuery
  • We can see our data published from Pub/Sub to BigQuery.
  • Try publishing one more time and querying the data received as shown below.
Publishing a message in Pub/Sub
Querying data in BigQuery

Conclusion

This way we can create a streaming data ingestion pipeline from Pub/Sub to BigQuery using the DataFlow job template on Google Cloud Platform.

--

--

Sagar Patil

Data Scientist at Jio 🖥️📊 | Building Data Platform and Products for Jio | Google Cloud Engineer | Technology & Business Enthusiast