Stream Data Ingestion Pipeline in GCP from PubSub to BigQuery using Dataflow | Sagar Patil
In this article will discuss how you can create a streaming data ingestion pipeline on Google Cloud Platform (GCP) from Pub/Sub to BigQuery using the Dataflow job template. Before moving forward, let's look at the components we will use.
Google Cloud Components:
- Google Cloud Pub/Sub
- Google BigQuery
- Cloud DataFlow
Google Cloud PubSub:
Pub/Sub is used for streaming analytics and data integration pipelines to ingest and distribute data. It allows services to communicate asynchronously.
Google BigQuery:
Google BigQuery is a cost-effective, serverless, multi-cloud enterprise data warehouse offered by Google Cloud.
Cloud DataFlow:
DataFlow is a serverless, fast and cost-effective service that supports both stream and batch processing. It provides portability with processing jobs written using the open-source Apache Beam libraries and removes operational overhead from your data engineering teams by automating the infrastructure provisioning and cluster management. Users can setup pipelines in Dataflow to integrate and process large datasets. It has some ready-to-use stream and batch processing templates, also we can use our custom template for these jobs.
Creating Pipeline
Now, let's start with creating a data ingestion pipeline.
Step 1: We’ll start by creating a Pub/Sub topic for publishing a message.
Step 2: Next step is to create a dataset in BigQuery and a table inside this dataset. Specify the schema of the table as per the requirement.
Step 3: Now create a DataFlow job from template.
- Give a name to the job
- Select the regional endpoint (default: us-central1)
- Select “Pub/Sub Topic to BigQuery” from template menu.
- Input Pub/Sub topic: Select the topic you have created for this job.
- BigQuery output table: Path of the BigQuery table. You will find this in table details section in BigQuery.
- Temporary Bucket: Create a temporary bucket to store temporary files. You can just create a bucket and in the path add the folder name. New folder will be automatically created at the time of execution.
- Leave the remaining parameters as default.
Step 4: Run the job.
Once the job starts running a graph will be displayed as shown below.
Testing Pipeline:
- Go to Pub/Sub Topic which you have created.
- Click on “Publish Message” button and enter the data in JSON format as shown below. Publish message by clicking on “Publish” button.
- Go to BigQuery console and open query editor.
- Run SELECT query for displaying all columns and rows from the table as shown below.
- We can see our data published from Pub/Sub to BigQuery.
- Try publishing one more time and querying the data received as shown below.
Conclusion
This way we can create a streaming data ingestion pipeline from Pub/Sub to BigQuery using the DataFlow job template on Google Cloud Platform.