avatarChristianlauer

Summarize

Build Data Lake Pipelines with Google Pub/Sub

How new Pub/Sub Cloud Storage subscriptions will ease Data Integration

Photo by Leon Ephraïm on Unsplash

When building up a Data Lakehouse or a Data Lake also as a basis for example a Data Warehouse — Data Engineers in the Google Cloud could use Pub/Sub to ease data integration.

Data Engineers often had to put together complex data pipelines to ingest streaming data from Pub/Sub into Cloud Storage. In order to land streaming data into Cloud Storage, they either needed to write their own custom Cloud Storage subscriber, or to leverage intermediate Dataflow jobs. Writing a custom Cloud Storage subscriber comes with time-consuming design, development and maintenance of the subscriber code[1][2].

Google therefor now launched a new type of subscription: Pub/Sub Cloud Storage subscriptions which help you write your raw data into Cloud Storage without any transformations in between. These subscriptions offer multiple benefits[2]:

  • Simplified data pipelines — Streamline the ingestion pipelines for your data lake by using Cloud Storage subscriptions, which removes the need for an intermediate process (i.e., a custom subscriber or Dataflow). Cloud Storage subscriptions are fully managed by Pub/Sub, thus reducing the additional maintenance and monitoring overhead that comes with intermediate processes.
  • Lower latency and costs — Cloud Storage subscriptions remove the additional costs and latency introduced by Dataflow or a custom subscriber. (If Pub/Sub data needs to be transformed before landing into Cloud Storage, we still recommend leveraging Dataflow.)
  • Increased flexibility — Cloud Storage file batching options provide flexibility on how to batch messages and then land that data in Cloud Storage.
Create a subscription in Pub/Sub

With these new Cloud Storage subscriptions for Pub/Sub, Google is making it really easy and seamless to ingest your streaming data into your Cloud Storage — it also makes sense because also like other cloud provider easier data integration and approaches like Zero ETL gain momentum and can be an argument for companies to switch from old legacy data platform to modern SaaS based Data Lakes, Warehouse or even Data Lakehouses.

Sources and Further Readings

[1] Google, Simplify data lake pipelines with new Pub/Sub Cloud Storage subscriptions (2023)

[2] Medium, Simplifying Streaming Data Ingestion with Pub/Sub Cloud Storage Subscriptions(2023)

Data Science
Technology
Google
Bigquery
Programming
Recommended from ReadMedium