A guide for users who need to connect to a data store that isn’t supported by the Built-in I/O Transforms
This guide covers how to implement I/O transforms in the Beam model. Beam pipelines use these read and write transforms to import data for processing, and write data to a store.
Reading and writing data in Beam is a parallel task, and using
GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized
Sink classes for specific features. There are changes coming soon (
SplittableDoFn, BEAM-65) that will make
As you work on your I/O Transform, be aware that the Beam community is excited to help those building new I/O Transforms and that there are many examples and helper classes.
Read transforms take data from outside of the Beam pipeline and produce
PCollections of data.
For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:
Each of those steps will be a
ParDo, with a
GroupByKey in between. The
GroupByKey is an implementation detail, but for most runners it allows the runner to use different numbers of workers for:
GroupByKey will also allow Dynamic Work Rebalancing to occur (on supported runners).
Here are some examples of read transform implementations that use the “reading as a mini-pipeline” model when data can be read in parallel:
ParDo: As input, take in a file glob. Produce a
PCollectionof strings, each of which is a file path.
ParDo: Given the
PCollectionof file paths, read each one, producing a
ParDo: As input, receive connection information for the database and the key range to read from. Produce a
PCollectionof key ranges that can be read in parallel efficiently.
ParDo: Given the
PCollectionof key ranges, read the key range, producing a
For data stores or files where reading cannot occur in parallel, reading is a simple task that can be accomplished with a single
GroupByKey. For example:
ParDoin this case would establish a connection to the database and read batches of records, producing a
PCollectionof those records.
ParDoin this case would open the file and read in sequence, producing a
PCollectionof records from the file.
The above discussion is in terms of
ParDos - this is because
Sources have proven to be tricky to implement. At this point in time, the recommendation is to use
Source only if
ParDo doesn’t meet your needs. A class derived from
FileBasedSource is often the best option when reading from files.
If you’re trying to decide on whether or not to use
Source, feel free to email the Beam dev mailing list and we can discuss the specific pros and cons of your case.
In some cases implementing a
Source may be necessary or result in better performance.
ParDos will not work for reading from unbounded sources - they do not support checkpointing and don’t support mechanisms like de-duping that have proven useful for streaming data sources.
ParDos cannot provide hints to runners about their progress or the size of data they are reading - without size estimation of the data or progress on your read, the runner doesn’t have any way to guess how large your read will be, and thus if it attempts to dynamically allocate workers, it does not have any clues as to how many workers you may need for your pipeline.
ParDos do not support Dynamic Work Rebalancing - these are features used by some readers to improve the processing speed of jobs (but may not be possible with your data source).
ParDos do not receive ‘desired_bundle_size’ as a hint from runners when performing initial splitting.
SplittableDoFn(BEAM-65) will mitigate many of these concerns.
Write transforms are responsible for taking the contents of a
PCollection and transferring that data outside of the Beam pipeline.
Write transforms can usually be implemented using a single
ParDo that writes the records received to the data store.
TODO: this section needs further explanation.
You are strongly discouraged from using the
Sink class unless you are creating a
FileBasedSink. Most of the time, a simple
ParDo is all that’s necessary. If you think you have a case that is only possible using a
Sink, please email the Beam dev mailing list.
This guide is still in progress. There is an open issue to finish the guide: BEAM-1025.