StreamingDataFrame Assignment Rules
This section aggregates some of the more nuanced rules behind StreamingDataFrame
operation assignments that are mentioned in various places.
Recommendation: Always Assign Operations!
You can safely assign any StreamingDataFrame
operation to a variable, regardless
of it being in-place or not.
# NOTE: this is an incomplete stub just to show usage
from quixstreams import Application
sdf = Application().dataframe()
sdf = sdf.apply()
sdf = sdf.update() # in-place with assigning!
sdf = sdf.apply().to_topic()
So whenever in doubt, simply assign operations to a variable, even if it won't be referenced afterward.
Once you grow more comfortable with how the various StreamingDataFrame
operations work,
feel free to skip assignments where applicable.
The only time you should not do assignments is very special edge cases with intermediate operation referencing.
Avoid: Intermediate Operation Referencing
Intermediate operation referencing corresponds to a specific set of StreamingDataFrame
operations that are assigned as a variable in one assignment step with the intention to
being used in future ones (often more than once), especially with branching.
It will NOT work as expected, so avoid it!
It applies only to a few specific use cases:
-
Using a previous
SDF.apply()
as a filter in a later step:- Recommended:
- Avoid:
-
Using column-based manipulations later (
SDF["col"]
references):- Recommended:
- Avoid:
Intermediate operation references will not raise exceptions, so be careful!
Assignment vs "in-place" Patterns
For StreamingDataFrames
, the general expected pattern is to assign it to a variable
and reassign it as operations are added:
# NOTE: this is an incomplete stub just to show usage
from quixstreams import Application
sdf = Application().dataframe()
sdf = sdf.apply() # reassign with new operation
sdf = sdf.apply().apply() # reassign with chaining
This is in contrast to an "in-place" (or "discard", "side effect") pattern, where the operation is not assigned:
# NOTE: this is an incomplete stub just to show usage
from quixstreams import Application
sdf = Application().dataframe()
sdf.print() # no assignment
sdf.update().print() # no assignment with chaining
Valid In-Place Operations
There are a few operations that are the exception: they are added to the "current branch" as expected without assignment:
.to_topic()
.print()
.update()
.drop()
that said, they can still be safely assigned and function as expected, because they
return the StreamingDataFrame
, which is why the recommended pattern is to
just assign everything.
Sink
The .sink()
operation is a special case: it is an in-place operation with no return,
so DO NOT assign it if you wish to branch from its `StreamingDataFrame afterward.
See sinks for details.
The Impact of Branching
With the introduction of branching, assignment is no longer technically required for the "assignment" operation to be added: it instead generates a terminal branch.
In general, it is still recommended to follow both the recommended assignment patterns and recommended branching patterns.
Before Quix Streams v3.0.0 (branching)
For those who never used anything <v3.0, ignore this!
Prior to Quix Streams 3.0, "reassignment" was a required pattern for practically all
but a few operations; if something like .apply()
was called without assigning it,
the function would be skipped entirely, like it never existed
(because it was never actually added!).