Stream fusion for streaming, without writing any code

Published: 2016-06-07T21:02Z
Tags: haskell, pipes
License: CC-BY
Source code for this post is available on github.

I recently came accross the streaming library. This library defines a type Stream (Of a) m b for computations that produce zero or more values of type a in a monad m, and eventually produce a value of type b. This stream type can be used for efficient IO without having to load whole files into memory. The streaming library touts bechmark results showing superior performance compared to other libraries like conduit, pipes and machines.

Looking at the datatype definition,

data Stream f m r = Step !(f (Stream f m r))
                  | Effect (m (Stream f m r))
                  | Return r

it struck me how similar this type is to what is used in the stream fusion framework. The main difference being that the streaming library allows for interleaved monadic actions, and of course the lack of decoupling of the state from the stream to allow for fusion. But the vector library actually also uses such a monadic stream fusion framework, to allow for writing into buffers and such. This is type is defined in the module Data.Vector.Fusion.Stream.Monadic.

data Stream m a = forall s. Stream (s -> m (Step s a)) s
data Step s a where
   Yield :: a -> s -> Step s a
   Skip  :: s -> Step s a
   Done  :: Step s a

So, why not try to use vector's stream type directly as a representation of streams? I added this type as an extra alternative to the benchmark, and without writing any more code, the results are pretty impressive:

The only function that could be improved is scanL. In vector this function is implemented in terms of prescan (scanL without the first element) and cons, which makes it pretty inefficient. So I made a specialized implementation.

And that's all. A simple streaming 'library' with state of the art performance, while writing hardly any new code. Now to be fair, there are some reasons why you wouldn't always want to use these fusing streams. In particular, the resulting code could get quite large, and without fusion they may not be the most efficient.


(optional, will not be revealed)
Name a function of type (a -> b) -> ([a] -> [b]):
Use > code for code blocks, @code@ for inline code. Some html is also allowed.