Overview¶
We’ll go over a quick example to see what Fuel is capable of.
Let’s start by creating some random data to act as features and targets. We’ll pretend that we have eight 2x2 grayscale images separated into four classes.
>>> import numpy
>>> seed = 1234
>>> rng = numpy.random.RandomState(seed)
>>> features = rng.randint(256, size=(8, 2, 2))
>>> targets = rng.randint(4, size=(8, 1))
Our goal is to use Fuel to interface with this data, iterate over it in various ways and apply transformations to it on the fly.
Division of labour¶
There are four basic tasks that Fuel needs to handle:
- Interface with the data, be it on disk or in memory.
- Decide which data points to visit, and in which order.
- Iterate over the selected data points.
- At each iteration step, apply some transformation to the selected data points.
Each of those four tasks is delegated to a particular class of objects, which we’ll be introducing in order.
Schematic overview of Fuel¶
For the more visual people, here’s a schematic view of how the different components of Fuel interact together. Dashed lines are optional.
![digraph datasets {
Dataset -> DataStream [label=" Argument to"];
DataStream -> Dataset [label=" Gets data from"];
DataStream -> DataIterator [label=" Returns"];
IterationScheme -> DataStream [style=dashed, label=" Argument to"];
DataStream -> IterationScheme [style=dashed, label=" Gets request iterator"];
IterationScheme -> RequestIterator [label=" Returns"];
RequestIterator -> DataIterator [style=dashed, label=" Argument to"];
DataIterator -> DataStream [label=" Gets data from"];
DataStream -> DataStream [style=dashed, label=" Gets data from (transformer)"];
{ rank=same; RequestIterator DataIterator }
}](_images/graphviz-c41c6fad1b71806e52ef022ade6826ae5cbcc52c.png)
Datasets: interfacing with data¶
In summary
Dataset
- Abstract class. Its subclasses are responsible for interfacing with your data.
- Constructor arguments:
sources
: optional, use to restrict which data sources are returned on data requests.axis_labels
: optional, use to document axis semantics.
- Instance attributes:
sources
: tuple of strings indicating which sources are provided by the dataset, and their ordering (which determines the return order ofget_data()
).provides_sources
: tuple of source names indicating what sources the dataset is able to provide.axis_labels
:dict
mapping from source names to tuples of strings orNone
. Used to document the axis semantics of the dataset’s sources.num_examples
: when implemented, represents the number of examples the dataset holds.
- Methods used to request data:
open()
: returns astate
object the dataset will interact with (e.g. a file handle), orNone
if it doesn’t need to interact with anything.get_data()
: given thestate
object and an optionalrequest
argument, returns data.close()
: given thestate
object, properly closes it.reset()
: given thestate
object, properly closes it and returns a fresh one.
IterableDataset
Allows to interface with iterable objects.
The state
IterableDataset.open()
returns is an iterator object.Its
get_data()
method doesn’t accept requests.Can only iterate examplewise and sequentially.
Constructor arguments:
iterables
: adict
mapping from source names to their corresponding iterable objects. Usecollections.OrderedDict
instances if the source order is important to you.
IndexableDataset
Allows to interface with indexable objects.
The state
IndexableDataset.open()
returns isNone
.Its
get_data()
method accepts requests.Allows random access.
Constructor arguments:
indexables
: adict
mapping from source names to their corresponding indexable objects. Usecollections.OrderedDict
instances if the source order is important to you.
The Dataset
class is responsible for interfacing with the data and
handling data access requests. Subclasses of Dataset
specialize in
certain types of data.
Datasets contain one or more sources of data, such as an array of images, a list of labels, a dictionary specifying an ontology, etc. Each source in a dataset is identified by a unique name.
All datasets have the following attributes:
sources
: tuple of source names indicating what the dataset will provide when queried for data.provides_sources
: tuple of source names indicating what sources the dataset is able to provide.axis_labels
:dict
mapping each source name to a tuple of axis labels, orNone
. Not all source names need to appear in the axis labels dictionary.
Some datasets also have a num_examples
attribute telling how many examples
the dataset provides.
IterableDataset¶
The simplest Dataset
subclass is IterableDataset
, which
interfaces with iterable objects.
It is created by passing an iterables
dict
mapping source names to
their associated data and, optionally, an axis_labels
dict
mapping
source names to their corresponding tuple of axis labels.
>>> from collections import OrderedDict
>>> from fuel.datasets import IterableDataset
>>> dataset = IterableDataset(
... iterables=OrderedDict([('features', features), ('targets', targets)]),
... axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
... ('targets', ('batch', 'index'))]))
We can access the sources
, provides_sources
and axis_labels
attributes defined in all datasets, as well as num_examples
.
>>> print('Provided sources are {}.'.format(dataset.provides_sources))
Provided sources are ('features', 'targets').
>>> print('Sources are {}.'.format(dataset.sources))
Sources are ('features', 'targets').
>>> print('Axis labels are {}.'.format(dataset.axis_labels))
Axis labels are OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))]).
>>> print('Dataset contains {} examples.'.format(dataset.num_examples))
Dataset contains 8 examples.
Tip
The source order of an IterableDataset
instance depends on the key
order of iterables
, which is nondeterministic for regular dict
instances. We therefore recommend that you use
collections.OrderedDict
instances if the source order is important
to you.
Datasets themselves are stateless objects (as opposed to, say, an open file
handle, or an iterator object). In order to request data from the dataset, we
need to ask it to instantiate some stateful object with which it will interact.
This is done through the Dataset.open()
method:
>>> state = dataset.open()
>>> print(state.__class__.__name__)
imap
We can see that in IterableDataset
’s case the state is an iterator
(imap
) object. We can now visit the examples this dataset contains
using its get_data()
method.
>>> while True:
... try:
... print(dataset.get_data(state=state))
... except StopIteration:
... print('Iteration over')
... break
(array([[ 47, 211],
[ 38, 53]]), array([0]))
(array([[204, 116],
[152, 249]]), array([3]))
(array([[143, 177],
[ 23, 233]]), array([0]))
(array([[154, 30],
[171, 158]]), array([1]))
(array([[236, 124],
[ 26, 118]]), array([2]))
(array([[186, 120],
[112, 220]]), array([2]))
(array([[ 69, 80],
[201, 127]]), array([2]))
(array([[246, 254],
[175, 50]]), array([3]))
Iteration over
Eventually, the iterator is depleted and it raises a StopIteration
exception. We can iterate over the dataset again by requesting a fresh iterator
through the dataset’s reset()
method.
>>> state = dataset.reset(state=state)
>>> print(dataset.get_data(state=state))
(array([[ 47, 211],
[ 38, 53]]), array([0]))
When you’re done, don’t forget to call the dataset’s close()
method on
the state. This has the effect of cleanly closing the state (e.g. if the state
is an open file handle, close()
will close it).
>>> dataset.close(state=state)
IndexableDataset¶
The IterableDataset
implementation is pretty minimal. For instance, it
only lets you iterate sequentially and examplewise over your data.
If your data happens to be indexable (e.g. a list
, or a
numpy.ndarray
), then IndexableDataset
will let you do much
more.
We instantiate IndexableDataset
just like IterableDataset
.
>>> from fuel.datasets import IndexableDataset
>>> dataset = IndexableDataset(
... indexables=OrderedDict([('features', features), ('targets', targets)]),
... axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
... ('targets', ('batch', 'index'))]))
The main advantage of IndexableDataset
over IterableDataset
is that it allows random access of the data it contains. In order to do so, we
need to pass an additional request
argument to get_data()
in the form
of a list of indices.
>>> state = dataset.open()
>>> print('State is {}.'.format(state))
State is None.
>>> print(dataset.get_data(state=state, request=[0, 1]))
(array([[[ 47, 211],
[ 38, 53]],
<BLANKLINE>
[[204, 116],
[152, 249]]]), array([[0],
[3]]))
>>> dataset.close(state=state)
See how IndexableDataset
returns a None
state: this is because
there’s no actual state to maintain in this case.
Restricting sources¶
In some cases (e.g. unsupervised learning), you might want to use a subset of
the provided sources. This is achieved by passing a sources
argument to the
dataset constructor. Here’s an example:
>>> restricted_dataset = IndexableDataset(
... indexables=OrderedDict([('features', features), ('targets', targets)]),
... axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
... ('targets', ('batch', 'index'))]),
... sources=('features',))
>>> print(restricted_dataset.provides_sources)
('features', 'targets')
>>> print(restricted_dataset.sources)
('features',)
>>> state = restricted_dataset.open()
>>> print(restricted_dataset.get_data(state=state, request=[0, 1]))
(array([[[ 47, 211],
[ 38, 53]],
<BLANKLINE>
[[204, 116],
[152, 249]]]),)
>>> restricted_dataset.close(state=state)
You can see that in this case only the features are returned by
get_data()
.
Iteration schemes: which examples to visit¶
In summary
IterationScheme
- Abstract class. Its subclasses are responsible for deciding in which order examples are visited.
- Methods:
get_request_iterator()
: returns an iterator object that returns requests. These requests can be fed to a dataset’sget_data()
method.
BatchScheme
- Abstract class. Its subclasses return batch requests.
- Commonly used subclasses are:
SequentialScheme
: requests batches sequentially.ShuffledScheme
: requests batches in shuffled order.
IndexScheme
- Abstract class. Its subclasses return example requests.
- Commonly used subclasses are:
SequentialExampleScheme
: requests examples sequentially.ShuffledExampleScheme
: requests examples in shuffled order.
Encapsulating and accessing our data is good, but if we’re to integrate it into
a training loop, we need to be able to iterate over the data. For that, we need
to decide which indices to request and in which order. This is accomplished
via an IterationScheme
subclass.
At its most basic level, an iteration scheme is responsible, through its
get_request_iterator()
method, for building an iterator that will return
requests. Here are some examples:
>>> from fuel.schemes import (SequentialScheme, ShuffledScheme,
... SequentialExampleScheme, ShuffledExampleScheme)
>>> schemes = [SequentialScheme(examples=8, batch_size=4),
... ShuffledScheme(examples=8, batch_size=4),
... SequentialExampleScheme(examples=8),
... ShuffledExampleScheme(examples=8)]
>>> for scheme in schemes:
... print(list(scheme.get_request_iterator()))
[[0, 1, 2, 3], [4, 5, 6, 7]]
[[7, 2, 1, 6], [0, 4, 3, 5]]
[0, 1, 2, 3, 4, 5, 6, 7]
[7, 2, 1, 6, 0, 4, 3, 5]
We can therefore use an iteration scheme to visit a dataset in some order.
>>> state = dataset.open()
>>> scheme = ShuffledScheme(examples=dataset.num_examples, batch_size=4)
>>> for request in scheme.get_request_iterator():
... data = dataset.get_data(state=state, request=request)
... print(data[0].shape, data[1].shape)
(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)
>>> dataset.close(state)
Note
Not all iteration schemes work with all datasets. For instance,
IterableDataset
doesn’t work with any iteration scheme,
since its get_data()
method doesn’t accept requests.
Data streams: automating the iteration process¶
In summary
AbstractDataStream
- Abstract class. Its subclasses are responsible for coordinating a dataset and an iteration scheme to iterate through the data.
- Methods for iterating:
get_epoch_iterator()
: returns an iterator that returns examples or batches of examples.
- Constructor arguments:
iteration_scheme
:IterationScheme
instance, optional, use to specify the iteration order.axis_labels
: optional, use to document axis semantics.
DataStream
- The most common data stream.
- Constructor arguments:
dataset
:Dataset
instance, which dataset to iterate over.
Iteration schemes offer a more convenient way to visit the dataset than
accessing the data by hand, but we can do better: the act of getting a fresh
state from the dataset, getting a request iterator from the iteration scheme,
using both to access the data and closing the state is repetitive. To automate
this, we have data streams, which are subclasses of
AbstractDataStream
.
The most common AbstractDataStream
subclass is DataStream
. It
is instantiated with a dataset and an iteration scheme, and returns an epoch
iterator through its get_epoch_iterator()
method, which iterates over the
dataset in the order defined by the iteration scheme.
>>> from fuel.streams import DataStream
>>> data_stream = DataStream(dataset=dataset, iteration_scheme=scheme)
>>> for data in data_stream.get_epoch_iterator():
... print(data[0].shape, data[1].shape)
(4, 2, 2) (4, 1)
(4, 2, 2) (4, 1)
Transformers: apply some transformation on the fly¶
In summary
Transformer
AbstractDataStream
subclass. Is itself an abstract class. Its subclasses are responsible for taking data stream(s) as input and producing a data stream as output, which applies some transformation to the input stream(s).- Transformers can be chained together to form complex data processing pipelines.
- Constructor arguments:
data_stream
:AbstractDataStream
instance, the input stream.
Some AbstractDataStream
subclasses take data streams as input. We call
them transformers, and they enable us to build complex data preprocessing
pipelines.
Transformers are Transformer
subclasses, which is itself an
AbstractDataStream
subclass. Here are some commonly used ones:
Flatten
: flattens the input into a matrix (for batch input) or a vector (for examplewise input).ScaleAndShift
: scales and shifts the input by scalar quantities.Cast
: casts the input into some data type.
As an example, let’s standardize the images we have by substracting their mean and dividing by their standard deviation.
>>> from fuel.transformers import ScaleAndShift
>>> # Note: ScaleAndShift applies (batch * scale) + shift, as
>>> # opposed to (batch + shift) * scale.
>>> scale = 1.0 / features.std()
>>> shift = - scale * features.mean()
>>> standardized_stream = ScaleAndShift(data_stream=data_stream,
... scale=scale, shift=shift,
... which_sources=('features',))
The resulting data stream can be used to iterate over the dataset just like before, but this time features will be standardized on-the-fly.
>>> for batch in standardized_stream.get_epoch_iterator():
... print(batch)
(array([[[ 0.18530572, -1.54479571],
[ 0.42249705, 0.24111545]],
<BLANKLINE>
[[-1.30760439, 0.98059429],
[-1.43317627, -1.2238898 ]],
<BLANKLINE>
[[ 1.46892937, 1.58054882],
[ 0.47830677, -1.2657471 ]],
<BLANKLINE>
[[ 0.63178351, -0.28907693],
[-0.40069638, 1.10616617]]]), array([[1],
[0],
[3],
[2]]))
(array([[[ 1.32940506, -0.2332672 ],
[-1.60060544, -0.31698179]],
<BLANKLINE>
[[ 0.03182898, 0.50621164],
[-1.64246273, 1.28754777]],
<BLANKLINE>
[[ 0.88292727, -0.34488665],
[ 0.15740086, 1.51078666]],
<BLANKLINE>
[[-1.00065091, -0.84717417],
[ 0.84106998, -0.19140991]]]), array([[2],
[0],
[3],
[2]]))
Now, let’s imagine that for some reason (e.g. running Theano code on GPU) we
need features to have a data type of float32
. We can cast them on-the-fly
with a Cast
transformer.
>>> from fuel.transformers import Cast
>>> cast_standardized_stream = Cast(
... data_stream=standardized_stream,
... dtype='float32', which_sources=('features',))
As you can see, Fuel makes it easy to chain transformations to form a preprocessing pipeline. The complete pipeline now looks like this:
>>> data_stream = Cast(
... ScaleAndShift(
... DataStream(
... dataset=dataset, iteration_scheme=scheme),
... scale=scale, shift=shift, which_sources=('features',)),
... dtype='float32', which_sources=('features',))
Going further¶
You now know enough to find your way around Fuel. Here are the next steps:
- Learn how to use built-in datasets.
- Learn how to import your own data in Fuel.
- Learn how to extend Fuel to suit your needs.