Skip to content

Commit

Permalink
Merge pull request spotify#97 from spotify/lynn/more-faqs
Browse files Browse the repository at this point in the history
[docs] Add more introductory FAQs
  • Loading branch information
econchick authored Oct 13, 2020
2 parents cccb452 + 16039f9 commit 3372f77
Show file tree
Hide file tree
Showing 12 changed files with 144 additions and 65 deletions.
9 changes: 9 additions & 0 deletions docs/src/faqs/dags_in_klio.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
How are DAGs described within Klio?
===================================

A DAG (directed acyclic graph) of streaming Klio jobs is defined in a job's ``klio-job.yaml`` configuration file.
The :doc:`output </userguide/io/index>` of one job can be used as the input of another job.

Learn more about how DAGs are used in Klio :doc:`here </userguide/anatomy/graph>`.

Learn more about setting up a DAG of Klio jobs through configuration :doc:`here </userguide/config/job_config>`.
6 changes: 6 additions & 0 deletions docs/src/faqs/file_io_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Does Klio take care of downloading input/uploading output data files to/from workers?
=====================================================================================

Currently, Klio has some :doc:`basic utilities </reference/audio/api/io>` for downloading and uploading audio to & from memory.
We will certainly be building this out, but welcome contributions to the effect as well.
Klio also provides :ref:`builtin transforms <pipeline-using-builtins>` to ensure media is not unnecessarily processed.
42 changes: 33 additions & 9 deletions docs/src/faqs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,21 @@ FAQs
:hidden:

klio_at_spotify
relation_to_beam
what_klio_doesnt_do
other_oss
do_i_need_dataflow
performance
research_vs_prod_loads
klio_vs_scio
klio_vs_kubeflow
klio_vs_tensorflow_serving
relation_to_beam
native_beam
file_io_handling
consume_non_klio_messages
publish_kmsgs_from_non_klio_job
custom_proto_def
dags_in_klio
migrate_from_fnapi


Expand All @@ -28,27 +36,43 @@ General
:maxdepth: 1

klio_at_spotify
relation_to_beam
what_klio_doesnt_do
other_oss
do_i_need_dataflow
performance
research_vs_prod_loads


Klio vs ...
-----------

.. toctree::
:maxdepth: 1

klio_vs_scio
klio_vs_kubeflow
klio_vs_tensorflow_serving


Message Processing
------------------
Beam and Klio
-------------

.. toctree::
:maxdepth: 1

consume_non_klio_messages
publish_kmsgs_from_non_klio_job
custom_proto_def
relation_to_beam
native_beam


Misc
----
Technical
---------

.. toctree::
:maxdepth: 1

file_io_handling
dags_in_klio
consume_non_klio_messages
publish_kmsgs_from_non_klio_job
custom_proto_def
migrate_from_fnapi
8 changes: 8 additions & 0 deletions docs/src/faqs/klio_vs_kubeflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
How does Klio compare to Kubeflow?
==================================

`Kubeflow <https://www.kubeflow.org/docs/about/kubeflow/>`_ is a very powerful platform that uses `Kubernetes <https://kubernetes.io/>`_ under the hood to help construct workflows.
Kubeflow allows its users to process data and to use that data to experiment & train ML models.


On the other hand, Klio takes complex algorithms, whether a trained ML model or a media processing algorithm, and enables deployment within research or production pipelines with the focus on optimizing for heavy file I/O and its related resources.
4 changes: 2 additions & 2 deletions docs/src/faqs/klio_vs_scio.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
How does Klio relate to Spotify Scio?
=====================================
How does Klio compare to Spotify Scio?
======================================

Both projects bring Apache Beam to new domains: `Scio <https://github.com/spotify/scio>`_ brings Beam pipelines to Scala, while Klio focuses Beam on analyzing, manipulating, and transforming large binary media (e.g. images, audio, video) where the content in its native form can’t really fit or be analyzed in a database in any meaningful way.
7 changes: 7 additions & 0 deletions docs/src/faqs/klio_vs_tensorflow_serving.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
How does Klio compare to Tensorflow Serving?
============================================

`Tensorflow Serving <https://www.tensorflow.org/tfx/guide/serving>`_ enables creating a service around a Tensorflow-based ML model.
Although a streaming Klio job could be compared to serving a model with a service, Klio is meant for media processing pipelines, not necessarily serving a model.
Klio enables heavy file I/O for processing media, whether it's using a model or not.
As well, Klio is agnostic to the type of ML model used (Tensorflow, PyTorch, scikit-learn, etc.).
5 changes: 5 additions & 0 deletions docs/src/faqs/native_beam.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Does Klio allow you to use native Beam components?
==================================================

Yes, definitely. Klio's design is meant to enhance Beam's primitives (`Pipeline <https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline>`_, `PCollections <https://beam.apache.org/documentation/programming-guide/#pcollections>`_, `Transforms <https://beam.apache.org/documentation/programming-guide/#transforms>`_, etc.).
Write a Klio pipeline should feel very similar to writing a Beam pipeline.
8 changes: 8 additions & 0 deletions docs/src/faqs/other_oss.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Why not other open source frameworks?
=====================================

There are a number of well-developed, supported data processing frameworks available in the open.
At Spotify, we've standardized around `Apache Beam <https://beam.apache.org/>`_ with our sister open source framework, `Scio <https://spotify.github.io/scio/>`_.
We've found that Beam is a framework that engineers and researchers alike can pick up quickly to create `embarrassingly parallel <https://en.wikipedia.org/wiki/Embarrassingly_parallel>`_ pipelines.
But no solution yet existed to handle the resource and environment demands of processing media.

4 changes: 4 additions & 0 deletions docs/src/faqs/performance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
What's the performance of a Klio pipeline?
==========================================

As a simple test, we've `downsampled <https://en.wikipedia.org/wiki/Downsampling_(signal_processing)>`_ 10s of millions of songs in :violetemph:`6 days` using 600 `n1-standard-16 <https://cloud.google.com/compute/docs/machine-types#n1_machine_types>`_ machines (16 vCPUs, 60GB memory).
5 changes: 5 additions & 0 deletions docs/src/faqs/research_vs_prod_loads.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Can Klio be used for smaller loads for ongoing research? or just production loads?
==================================================================================

Klio is meant for processing media - it doesn't matter how big the collection of media files that it processes.
It can be used on cloud infrastructure, or locally on one's computer.
3 changes: 3 additions & 0 deletions docs/src/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ GaugeDispatcher
Kleio
KlioMessage
KlioMessages
Kubeflow
Makefile
PCollection
PCollections
Expand Down Expand Up @@ -80,6 +81,7 @@ rst
run
runtime
schemas
scikit
spectrogram
spectrograms
spotify
Expand All @@ -98,5 +100,6 @@ unpickles
unserialization
userguide
utils
vCPUs
virtualenv
virtualenvs
108 changes: 54 additions & 54 deletions docs/src/userguide/config/job_config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,38 +32,38 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
``job_config.events``
---------------------

Event inputs/outputs designate where to read/write KlioMessages.
Event inputs/outputs designate where to read/write KlioMessages.

The :doc:`KlioMessage <../pipeline/message>` contains a unique identifier of some sort that
refers to a unit of work (e.g. file IDs, track IDs, etc.). This unique identifier can then be
used to look up the binary data as configured in ``job_config.data`` for the job to process. A
job's events can therefore be seen as "triggers" of work needing to be done on particular
binary data.
The :doc:`KlioMessage <../pipeline/message>` contains a unique identifier of some sort that
refers to a unit of work (e.g. file IDs, track IDs, etc.). This unique identifier can then be
used to look up the binary data as configured in ``job_config.data`` for the job to process. A
job's events can therefore be seen as "triggers" of work needing to be done on particular
binary data.

Example:
Example:

.. code-block:: yaml
.. code-block:: yaml
name: my-cool-job
pipeline_options:
streaming: True
job_config:
events:
inputs:
- type: pubsub
subscription: my-input-subscription
outputs:
- type: pubsub
topic: my-output-topic
name: my-cool-job
pipeline_options:
streaming: True
job_config:
events:
inputs:
- type: pubsub
subscription: my-input-subscription
outputs:
- type: pubsub
topic: my-output-topic
``job_config.events.inputs[]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A list of input configuration that will be will used to determine when and how to do work.
A list of input configuration that will be will used to determine when and how to do work.

If more than one input is configured, please familiarized yourself with
:doc:`how multiple configured inputs <../pipeline/multiple_inputs>` are handled in Klio.
If more than one input is configured, please familiarized yourself with
:doc:`how multiple configured inputs <../pipeline/multiple_inputs>` are handled in Klio.


.. option:: job_config.events.inputs[].type STR
Expand Down Expand Up @@ -100,15 +100,15 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
``job_config.events.outputs[]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A list of output configurations that Klio will use to signify that work has been
completed.
A list of output configurations that Klio will use to signify that work has been
completed.

.. warning::
.. warning::

Currently, only one event output configuration is supported in Klio out of the box.
Currently, only one event output configuration is supported in Klio out of the box.

If more than one output is required, set ``skip_klio_write`` of each output configuration
to ``True``.
If more than one output is required, set ``skip_klio_write`` of each output configuration
to ``True``.


.. option:: job_config.events.outputs[].type STR
Expand Down Expand Up @@ -140,29 +140,29 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
``job_config.data``
-------------------

Data inputs/outputs refer to where the files are (typically GCS buckets) that ``KlioMessages``
generated by event inputs refer to.
Data inputs/outputs refer to where the files are (typically GCS buckets) that ``KlioMessages``
generated by event inputs refer to.


``job_config.data.inputs[]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A list of input configurations that Klio will use to look for data to be processed.
A list of input configurations that Klio will use to look for data to be processed.

By default, Klio will drop a ``KlioMessage`` when input data for the corresponding element ID
does not exist. Set ``skip_klio_existence_check`` to ``False`` to implement different behavior.
By default, Klio will drop a ``KlioMessage`` when input data for the corresponding element ID
does not exist. Set ``skip_klio_existence_check`` to ``False`` to implement different behavior.

.. note::
.. note::

Klio does not upload data automatically to the configured location. This must be done from
within the pipeline.
Klio does not upload data automatically to the configured location. This must be done from
within the pipeline.

.. warning::
.. warning::

Currently, only one data input configuration is supported in Klio out of the box.
Currently, only one data input configuration is supported in Klio out of the box.

If more than one input is required, set ``skip_klio_existence_check`` of each input
configuration to ``True``.
If more than one input is required, set ``skip_klio_existence_check`` of each input
configuration to ``True``.


.. option:: job_config.data.inputs[].type STR
Expand Down Expand Up @@ -205,20 +205,20 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
``job_config.data.outputs[]``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A list of output configurations that Klio will use to look for data that has already been
processed.
A list of output configurations that Klio will use to look for data that has already been
processed.

.. note::
.. note::

Klio does not upload data automatically to the configured location. This must be done from
within the pipeline.
Klio does not upload data automatically to the configured location. This must be done from
within the pipeline.

.. warning::
.. warning::

Currently, only one data output configuration is supported in Klio out of the box.
Currently, only one data output configuration is supported in Klio out of the box.

If more than one output is required, set ``skip_klio_existence_check`` of each output
configuration to ``True``.
If more than one output is required, set ``skip_klio_existence_check`` of each output
configuration to ``True``.



Expand Down Expand Up @@ -264,13 +264,13 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
``job_config.metrics``
----------------------

With no additional configuration needed, metrics will be turned on and collected. The default
client depends on the runner:
With no additional configuration needed, metrics will be turned on and collected. The default
client depends on the runner:

| **DataflowRunner**: Stackdriver log-based metrics
| **DirectRunner**: Python standard library logging
| **DataflowRunner**: Stackdriver log-based metrics
| **DirectRunner**: Python standard library logging
See :doc:`documentation on metrics <../pipeline/metrics>` for information on how to emit metrics from a pipeline.
See :doc:`documentation on metrics <../pipeline/metrics>` for information on how to emit metrics from a pipeline.


.. option:: job_config.metrics.logger DICT | BOOL
Expand Down

0 comments on commit 3372f77

Please sign in to comment.