Merge pull request spotify#97 from spotify/lynn/more-faqs

[docs] Add more introductory FAQs
Olubenade · Oct 13, 2020 · 3372f77 · 3372f77
2 parents cccb452 + 16039f9
commit 3372f77
Show file tree

Hide file tree

Showing 12 changed files with 144 additions and 65 deletions.
diff --git a/docs/src/faqs/dags_in_klio.rst b/docs/src/faqs/dags_in_klio.rst
@@ -0,0 +1,9 @@
+How are DAGs described within Klio?
+===================================
+
+A DAG (directed acyclic graph) of streaming Klio jobs is defined in a job's ``klio-job.yaml`` configuration file.
+The :doc:`output </userguide/io/index>` of one job can be used as the input of another job.
+
+Learn more about how DAGs are used in Klio :doc:`here </userguide/anatomy/graph>`.
+
+Learn more about setting up a DAG of Klio jobs through configuration :doc:`here </userguide/config/job_config>`.
diff --git a/docs/src/faqs/file_io_handling.rst b/docs/src/faqs/file_io_handling.rst
@@ -0,0 +1,6 @@
+Does Klio take care of downloading input/uploading output data files to/from workers?
+=====================================================================================
+
+Currently, Klio has some :doc:`basic utilities </reference/audio/api/io>` for downloading and uploading audio to & from memory.
+We will certainly be building this out, but welcome contributions to the effect as well.
+Klio also provides :ref:`builtin transforms <pipeline-using-builtins>` to ensure media is not unnecessarily processed.
diff --git a/docs/src/faqs/index.rst b/docs/src/faqs/index.rst
@@ -11,13 +11,21 @@ FAQs
     :hidden:
 
     klio_at_spotify
-    relation_to_beam
     what_klio_doesnt_do
+    other_oss
     do_i_need_dataflow
+    performance
+    research_vs_prod_loads
     klio_vs_scio
+    klio_vs_kubeflow
+    klio_vs_tensorflow_serving
+    relation_to_beam
+    native_beam
+    file_io_handling
     consume_non_klio_messages
     publish_kmsgs_from_non_klio_job
     custom_proto_def
+    dags_in_klio
     migrate_from_fnapi
 
 
@@ -28,27 +36,43 @@ General
     :maxdepth: 1
 
     klio_at_spotify
-    relation_to_beam
     what_klio_doesnt_do
+    other_oss
     do_i_need_dataflow
+    performance
+    research_vs_prod_loads
+
+
+Klio vs ...
+-----------
+
+.. toctree::
+    :maxdepth: 1
+
     klio_vs_scio
+    klio_vs_kubeflow
+    klio_vs_tensorflow_serving
 
 
-Message Processing
-------------------
+Beam and Klio
+-------------
 
 .. toctree::
     :maxdepth: 1
 
-    consume_non_klio_messages
-    publish_kmsgs_from_non_klio_job
-    custom_proto_def
+    relation_to_beam
+    native_beam
 
 
-Misc
-----
+Technical
+---------
 
 .. toctree::
     :maxdepth: 1
 
+    file_io_handling
+    dags_in_klio
+    consume_non_klio_messages
+    publish_kmsgs_from_non_klio_job
+    custom_proto_def
     migrate_from_fnapi
diff --git a/docs/src/faqs/klio_vs_kubeflow.rst b/docs/src/faqs/klio_vs_kubeflow.rst
@@ -0,0 +1,8 @@
+How does Klio compare to Kubeflow?
+==================================
+
+`Kubeflow <https://www.kubeflow.org/docs/about/kubeflow/>`_ is a very powerful platform that uses `Kubernetes <https://kubernetes.io/>`_ under the hood to help construct workflows.
+Kubeflow allows its users to process data and to use that data to experiment & train ML models.
+
+
+On the other hand, Klio takes complex algorithms, whether a trained ML model or a media processing algorithm, and enables deployment within research or production pipelines with the focus on optimizing for heavy file I/O and its related resources.
diff --git a/docs/src/faqs/klio_vs_scio.rst b/docs/src/faqs/klio_vs_scio.rst
@@ -1,4 +1,4 @@
-How does Klio relate to Spotify Scio?
-=====================================
+How does Klio compare to Spotify Scio?
+======================================
 
 Both projects bring Apache Beam to new domains: `Scio <https://github.com/spotify/scio>`_ brings Beam pipelines to Scala, while Klio focuses Beam on analyzing, manipulating, and transforming large binary media (e.g. images, audio, video) where the content in its native form can’t really fit or be analyzed in a database in any meaningful way.
diff --git a/docs/src/faqs/klio_vs_tensorflow_serving.rst b/docs/src/faqs/klio_vs_tensorflow_serving.rst
@@ -0,0 +1,7 @@
+How does Klio compare to Tensorflow Serving?
+============================================
+
+`Tensorflow Serving <https://www.tensorflow.org/tfx/guide/serving>`_ enables creating a service around a Tensorflow-based ML model.
+Although a streaming Klio job could be compared to serving a model with a service, Klio is meant for media processing pipelines, not necessarily serving a model.
+Klio enables heavy file I/O for processing media, whether it's using a model or not.
+As well, Klio is agnostic to the type of ML model used (Tensorflow, PyTorch, scikit-learn, etc.).
diff --git a/docs/src/faqs/native_beam.rst b/docs/src/faqs/native_beam.rst
@@ -0,0 +1,5 @@
+Does Klio allow you to use native Beam components?
+==================================================
+
+Yes, definitely. Klio's design is meant to enhance Beam's primitives (`Pipeline <https://beam.apache.org/documentation/programming-guide/#creating-a-pipeline>`_, `PCollections <https://beam.apache.org/documentation/programming-guide/#pcollections>`_, `Transforms <https://beam.apache.org/documentation/programming-guide/#transforms>`_, etc.).
+Write a Klio pipeline should feel very similar to writing a Beam pipeline.
diff --git a/docs/src/faqs/other_oss.rst b/docs/src/faqs/other_oss.rst
@@ -0,0 +1,8 @@
+Why not other open source frameworks?
+=====================================
+
+There are a number of well-developed, supported data processing frameworks available in the open.
+At Spotify, we've standardized around `Apache Beam <https://beam.apache.org/>`_ with our sister open source framework, `Scio <https://spotify.github.io/scio/>`_.
+We've found that Beam is a framework that engineers and researchers alike can pick up quickly to create `embarrassingly parallel <https://en.wikipedia.org/wiki/Embarrassingly_parallel>`_ pipelines.
+But no solution yet existed to handle the resource and environment demands of processing media.
+
diff --git a/docs/src/faqs/performance.rst b/docs/src/faqs/performance.rst
@@ -0,0 +1,4 @@
+What's the performance of a Klio pipeline?
+==========================================
+
+As a simple test, we've `downsampled <https://en.wikipedia.org/wiki/Downsampling_(signal_processing)>`_ 10s of millions of songs in :violetemph:`6 days` using 600 `n1-standard-16 <https://cloud.google.com/compute/docs/machine-types#n1_machine_types>`_ machines (16 vCPUs, 60GB memory).
diff --git a/docs/src/faqs/research_vs_prod_loads.rst b/docs/src/faqs/research_vs_prod_loads.rst
@@ -0,0 +1,5 @@
+Can Klio be used for smaller loads for ongoing research? or just production loads?
+==================================================================================
+
+Klio is meant for processing media - it doesn't matter how big the collection of media files that it processes.
+It can be used on cloud infrastructure, or locally on one's computer.
diff --git a/docs/src/spelling_wordlist.txt b/docs/src/spelling_wordlist.txt
@@ -6,6 +6,7 @@ GaugeDispatcher
 Kleio
 KlioMessage
 KlioMessages
+Kubeflow
 Makefile
 PCollection
 PCollections
@@ -80,6 +81,7 @@ rst
 run
 runtime
 schemas
+scikit
 spectrogram
 spectrograms
 spotify
@@ -98,5 +100,6 @@ unpickles
 unserialization
 userguide
 utils
+vCPUs
 virtualenv
 virtualenvs
diff --git a/docs/src/userguide/config/job_config.rst b/docs/src/userguide/config/job_config.rst
@@ -32,38 +32,38 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
 ``job_config.events``
 ---------------------
 
-    Event inputs/outputs designate where to read/write KlioMessages.
+Event inputs/outputs designate where to read/write KlioMessages.
 
-    The :doc:`KlioMessage <../pipeline/message>` contains a unique identifier of some sort that
-    refers to a unit of work (e.g. file IDs, track IDs, etc.). This unique identifier can then be
-    used to look up the binary data as configured in ``job_config.data`` for the job to process. A
-    job's events can therefore be seen as "triggers" of work needing to be done on particular
-    binary data.
+The :doc:`KlioMessage <../pipeline/message>` contains a unique identifier of some sort that
+refers to a unit of work (e.g. file IDs, track IDs, etc.). This unique identifier can then be
+used to look up the binary data as configured in ``job_config.data`` for the job to process. A
+job's events can therefore be seen as "triggers" of work needing to be done on particular
+binary data.
 
-    Example:
+Example:
 
-        .. code-block:: yaml
+.. code-block:: yaml
 
-            name: my-cool-job
-            pipeline_options:
-              streaming: True
-            job_config:
-              events:
-                inputs:
-                  - type: pubsub
-                    subscription: my-input-subscription
-                outputs:
-                  - type: pubsub
-                    topic: my-output-topic
+    name: my-cool-job
+    pipeline_options:
+      streaming: True
+    job_config:
+      events:
+        inputs:
+          - type: pubsub
+            subscription: my-input-subscription
+        outputs:
+          - type: pubsub
+            topic: my-output-topic
 
 
 ``job_config.events.inputs[]``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-    A list of input configuration that will be will used to determine when and how to do work.
+A list of input configuration that will be will used to determine when and how to do work.
 
-    If more than one input is configured, please familiarized yourself with
-    :doc:`how multiple configured inputs <../pipeline/multiple_inputs>` are handled in Klio.
+If more than one input is configured, please familiarized yourself with
+:doc:`how multiple configured inputs <../pipeline/multiple_inputs>` are handled in Klio.
 
 
 .. option:: job_config.events.inputs[].type STR
@@ -100,15 +100,15 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
 ``job_config.events.outputs[]``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-    A list of output configurations that Klio will use to signify that work has been
-    completed.
+A list of output configurations that Klio will use to signify that work has been
+completed.
 
-    .. warning::
+.. warning::
 
-        Currently, only one event output configuration is supported in Klio out of the box.
+    Currently, only one event output configuration is supported in Klio out of the box.
 
-        If more than one output is required, set ``skip_klio_write`` of each output configuration
-        to ``True``.
+    If more than one output is required, set ``skip_klio_write`` of each output configuration
+    to ``True``.
 
 
 .. option:: job_config.events.outputs[].type STR
@@ -140,29 +140,29 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
 ``job_config.data``
 -------------------
 
-    Data inputs/outputs refer to where the files are (typically GCS buckets) that ``KlioMessages``
-    generated by event inputs refer to.
+Data inputs/outputs refer to where the files are (typically GCS buckets) that ``KlioMessages``
+generated by event inputs refer to.
 
 
 ``job_config.data.inputs[]``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-    A list of input configurations that Klio will use to look for data to be processed.
+A list of input configurations that Klio will use to look for data to be processed.
 
-    By default, Klio will drop a ``KlioMessage`` when input data for the corresponding element ID
-    does not exist. Set ``skip_klio_existence_check`` to ``False`` to implement different behavior.
+By default, Klio will drop a ``KlioMessage`` when input data for the corresponding element ID
+does not exist. Set ``skip_klio_existence_check`` to ``False`` to implement different behavior.
 
-    .. note::
+.. note::
 
-        Klio does not upload data automatically to the configured location. This must be done from
-        within the pipeline.
+    Klio does not upload data automatically to the configured location. This must be done from
+    within the pipeline.
 
-    .. warning::
+.. warning::
 
-        Currently, only one data input configuration is supported in Klio out of the box.
+    Currently, only one data input configuration is supported in Klio out of the box.
 
-        If more than one input is required, set ``skip_klio_existence_check`` of each input
-        configuration to ``True``.
+    If more than one input is required, set ``skip_klio_existence_check`` of each input
+    configuration to ``True``.
 
 
 .. option:: job_config.data.inputs[].type STR
@@ -205,20 +205,20 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
 ``job_config.data.outputs[]``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-    A list of output configurations that Klio will use to look for data that has already been
-    processed.
+A list of output configurations that Klio will use to look for data that has already been
+processed.
 
-    .. note::
+.. note::
 
-        Klio does not upload data automatically to the configured location. This must be done from
-        within the pipeline.
+    Klio does not upload data automatically to the configured location. This must be done from
+    within the pipeline.
 
-    .. warning::
+.. warning::
 
-        Currently, only one data output configuration is supported in Klio out of the box.
+    Currently, only one data output configuration is supported in Klio out of the box.
 
-        If more than one output is required, set ``skip_klio_existence_check`` of each output
-        configuration to ``True``.
+    If more than one output is required, set ``skip_klio_existence_check`` of each output
+    configuration to ``True``.
 
 
 
@@ -264,13 +264,13 @@ Klio-specific and :ref:`user-specified custom <custom-conf>` job configuration.
 ``job_config.metrics``
 ----------------------
 
-    With no additional configuration needed, metrics will be turned on and collected. The default
-    client depends on the runner:
+With no additional configuration needed, metrics will be turned on and collected. The default
+client depends on the runner:
 
-    | **DataflowRunner**: Stackdriver log-based metrics
-    | **DirectRunner**: Python standard library logging
+| **DataflowRunner**: Stackdriver log-based metrics
+| **DirectRunner**: Python standard library logging
 
-    See :doc:`documentation on metrics <../pipeline/metrics>` for information on how to emit metrics from a pipeline.
+See :doc:`documentation on metrics <../pipeline/metrics>` for information on how to emit metrics from a pipeline.
 
 
 .. option:: job_config.metrics.logger DICT | BOOL