feat: add support for aggregates and toxicity classification (georgia…

…-tech-db#551) Merging to resolve emotional analysis model issue. We still need to take care of other issues highlighted by @gaurav274 .
luoj1 · Jan 7, 2023 · 39183a4 · 39183a4
1 parent f88db49
commit 39183a4
Show file tree

Hide file tree

Showing 29 changed files with 747 additions and 833 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -6,6 +6,7 @@ orbs:
 workflows:
   main:
     jobs:
+        - Windows   
         - test:
             name: "Linux - Python v3.7"
             v: "3.7"
@@ -18,9 +19,8 @@ workflows:
         - test:
             name: "Linux - Python v3.10"
             v: "3.10"
-        - Windows
         #- test:
-        #    name: "Python v3.11"  # missing Torchvision
+        #    name: "Linux - Python v3.11"  # missing Torchvision
         #    v: "3.11"
 
 jobs:
@@ -46,8 +46,7 @@ jobs:
           command: |
             pip install --upgrade pip
             pip install evadb
-            
-        # bash script/test/package.sh
+            bash script/test/package.sh
 
       - run:
           name: Install EVA package from GitHub repo with all dependencies

diff --git a/README.md b/README.md
@@ -23,16 +23,16 @@
 
 EVA is a **database system tailored for video analytics** -- think PostgreSQL for videos. It supports a SQL-like language for querying videos like:
 
- * examining the "emotion palette" of different actors
- * finding gameplays that lead to a touchdown in a football game
+ * examining the movement of vehicles in a traffic video
+ * finding touchdowns in a football game
 
 EVA comes with a wide range of commonly used computer vision models. It written in Python, and it is licensed under the Apache license. 
 
-If you are wondering why you might need a video database system, start with page on <a href="https://evadb.readthedocs.io/en/latest/source/overview/video.html#">Video Database Systems</a>. It describes how EVA lets users easily make use of deep learning models and how they can reduce money spent on inference on large image or video datasets.
+If you are wondering why you might need a video database system, start with page on <a href="https://evadb.readthedocs.io/en/stable/source/overview/video.html#">Video Database Systems</a>. It describes how EVA lets users easily make use of deep learning models and how they can reduce money spent on inference on large image or video datasets.
 
-The <a href="https://evadb.readthedocs.io/en/latest/source/overview/installation.html">Getting Started</a> page shows how you can use EVA for different computer vision tasks: image classification, object detection, action recognition, and how you can easily extend EVA to support your custom deep learning model in the form of user-defined functions.
+The <a href="https://evadb.readthedocs.io/en/stable/source/overview/installation.html">Getting Started</a> page shows how you can use EVA for different computer vision tasks: image classification, object detection, action recognition, and how you can easily extend EVA to support your custom deep learning model in the form of user-defined functions.
 
-The <a href="https://evadb.readthedocs.io/en/latest/source/tutorials/index.html">User Guides</a> section contains Jupyter Notebooks that demonstrate how to use various features of EVA. Each notebook includes a link to Google Colab, where you can run the code by yourself.
+The <a href="https://evadb.readthedocs.io/en/stable/source/tutorials/index.html">User Guides</a> section contains Jupyter Notebooks that demonstrate how to use various features of EVA. Each notebook includes a link to Google Colab, where you can run the code by yourself.
 
 ## Why EVA? ##
 
@@ -52,7 +52,7 @@ The <a href="https://evadb.readthedocs.io/en/latest/source/tutorials/index.html"
 </details>
 
 ## Links
-* [Documentation](https://evadb.readthedocs.io/en/latest/)
+* [Documentation](https://evadb.readthedocs.io/)
 * [Tutorials](https://github.com/georgia-tech-db/eva/blob/master/tutorials/03-emotion-analysis.ipynb)
 * [Join Slack](https://join.slack.com/t/eva-db/shared_invite/zt-1i10zyddy-PlJ4iawLdurDv~aIAq90Dg)
 * [Demo](https://ada-00.cc.gatech.edu/eva/playground)
@@ -124,22 +124,37 @@ IMPL  'eva/udfs/fastrcnn_object_detector.py';
 
 ## Illustrative EVA Applications 
 
-### :desert_island: Traffic Analysis Application using Object Detection Model
+### 🔮 Traffic Analysis (Object Detection Model)
 | Source Video  | Query Result |
 |---------------|--------------|
 |<img alt="Source Video" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/traffic-input.webp" width="300"> |<img alt="Query Result" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/traffic-output.webp" width="300"> |
 
-### :desert_island: MNIST Digit Recognition using Image Classification Model
+### 🔮 MNIST Digit Recognition (Image Classification Model)
 | Source Video  | Query Result |
 |---------------|--------------|
 |<img alt="Source Video" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/mnist-input.webp" width="150"> |<img alt="Query Result" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/mnist-output.webp" width="150"> |
 
-### :desert_island: Movie Analysis Application using Face Detection + Emotion Classfication Models
+### 🔮 Movie Analysis (Face Detection + Emotion Classfication Models)
 
 | Source Video  | Query Result |
 |---------------|--------------|
 |<img alt="Source Video" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/gangubai-input.webp" width="400"> |<img alt="Query Result" src="https://github.com/georgia-tech-db/eva/releases/download/v0.1.0/gangubai-output.webp" width="400"> |
 
+### 🔮 [License Plate Recognition](https://github.com/georgia-tech-db/eva-application-template) (Plate Detection + OCR Extraction Models)
+
+| Source Image  | Query Result |
+|---------------|--------------|
+|<img alt="Source Image" src="https://raw.githubusercontent.com/georgia-tech-db/eva-application-template/main/README_files/README_14_6.png" width="400"> |<img alt="Query Result" src="https://raw.githubusercontent.com/georgia-tech-db/eva-application-template/main/README_files/README_19_1.png" width="400"> |
+
+### 🔮 [Meme Toxicity Classification](https://github.com/georgia-tech-db/toxicity-classification) (OCR Extraction + Toxicity Classification Models)
+
+| Source Image  | Query Result |
+|---------------|--------------|
+|<img alt="Source Image" src="https://raw.githubusercontent.com/georgia-tech-db/toxicity-classification/main/README_files/README_16_1.png" width="300"> |<img alt="Query Result" src="https://raw.githubusercontent.com/georgia-tech-db/toxicity-classification/main/README_files/README_16_2.png" width="300"> |
+
+
+
+
 ## Community
 
 Join the EVA community on [Slack](https://join.slack.com/t/eva-db/shared_invite/zt-1i10zyddy-PlJ4iawLdurDv~aIAq90Dg) to ask questions and to share your ideas for improving EVA.
@@ -153,11 +168,11 @@ Join the EVA community on [Slack](https://join.slack.com/t/eva-db/shared_invite/
 [![PyPI Version](https://img.shields.io/pypi/v/evadb.svg)](https://pypi.org/project/evadb)
 [![CI Status](https://circleci.com/gh/georgia-tech-db/eva.svg?style=svg)](https://circleci.com/gh/georgia-tech-db/eva)
 [![Coverage Status](https://coveralls.io/repos/github/georgia-tech-db/eva/badge.svg?branch=master)](https://coveralls.io/github/georgia-tech-db/eva?branch=master)
-[![Documentation Status](https://readthedocs.org/projects/evadb/badge/?version=latest)](https://evadb.readthedocs.io/en/latest/index.html)
+[![Documentation Status](https://readthedocs.org/projects/evadb/badge/?version=stable)](https://evadb.readthedocs.io/en/stable/index.html)
 
 To file a bug or request a feature, please use GitHub issues. Pull requests are welcome.
 For more information on installing from source and contributing to EVA, see our
-[contributing guidelines](https://evadb.readthedocs.io/en/latest/source/contribute/index.html).
+[contributing guidelines](https://evadb.readthedocs.io/en/stable/source/contribute/index.html).
 
 ## License
 Copyright (c) 2018-2022 [Georgia Tech Database Group](http://db.cc.gatech.edu/)

diff --git a/data/detoxify/meme1.jpg b/data/detoxify/meme1.jpg
diff --git a/data/detoxify/meme2.jpg b/data/detoxify/meme2.jpg
diff --git a/eva/catalog/catalog_utils.py b/eva/catalog/catalog_utils.py
@@ -38,6 +38,7 @@ def get_video_table_column_definitions() -> List[ColumnDefinition]:
         ColumnDefinition(
             "data", ColumnType.NDARRAY, NdArrayType.UINT8, (None, None, None)
         ),
+        ColumnDefinition("seconds", ColumnType.FLOAT, None, []),
     ]
     return columns
 

diff --git a/eva/expression/aggregation_expression.py b/eva/expression/aggregation_expression.py
@@ -37,14 +37,14 @@ def __init__(
         )  # can also be a float
 
     def evaluate(self, *args, **kwargs):
-        batch = self.get_child(0).evaluate(*args, **kwargs)
+        batch: Batch = self.get_child(0).evaluate(*args, **kwargs)
         if self.etype == ExpressionType.AGGREGATION_FIRST:
             batch = batch[0]
-        if self.etype == ExpressionType.AGGREGATION_LAST:
+        elif self.etype == ExpressionType.AGGREGATION_LAST:
             batch = batch[-1]
-        if self.etype == ExpressionType.AGGREGATION_SEGMENT:
+        elif self.etype == ExpressionType.AGGREGATION_SEGMENT:
             batch = Batch.stack(batch)
-        if self.etype == ExpressionType.AGGREGATION_SUM:
+        elif self.etype == ExpressionType.AGGREGATION_SUM:
             batch.aggregate("sum")
         elif self.etype == ExpressionType.AGGREGATION_COUNT:
             batch.aggregate("count")
@@ -55,9 +55,14 @@ def evaluate(self, *args, **kwargs):
         elif self.etype == ExpressionType.AGGREGATION_MAX:
             batch.aggregate("max")
         batch.reset_index()
-        # TODO ACTION:
-        # Add raise exception if data type doesn't match
 
+        column_name = self.etype.name
+        if column_name.find("AGGREGATION_") != -1:
+            # AGGREGATION_MAX -> MAX
+            updated_column_name = column_name.replace("AGGREGATION_", "")
+            batch.modify_column_alias(updated_column_name)
+
+        # TODO: Raise exception if data type doesn't match
         return batch
 
     def get_symbol(self) -> str:

diff --git a/eva/expression/comparison_expression.py b/eva/expression/comparison_expression.py
@@ -48,7 +48,9 @@ def evaluate(self, *args, **kwargs):
             elif len(rbatch) == 1:
                 rbatch.repeat(len(lbatch))
             else:
-                raise Exception("Left and Right batch does not have equal elements")
+                raise Exception(
+                    f"Left and Right batch does not have equal elements: left: {len(lbatch)} right: {len(rbatch)}"
+                )
 
         if self.etype == ExpressionType.COMPARE_EQUAL:
             return Batch.from_eq(lbatch, rbatch)

diff --git a/eva/models/storage/batch.py b/eva/models/storage/batch.py
@@ -278,7 +278,9 @@ def merge_column_wise(cls, batches: List[Batch], auto_renaming=False) -> Batch:
         if not len(batches):
             return Batch()
         frames = [batch.frames for batch in batches]
-        new_frames = pd.concat(frames, axis=1, copy=False, ignore_index=False)
+        new_frames = pd.concat(frames, axis=1, copy=False, ignore_index=False).fillna(
+            method="ffill"
+        )
         if new_frames.columns.duplicated().any():
             logger.warn("Duplicated column name detected {}".format(new_frames))
         return Batch(new_frames)
@@ -427,9 +429,9 @@ def modify_column_alias(self, alias: Union[Alias, str]) -> None:
             ]
         else:
             for col_name in self.columns:
-                if "." in col_name:
+                if "." in str(col_name):
                     new_col_names.append(
-                        "{}.{}".format(alias.alias_name, col_name.split(".")[1])
+                        "{}.{}".format(alias.alias_name, str(col_name).split(".")[1])
                     )
                 else:
                     new_col_names.append("{}.{}".format(alias.alias_name, col_name))
@@ -446,3 +448,7 @@ def drop_column_alias(self) -> None:
                 new_col_names.append(col_name)
 
         self._frames.columns = new_col_names
+
+    def rename(self, columns) -> None:
+        "Rename column names"
+        self._frames.rename(columns=columns, inplace=True)
diff --git a/eva/parser/eva.lark b/eva/parser/eva.lark
@@ -242,9 +242,8 @@ function_call: udf_function         ->udf_function_call
 
 udf_function: simple_id "(" function_args ")" dotted_id?
 
-
-aggregate_windowed_function: aggregate_function_name "(" (ALL | DISTINCT)? function_arg ")"
-                           | COUNT "(" ("*" | ALL? function_arg) ")"
+aggregate_windowed_function: aggregate_function_name "(" function_arg ")"
+                           | COUNT "(" (STAR | function_arg) ")"
 
 
 aggregate_function_name: AVG | MAX | MIN | SUM | FIRST | LAST | SEGMENT

diff --git a/eva/parser/lark_visitor/_functions.py b/eva/parser/lark_visitor/_functions.py
@@ -13,11 +13,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from lark import Tree
+from lark import Token, Tree
 
 from eva.expression.abstract_expression import ExpressionType
 from eva.expression.aggregation_expression import AggregationExpression
 from eva.expression.function_expression import FunctionExpression
+from eva.expression.tuple_value_expression import TupleValueExpression
 from eva.parser.create_udf_statement import CreateUDFStatement
 from eva.parser.drop_udf_statement import DropUDFStatement
 from eva.utils.logging_manager import logger
@@ -114,7 +115,17 @@ def create_udf(self, tree):
 
     def get_aggregate_function_type(self, agg_func_name):
         agg_func_type = None
-        if agg_func_name == "FIRST":
+        if agg_func_name == "COUNT":
+            agg_func_type = ExpressionType.AGGREGATION_COUNT
+        elif agg_func_name == "MIN":
+            agg_func_type = ExpressionType.AGGREGATION_MIN
+        elif agg_func_name == "MAX":
+            agg_func_type = ExpressionType.AGGREGATION_MAX
+        elif agg_func_name == "SUM":
+            agg_func_type = ExpressionType.AGGREGATION_SUM
+        elif agg_func_name == "AVG":
+            agg_func_type = ExpressionType.AGGREGATION_AVG
+        elif agg_func_name == "FIRST":
             agg_func_type = ExpressionType.AGGREGATION_FIRST
         elif agg_func_name == "LAST":
             agg_func_type = ExpressionType.AGGREGATION_LAST
@@ -125,22 +136,24 @@ def get_aggregate_function_type(self, agg_func_name):
         return agg_func_type
 
     def aggregate_windowed_function(self, tree):
-        agg_func_name = self.visit(tree.children[0]).value
+
         agg_func_arg = None
-        assert agg_func_name in [
-            "MIN",
-            "MAX",
-            "AVG",
-            "SUM",
-            "COUNT",
-            "FIRST",
-            "LAST",
-            "SEGMENT",
-        ]
+        agg_func_name = None
+
         for child in tree.children:
             if isinstance(child, Tree):
                 if child.data == "function_arg":
                     agg_func_arg = self.visit(child)
+                elif child.data == "aggregate_function_name":
+                    agg_func_name = self.visit(child).value
+            elif isinstance(child, Token):
+                token = child.value
+                # Support for COUNT(*)
+                if token != "*":
+                    agg_func_name = token
+                else:
+                    agg_func_arg = TupleValueExpression(col_name="id")
+
         agg_func_type = self.get_aggregate_function_type(agg_func_name)
         agg_expr = AggregationExpression(agg_func_type, None, agg_func_arg)
         return agg_expr
diff --git a/eva/readers/opencv_reader.py b/eva/readers/opencv_reader.py
@@ -59,7 +59,11 @@ def _read(self) -> Iterator[Dict]:
                 _, frame = video.read()
                 frame_id = begin
                 while frame is not None and frame_id <= end:
-                    yield {"id": frame_id, "data": frame}
+                    yield {
+                        "id": frame_id,
+                        "data": frame,
+                        "seconds": frame_id // video.get(cv2.CAP_PROP_FPS),
+                    }
                     _, frame = video.read()
                     frame_id += 1
         else:

diff --git a/eva/udfs/emotion_detector.py b/eva/udfs/emotion_detector.py
@@ -108,7 +108,7 @@ def setup(self, threshold=0.85):
 
         # pull model from dropbox if not present
         if not os.path.exists(model_path):
-            model_url = "https://www.dropbox.com/s/bqblykok62d28mn/emotion_detector.t7"
+            model_url = "https://www.dropbox.com/s/x0a8bz53apvmoc9/emotion_detector.t7"
             subprocess.run(["wget", model_url, "--directory-prefix", output_directory])
 
         # self.get_device() infers device from the loaded model, so not using it

diff --git a/eva/udfs/ndarray/timestamp.py b/eva/udfs/ndarray/timestamp.py
@@ -0,0 +1,49 @@
+# coding=utf-8
+# Copyright 2018-2022 EVA
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import time
+
+import pandas as pd
+
+from eva.udfs.abstract.abstract_udf import AbstractUDF
+
+
+class Timestamp(AbstractUDF):
+    @property
+    def name(self) -> str:
+        return "Timestamp"
+
+    def setup(self):
+        pass
+
+    def forward(self, inp: pd.DataFrame) -> pd.DataFrame:
+        """
+        inp: DataFrame -> out: DataFrame
+            second           timestamp
+        0   int           0   string
+        1   int           1   string
+        """
+
+        # Sanity check
+        if len(inp.columns) != 1:
+            raise ValueError("input must only contain one column (seconds)")
+
+        seconds = pd.DataFrame(inp[inp.columns[0]])
+        timestamp_result = seconds.apply(lambda x: self.format_timestamp(x[0]), axis=1)
+        outcome = pd.DataFrame({"timestamp": timestamp_result.values})
+        return outcome
+
+    def format_timestamp(self, num_of_seconds):
+        timestamp = time.strftime("%H:%M:%S", time.gmtime(num_of_seconds))
+        return timestamp