Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bbd3f70
Added initial capa control flow for scripts in C#.
adamstorek Jun 27, 2022
8173397
Implemented some further basic TreeSitter Extractor-related concepts …
adamstorek Jun 27, 2022
428f6bc
Modified mypy config file to ignore tree-sitter's missing exports.
adamstorek Jun 28, 2022
a6d7ba2
Implemented core tree sitter engine component with C# queries that se…
adamstorek Jun 28, 2022
80bf78b
Implemented script global extraction handlers (mostly wrapping existi…
adamstorek Jun 28, 2022
cf3dc7e
Reworked format parsing to align better with the rest of capa logic.
adamstorek Jun 28, 2022
9d7f575
Implemented a large part of the C# functionality; refactored the Tree…
adamstorek Jun 29, 2022
3d4b4ec
Added function-level feature extraction.
adamstorek Jun 30, 2022
eca7ead
Bug fixes and code refactoring of the Tree Sitter extractor.
adamstorek Jun 30, 2022
5fd953f
Added tree_sitter to requirements in setup.py.
adamstorek Jun 30, 2022
1f79db9
Added tests for TreeSitterExtractorEngine initialization, new object …
adamstorek Jul 1, 2022
a58bc0b
Added more TreeSitterExtractorEngine tests for pure C#.
adamstorek Jul 1, 2022
5ddb8ba
Added last remaining tests for the TreeSitterExtractorEngine class an…
adamstorek Jul 1, 2022
31e2fb9
Reverted yielding only non-empty strings in order to stay consistent …
adamstorek Jul 5, 2022
5bf3f18
Removing functions that should not be used in tree-sitter extractor (…
adamstorek Jul 5, 2022
a4529fc
Modifying extraction of global statements to omit local function decl…
adamstorek Jul 5, 2022
d5de9a1
Added script language feature to freeze.
adamstorek Jul 5, 2022
6c10458
Added test cases for TS Extractor.
adamstorek Jul 5, 2022
9bd9824
Refactored query bindings.
adamstorek Jul 6, 2022
2594849
Added support for template parsing.
adamstorek Jul 6, 2022
619ed94
Added support for HTML parsing.
adamstorek Jul 6, 2022
5e23802
Implemented the necessary modifications to support embedded templates…
adamstorek Jul 7, 2022
5d83e8d
Added more buildings to build; minor style improvement.
adamstorek Jul 7, 2022
9570523
Further refactored the Tree-sitter queries and fixed minor template e…
adamstorek Jul 7, 2022
7c5e6e3
Refactored extractor engine tests and began adding new template tests.
adamstorek Jul 7, 2022
1e0326a
Added new tests for embedded template testing and refactored a few al…
adamstorek Jul 8, 2022
ca1939f
Bug fixes in extractor and HTML Tree-sitter engine.
adamstorek Jul 8, 2022
d7ab2db
Fixed important namespace-parsing bugs.
adamstorek Jul 11, 2022
5cfbecc
Further improvement to namespace parsing, including default namespace…
adamstorek Jul 11, 2022
26cc1bc
Added more tests and a few minor bug fixes.
adamstorek Jul 11, 2022
2a9e76f
Added language-specific integer parsing.
adamstorek Jul 12, 2022
672ca71
Fixed an important bug in FileOffsetRangeAddress comparison method.
adamstorek Jul 12, 2022
ca426ca
Added more ASPX tests.
adamstorek Jul 12, 2022
fd80277
Fixed the capa control flow to fully support capa scripts.
adamstorek Jul 12, 2022
d0c4acb
Major changes: switching imports and function names to properties, st…
adamstorek Jul 18, 2022
ad31d83
Fixed property-extraction bugs.
adamstorek Jul 19, 2022
e52a9b3
Added few more test cases.
adamstorek Jul 19, 2022
b27713b
Minor style improvements.
adamstorek Jul 19, 2022
b2df2b0
Removed deprecated parse_integer.
adamstorek Jul 19, 2022
a0379a6
Added more tests; fixed integer parsing related bugs.
adamstorek Jul 19, 2022
eeecb63
Fixing address range bug; refactoring and cleanup.
adamstorek Jul 20, 2022
cebc5e1
Incorporated more tests.
adamstorek Jul 20, 2022
d7dcc94
Added support for Python.
adamstorek Jul 26, 2022
32dc5ff
Added more python test cases; fixed a number of python bugs; further …
adamstorek Jul 29, 2022
5e85a6e
Implemented namespace aliasing; further refactored the codebase.
adamstorek Aug 2, 2022
614900f
Refactored/simplified parts of the codebase to improve readability; a…
adamstorek Aug 3, 2022
bb08181
Implemented script language auto-detection.
adamstorek Aug 3, 2022
1fd9d4a
Removed a spurious import.
adamstorek Aug 3, 2022
7ba978f
Added more test cases; moved script language feature to global featur…
adamstorek Aug 5, 2022
25cf09b
Introduced auto-detection to template-script parsing, builtins namesp…
adamstorek Aug 10, 2022
e693573
Attempted to implement the class extraction as specified last Friday …
adamstorek Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added test cases for TS Extractor.
  • Loading branch information
adamstorek committed Jul 19, 2022
commit 6c10458784c20d6bff0653ca8f7cf13f9bdf61df
3 changes: 1 addition & 2 deletions capa/features/extractors/script.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ def extract_format() -> Iterator[Tuple[Feature, Address]]:


def get_language_from_ext(path: str):
_, ext = os.path.splitext(path)
if ext == ".cs":
if path.endswith((".cs", ".cs_")):
return LANG_CS
raise ValueError(f"{path} has an unrecognized or an unsupported extension.")
49 changes: 46 additions & 3 deletions tests/fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
import itertools
import contextlib
import collections
from typing import Set, Dict
from typing import Set, Dict, Union
from functools import lru_cache

import pytest
Expand All @@ -38,6 +38,7 @@
Feature,
)
from capa.features.address import Address
from capa.features.extractors.ts.extractor import TreeSitterFeatureExtractor
from capa.features.extractors.base_extractor import BBHandle, InsnHandle, FunctionHandle
from capa.features.extractors.dnfile.extractor import DnfileFeatureExtractor

Expand Down Expand Up @@ -177,6 +178,13 @@ def get_ts_extractor_engine(language, path):
return capa.features.extractors.ts.engine.TreeSitterExtractorEngine(language, path)


@lru_cache(maxsize=1)
def get_ts_extractor(path):
import capa.features.extractors.ts.extractor

return capa.features.extractors.ts.extractor.TreeSitterFeatureExtractor(path)


def extract_global_features(extractor):
features = collections.defaultdict(set)
for feature, va in extractor.extract_global_features():
Expand Down Expand Up @@ -359,9 +367,13 @@ def sample(request):
return resolve_sample(request.param)


def get_function(extractor, fva: int) -> FunctionHandle:
def get_function(extractor, fva: Union[int, tuple]) -> FunctionHandle:
if isinstance(fva, tuple) and not isinstance(extractor, TreeSitterFeatureExtractor):
raise ValueError("invalid fva format")
for fh in extractor.get_functions():
if isinstance(extractor, DnfileFeatureExtractor):
if isinstance(extractor, TreeSitterFeatureExtractor):
addr = (fh.inner.start_byte, fh.inner.end_byte)
elif isinstance(extractor, DnfileFeatureExtractor):
addr = fh.inner.offset
else:
addr = fh.address
Expand Down Expand Up @@ -475,6 +487,37 @@ def scope(request):
return resolve_scope(request.param)


def resolve_scope_ts(scope):
if scope == "global":
inner_fn = lambda extractor: extract_global_features(extractor)
elif scope == "file":

def inner_fn(extractor):
features = extract_file_features(extractor)
for k, vs in extract_global_features(extractor).items():
features[k].update(vs)
return features

elif scope.startswith("function"):
# like `function=(155, 192)`
def inner_fn(extractor):
fh = get_function(extractor, eval(scope.partition("=")[2]))
features = extract_function_features(extractor, fh)
for k, vs in extract_global_features(extractor).items():
features[k].update(vs)
return features

else:
raise ValueError("unexpected scope fixture")
inner_fn.__name__ = scope
return inner_fn


@pytest.fixture
def scope_ts(request):
return resolve_scope_ts(request.param)


def make_test_id(values):
return "-".join(map(str, values))

Expand Down
32 changes: 25 additions & 7 deletions tests/test_ts.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
from typing import List, Tuple

import pytest
import fixtures
from fixtures import *
from tree_sitter import Node, Tree

from capa.features.address import FileOffsetRangeAddress
from capa.features.file import Import
from capa.features.common import OS, OS_ANY, ARCH_ANY, FORMAT_SCRIPT, Arch, Format, String, Namespace, ScriptLanguage
from capa.features.address import NO_ADDRESS, FileOffsetRangeAddress
from capa.features.extractors.script import LANG_CS
from capa.features.extractors.ts.query import QueryBinding
from capa.features.extractors.ts.engine import TreeSitterExtractorEngine
Expand All @@ -26,10 +29,6 @@ def do_test_range(engine: TreeSitterExtractorEngine, node: Node, expected_range:
assert engine.get_range(node).startswith(expected_range) if startswith else engine.get_range(node) == expected_range


def do_test_id_range(engine: TreeSitterExtractorEngine, node: Node, expected_id_range: str, startswith: bool = False):
do_test_range(engine, engine.get_object_id(node), expected_id_range, startswith)


def do_test_range_address(engine: TreeSitterExtractorEngine, node: Node):
assert isinstance(engine.get_address(node), FileOffsetRangeAddress)
addr = engine.get_address(node)
Expand Down Expand Up @@ -235,8 +234,6 @@ def do_test_ts_engine_function_names_parsing(
"global statements": [
'string stdout = "";',
'string stderr = "";',
"void die() {",
"void Page_Load(object sender, System.EventArgs e) {",
],
"all import names": ["System.Diagnostics.ProcessStartInfo", "System.Diagnostics.Process"],
"all function names": [],
Expand All @@ -259,3 +256,24 @@ def test_ts_engine(request: pytest.FixtureRequest, engine_str: str, expected_dic
do_test_ts_engine_global_statements_parsing(engine, expected_dict["global statements"])
do_test_ts_engine_namespaces_parsing(engine, expected_dict["namespaces"])
do_test_ts_engine_default_range_address(engine)


FEATURE_PRESENCE_TESTS_SCRIPTS = sorted(
[
("cs_f397cb", "global", Arch(ARCH_ANY), True),
("cs_f397cb", "global", OS(OS_ANY), True),
("cs_f397cb", "file", Format(FORMAT_SCRIPT), True),
("cs_f397cb", "file", ScriptLanguage(LANG_CS), True),
("cs_f397cb", "file", Namespace("System"), True),
("cs_f397cb", "file", String(""), True),
("cs_f397cb", "function=(0x38,0x16c)", String("Not Found"), True),
("cs_f397cb", "function=(0x16e,0x7ce)", String("127.0.0.1"), True),
("cs_f397cb", "function=(0x16e,0x7ce)", Import("System.Diagnostics.ProcessStartInfo"), True),
("cs_f397cb", "function=(0x16e,0x7ce)", Import("System.Diagnostics.Process"), True),
]
)


@parametrize("sample, scope_ts, feature, expected", FEATURE_PRESENCE_TESTS_SCRIPTS, indirect=["sample", "scope_ts"])
def test_ts_extractor(sample, scope_ts, feature, expected):
fixtures.do_test_feature_presence(fixtures.get_ts_extractor, sample, scope_ts, feature, expected)