Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for analysis of source code/scripted languages #1080

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bbd3f70
Added initial capa control flow for scripts in C#.
adamstorek Jun 27, 2022
8173397
Implemented some further basic TreeSitter Extractor-related concepts …
adamstorek Jun 27, 2022
428f6bc
Modified mypy config file to ignore tree-sitter's missing exports.
adamstorek Jun 28, 2022
a6d7ba2
Implemented core tree sitter engine component with C# queries that se…
adamstorek Jun 28, 2022
80bf78b
Implemented script global extraction handlers (mostly wrapping existi…
adamstorek Jun 28, 2022
cf3dc7e
Reworked format parsing to align better with the rest of capa logic.
adamstorek Jun 28, 2022
9d7f575
Implemented a large part of the C# functionality; refactored the Tree…
adamstorek Jun 29, 2022
3d4b4ec
Added function-level feature extraction.
adamstorek Jun 30, 2022
eca7ead
Bug fixes and code refactoring of the Tree Sitter extractor.
adamstorek Jun 30, 2022
5fd953f
Added tree_sitter to requirements in setup.py.
adamstorek Jun 30, 2022
1f79db9
Added tests for TreeSitterExtractorEngine initialization, new object …
adamstorek Jul 1, 2022
a58bc0b
Added more TreeSitterExtractorEngine tests for pure C#.
adamstorek Jul 1, 2022
5ddb8ba
Added last remaining tests for the TreeSitterExtractorEngine class an…
adamstorek Jul 1, 2022
31e2fb9
Reverted yielding only non-empty strings in order to stay consistent …
adamstorek Jul 5, 2022
5bf3f18
Removing functions that should not be used in tree-sitter extractor (…
adamstorek Jul 5, 2022
a4529fc
Modifying extraction of global statements to omit local function decl…
adamstorek Jul 5, 2022
d5de9a1
Added script language feature to freeze.
adamstorek Jul 5, 2022
6c10458
Added test cases for TS Extractor.
adamstorek Jul 5, 2022
9bd9824
Refactored query bindings.
adamstorek Jul 6, 2022
2594849
Added support for template parsing.
adamstorek Jul 6, 2022
619ed94
Added support for HTML parsing.
adamstorek Jul 6, 2022
5e23802
Implemented the necessary modifications to support embedded templates…
adamstorek Jul 7, 2022
5d83e8d
Added more buildings to build; minor style improvement.
adamstorek Jul 7, 2022
9570523
Further refactored the Tree-sitter queries and fixed minor template e…
adamstorek Jul 7, 2022
7c5e6e3
Refactored extractor engine tests and began adding new template tests.
adamstorek Jul 7, 2022
1e0326a
Added new tests for embedded template testing and refactored a few al…
adamstorek Jul 8, 2022
ca1939f
Bug fixes in extractor and HTML Tree-sitter engine.
adamstorek Jul 8, 2022
d7ab2db
Fixed important namespace-parsing bugs.
adamstorek Jul 11, 2022
5cfbecc
Further improvement to namespace parsing, including default namespace…
adamstorek Jul 11, 2022
26cc1bc
Added more tests and a few minor bug fixes.
adamstorek Jul 11, 2022
2a9e76f
Added language-specific integer parsing.
adamstorek Jul 12, 2022
672ca71
Fixed an important bug in FileOffsetRangeAddress comparison method.
adamstorek Jul 12, 2022
ca426ca
Added more ASPX tests.
adamstorek Jul 12, 2022
fd80277
Fixed the capa control flow to fully support capa scripts.
adamstorek Jul 12, 2022
d0c4acb
Major changes: switching imports and function names to properties, st…
adamstorek Jul 18, 2022
ad31d83
Fixed property-extraction bugs.
adamstorek Jul 19, 2022
e52a9b3
Added few more test cases.
adamstorek Jul 19, 2022
b27713b
Minor style improvements.
adamstorek Jul 19, 2022
b2df2b0
Removed deprecated parse_integer.
adamstorek Jul 19, 2022
a0379a6
Added more tests; fixed integer parsing related bugs.
adamstorek Jul 19, 2022
eeecb63
Fixing address range bug; refactoring and cleanup.
adamstorek Jul 20, 2022
cebc5e1
Incorporated more tests.
adamstorek Jul 20, 2022
d7dcc94
Added support for Python.
adamstorek Jul 26, 2022
32dc5ff
Added more python test cases; fixed a number of python bugs; further …
adamstorek Jul 29, 2022
5e85a6e
Implemented namespace aliasing; further refactored the codebase.
adamstorek Aug 2, 2022
614900f
Refactored/simplified parts of the codebase to improve readability; a…
adamstorek Aug 3, 2022
bb08181
Implemented script language auto-detection.
adamstorek Aug 3, 2022
1fd9d4a
Removed a spurious import.
adamstorek Aug 3, 2022
7ba978f
Added more test cases; moved script language feature to global featur…
adamstorek Aug 5, 2022
25cf09b
Introduced auto-detection to template-script parsing, builtins namesp…
adamstorek Aug 10, 2022
e693573
Attempted to implement the class extraction as specified last Friday …
adamstorek Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Reworked format parsing to align better with the rest of capa logic.
  • Loading branch information
adamstorek committed Jul 19, 2022
commit cf3dc7e0c91073d645663bb3b94d36f74f827247
2 changes: 1 addition & 1 deletion capa/features/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ def __init__(self, value: str, description=None):
FORMAT_SC32 = "sc32"
FORMAT_SC64 = "sc64"
FORMAT_FREEZE = "freeze"
FORMAT_CS = "script_cs"
FORMAT_SCRIPT = "script"
FORMAT_UNKNOWN = "unknown"


Expand Down
10 changes: 6 additions & 4 deletions capa/features/extractors/script.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
from typing import Tuple, Iterator

from capa.features.common import OS, OS_ANY, ARCH_ANY, FORMAT_CS, Arch, Feature, ScriptLanguage
from capa.features.common import OS, OS_ANY, ARCH_ANY, Arch, Feature, ScriptLanguage
from capa.features.address import NO_ADDRESS, Address, FileOffsetRangeAddress

LANG_CS = "c_sharp"
adamstorek marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -18,7 +19,8 @@ def extract_os() -> Iterator[Tuple[Feature, Address]]:
yield OS(OS_ANY), NO_ADDRESS


def get_language_from_format(format_: str) -> str:
if format_ == FORMAT_CS:
def get_language_from_ext(path: str):
_, ext = os.path.splitext(path)
if ext == ".cs":
return LANG_CS
return "unknown"
raise ValueError("{path} has an unrecognized or an unsupported extension.")
4 changes: 2 additions & 2 deletions capa/features/extractors/ts/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@


class TreeSitterFeatureExtractor(FeatureExtractor):
def __init__(self, path: str, format_: str):
def __init__(self, path: str):
super().__init__()
self.path = path
self.language = capa.features.extractors.script.get_language_from_format(format_)
self.language = capa.features.extractors.script.get_language_from_ext(path)
with open(self.path, "rb") as f:
self.buf = f.read()
self.engine = capa.features.extractors.ts.engine.TreeSitterExtractorEngine(self.language)
Expand Down
12 changes: 4 additions & 8 deletions capa/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,12 @@
from typing import NoReturn

from capa.exceptions import UnsupportedFormatError
from capa.features.common import FORMAT_CS, FORMAT_SC32, FORMAT_SC64, FORMAT_UNKNOWN
from capa.features.common import FORMAT_SC32, FORMAT_SC64, FORMAT_SCRIPT, FORMAT_UNKNOWN

EXTENSIONS_SHELLCODE_32 = ("sc32", "raw32")
EXTENSIONS_SHELLCODE_64 = ("sc64", "raw64")
<<<<<<< HEAD
EXTENSIONS_ELF = "elf_"
=======
EXTENSION_CS = "cs"

>>>>>>> Added initial capa control flow for scripts in C#.
EXTENSIONS_SUPPORTED_SCRIPTS = "cs"

logger = logging.getLogger("capa")

Expand Down Expand Up @@ -56,8 +52,8 @@ def get_format_from_extension(sample: str) -> str:
return FORMAT_SC32
elif sample.endswith(EXTENSIONS_SHELLCODE_64):
return FORMAT_SC64
elif sample.endswith(EXTENSION_CS):
return FORMAT_CS
elif sample.endswith(EXTENSIONS_SUPPORTED_SCRIPTS):
return FORMAT_SCRIPT
return FORMAT_UNKNOWN


Expand Down
11 changes: 6 additions & 5 deletions capa/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
FORMAT_SC64,
FORMAT_DOTNET,
FORMAT_FREEZE,
FORMAT_SCRIPT,
)
from capa.features.address import NO_ADDRESS
from capa.features.extractors.base_extractor import BBHandle, InsnHandle, FunctionHandle, FeatureExtractor
Expand Down Expand Up @@ -345,11 +346,11 @@ def has_file_limitation(rules: RuleSet, capabilities: MatchResults, is_standalon
return False


def is_supported_script(format_: str):
def is_script_format(format_: str):
"""
If the script format was recognized, then it is supported.
"""
return format_.startswith("script")
return format_ == FORMAT_SCRIPT


def is_supported_format(sample: str) -> bool:
Expand Down Expand Up @@ -521,10 +522,10 @@ def get_extractor(
UnsupportedArchError
UnsupportedOSError
"""
if is_supported_script(format_):
if format_ == FORMAT_SCRIPT:
import capa.features.extractors.ts.extractor

return capa.features.extractors.ts.extractor.TreeSitterFeatureExtractor(path, format_)
return capa.features.extractors.ts.extractor.TreeSitterFeatureExtractor(path)

if format_ not in (FORMAT_SC32, FORMAT_SC64):
if not is_supported_format(path):
Expand Down Expand Up @@ -705,7 +706,7 @@ def collect_metadata(

format_ = get_format(sample_path)

if is_supported_script(format_):
if format_ == FORMAT_SCRIPT:
arch = get_script_arch()
os_ = get_script_os()
else:
Expand Down