Datasets

codesectools.datasets.core

Initializes the core dataset module.

Modules:

Name	Description
`dataset`	Defines the core abstract classes and data structures for datasets.

dataset

Defines the core abstract classes and data structures for datasets.

This module provides the foundational components for creating and managing datasets used for benchmarking SAST tools. It includes abstract base classes for different types of datasets (e.g., file-based, Git repository-based) and data classes to hold benchmark results.

Dataset

Dataset(lang: str | None = None)

Bases: ABC

Abstract base class for all datasets.

Defines the common interface that all dataset types must implement.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset.
`supported_languages`	`list[str]`	A list of programming languages supported by the dataset.
`license`	`str`	The license under which the dataset is distributed.
`license_url`	`str`	A URL to the full text of the license.

Initialize the Dataset instance.

Set up paths and load the dataset if a language is specified.

Parameters:

Name	Type	Description	Default
`lang`	`str \| None`	The programming language of the dataset to load. Must be one of the supported languages for the dataset class.	`None`

Methods:

Name	Description
`is_cached`	Check if the dataset has been downloaded and is cached locally.
`prompt_license_agreement`	Display the dataset's license and prompt the user for agreement.
`download_files`	Download the raw dataset files.
`download_dataset`	Handle the full dataset download process, including license prompt and caching.
`load_dataset`	Load the dataset into memory.
`list_dataset_full_names`	List all available language-specific versions of this dataset.

name `instance-attribute`

name: str

supported_languages `instance-attribute`

supported_languages: list[str]

license `instance-attribute`

license: str

license_url `instance-attribute`

license_url: str

directory `instance-attribute`

directory = USER_CACHE_DIR / self.name

lang `instance-attribute`

lang = lang

full_name `instance-attribute`

full_name = f'{self.name}_{self.lang}'

files `property`

files: list

Get the list of dataset files, loading them if necessary.

is_cached `classmethod`

is_cached() -> bool

Check if the dataset has been downloaded and is cached locally.

Returns:

Type	Description
`bool`	True if the dataset is cached, False otherwise.

prompt_license_agreement

prompt_license_agreement() -> None

Display the dataset's license and prompt the user for agreement.

download_files `abstractmethod`

download_files(test: bool = False) -> None

Download the raw dataset files.

This method must be implemented by subclasses to define how the raw files for the dataset are obtained.

Parameters:

Name	Type	Description	Default
`test`	`bool`	If True, download a smaller subset of the dataset for testing.	`False`

download_dataset

download_dataset(test: bool = False) -> None

Handle the full dataset download process, including license prompt and caching.

This method orchestrates the download by first prompting for license agreement, then calling the download_files method, and finally creating a .complete file to mark the dataset as cached.

Parameters:

Name	Type	Description	Default
`test`	`bool`	If True, download a smaller subset of the dataset for testing.	`False`

load_dataset `abstractmethod`

load_dataset() -> list[File]

Load the dataset into memory.

This method must be implemented by subclasses to define how the dataset's contents are loaded.

Returns:

Type	Description
`list[File]`	A list of `File` objects representing the dataset.

list_dataset_full_names `classmethod`

list_dataset_full_names() -> list[str]

List all available language-specific versions of this dataset.

Returns:

Type	Description
`list[str]`	A sorted list of strings, where each string is the dataset name
`list[str]`	suffixed with a supported language (e.g., "MyDataset_java").

PrebuiltDatasetMixin

Provide functionality for datasets that require a build step.

Attributes:

Name	Type	Description
`build_command`	`str`	The command required to build the dataset.
`prebuilt_expected`	`tuple[Path, str]`	A tuple containing the path and glob pattern to find the built artifacts.
`artifacts_arg`	`str`	The argument to pass to the SAST tool command template.

Methods:

Name	Description
`is_built`	Check if the dataset has been built.
`list_prebuilt_files`	List the pre-built artefact files.

build_command `instance-attribute`

build_command: str

prebuilt_expected `instance-attribute`

prebuilt_expected: tuple[Path, str]

artifacts_arg `instance-attribute`

artifacts_arg: str

is_built

is_built() -> bool

Check if the dataset has been built.

list_prebuilt_files

list_prebuilt_files() -> list[Path]

List the pre-built artefact files.

DatasetUnit

Base class for a single unit within a dataset.

Serves as a marker class for items like File or GitRepo.

BenchmarkData

Base class for storing data resulting from a benchmark.

Serves as a marker class for data holders like FileDatasetData or GitRepoDatasetData.

File

File(
    filepath: Path,
    content: str | bytes,
    cwes: list[CWE],
    has_vuln: bool,
)

Bases: DatasetUnit

Represent a single file in a dataset.

Attributes:

Name	Type	Description
`filepath`	`Path`	The relative path to the file.
`content`	`bytes`	The byte content of the file.
`cwes`	`list[CWE]`	A list of CWEs associated with the file.
`has_vuln`	`bool`	True if the vulnerability is real, False if it's intended to be a false positive test case.

Initialize a File instance.

Parameters:

Name	Type	Description	Default
`filepath`	`Path`	The relative path of the file.	required
`content`	`str \| bytes`	The content of the file, as a string or bytes. It will be converted to bytes if provided as a string.	required
`cwes`	`list[CWE]`	A list of CWEs associated with the file.	required
`has_vuln`	`bool`	True if the vulnerability is real, False if it's intended to be a false positive test case.	required

Methods:

Name	Description
`__repr__`	Return a developer-friendly string representation of the File.
`__eq__`	Compare this File with another object for equality based on filepath.
`save`	Save the file's content to a specified directory.

filepath `instance-attribute`

filepath = filepath

filename `instance-attribute`

filename = self.filepath.name

content `instance-attribute`

content = content

cwes `instance-attribute`

cwes = cwes

has_vuln `instance-attribute`

has_vuln = has_vuln

repr

__repr__() -> str

Return a developer-friendly string representation of the File.

Returns:

Type	Description
`str`	A string showing the class name, filepath, and CWE IDs.

eq

__eq__(other: str | Path | Self) -> bool

Compare this File with another object for equality based on filepath.

Parameters:

Name	Type	Description	Default
`other`	`str \| Path \| Self`	The object to compare with. Can be a string/Path (filepath) or another File instance.	required

Returns:

Type	Description
`bool`	True if the filepaths are equal, False otherwise.

save

save(dir: Path) -> None

Save the file's content to a specified directory.

Parameters:

Name	Type	Description	Default
`dir`	`Path`	The path to the directory where the file should be saved.	required

FileDataset

FileDataset(lang: str)

Bases: Dataset

Abstract base class for datasets composed of individual files.

Initialize a FileDataset instance.

Parameters:

Name	Type	Description	Default
`lang`	`str`	The programming language of the dataset to load.	required

Methods:

Name	Description
`validate`	Validate a SAST analysis result against the ground truth of the dataset.

validate

validate(
    analysis_result: AnalysisResult,
) -> FileDatasetData

Validate a SAST analysis result against the ground truth of the dataset.

Compares the defects found by a SAST tool with the known vulnerabilities in the dataset files to categorize them as true positives, false positives, and false negatives, counting each unique (file, CWE) pair only once.

Parameters:

Name	Type	Description	Default
`analysis_result`	`AnalysisResult`	The result from a SAST tool analysis.	required

Returns:

Type	Description
`FileDatasetData`	A `FileDatasetData` object containing the validation metrics.

PrebuiltFileDataset

PrebuiltFileDataset(lang: str)

Bases: PrebuiltDatasetMixin, FileDataset

Represent a file-based dataset that requires a build step.

FileDatasetData

FileDatasetData(
    dataset: FileDataset,
    tp_defects: list[Defect],
    fp_defects: list[Defect],
    fn_defects: list[tuple[str, CWE]],
    cwes_list: list[CWE],
    tp_cwes: list[CWE],
    fp_cwes: list[CWE],
    fn_cwes: list[CWE],
    file_number: int,
    defect_number: int,
    unique_correct_number: int,
)

Bases: BenchmarkData

Store the results of validating an analysis against a FileDataset.

The counts for true positives, false positives, and false negatives are based on unique (file, CWE) pairs.

Attributes:

Name	Type	Description
`dataset`	`FileDataset`	The dataset used for the benchmark.
`tp_defects`	`list[Defect]`	A list of unique, correctly identified defects (True Positives).
`fp_defects`	`list[Defect]`	A list of unique, incorrectly identified defects (False Positives).
`fn_defects`	`list[tuple[str, CWE]]`	A list of unique vulnerabilities that were not found (False Negatives).
`cwes_list`	`list[CWE]`	All CWEs present in the dataset's ground truth (may contain duplicates if a CWE appears in multiple files).
`tp_cwes`	`list[CWE]`	List of CWEs from True Positive findings.
`fp_cwes`	`list[CWE]`	List of CWEs from False Positive findings.
`fn_cwes`	`list[CWE]`	List of CWEs from False Negative findings (missed vulnerabilities).
`file_number`	`int`	Total number of files in the dataset.
`defect_number`	`int`	Total number of defects reported by the tool (before de-duplication).
`unique_correct_number`	`int`	Number of files with at least one correctly identified defect.

Initialize a FileDatasetData instance.

Parameters:

Name	Type	Description	Default
`dataset`	`FileDataset`	The dataset used for the benchmark.	required
`tp_defects`	`list[Defect]`	A list of unique, correctly identified defects.	required
`fp_defects`	`list[Defect]`	A list of unique, incorrectly identified defects.	required
`fn_defects`	`list[tuple[str, CWE]]`	A list of unique vulnerabilities that were not found.	required
`cwes_list`	`list[CWE]`	A list of all ground-truth CWEs in the dataset.	required
`tp_cwes`	`list[CWE]`	A list of CWEs from True Positive findings.	required
`fp_cwes`	`list[CWE]`	A list of CWEs from False Positive findings.	required
`fn_cwes`	`list[CWE]`	A list of CWEs from missed vulnerabilities.	required
`file_number`	`int`	The total number of files in the dataset.	required
`defect_number`	`int`	The total number of defects found by the analysis (before de-duplication).	required
`unique_correct_number`	`int`	The number of files with at least one correctly identified vulnerability.	required

dataset `instance-attribute`

dataset = dataset

tp_defects `instance-attribute`

tp_defects = tp_defects

fp_defects `instance-attribute`

fp_defects = fp_defects

fn_defects `instance-attribute`

fn_defects = fn_defects

cwes_list `instance-attribute`

cwes_list = cwes_list

tp_cwes `instance-attribute`

tp_cwes = tp_cwes

fp_cwes `instance-attribute`

fp_cwes = fp_cwes

fn_cwes `instance-attribute`

fn_cwes = fn_cwes

file_number `instance-attribute`

file_number = file_number

defect_number `instance-attribute`

defect_number = defect_number

unique_correct_number `instance-attribute`

unique_correct_number = unique_correct_number

GitRepo

GitRepo(
    name: str,
    url: str,
    commit: str,
    size: int,
    cwes: list[CWE],
    files: list[str],
    has_vuln: bool,
)

Bases: DatasetUnit

Represent a single Git repository in a dataset.

Attributes:

Name	Type	Description
`name`	`str`	A unique name for the repository, often a CVE ID.
`url`	`str`	The URL to clone the Git repository.
`commit`	`str`	The specific commit hash to check out.
`size`	`int`	The size of the repository in bytes.
`cwes`	`list[CWE]`	A list of CWEs associated with the repository.
`files`	`list[str]`	A list of filenames known to be vulnerable in this commit.
`has_vuln`	`bool`	True if there is really a vuln in the gitrepo.

Initialize a GitRepo instance.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name/identifier for the repository.	required
`url`	`str`	The clone URL of the repository.	required
`commit`	`str`	The commit hash to analyze.	required
`size`	`int`	The size of the repository in bytes.	required
`cwes`	`list[CWE]`	A list of CWEs associated with the repository.	required
`files`	`list[str]`	A list of vulnerable files in the specified commit.	required
`has_vuln`	`bool`	True if there is really a vuln in the gitrepo.	required

Methods:

Name	Description
`__repr__`	Return a developer-friendly string representation of the GitRepo.
`__eq__`	Compare this GitRepo with another object for equality based on name.
`save`	Clone the repository and check out the specific commit.

name `instance-attribute`

name = name

url `instance-attribute`

url = url

commit `instance-attribute`

commit = commit

size `instance-attribute`

size = size

cwes `instance-attribute`

cwes = cwes

files `instance-attribute`

files = files

has_vuln `instance-attribute`

has_vuln = has_vuln

repr

__repr__() -> str

Return a developer-friendly string representation of the GitRepo.

Returns:

Type	Description
`str`	A string showing the repo's name, URL, commit, size, CWEs, and files.

eq

__eq__(other: str | Self) -> bool

Compare this GitRepo with another object for equality based on name.

Parameters:

Name	Type	Description	Default
`other`	`str \| Self`	The object to compare with. Can be a string (repo name) or another GitRepo instance.	required

Returns:

Type	Description
`bool`	True if the names are equal, False otherwise.

save

save(dir: Path) -> None

Clone the repository and check out the specific commit.

Parameters:

Name	Type	Description	Default
`dir`	`Path`	The path to the directory where the repository should be cloned.	required

GitRepoDataset

GitRepoDataset(lang: str)

Bases: Dataset

Abstract base class for datasets composed of Git repositories.

Attributes:

Name	Type	Description
`directory`	`Path`	The directory path for the dataset.
`lang`	`str`	The programming language of the dataset.
`full_name`	`str`	The full name of the dataset, including the language.
`repos`	`list[GitRepo]`	A list of `GitRepo` objects loaded from the dataset.
`max_repo_size`	`int`	The maximum repository size to consider for analysis.

Initialize a GitRepoDataset instance.

Parameters:

Name	Type	Description	Default
`lang`	`str`	The programming language of the dataset to load.	required

Methods:

Name	Description
`validate`	Validate SAST analysis results against the ground truth of the dataset.

repos `instance-attribute`

repos: list[GitRepo] = self.files

max_repo_size `instance-attribute`

max_repo_size: int

validate

validate(
    analysis_results: list[AnalysisResult],
) -> GitRepoDatasetData

Validate SAST analysis results against the ground truth of the dataset.

Compare the defects found by a SAST tool for each repository with the known vulnerabilities (CWEs and file locations) in the dataset to categorize them as true positives, false positives, and false negatives. Each unique (file, CWE) pair is counted once per repository.

Parameters:

Name	Type	Description	Default
`analysis_results`	`list[AnalysisResult]`	A list of analysis results, one for each repository.	required

Returns:

Type	Description
`GitRepoDatasetData`	A `GitRepoDatasetData` object containing the validation metrics.

GitRepoDatasetData

GitRepoDatasetData(
    dataset: GitRepoDataset,
    validated_repos: list[dict],
    total_repo_number: int,
    defect_numbers: int,
)

Bases: BenchmarkData

Store the results of validating an analysis against a GitRepoDataset.

Attributes:

Name	Type	Description
`dataset`	`GitRepoDataset`	The dataset used for the benchmark.
`validated_repos`	`list[dict]`	A list of dictionaries, each containing the validation results for a single repository.
`total_repo_number`	`int`	The total number of repositories in the dataset.
`defect_numbers`	`int`	The total number of defects found across all repos.

Initialize a GitRepoDatasetData instance.

Parameters:

Name	Type	Description	Default
`dataset`	`GitRepoDataset`	The dataset used for the benchmark.	required
`validated_repos`	`list[dict]`	A list of validation results per repository.	required
`total_repo_number`	`int`	The total number of repositories in the dataset.	required
`defect_numbers`	`int`	The total number of defects found by the analysis.	required

dataset `instance-attribute`

dataset = dataset

validated_repos `instance-attribute`

validated_repos = validated_repos

total_repo_number `instance-attribute`

total_repo_number = total_repo_number

defect_numbers `instance-attribute`

defect_numbers = defect_numbers

Datasets

codesectools.datasets.core

dataset

Dataset

lang

name instance-attribute

supported_languages instance-attribute

license instance-attribute

license_url instance-attribute

directory instance-attribute

lang instance-attribute

full_name instance-attribute

files property

is_cached classmethod

prompt_license_agreement

download_files abstractmethod

test

download_dataset

test

load_dataset abstractmethod

list_dataset_full_names classmethod

PrebuiltDatasetMixin

build_command instance-attribute

prebuilt_expected instance-attribute

artifacts_arg instance-attribute

is_built

list_prebuilt_files

DatasetUnit

BenchmarkData

File

filepath

content

cwes

has_vuln

filepath instance-attribute

filename instance-attribute

content instance-attribute

cwes instance-attribute

has_vuln instance-attribute

__repr__

__eq__

other

save

dir

FileDataset

lang

validate

analysis_result

PrebuiltFileDataset

FileDatasetData

dataset

tp_defects

fp_defects

fn_defects

cwes_list

tp_cwes

fp_cwes

fn_cwes

file_number

defect_number

unique_correct_number

dataset instance-attribute

tp_defects instance-attribute

fp_defects instance-attribute

fn_defects instance-attribute

cwes_list instance-attribute

tp_cwes instance-attribute

fp_cwes instance-attribute

fn_cwes instance-attribute

file_number instance-attribute

defect_number instance-attribute

unique_correct_number instance-attribute

GitRepo

name

url

commit

size

cwes

files

has_vuln

`lang`

name `instance-attribute`

supported_languages `instance-attribute`

license `instance-attribute`

license_url `instance-attribute`

directory `instance-attribute`

lang `instance-attribute`

full_name `instance-attribute`

files `property`

is_cached `classmethod`

download_files `abstractmethod`

`test`

`test`

load_dataset `abstractmethod`

list_dataset_full_names `classmethod`

build_command `instance-attribute`

prebuilt_expected `instance-attribute`

artifacts_arg `instance-attribute`

`filepath`

`content`

`cwes`

`has_vuln`

filepath `instance-attribute`

filename `instance-attribute`

content `instance-attribute`

cwes `instance-attribute`

has_vuln `instance-attribute`

repr

eq

`other`

`dir`

`lang`

`analysis_result`

`dataset`

`tp_defects`

`fp_defects`

`fn_defects`

`cwes_list`

`tp_cwes`

`fp_cwes`

`fn_cwes`

`file_number`

`defect_number`

`unique_correct_number`

dataset `instance-attribute`

tp_defects `instance-attribute`

fp_defects `instance-attribute`

fn_defects `instance-attribute`

cwes_list `instance-attribute`

tp_cwes `instance-attribute`

fp_cwes `instance-attribute`

fn_cwes `instance-attribute`

file_number `instance-attribute`

defect_number `instance-attribute`

unique_correct_number `instance-attribute`

`name`

`url`

`commit`

`size`

`cwes`

`files`

`has_vuln`

name `instance-attribute`

url `instance-attribute`

commit `instance-attribute`

size `instance-attribute`

cwes `instance-attribute`

files `instance-attribute`

has_vuln `instance-attribute`

repr

eq

`other`

`dir`

`lang`

repos `instance-attribute`

max_repo_size `instance-attribute`

`analysis_results`

`dataset`

`validated_repos`

`total_repo_number`