Skip to content

Datasets

codesectools.datasets.core

Initializes the core dataset module.

Modules:

Name Description
dataset

Defines the core abstract classes and data structures for datasets.

dataset

Defines the core abstract classes and data structures for datasets.

This module provides the foundational components for creating and managing datasets used for benchmarking SAST tools. It includes abstract base classes for different types of datasets (e.g., file-based, Git repository-based) and data classes to hold benchmark results.

Dataset

Dataset(lang: str | None = None)

Bases: ABC

Abstract base class for all datasets.

Defines the common interface that all dataset types must implement.

Attributes:

Name Type Description
name str

The name of the dataset.

supported_languages list[str]

A list of programming languages supported by the dataset.

Initialize the Dataset instance.

Set up paths and load the dataset if a language is specified.

Parameters:

Name Type Description Default
lang
str | None

The programming language of the dataset to load. Must be one of the supported languages for the dataset class.

None

Methods:

Name Description
is_cached

Check if the dataset has been downloaded and is cached locally.

prompt_license_agreement

Display the dataset's license and prompt the user for agreement.

download_files

Download the raw dataset files.

download_dataset

Handle the full dataset download process, including license prompt and caching.

load_dataset

Load the dataset into memory.

list_dataset_full_names

List all available language-specific versions of this dataset.

name instance-attribute
name: str
supported_languages instance-attribute
supported_languages: list[str]
license instance-attribute
license: str
license_url instance-attribute
license_url: str
directory instance-attribute
directory = USER_CACHE_DIR / self.name
lang instance-attribute
lang = lang
full_name instance-attribute
full_name = f'{self.name}_{self.lang}'
files instance-attribute
files: list[File] = self.load_dataset()
is_cached classmethod
is_cached() -> bool

Check if the dataset has been downloaded and is cached locally.

Returns:

Type Description
bool

True if the dataset is cached, False otherwise.

prompt_license_agreement
prompt_license_agreement() -> None

Display the dataset's license and prompt the user for agreement.

download_files abstractmethod
download_files() -> None

Download the raw dataset files.

download_dataset
download_dataset() -> None

Handle the full dataset download process, including license prompt and caching.

load_dataset abstractmethod
load_dataset() -> list[File]

Load the dataset into memory.

This method must be implemented by subclasses to define how the dataset's contents are loaded.

Returns:

Type Description
list[File]

A list of File objects representing the dataset.

list_dataset_full_names classmethod
list_dataset_full_names() -> list[str]

List all available language-specific versions of this dataset.

Returns:

Type Description
list[str]

A sorted list of strings, where each string is the dataset name

list[str]

suffixed with a supported language (e.g., "MyDataset_java").

DatasetUnit

Base class for a single unit within a dataset.

Serves as a marker class for items like File or GitRepo.

BenchmarkData

Base class for storing data resulting from a benchmark.

Serves as a marker class for data holders like FileDatasetData or GitRepoDatasetData.

File

Bases: DatasetUnit

Represent a single file in a dataset.

Attributes:

Name Type Description
filename str

The name of the file.

content bytes

The byte content of the file.

cwes list[CWE]

A list of CWEs associated with the file.

has_vuln bool

True if the vulnerability is real, False if it's intended to be a false positive test case.

Initialize a File instance.

Parameters:

Name Type Description Default
filename
str

The name of the file.

required
content
str | bytes

The content of the file, as a string or bytes. It will be converted to bytes if provided as a string.

required
cwes
list[CWE]

A list of CWEs associated with the file.

required
has_vuln
bool

True if the vulnerability is real, False if it's intended to be a false positive test case.

required

Methods:

Name Description
__repr__

Return a developer-friendly string representation of the File.

__eq__

Compare this File with another object for equality based on filename.

save

Save the file's content to a specified directory.

filename instance-attribute
filename = filename
content instance-attribute
content = content
cwes instance-attribute
cwes = cwes
has_vuln instance-attribute
has_vuln = has_vuln
__repr__
__repr__() -> str

Return a developer-friendly string representation of the File.

Returns:

Type Description
str

A string showing the class name, filename, and CWE IDs.

__eq__
__eq__(other: str | Self) -> bool

Compare this File with another object for equality based on filename.

Parameters:

Name Type Description Default
other
str | Self

The object to compare with. Can be a string (filename) or another File instance.

required

Returns:

Type Description
bool

True if the filenames are equal, False otherwise.

save
save(dir: Path) -> None

Save the file's content to a specified directory.

Parameters:

Name Type Description Default
dir
Path

The path to the directory where the file should be saved.

required

FileDataset

FileDataset(lang: str)

Bases: Dataset

Abstract base class for datasets composed of individual files.

Attributes:

Name Type Description
directory Path

The directory path for the dataset.

lang str

The programming language of the dataset.

full_name str

The full name of the dataset, including the language.

files list[File]

A list of File objects loaded from the dataset.

Initialize a FileDataset instance.

Parameters:

Name Type Description Default
lang
str

The programming language of the dataset to load.

required

Methods:

Name Description
validate

Validate a SAST analysis result against the ground truth of the dataset.

validate

Validate a SAST analysis result against the ground truth of the dataset.

Compares the defects found by a SAST tool with the known vulnerabilities in the dataset files to categorize them as true positives, false positives, and false negatives, counting each unique (file, CWE) pair only once.

Parameters:

Name Type Description Default
analysis_result
AnalysisResult

The result from a SAST tool analysis.

required

Returns:

Type Description
FileDatasetData

A FileDatasetData object containing the validation metrics.

FileDatasetData

Bases: BenchmarkData

Store the results of validating an analysis against a FileDataset.

The counts for true positives, false positives, and false negatives are based on unique (file, CWE) pairs.

Attributes:

Name Type Description
dataset FileDataset

The dataset used for the benchmark.

tp_defects list[Defect]

A list of unique, correctly identified defects (True Positives).

fp_defects list[Defect]

A list of unique, incorrectly identified defects (False Positives).

fn_defects list[tuple[str, CWE]]

A list of unique vulnerabilities that were not found (False Negatives).

cwes_list list[CWE]

All CWEs present in the dataset's ground truth (may contain duplicates if a CWE appears in multiple files).

tp_cwes list[CWE]

List of CWEs from True Positive findings.

fp_cwes list[CWE]

List of CWEs from False Positive findings.

fn_cwes list[CWE]

List of CWEs from False Negative findings (missed vulnerabilities).

file_number int

Total number of files in the dataset.

defect_number int

Total number of defects reported by the tool (before de-duplication).

unique_correct_number int

Number of files with at least one correctly identified defect.

Initialize a FileDatasetData instance.

Parameters:

Name Type Description Default
dataset
FileDataset

The dataset used for the benchmark.

required
tp_defects
list[Defect]

A list of unique, correctly identified defects.

required
fp_defects
list[Defect]

A list of unique, incorrectly identified defects.

required
fn_defects
list[tuple[str, CWE]]

A list of unique vulnerabilities that were not found.

required
cwes_list
list[CWE]

A list of all ground-truth CWEs in the dataset.

required
tp_cwes
list[CWE]

A list of CWEs from True Positive findings.

required
fp_cwes
list[CWE]

A list of CWEs from False Positive findings.

required
fn_cwes
list[CWE]

A list of CWEs from missed vulnerabilities.

required
file_number
int

The total number of files in the dataset.

required
defect_number
int

The total number of defects found by the analysis (before de-duplication).

required
unique_correct_number
int

The number of files with at least one correctly identified vulnerability.

required
dataset instance-attribute
dataset = dataset
tp_defects instance-attribute
tp_defects = tp_defects
fp_defects instance-attribute
fp_defects = fp_defects
fn_defects instance-attribute
fn_defects = fn_defects
cwes_list instance-attribute
cwes_list = cwes_list
tp_cwes instance-attribute
tp_cwes = tp_cwes
fp_cwes instance-attribute
fp_cwes = fp_cwes
fn_cwes instance-attribute
fn_cwes = fn_cwes
file_number instance-attribute
file_number = file_number
defect_number instance-attribute
defect_number = defect_number
unique_correct_number instance-attribute
unique_correct_number = unique_correct_number

GitRepo

GitRepo(
    name: str,
    url: str,
    commit: str,
    size: int,
    cwes: list[CWE],
    files: list[str],
    has_vuln: bool,
)

Bases: DatasetUnit

Represent a single Git repository in a dataset.

Attributes:

Name Type Description
name str

A unique name for the repository, often a CVE ID.

url str

The URL to clone the Git repository.

commit str

The specific commit hash to check out.

size int

The size of the repository in bytes.

cwes list[CWE]

A list of CWEs associated with the repository.

files list[str]

A list of filenames known to be vulnerable in this commit.

has_vuln bool

True if there is really a vuln in the gitrepo.

Initialize a GitRepo instance.

Parameters:

Name Type Description Default
name
str

The name/identifier for the repository.

required
url
str

The clone URL of the repository.

required
commit
str

The commit hash to analyze.

required
size
int

The size of the repository in bytes.

required
cwes
list[CWE]

A list of CWEs associated with the repository.

required
files
list[str]

A list of vulnerable files in the specified commit.

required
has_vuln
bool

True if there is really a vuln in the gitrepo.

required

Methods:

Name Description
__repr__

Return a developer-friendly string representation of the GitRepo.

__eq__

Compare this GitRepo with another object for equality based on name.

save

Clone the repository and check out the specific commit.

name instance-attribute
name = name
url instance-attribute
url = url
commit instance-attribute
commit = commit
size instance-attribute
size = size
cwes instance-attribute
cwes = cwes
files instance-attribute
files = files
has_vuln instance-attribute
has_vuln = has_vuln
__repr__
__repr__() -> str

Return a developer-friendly string representation of the GitRepo.

Returns:

Type Description
str

A string showing the repo's name, URL, commit, size, CWEs, and files.

__eq__
__eq__(other: str | Self) -> bool

Compare this GitRepo with another object for equality based on name.

Parameters:

Name Type Description Default
other
str | Self

The object to compare with. Can be a string (repo name) or another GitRepo instance.

required

Returns:

Type Description
bool

True if the names are equal, False otherwise.

save
save(dir: Path) -> None

Clone the repository and check out the specific commit.

Parameters:

Name Type Description Default
dir
Path

The path to the directory where the repository should be cloned.

required

GitRepoDataset

GitRepoDataset(lang: str)

Bases: Dataset

Abstract base class for datasets composed of Git repositories.

Attributes:

Name Type Description
directory Path

The directory path for the dataset.

lang str

The programming language of the dataset.

full_name str

The full name of the dataset, including the language.

repos list[GitRepo]

A list of GitRepo objects loaded from the dataset.

max_repo_size int

The maximum repository size to consider for analysis.

Initialize a GitRepoDataset instance.

Parameters:

Name Type Description Default
lang
str

The programming language of the dataset to load.

required

Methods:

Name Description
validate

Validate SAST analysis results against the ground truth of the dataset.

repos instance-attribute
repos: list[GitRepo] = self.files
max_repo_size instance-attribute
max_repo_size: int
validate

Validate SAST analysis results against the ground truth of the dataset.

Compare the defects found by a SAST tool for each repository with the known vulnerabilities (CWEs and file locations) in the dataset to categorize them as true positives, false positives, and false negatives. Each unique (file, CWE) pair is counted once per repository.

Parameters:

Name Type Description Default
analysis_results
list[AnalysisResult]

A list of analysis results, one for each repository.

required

Returns:

Type Description
GitRepoDatasetData

A GitRepoDatasetData object containing the validation metrics.

GitRepoDatasetData

Bases: BenchmarkData

Store the results of validating an analysis against a GitRepoDataset.

Attributes:

Name Type Description
dataset GitRepoDataset

The dataset used for the benchmark.

validated_repos list[dict]

A list of dictionaries, each containing the validation results for a single repository.

total_repo_number int

The total number of repositories in the dataset.

defect_numbers int

The total number of defects found across all repos.

Initialize a GitRepoDatasetData instance.

Parameters:

Name Type Description Default
dataset
GitRepoDataset

The dataset used for the benchmark.

required
validated_repos
list[dict]

A list of validation results per repository.

required
total_repo_number
int

The total number of repositories in the dataset.

required
defect_numbers
int

The total number of defects found by the analysis.

required
dataset instance-attribute
dataset = dataset
validated_repos instance-attribute
validated_repos = validated_repos
total_repo_number instance-attribute
total_repo_number = total_repo_number
defect_numbers instance-attribute
defect_numbers = defect_numbers