Datasets
codesectools.datasets.core
Initializes the core dataset module.
Modules:
Name | Description |
---|---|
dataset |
Defines the core abstract classes and data structures for datasets. |
dataset
Defines the core abstract classes and data structures for datasets.
This module provides the foundational components for creating and managing datasets used for benchmarking SAST tools. It includes abstract base classes for different types of datasets (e.g., file-based, Git repository-based) and data classes to hold benchmark results.
Dataset
Bases: ABC
Abstract base class for all datasets.
Defines the common interface that all dataset types must implement.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The name of the dataset. |
supported_languages |
list[str]
|
A list of programming languages supported by the dataset. |
Initialize the Dataset instance.
Set up paths and load the dataset if a language is specified.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str | None
|
The programming language of the dataset to load. Must be one of the supported languages for the dataset class. |
None
|
Methods:
Name | Description |
---|---|
is_cached |
Check if the dataset has been downloaded and is cached locally. |
prompt_license_agreement |
Display the dataset's license and prompt the user for agreement. |
download_files |
Download the raw dataset files. |
download_dataset |
Handle the full dataset download process, including license prompt and caching. |
load_dataset |
Load the dataset into memory. |
list_dataset_full_names |
List all available language-specific versions of this dataset. |
is_cached
classmethod
is_cached() -> bool
Check if the dataset has been downloaded and is cached locally.
Returns:
Type | Description |
---|---|
bool
|
True if the dataset is cached, False otherwise. |
prompt_license_agreement
Display the dataset's license and prompt the user for agreement.
download_dataset
Handle the full dataset download process, including license prompt and caching.
load_dataset
abstractmethod
list_dataset_full_names
classmethod
DatasetUnit
Base class for a single unit within a dataset.
Serves as a marker class for items like File
or GitRepo
.
BenchmarkData
Base class for storing data resulting from a benchmark.
Serves as a marker class for data holders like FileDatasetData
or
GitRepoDatasetData
.
File
Bases: DatasetUnit
Represent a single file in a dataset.
Attributes:
Name | Type | Description |
---|---|---|
filename |
str
|
The name of the file. |
content |
bytes
|
The byte content of the file. |
cwes |
list[CWE]
|
A list of CWEs associated with the file. |
has_vuln |
bool
|
True if the vulnerability is real, False if it's intended to be a false positive test case. |
Initialize a File instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The name of the file. |
required |
|
str | bytes
|
The content of the file, as a string or bytes. It will be converted to bytes if provided as a string. |
required |
|
list[CWE]
|
A list of CWEs associated with the file. |
required |
|
bool
|
True if the vulnerability is real, False if it's intended to be a false positive test case. |
required |
Methods:
Name | Description |
---|---|
__repr__ |
Return a developer-friendly string representation of the File. |
__eq__ |
Compare this File with another object for equality based on filename. |
save |
Save the file's content to a specified directory. |
FileDataset
Bases: Dataset
Abstract base class for datasets composed of individual files.
Attributes:
Name | Type | Description |
---|---|---|
directory |
Path
|
The directory path for the dataset. |
lang |
str
|
The programming language of the dataset. |
full_name |
str
|
The full name of the dataset, including the language. |
files |
list[File]
|
A list of |
Initialize a FileDataset instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The programming language of the dataset to load. |
required |
Methods:
Name | Description |
---|---|
validate |
Validate a SAST analysis result against the ground truth of the dataset. |
validate
validate(
analysis_result: AnalysisResult,
) -> FileDatasetData
Validate a SAST analysis result against the ground truth of the dataset.
Compares the defects found by a SAST tool with the known vulnerabilities in the dataset files to categorize them as true positives, false positives, and false negatives, counting each unique (file, CWE) pair only once.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
AnalysisResult
|
The result from a SAST tool analysis. |
required |
Returns:
Type | Description |
---|---|
FileDatasetData
|
A |
FileDatasetData
FileDatasetData(
dataset: FileDataset,
tp_defects: list[Defect],
fp_defects: list[Defect],
fn_defects: list[tuple[str, CWE]],
cwes_list: list[CWE],
tp_cwes: list[CWE],
fp_cwes: list[CWE],
fn_cwes: list[CWE],
file_number: int,
defect_number: int,
unique_correct_number: int,
)
Bases: BenchmarkData
Store the results of validating an analysis against a FileDataset.
The counts for true positives, false positives, and false negatives are based on unique (file, CWE) pairs.
Attributes:
Name | Type | Description |
---|---|---|
dataset |
FileDataset
|
The dataset used for the benchmark. |
tp_defects |
list[Defect]
|
A list of unique, correctly identified defects (True Positives). |
fp_defects |
list[Defect]
|
A list of unique, incorrectly identified defects (False Positives). |
fn_defects |
list[tuple[str, CWE]]
|
A list of unique vulnerabilities that were not found (False Negatives). |
cwes_list |
list[CWE]
|
All CWEs present in the dataset's ground truth (may contain duplicates if a CWE appears in multiple files). |
tp_cwes |
list[CWE]
|
List of CWEs from True Positive findings. |
fp_cwes |
list[CWE]
|
List of CWEs from False Positive findings. |
fn_cwes |
list[CWE]
|
List of CWEs from False Negative findings (missed vulnerabilities). |
file_number |
int
|
Total number of files in the dataset. |
defect_number |
int
|
Total number of defects reported by the tool (before de-duplication). |
unique_correct_number |
int
|
Number of files with at least one correctly identified defect. |
Initialize a FileDatasetData instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
FileDataset
|
The dataset used for the benchmark. |
required |
|
list[Defect]
|
A list of unique, correctly identified defects. |
required |
|
list[Defect]
|
A list of unique, incorrectly identified defects. |
required |
|
list[tuple[str, CWE]]
|
A list of unique vulnerabilities that were not found. |
required |
|
list[CWE]
|
A list of all ground-truth CWEs in the dataset. |
required |
|
list[CWE]
|
A list of CWEs from True Positive findings. |
required |
|
list[CWE]
|
A list of CWEs from False Positive findings. |
required |
|
list[CWE]
|
A list of CWEs from missed vulnerabilities. |
required |
|
int
|
The total number of files in the dataset. |
required |
|
int
|
The total number of defects found by the analysis (before de-duplication). |
required |
|
int
|
The number of files with at least one correctly identified vulnerability. |
required |
GitRepo
GitRepo(
name: str,
url: str,
commit: str,
size: int,
cwes: list[CWE],
files: list[str],
has_vuln: bool,
)
Bases: DatasetUnit
Represent a single Git repository in a dataset.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
A unique name for the repository, often a CVE ID. |
url |
str
|
The URL to clone the Git repository. |
commit |
str
|
The specific commit hash to check out. |
size |
int
|
The size of the repository in bytes. |
cwes |
list[CWE]
|
A list of CWEs associated with the repository. |
files |
list[str]
|
A list of filenames known to be vulnerable in this commit. |
has_vuln |
bool
|
True if there is really a vuln in the gitrepo. |
Initialize a GitRepo instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The name/identifier for the repository. |
required |
|
str
|
The clone URL of the repository. |
required |
|
str
|
The commit hash to analyze. |
required |
|
int
|
The size of the repository in bytes. |
required |
|
list[CWE]
|
A list of CWEs associated with the repository. |
required |
|
list[str]
|
A list of vulnerable files in the specified commit. |
required |
|
bool
|
True if there is really a vuln in the gitrepo. |
required |
Methods:
Name | Description |
---|---|
__repr__ |
Return a developer-friendly string representation of the GitRepo. |
__eq__ |
Compare this GitRepo with another object for equality based on name. |
save |
Clone the repository and check out the specific commit. |
GitRepoDataset
Bases: Dataset
Abstract base class for datasets composed of Git repositories.
Attributes:
Name | Type | Description |
---|---|---|
directory |
Path
|
The directory path for the dataset. |
lang |
str
|
The programming language of the dataset. |
full_name |
str
|
The full name of the dataset, including the language. |
repos |
list[GitRepo]
|
A list of |
max_repo_size |
int
|
The maximum repository size to consider for analysis. |
Initialize a GitRepoDataset instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The programming language of the dataset to load. |
required |
Methods:
Name | Description |
---|---|
validate |
Validate SAST analysis results against the ground truth of the dataset. |
validate
validate(
analysis_results: list[AnalysisResult],
) -> GitRepoDatasetData
Validate SAST analysis results against the ground truth of the dataset.
Compare the defects found by a SAST tool for each repository with the known vulnerabilities (CWEs and file locations) in the dataset to categorize them as true positives, false positives, and false negatives. Each unique (file, CWE) pair is counted once per repository.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
list[AnalysisResult]
|
A list of analysis results, one for each repository. |
required |
Returns:
Type | Description |
---|---|
GitRepoDatasetData
|
A |
GitRepoDatasetData
GitRepoDatasetData(
dataset: GitRepoDataset,
validated_repos: list[dict],
total_repo_number: int,
defect_numbers: int,
)
Bases: BenchmarkData
Store the results of validating an analysis against a GitRepoDataset.
Attributes:
Name | Type | Description |
---|---|---|
dataset |
GitRepoDataset
|
The dataset used for the benchmark. |
validated_repos |
list[dict]
|
A list of dictionaries, each containing the validation results for a single repository. |
total_repo_number |
int
|
The total number of repositories in the dataset. |
defect_numbers |
int
|
The total number of defects found across all repos. |
Initialize a GitRepoDatasetData instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
GitRepoDataset
|
The dataset used for the benchmark. |
required |
|
list[dict]
|
A list of validation results per repository. |
required |
|
int
|
The total number of repositories in the dataset. |
required |
|
int
|
The total number of defects found by the analysis. |
required |