CVEfixes
CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.
Type: GitRepo
Supported version: v1.0.8
Disclaimer
This project provides wrappers and scripts to integrate with CVEfixes
, but does not include the tool itself.
Therefore, you are responsible for reviewing and complying with the product's license and terms of use.
Requirements
- None
Dataset content
CVEfixes_java.csv
Included and pre-extracted from the original dataset available at Zenodo.
Large Original Dataset
The full CVEfixes dataset is a SQLite database of approximately 50GB. To keep it manageable for SAST tools, we provide a pre-processed version with minimal data for specific languages.
We extracted the following columns to create language-specific CSV files (e.g., CVEfixes_java.csv
):
cve_id
cwe_ids
repo_url
parents
(vulnerable commit)filenames
We also added repo_size
for each repository, fetched using the GitHub API.
Extract a minimal version for a specific programming language
Requirements:
- Download
CVEfixes.db
from Zenodo. - Modify the script variables
LANG
,LANG_EXT
, andTOKEN
to match your needs.
GitHub API Token:
A GitHub Personal Access Token (PAT) is required to query repository sizes due to API rate limits.
- Generate a new token at github.com/settings/tokens/new with the
public_repo
scope. - Paste the token into the
TOKEN
variable in the script. - Remember to delete the token after use.
```python import csv import sqlite3 import sys
import requests
LANG = "
def get_size(repo_url: str) -> int | None: """Get the size of a GitHub repository.""" headers = {"Authorization": f"Bearer {TOKEN}"} r = requests.get( repo_url.replace("github.com", "api.github.com/repos"), headers=headers ) size_kb = r.json().get("size", None) if size_kb: return size_kb * 1000
headers = {"Authorization": f"Bearer {TOKEN}"} r = requests.get("https://api.github.com", headers=headers) if r.status_code == 401: print(r.json()) sys.exit(1)
conn = sqlite3.connect("CVEfixes.db") cursor = conn.cursor() query = f""" SELECT cve.cve_id, REPLACE(GROUP_CONCAT(DISTINCT cwe.cwe_id), ',', ';') AS cwe_ids, REPLACE(GROUP_CONCAT(DISTINCT cwe.description), ',', ';') AS cwe_descriptions, repository.repo_url, commits.parents, REPLACE(GROUP_CONCAT(DISTINCT file_change.filename), ',', ';') AS filenames FROM cve JOIN fixes ON fixes.cve_id = cve.cve_id JOIN commits ON commits.hash = fixes.hash AND commits.repo_url = fixes.repo_url JOIN file_change ON file_change.hash = commits.hash JOIN repository ON repository.repo_url = commits.repo_url JOIN cwe_classification ON cwe_classification.cve_id = cve.cve_id JOIN cwe ON cwe.cwe_id = cwe_classification.cwe_id WHERE LOWER(repository.repo_language) = '{LANG}' AND cwe.cwe_id GLOB 'CWE-[0-9]*'"""
for ext in LANG_EXT: query += f""" AND file_change.filename GLOB '*.{ext}'"""
query += """ AND file_change.filename NOT GLOB '[Tt]est' GROUP BY cve.cve_id;""" cursor.execute(query) rows = cursor.fetchall()
with open(f"CVEfixes_{LANG}.csv", "w", newline="", encoding="utf-8") as csvfile: writer = csv.writer(csvfile) headers = [desc[0] for desc in cursor.description] + ["repo_size"] writer.writerow(headers)
for row in rows:
repo_url = row[3]
size = get_size(repo_url)
writer.writerow(list(row) + [size])
conn.close()