CVE Forecast - Technical Details

Project Overview

The CVE Forecast project is an automated system designed to predict the number of Common Vulnerabilities and Exposures (CVEs) for the upcoming months. By leveraging a suite of time series forecasting models, the project provides insights into future trends in vulnerability disclosures.

Model Evaluation

Models are evaluated using multiple metrics to ensure reliable forecasts:

MAPE (Mean Absolute Percentage Error): Primary metric for model selection
MASE (Mean Absolute Scaled Error): Scale-independent measure of forecast accuracy
RMSSE (Root Mean Squared Scaled Error): Good for comparing across different time series
MAE (Mean Absolute Error): Easy to interpret in the original units

System Architecture

The project is built with a modular architecture, separating concerns into distinct Python modules. This design enhances maintainability, scalability, and testability.

Core Modules

main.py: The main entry point that orchestrates the entire forecasting workflow.
config.py: Centralized configuration for models, parameters, and constants.
data_processor.py: Handles the parsing and processing of raw CVE data.
analysis.py: Contains the core logic for model training, evaluation, and forecasting.
file_io.py: Manages all file input/output operations, including the generation of the final `data.json`.
utils.py: Provides utility functions, such as logging setup.

Data Processing Pipeline

The data processing pipeline is designed to be efficient and robust, handling large volumes of CVE data.

Data Ingestion: The system clones the official cvelistV5 repository from GitHub, which contains all CVE data in JSON format.
Data Parsing: Each JSON file is parsed to extract the `publishedDate`. The system is designed to handle malformed JSON files gracefully.
Time Series Aggregation: The extracted publication dates are aggregated into a monthly time series, counting the number of CVEs published each month.
Data Validation: The time series data is validated to ensure completeness and correctness before being passed to the forecasting models.

Forecasting Models

The project employs a diverse set of over 25 time series forecasting models from the Darts library. This allows for a comprehensive analysis and the selection of the best-performing model for the given data.

Model Categories

Note: Models are automatically selected based on performance and stability. Deep learning models are available but disabled by default for CPU environments.

Statistical Models: ARIMA, ExponentialSmoothing, Prophet, Theta, etc.
Machine Learning Models: LinearRegression, RandomForest, LightGBM, XGBoost.
Deep Learning Models: NBEATS, NHiTS, TCN, Transformer, TiDE.
Baseline Models: NaiveMean, NaiveSeasonal, NaiveDrift.

Model Evaluation

All models are evaluated using a rigorous and automated process defined within the system's configuration:

Centralized Configuration: Key parameters for the evaluation, such as the train/test split ratio and forecast horizon, are managed in config.json.
Optimized Hyperparameters: Model hyperparameters are systematically managed within the forecasting engine, leveraging best-known configurations for each model type.
Performance Metrics: Models are ranked based on Mean Absolute Percentage Error (MAPE), with other metrics like MAE, RMSE, and MASE also calculated for a comprehensive assessment.
Performance History: The performance of each model run, including metrics and hyperparameters, is logged in performance_history.json (path defined in config.json) to track performance and improvements over time.

Change Log

v.05 - Adolfo Suárez Madrid-Baraja 🇪🇸

Fixed a critical bug that prevented the cumulative graph from rendering due to an incorrect data structure in data.json.
Restored frontend compatibility by correcting the data generation logic, ensuring all charts now load correctly.

v.04 ORD ✈️ MAD

Enhanced model stability with improved error handling.
Added input validation and scaling for better numerical stability.
Optimized for CPU-only environments.
Implemented dynamic forecast period calculation.
Improved model selection based on MAPE scores.

Deployment and Automation

The CVE Forecast dashboard is deployed as a static website, with the data being updated daily through a fully automated CI/CD pipeline using GitHub Actions.

GitHub Actions Workflow

Scheduled Trigger: The workflow is triggered daily at midnight UTC.
Data Fetching: The latest CVE data is downloaded.
Forecasting: The entire forecasting pipeline is executed, generating new predictions.
Data Commit: The updated `data.json` is committed back to the repository.
Deployment: The static site is automatically deployed, making the new forecasts available to users.