Project Overview
The CVE Forecast project is an automated system designed to predict the number of Common Vulnerabilities and Exposures (CVEs) for the upcoming months. By leveraging a suite of time series forecasting models, the project provides insights into future trends in vulnerability disclosures.
Model Evaluation
Models are evaluated using multiple metrics to ensure reliable forecasts:
- MAPE (Mean Absolute Percentage Error): Primary metric for model selection
- MASE (Mean Absolute Scaled Error): Scale-independent measure of forecast accuracy
- RMSSE (Root Mean Squared Scaled Error): Good for comparing across different time series
- MAE (Mean Absolute Error): Easy to interpret in the original units
System Architecture
The project is built with a modular architecture, separating concerns into distinct Python modules. This design enhances maintainability, scalability, and testability.
Core Modules
- main.py: The main entry point that orchestrates the entire forecasting workflow.
- config.py: Centralized configuration for models, parameters, and constants.
- data_processor.py: Handles the parsing and processing of raw CVE data.
- analysis.py: Contains the core logic for model training, evaluation, and forecasting.
- file_io.py: Manages all file input/output operations, including the generation of the final `data.json`.
- utils.py: Provides utility functions, such as logging setup.
Data Processing Pipeline
The data processing pipeline is designed to be efficient and robust, handling large volumes of CVE data.
- Data Ingestion: The system clones the official cvelistV5 repository from GitHub, which contains all CVE data in JSON format.
- Data Parsing: Each JSON file is parsed to extract the `publishedDate`. The system is designed to handle malformed JSON files gracefully.
- Time Series Aggregation: The extracted publication dates are aggregated into a monthly time series, counting the number of CVEs published each month.
- Data Validation: The time series data is validated to ensure completeness and correctness before being passed to the forecasting models.
Forecasting Models
The project employs a diverse set of over 25 time series forecasting models from the Darts library. This allows for a comprehensive analysis and the selection of the best-performing model for the given data.
Model Categories
Note: Models are automatically selected based on performance and stability. Deep learning models are available but disabled by default for CPU environments.
- Statistical Models: ARIMA, ExponentialSmoothing, Prophet, Theta, etc.
- Machine Learning Models: LinearRegression, RandomForest, LightGBM, XGBoost.
- Deep Learning Models: NBEATS, NHiTS, TCN, Transformer, TiDE.
- Baseline Models: NaiveMean, NaiveSeasonal, NaiveDrift.
Model Evaluation
All models are evaluated using a rigorous and automated process defined within the system's configuration:
- Centralized Configuration: Key parameters for the evaluation, such as the train/test split ratio and forecast horizon, are managed in
config.json
. - Optimized Hyperparameters: Model hyperparameters are systematically managed within the forecasting engine, leveraging best-known configurations for each model type.
- Performance Metrics: Models are ranked based on Mean Absolute Percentage Error (MAPE), with other metrics like MAE, RMSE, and MASE also calculated for a comprehensive assessment.
- Performance History: The performance of each model run, including metrics and hyperparameters, is logged in
performance_history.json
(path defined inconfig.json
) to track performance and improvements over time.
Change Log
v.05 - Adolfo Suárez Madrid-Baraja 🇪🇸
- Fixed a critical bug that prevented the cumulative graph from rendering due to an incorrect data structure in
data.json
. - Restored frontend compatibility by correcting the data generation logic, ensuring all charts now load correctly.
v.04 ORD ✈️ MAD
- Enhanced model stability with improved error handling.
- Added input validation and scaling for better numerical stability.
- Optimized for CPU-only environments.
- Implemented dynamic forecast period calculation.
- Improved model selection based on MAPE scores.
Deployment and Automation
The CVE Forecast dashboard is deployed as a static website, with the data being updated daily through a fully automated CI/CD pipeline using GitHub Actions.
GitHub Actions Workflow
- Scheduled Trigger: The workflow is triggered daily at midnight UTC.
- Data Fetching: The latest CVE data is downloaded.
- Forecasting: The entire forecasting pipeline is executed, generating new predictions.
- Data Commit: The updated `data.json` is committed back to the repository.
- Deployment: The static site is automatically deployed, making the new forecasts available to users.