The Difference Boosting Neural Network (DBNN), initially published in Intelligent Data Analysis, 4(2000) 463-473, IOS Press, is a simple yet effective Bayesian network that applies imposed conditional independence of joint probability of multiple features for classification. This implementation extends the original work with modern GPU optimization and adaptive learning capabilities.
- Histogram Model: Uses non-parametric density estimation with configurable bin sizes
- Gaussian Model: Uses multivariate normal distribution for feature pair modelling
Based on "What is there in a training sample?" (2009 World Congress on Nature & Biologically Inspired Computing), the implementation includes brilliant sample selection with configurable parameters:
-
active_learning_tolerance: Controls sample selection based on probability margins- Range: 1.0 to 99.0 (higher means more samples selected)
- Default: 3.0 (samples within 3% of maximum probability)
- Example: 99.0 selects samples within 99% of the maximum probability
-
cardinality_threshold_percentile: Controls feature complexity threshold- Range: 1 to 100 (lower means more samples selected)
- Default: 95 (95th percentile)
- Example: 75 means selecting samples below the 75th percentile of feature cardinality
<span class="pl-ii">/* Model and computation settings */</span>
<span class="pl-ent">"modelType"</span>: <span class="pl-s"><span class="pl-pds">"</span>Histogram<span class="pl-pds">"</span></span>, <span class="pl-ii">// Model type: "Histogram" or "Gaussian"</span>
<span class="pl-ent">"compute_device"</span>: <span class="pl-s"><span class="pl-pds">"</span>auto<span class="pl-pds">"</span></span>, <span class="pl-ii">// "auto", "cuda", or "cpu"</span>
<span class="pl-ent">"use_interactive_kbd"</span>: <span class="pl-c1">false</span>, <span class="pl-ii">// Enable keyboard interaction</span>
<span class="pl-ent">"debug_enabled"</span>: <span class="pl-c1">true</span>, <span class="pl-ii">// Enable detailed debug logging</span>
<span class="pl-ii">/* Training data management */</span>
<span class="pl-ent">"Save_training_epochs"</span>: <span class="pl-c1">true</span>, <span class="pl-ii">// Save data for each epoch</span>
<span class="pl-ent">"training_save_path"</span>: <span class="pl-s"><span class="pl-pds">"</span>training_data<span class="pl-pds">"</span></span> <span class="pl-ii">// Path for saving training data</span>
},
<span class="pl-ent">"execution_flags"</span>: {
<span class="pl-ent">"train"</span>: <span class="pl-c1">true</span>, <span class="pl-ii">// Enable training</span>
<span class="pl-ent">"train_only"</span>: <span class="pl-c1">false</span>, <span class="pl-ii">// Only perform training</span>
<span class="pl-ent">"predict"</span>: <span class="pl-c1">true</span>, <span class="pl-ii">// Enable prediction</span>
<span class="pl-ent">"gen_samples"</span>: <span class="pl-c1">false</span>, <span class="pl-ii">// Generate sample datasets</span>
<span class="pl-ent">"fresh_start"</span>: <span class="pl-c1">false</span>, <span class="pl-ii">// Start fresh training</span>
<span class="pl-ent">"use_previous_model"</span>: <span class="pl-c1">true</span> <span class="pl-ii">// Use previously trained model if available</span>
}
}
<span class="pl-ii">/* Likelihood computation settings */</span>
<span class="pl-ent">"likelihood_config"</span>: {
<span class="pl-ent">"feature_group_size"</span>: <span class="pl-c1">2</span>, <span class="pl-ii">// Size of feature groups (usually 2)</span>
<span class="pl-ent">"max_combinations"</span>: <span class="pl-c1">1000</span>, <span class="pl-ii">// Maximum feature combinations</span>
<span class="pl-ent">"bin_sizes"</span>: [<span class="pl-c1">20</span>] <span class="pl-ii">// Bin sizes for histogram. This can also be variable sizes for each feature [20,33,64..]</span>
},
<span class="pl-ii">/* Active learning parameters */</span>
<span class="pl-ent">"active_learning"</span>: {
<span class="pl-ent">"tolerance"</span>: <span class="pl-c1">1.0</span>, <span class="pl-ii">// Learning tolerance</span>
<span class="pl-ent">"cardinality_threshold_percentile"</span>: <span class="pl-c1">95</span>, <span class="pl-ii">// Percentile for cardinality threshold</span>
<span class="pl-ent">"strong_margin_threshold"</span>: <span class="pl-c1">0.3</span>, <span class="pl-ii">// Threshold for strong failures</span>
<span class="pl-ent">"marginal_margin_threshold"</span>: <span class="pl-c1">0.1</span>, <span class="pl-ii">// Threshold for marginal failures</span>
<span class="pl-ent">"min_divergence"</span>: <span class="pl-c1">0.1</span> <span class="pl-ii">// Minimum divergence between samples</span>
},
<span class="pl-ii">/* Training parameters specific to this dataset */</span>
<span class="pl-ent">"training_params"</span>: {
<span class="pl-ent">"Save_training_epochs"</span>: <span class="pl-c1">true</span>, <span class="pl-ii">// Save epoch-specific data</span>
<span class="pl-ent">"training_save_path"</span>: <span class="pl-s"><span class="pl-pds">"</span>training_data/dataset_name<span class="pl-pds">"</span></span> <span class="pl-ii">// Dataset-specific save path</span>
},
<span class="pl-ii">/* Model selection */</span>
<span class="pl-ent">"modelType"</span>: <span class="pl-s"><span class="pl-pds">"</span>Histogram<span class="pl-pds">"</span></span> <span class="pl-ii">// "Histogram" or "Gaussian"</span>
}
- Add
#before feature names in the config file to exclude them - Automatic filtering of high cardinality features
cardinality_tolerance: -1 preserves exact precision, positive number rounds to that decimal placerandom_seed: -1 enables data shuffling, positive number ensures reproducible splits
- Automatic device selection with
compute_device: "auto" - Batch processing for memory efficiency
- Parallel likelihood computation
- Optimized tensor operations
- Saves and loads model weights
- Preserves categorical encoders
- Maintains model state between sessions
- Supports continued training with
use_previous_model
- Confusion matrices with colour coding
- Training progress plots
- Probability distribution visualizations
- Detailed classification reports
- Confidence metrics for predictions
- Press 'q' or 'Q' to skip to the next training phase (requires X11 on Linux)
- Early stopping based on error rates
- Adaptive sample selection with configurable thresholds
model = GPUDBNN(dataset_name='your_dataset')
results = model.fit_predict(batch_size=32)model = GPUDBNN(dataset_name='your_dataset', fresh=True)
history = model.adaptive_fit_predict(max_rounds=10)Use space2csv.py to convert space-separated files to CSV:
python space2csv.py input_file.txt output_file.csv- PyTorch
- NumPy
- Pandas
- Scikit-learn
- Matplotlib
- CUDA (optional for GPU acceleration)
- Use GPU acceleration for large datasets
- Adjust batch size based on available memory
- Configure bin sizes based on data distribution
- Tune active learning parameters for optimal sample selection
- Automatic fallback to CPU if GPU unavailable
- Robust handling of missing values
- Graceful degradation for large datasets
- Comprehensive error reporting
Please report any issues or contribute improvements through the project repository.
Note: This implementation extends the original DBNN with modern optimizations and additional features while maintaining its core principles of simplicity and effectiveness.