Performance Analysis and Optimization ===================================== This document outlines the performance characteristics of the Inhabit model, identifies current bottlenecks, and provides a roadmap for future optimizations based on the comprehensive code analysis. Overview of Performance Characteristics --------------------------------------- The Inhabit model processes large-scale longitudinal survey data (SOEP) and projects it over multiple decades. The computational complexity is primarily driven by: 1. **High-Dimensionality**: The inhabit matrix cross-tabulates multiple household and dwelling dimensions, leading to a large number of unique combinations. 2. **Iterative Processes**: The allocation algorithm runs multiple iterations per year to match searching households to available dwellings. 3. **Recursive Calibration**: The Iterative Proportional Fitting (IPF) algorithm requires multiple passes to converge on census targets. 4. **Temporal Depth**: Simulations often span 20-40 years, magnifying any per-year inefficiencies. Identified Bottlenecks ---------------------- Based on profiling and code review, the following areas have been identified as the primary sources of overhead: Iterative Row Processing ~~~~~~~~~~~~~~~~~~~~~~~~ Several core functions utilize ``pandas.DataFrame.iterrows()`` or similar row-wise iteration patterns. In Python, this is significantly slower than vectorized operations as it involves converting each row to a Series object. * **Impact**: High. Specifically visible in the allocation loop and custom disaggregation functions. * **Recommendation**: Replace row-wise logic with vectorized NumPy operations or grouped aggregations where possible. Repeated Mask Creation ~~~~~~~~~~~~~~~~~~~~~~ The model frequently creates boolean masks (e.g., ``df[df['col'] == value]``) inside nested loops to filter data for specific combinations. * **Impact**: Medium-High. Repeatedly indexing and searching the entire DataFrame is computationally expensive. * **Recommendation**: Use multi-indexing or pre-filter DataFrames into dictionaries of sub-groups. Redundant I/O Operations ~~~~~~~~~~~~~~~~~~~~~~~~ Reading and writing CSV/Excel files within the simulation loop adds significant latency due to disk I/O overhead. * **Impact**: Medium. Particularly noticeable when saving intermediate results for every year. * **Recommendation**: Implement batch I/O or only save results at the end of the simulation. Frequent DataFrame Copying ~~~~~~~~~~~~~~~~~~~~~~~~~~ Functions often use ``df.copy()`` to avoid side effects. While safe, excessive copying of large DataFrames consumes memory and adds processing time. * **Impact**: Low-Medium. * **Recommendation**: Use in-place operations (``inplace=True``) where state management allows, or pass references more strategically. Optimization Roadmap -------------------- A staged approach is recommended to improve model performance without compromising the integrity of the results. Stage 1: Vectorization (Short Term) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Refactor ``inhabit_matrix.py`` disaggregation functions to use vectorized mapping instead of row-wise processing. * Update ``allocation.py`` handlers to operate on entire columns of attributes rather than individual rows. Stage 2: Algorithmic Efficiency (Medium Term) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **IPF Optimization**: The census calibration already implements "Factor Reuse," which provides a ~100x speedup for multi-year runs. Ensure this pattern is utilized across all calibration steps. * **Caching**: Implement memoization for preference matrices that don't change between simulation years. Stage 3: Parallelization (Long Term) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Multiprocessing**: Utilize Python's ``concurrent.futures`` or ``multiprocessing`` to run independent regional simulations (e.g., Rural vs. Urban) or independent scenarios in parallel. * **Dask/Polars**: For extremely large datasets, consider migrating core data processing from Pandas to Polars or Dask to take advantage of multi-core hardware and lazy evaluation. Memory Management ----------------- The model can be memory-intensive when running long-term projections. To minimize the memory footprint: 1. **Downcasting**: Cast numeric columns to the smallest possible type (e.g., ``float32`` instead of ``float64``). 2. **Categorical Data**: Convert string-based dimensions (like building type or region) to the Pandas ``category`` dtype. 3. **Garbage Collection**: Explicitly delete large temporary DataFrames within the simulation loop to free up RAM. Benchmarking ------------ Developers are encouraged to use the ``@misc.timer_func`` decorator on core functions to track execution time during development. Any significant refactor should be accompanied by a benchmark comparison against the baseline performance.