Code Analysis and API Reference
This section provides an in-depth look at the codebase, detailing the functionality of each module and its components. We leverage Sphinx’s autodoc and autosummary features to automatically generate documentation directly from the Python source code.
Module: Inhabit Matrix Core
The core of the Inhabit model, responsible for creating and projecting the inhabit matrix.
Combine dwellings and households function to inhabit matrix.
This module creates a cross-tabulated inhabit matrix by matching household demographics to dwelling characteristics. It processes SOEP (Socio-Economic Panel) survey data to model housing occupancy patterns over time.
To completely run this script, you need access to the SOEP Core panel data. Apply for data access here: https://www.diw.de/en/diw_01.c.601584.en/data_access.html
You can run the script without access to the SOEP Core panel data, relying on pre-computed data which you will find in data/evidence.
- inhabit_matrix.add_nans(inhabit_v: DataFrame, df_hh: DataFrame, nans_v: DataFrame, ip: Dict[str, Any]) DataFrame
Add NaN handling to inhabit by upweighting records and including missing.
Adjusts inhabit weights to account for households removed during data cleaning due to missing values. Applies household filters to NaN records, then upweights remaining inhabit records proportionally to restore population totals. Handles edge case where dwelling configurations were unavailable in certain years.
- Args:
inhabit_v: Inhabit dataframe with household-dwelling combinations and weights df_hh: Original household dataframe before filtering nans_v: Dataframe containing households removed due to NaN values ip: Configuration dictionary with household filter parameters
- Returns:
pd.DataFrame: Modified inhabit with upweighted population to account for all households (including those removed as NaN)
- Examples:
>>> inhabit_adj = add_nans(inhabit, df_hh, nans, ip) >>> inhabit_adj['weights'].sum() > inhabit['weights'].sum() True
- inhabit_matrix.adjust_inhabit(ip: Dict[str, Any], inhabit: DataFrame, dwell_stock: DataFrame, hh_stock: DataFrame) DataFrame
Synchronize inhabit matrix with household and dwelling stocks.
Adjusts the inhabit matrix to align with updated household and dwelling stock totals by proportionally scaling search and stay percentages. Handles edge cases where no searchers exist in current matrix but some are present in the stock (leftover searchers) by adding them to the most probable dwelling configurations.
- Args:
- ip: Configuration dictionary including:
cols_hh: Household grouping column names
cols_dwell: Dwelling grouping column names
to_group_col: Full group column names for multi-index
mean_ls_hh: Path to household mean living space file
inhabit: Indexed dataframe with search, stay, weights, and mean_ls dwell_stock: Indexed dataframe with vacated dwellings by config hh_stock: Indexed dataframe with search, stay, and mean_ls by config
- Returns:
pd.DataFrame: Updated inhabit matrix with normalized search/stay values matching stock totals, filled with 0 for NaN values.
- Raises:
- AssertionError: If final inhabit totals don’t match stock totals
within numerical tolerance (indicates inconsistent logic)
- Examples:
>>> inhabit_adj = adjust_inhabit(ip, inhabit, dwell_stock, hh_stock) >>> inhabit_adj['weights'].sum() == hh_stock['inhabits'].sum() True
- inhabit_matrix.adjust_stocks(ip: Dict[str, Any], inhabit: DataFrame, dwell_stock: DataFrame, hh_stock: DataFrame, year: int) Tuple[DataFrame, DataFrame, DataFrame, float, float]
Update stocks with external data and synchronize with inhabit matrix.
Applies external dwelling stock changes (reductions, characteristic shifts) and household movements based on empirical data and model parameters. Tracks the magnitude of changes for model calibration. Updates both stocks and the inhabit matrix to maintain consistency across datasets.
- Args:
ip: Configuration dictionary with stock adjustment parameters inhabit: Indexed dataframe with household-dwelling combinations dwell_stock: Indexed dataframe containing dwelling inventory hh_stock: Indexed dataframe containing household inventory year: Current year for applying year-specific adjustments
- Returns:
- Tuple containing:
dwell_stock: Updated indexed dwelling stock dataframe
hh_stock: Updated indexed household stock dataframe
inhabit: Updated indexed inhabit matrix
diff_vac: Change in dwelling vacancies during adjustment
diff_mov: Change in household movers during adjustment
- Raises:
AssertionError: If dwelling or household balance equations fail
- Examples:
>>> ds, hs, inh, dv, dm = adjust_stocks(ip, inh, ds, hs, 2015) >>> dv >= 0 # vacancy changes tracked True
- inhabit_matrix.create_helping_csvs(ip: Dict[str, Any]) None
Load SOEP-csvs and attach all variables needed to household/dwelling/wum.
Loads raw SOEP survey data and creates preprocessed CSV files for each year containing all necessary variables for inhabit matrix creation. Uses concurrent processing for efficiency. Writes separate CSV files for household and dwelling data by year, avoiding re-processing raw data on every run.
- Args:
- ip: Configuration dictionary with:
empirical_start_year: First year to process
empirical_end_year: Last year to process
composita_hh_path: Template path for household CSVs
composita_dwelling_path: Template path for dwelling CSVs
new_soep: Flag to force reprocessing of raw SOEP data
- Returns:
None. Creates CSV files on disk at paths specified in ip.
- Raises:
FileNotFoundError: If raw SOEP data files not accessible IOError: If CSV files cannot be written
- Examples:
>>> create_helping_csvs(ip) # CSV files created for each year in empirical_start/end_year range
- inhabit_matrix.create_hh_mask(hh_stock: DataFrame, ip: Dict[str, Any], hh_key: Tuple[Any, ...]) Series
Create boolean mask for filtering household stock by characteristics.
Generates a mask that selects rows in household stock matching the specified household configuration tuple. Handles missing columns gracefully.
- Args:
hh_stock: Dataframe containing household groupings and characteristics ip: Configuration dictionary containing:
cols_hh: List of household column names in correct order
- hh_key: Tuple of household characteristic values to match in order
corresponding to cols_hh
- Returns:
pd.Series: Boolean mask where True indicates matching rows
- Examples:
>>> mask = create_hh_mask(hh_stock, ip, (1, 2, 3, 4, 5)) >>> filtered = hh_stock[mask] >>> len(filtered) <= 1 # Should match 0 or 1 rows True
- inhabit_matrix.dwelling_disagg(ip: Dict[str, Any], df: DataFrame, year: int) Tuple[DataFrame, List[Dict[str, Any]]]
Create dataframe from dwelling data to be included in inhabit matrix.
Processes dwelling data by applying temporal filters, extracting structural characteristics, and preparing data for combination with household demographics. Handles special cases for missing dwelling ownership data.
- Args:
- ip: Configuration dictionary containing user inputs and processing
functions. Must include ‘dwell’ (dict of dwelling processing functions) and column name specifications.
- df: Raw dwelling dataframe containing dwelling characteristics,
structural attributes, and geographic information from SOEP.
year: Current year of observation for filtering dwelling stock.
- Returns:
- Tuple containing:
df: Processed dwelling dataframe with selected columns (includes living_space and structural characteristics)
all_dimensions: List of dimension dictionaries from dwelling characteristics processing functions
- Raises:
KeyError: When required configuration keys are missing from ip ValueError: When dataframe is empty after filtering
- Examples:
>>> df_dwell, dims = dwelling_disagg(ip, df, 2015) >>> 'living_space' in df_dwell.columns True
- inhabit_matrix.get_full_inhabits(inhabit_v: DataFrame, inhabit_move_v: DataFrame, year: int, all_dims_dwell: List[Dict[str, Any]], all_dims_hh: List[Dict[str, Any]], ip: Dict[str, Any]) DataFrame
Expand sparse inhabit matrix to full dimensional space with aggregations.
Creates multiple fully-dimensioned versions of the sparse inhabit matrix: - Weighted: All household-dwelling combinations with population weights - Absolute: Count of unique households (unweighted) per combination - Move: Count of searchers from move_v dataframe
Calculates mean living space per combination as total_living_space / count. Saves results to disk and integrates searcher/stayer information.
- Args:
- inhabit_v: Sparse inhabit matrix with household-dwelling combinations
and weights
inhabit_move_v: Sparse matrix of households planning to move year: Current year for output file naming all_dims_dwell: List of dwelling dimension dictionaries all_dims_hh: List of household dimension dictionaries ip: Configuration dictionary with:
to_group_col: Grouping columns for aggregation
col_weights: Population weight column name
inhabit_evidence_path: Output file path template
- Returns:
pd.DataFrame: Full dimensional inhabit with weights, search, stay, and mean_ls columns, indexed by grouping columns
- Raises:
FileNotFoundError: If output directory doesn’t exist IOError: If file cannot be written to disk
- Examples:
>>> inhabit_full = get_full_inhabits(inhabit_sparse, inhabit_move, 2015, dims_d, dims_h, ip) >>> 'mean_ls' in inhabit_full.columns True
- inhabit_matrix.get_hh_matches(inhabit: DataFrame, row: Series) DataFrame
Get matching households in inhabit for a given household segment.
Retrieves all household-dwelling combinations where households match a reference segment and have non-zero weights. Used to identify which dwelling configurations are occupied by a specific household type.
- Args:
inhabit: Sparse matrix with household-dwelling combinations row: Household stock row with characteristics to match
- Returns:
pd.DataFrame: Subset of inhabit containing matching households with positive weights/occupancy
- Examples:
>>> matches = get_hh_matches(inhabit, hh_row) >>> (matches['weights'] > 0).all() True
- inhabit_matrix.household_disagg(ip: Dict[str, Any], df: DataFrame, year: int) Tuple[DataFrame, List[Dict[str, Any]], DataFrame]
Create dataframe from household data to be included in inhabit matrix.
Processes household data by applying filters, extracting demographic characteristics, and preparing the data for matrix combination with dwelling characteristics. Tracks households that will move during the year for separate processing.
- Args:
- ip: Configuration dictionary containing user inputs and processing
functions. Must include ‘col_oldest’, ‘hh’ (dict of household processing functions), ‘move’, ‘col_weights’, and ‘col_weights’ keys for column names.
- df: Raw household dataframe from SOEP survey data containing
household IDs, demographic variables, and survey weights.
year: Current year of observation for filtering and processing.
- Returns:
- Tuple containing:
df: Processed household dataframe with selected columns ready for inhabit matrix creation
all_dimensions: List of dimension dictionaries from household characteristics processing functions
nans: Dataframe containing rows with NaN values that were removed during cleaning, preserving household IDs for tracking
- Raises:
KeyError: When required configuration keys are missing from ip ValueError: When dataframe is empty after filtering
- Examples:
>>> df_hh, dims, nans = household_disagg(ip, df, 2015) >>> len(df_hh) > 0 True
- inhabit_matrix.inh_projection(ip: Dict[str, Any], year: int) None
Run yearly projection loop for housing market simulation.
Main simulation loop that processes housing market changes year-by-year. For each year: loads empirical inhabit data, applies census calibration, calculates household and dwelling stock changes from external models, and then allocates households to dwellings through the allocation module. Tracks living space metrics and handles scenario inputs for sensitivity analysis.
- Args:
- ip: Configuration dictionary with:
inhabit_evidence_path: Template path to empirical inhabit CSVs
inhabit_fac_evidence_path: Template path to calibrated inhabit CSVs
alloc_inh_created_path: Template path to save allocated inhabit
mean_ls_inh: Path to mean living space lookup
model_start_year, target_year: Year range for projection
scenario_start_year: Year when scenario inputs take effect
col_weights: Column name for population weights
debug: Boolean flag for debug output
year: Starting year for projection (typically model_start_year)
- Returns:
None. Saves inhabitance matrices and stock files to disk during projection loop. Updates debug/analysis logs if enabled.
- Raises:
FileNotFoundError: If required data files don’t exist AssertionError: If validation checks fail on intermediate results
- Examples:
>>> inh_projection(ip, 2015) # Projection runs from 2015 to target_year, saving yearly results
- inhabit_matrix.main(*args: Any, **flags: Dict[str, Any]) None
Combine household and dwelling data into inhabit matrix.
Entry point for the inhabit model. Orchestrates the complete workflow: loads configuration, checks for required evidence files, creates helper CSVs from raw SOEP if needed, runs model to create inhabit matrices, and finally triggers forward projection.
- Args:
*args: Positional arguments (unused, for future extension) **flags: Keyword arguments passed to inputs.load_inputs() for
configuration overrides (e.g., debug=True, new_soep=True)
- Returns:
None. Orchestrates file I/O and model execution.
- Raises:
FileNotFoundError: If SOEP data or required configuration files missing ValueError: If configuration parameters invalid
- Examples:
>>> main(debug=True) # Model runs with debug output enabled
- inhabit_matrix.run_model(ip: Dict[str, Any]) None
Main model execution orchestrating inhabit matrix creation.
Coordinates the complete workflow: saves configuration parameters, loads household and dwelling data for required years, calculates missing inhabit matrices for empirical period, and then runs projection from model_start_year forward.
- Args:
- ip: Configuration dictionary with:
empirical_start_year, empirical_end_year: Years to create inhabit for
model_start_year: Year to start forward projection
inhabit_evidence_path: Output path template for inhabit matrices
composita_hh_path, composita_dwelling_path: Input data paths
new_evidence: Flag to force recalculation of inhabit matrices
- Returns:
None. Orchestrates file I/O and calls yearly_inhabit and inh_projection.
- Examples:
>>> run_model(ip) # Inhabit matrices created for empirical years, projection runs forward
- inhabit_matrix.save_update_mean_ls(inhabit: DataFrame, ip: Dict[str, Any], year: int) None
Save and update weighted mean living space across stocks and inhabit.
Calculates mean living space (living area per person/dwelling) from the current inhabit matrix and merges with historical data to maintain weighted averages. Tracks which years had sufficient data for each household and dwelling configuration to support model calibration.
- Args:
- inhabit: Indexed dataframe containing weights, stay, search,
and mean_ls columns for all household-dwelling combinations.
- ip: Configuration dictionary with file paths for mean_ls storage:
mean_ls_hh: Path to household mean living space CSV
mean_ls_dwell: Path to dwelling mean living space CSV
mean_ls_inh: Path to inhabit matrix mean living space CSV
cols_hh, cols_dwell, to_group_col: Column names for merging
year: Current year to tag mean_ls records for tracking data currency.
- Returns:
None. Saves/updates CSVs at paths specified in ip dictionary.
- Raises:
FileNotFoundError: If CSV files are expected but not found during merge IOError: If CSV files cannot be written to specified paths
- Examples:
>>> save_update_mean_ls(inhabit, ip, 2015) # CSV files updated with new mean_ls values for 2015
- inhabit_matrix.set_occupation(dwell_row: Tuple[Any, ...], new_occ: float, dwell_stock: DataFrame, ip: Dict[str, Any]) Tuple[DataFrame, float]
Set dwelling occupation value and calculate leftover occupancy.
Attempts to place new occupants into a dwelling configuration. If capacity exists (dwells > current_occupied + new_occ), occupancy is increased normally. If insufficient capacity, returns negative leftover representing unplaced occupants for reallocation.
- Args:
- dwell_row: Tuple of dwelling characteristics identifying the
specific dwelling configuration to update
new_occ: Number of additional occupants to place (positive value) dwell_stock: Indexed dataframe with dwelling capacity and occupancy ip: Configuration dictionary with dwelling column names
- Returns:
- Tuple containing:
dwell_stock: Updated dataframe with modified occupied/vacated
leftovers: Excess occupancy if insufficient capacity (positive means space available, negative means excess demand)
- Examples:
>>> ds, leftovers = set_occupation(dwell_row, 100, ds, ip) >>> leftovers <= 0 # No excess occupants if successful True
- inhabit_matrix.update_mean_ls_inh(ip: Dict[str, Any]) int
Update mean living space in inhabit matrix from household/dwelling stocks.
Fills in missing mean living space values in the inhabit matrix by averaging household and dwelling mean living space estimates. Uses stored stock CSV files that were previously calculated with empirical weighting.
- Args:
- ip: Configuration dictionary with:
mean_ls_inh: Path to inhabit mean_ls CSV to update
mean_ls_hh: Path to household mean_ls CSV
mean_ls_dwell: Path to dwelling mean_ls CSV
- Returns:
int: Number of updated entries (0 if no updates needed, indicating all matrix entries already have mean_ls values)
- Raises:
FileNotFoundError: If CSV files don’t exist IOError: If CSV files cannot be read or written
- Examples:
>>> updated = update_mean_ls_inh(ip) >>> updated >= 0 True
- inhabit_matrix.yearly_inhabit(ip: Dict[str, Any], df_household: DataFrame, df_dwelling: DataFrame, year: int, calc_mor: bool) None
Process single year’s household and dwelling data into inhabit matrix.
Main orchestration function for creating the annual inhabit matrix. Uses parallel processing to simultaneously extract household and dwelling characteristics, then merges them into a sparse matrix indexed by household-dwelling combination. Applies region filter to ensure geographic consistency and calculates weighting adjustments for missing data.
- Args:
ip: Configuration dictionary with processing functions and paths df_household: Raw household dataframe for current year from SOEP df_dwelling: Raw dwelling dataframe for current year from SOEP year: Current year for processing calc_mor: Boolean flag to trigger move-out rate calculations and
projection if True (typically only for model_start_year)
- Returns:
None. Saves inhabit matrix and mean_ls statistics to disk. Triggers inh_projection() if calc_mor is True.
- Raises:
ValueError: If merge produces empty dataframe (no household-dwelling matches) IOError: If output files cannot be written
- Examples:
>>> yearly_inhabit(ip, df_hh, df_dwell, 2015, calc_mor=False) # 2015 inhabit matrix created and saved to disk
Module: Allocation Logic
This module handles the complex logic of allocating households to dwellings based on various preferences and constraints.
Create a matrix with preferences where to move in to per household category.
The matrix will have the same dimensions as the inhabit matrix. Each row in the matrix represents one household category. The columns represent the household categories. The cells in each row are filled with values summing up to 1. A value in a cell means the following: The probability that the household with the given configuration (the row) wants to move to the dwelling type of the respective column.
From the input sheet “inputs.csv” a variant for the generation of the preference matrix must be chosen.
Currently available variant: - “current_quintile”: the preferences of the household category
are equal to the distribution of households in the inhabit matrix in the same household category (including the same quintile)
Variants in the future: - “quintile_above”: the preference of the household category in quintile qx
is equal to the distribution of households in the inhabit matrix of the same household category but in the above quintile (q{x+1})
“highest_quintile”: the preference of the household category in quintile x is equal to the distribution of households in the inhabit matrix of the same household category but in the highest quintile (q5)
“avg_highest_current_quintile”: the preference of the household category in quintile qx is equal to the average of a) the distribution of hh in the inhabit matrix of the same hh category and b) the distribution of hh in the inhabit matrix in the same hh category but in the highest quintile ((qx+q5)/2)
- scripts.allocation.get_all_cases(ip: Dict[str, Any]) Dict[str, Dict[str, Dict[str, Any]]]
Define all dwelling attribute modification cases and their handlers.
Creates a configuration dictionary specifying how to modify different dwelling attributes (condition, rooms, ownership, building_type) during the allocation search process. Each attribute has handlers for different modification strategies.
- Args:
ip: Input parameters dictionary (not currently used but kept for consistency).
- Returns:
Nested dictionary structure: - First level: attribute name (e.g., “rooms”, “ownership”) - Second level: modification strategy (e.g., “other”, “default”, “oscillate_larger”) - Third level: handler function and arguments for that strategy
“handler”: function to call for modification
“arg1”, “arg2”: arguments to pass to handler (may be placeholders)
- Notes:
“other” strategy uses other_handler to toggle between two values
“default” strategy uses standard_handler to set a specific value
Oscillate strategies use special logic in get_args_and_handler
Some arg values are placeholders (-1, -2) filled in later
- scripts.allocation.get_alloc_dwell_order(ip: Dict[str, Any], needed_dwellings: Tuple[str, ...], ds, hh_configuration: Tuple[str, ...], alloc_limiter: int) Tuple[Any, bool]
Find available dwelling by searching through dwelling attribute modifications.
Searches for an available dwelling for a household by systematically modifying the preferred dwelling configuration. Follows a priority order specified in the input parameters, trying different combinations of dwelling attributes until a match with available space is found.
- Args:
- ip: Input parameters dictionary. Must contain keys like:
‘alloc_dwell_prio_1_feature’, ‘alloc_dwell_prio_1_feature_order’, etc. ‘max_rooms’, ‘dwelling_ownership_short’.
needed_dwellings: Preferred dwelling configuration tuple. ds_dict: Dwelling stock dictionary mapping dwelling tuple to stock info.
Each entry has ‘vacated’ (available units) and ‘occupied’ counts.
hh_configuration: Household configuration tuple (size, type, quintile, etc.). alloc_limiter: Maximum allowed underoccupation (rooms - household_size).
- Returns:
Tuple of (dwelling_configuration, use_large_uo): - dwelling_configuration: Found dwelling tuple, or -1 if none found - use_large_uo: Boolean indicating if large underoccupation was needed
- Notes:
Checks underoccupation limits before accepting a dwelling
Tries many combinations by modifying features in priority order
Caches underoccupation calculations for performance
Returns -1 if no suitable dwelling found (household becomes homeless)
May allow large underoccupation as last resort to avoid homelessness
- scripts.allocation.get_args_and_handler(ip: Dict[str, Any], all_cases: Dict[str, Dict[str, Dict[str, Any]]], dwell_var: str, dwell_order: str, changed_dwellings: Tuple[str, ...], params: Dict[str, Any]) Tuple[Callable, Any, Any, Dict[str, Any]]
Get handler function and arguments for modifying a dwelling attribute.
Retrieves the appropriate handler function and its arguments based on the dwelling variable being modified and the modification strategy. Handles special cases for rooms (oscillation) and ownership (multiple values).
- Args:
ip: Input parameters dictionary. May need ‘dwelling_ownership_short’ key. all_cases: Dictionary from get_all_cases() containing all handler configs. dwell_var: Name of dwelling attribute to modify (e.g., “rooms”, “ownership”). dwell_order: Modification strategy (e.g., “other”, “oscillate_larger”). changed_dwellings: Current dwelling configuration tuple. params: Parameters dict tracking iteration state. Modified in place.
Relevant keys: ‘oscillator’, ‘ownership’, ‘room_all_args’, ‘osci_i’, ‘own_i’.
- Returns:
Tuple of (handler_function, arg1, arg2, updated_params). Handler function can be called as handler(arg1, arg2, dwelling_list).
- Notes:
Handles placeholder values (-1, -2) by replacing with actual values
Sets flags in params dict for special iteration modes (oscillator, ownership)
For rooms: generates oscillation list on first call, reuses on subsequent calls
- scripts.allocation.get_move_in_want(pref_v: DataFrame, ip: Dict[str, Any], hh_stock: DataFrame, inhabit_v: DataFrame, alloc_rate_it: List[int]) Tuple[DataFrame, DataFrame, DataFrame]
Calculate household move-in desires based on preferences and search rates.
Combines household preferences with household search behavior to determine how many households of each type want to move into each dwelling type. Splits the total demand across multiple allocation iterations.
- Args:
pref_v: Preference vector with ‘weights_pref’ column (probabilities 0-1). ip: Input parameters dictionary. Required keys: ‘cols_dwell’, ‘cols_hh’,
‘col_weights’, ‘to_group_col’.
hh_stock: Household stock DataFrame with ‘search’ column (number searching). inhabit_v: Inhabit vector with ‘weights’ and ‘mean_ls’ columns. alloc_rate_it: List of percentages for each iteration (e.g., [20, 20, 20, 20, 20]).
- Returns:
Tuple of (move_in_want_v, move_in_want_v_save, move_in_v): - move_in_want_v: Move-in wants split by iteration, ordered by priority - move_in_want_v_save: Copy for saving/analysis - move_in_v: Empty DataFrame to accumulate actual allocations
- Notes:
Multiplies preferences by search counts to get absolute demand
Splits demand across iterations based on alloc_rate_it percentages
Rounds to whole numbers and ensures sum equals original total
Filters out zero-demand entries for efficiency
Orders final output according to allocation priority settings
- scripts.allocation.get_osciallating_vals(changed_dwellings: Tuple[str, ...], dwell_order: str, ip: Dict[str, Any]) List[List[str]]
Generate room size combinations in oscillating order for allocation search.
Creates a list of [current_room, target_room] pairs that alternate between larger and smaller room sizes. This is used when searching for available dwellings: if the preferred room size is unavailable, try slightly larger, then slightly smaller, then even larger, etc.
- Args:
changed_dwellings: Tuple of dwelling attributes, must include a room size. dwell_order: Order strategy - either ‘oscillate_larger’ (prefer up first)
or ‘oscillate_lower’ (prefer down first).
ip: Input parameters dictionary. Required key: ‘max_rooms’.
- Returns:
List of [source_room, target_room] pairs in oscillating order. For example, if current room is “3” and oscillate_larger: [[“3”, “4”], [“3”, “2”], [“3”, “5”], [“3”, “1”], [“3”, “6”], …]
- Example:
>>> ip = {'max_rooms': 7} >>> get_osciallating_vals(("urban", "SFH", "3"), "oscillate_larger", ip) [['3', '4'], ['3', '2'], ['3', '5'], ['3', '1'], ['3', '6'], ['3', '7+']]
- scripts.allocation.get_quintile_limits(df: DataFrame, alloc_pref_limit: str) Dict[str, float]
Calculate quintile-specific limits for preference allocation.
Determines what fraction of each income quintile should receive special preference treatment (e.g., anti-underoccupation preferences). Can be configured to apply different percentages to different quintiles.
- Args:
df: DataFrame with ‘income_quintile’ index level and ‘weights_pref’ column. alloc_pref_limit: Either “default” (100% for all quintiles) or a string
specifying custom limits (e.g., “q5: 80, q4: 60, q3: 40, q2: 20, q1: 20”).
- Returns:
Dictionary mapping quintile names (e.g., “q5”, “q4”) to their limit values. Limits represent the total preference weight allocated to that quintile.
- Example:
>>> df = pd.DataFrame({'weights_pref': [100, 80, 60, 40, 20]}, ... index=pd.Index(['q5', 'q4', 'q3', 'q2', 'q1'], ... name='income_quintile')) >>> limits = get_quintile_limits(df, "q5: 100, q4: 75, q3: 50, q2: 25, q1: 25") >>> limits['q4'] # Returns 80 * 0.75 = 60.0
- scripts.allocation.get_regtype(df: DataFrame, hh_dwell: str = 'dwell') DataFrame
Filter DataFrame to a specific region type (urban or rural).
Extracts rows corresponding to a specific region type and resets the index to the specified grouping columns.
- Args:
df: DataFrame with region type information. regtyp: Region type to filter for (e.g., “urban” or “rural”). hh_dwell: Suffix for the region_type column name (default “dwell”).
Column name will be f”region_type_{hh_dwell}”.
- Returns:
Filtered DataFrame indexed by to_group columns, containing only rows matching the specified region type.
- Example:
>>> cols = ["hh_size", "rooms", "building_type"] >>> df_urban = get_regtype(df, "urban", cols, "dwell")
- scripts.allocation.get_underoccupation(ip: Dict[str, Any], needed_dwellings: Tuple[str, ...], hh_configuration: Tuple[str, ...]) int
Calculate underoccupation level for a household-dwelling match.
Underoccupation is defined as the number of rooms minus the household size. Positive values indicate underoccupation (more rooms than people), negative values indicate overcrowding (more people than rooms).
- Args:
ip: Input parameters dictionary. Required keys: ‘max_rooms’, ‘max_hh_size’. needed_dwellings: Dwelling configuration tuple containing room size. hh_configuration: Household configuration tuple containing household size.
- Returns:
Integer representing underoccupation level (rooms - household_size). For example: 5 rooms - 3 people = 2 (underoccupied by 2 rooms).
- Example:
>>> ip = {'max_rooms': 7, 'max_hh_size': 5} >>> get_underoccupation(ip, ("urban", "SFH", "4"), ("3", "single")) 1 # 4 rooms - 3 people = 1
- scripts.allocation.other_handler(option1: str, option2: str, changed_dwellings: List[str]) List[str]
Toggle between two options in a dwelling configuration list.
Swaps occurrences of option1 with option2 or vice versa, depending on which is currently present in the list. Used for toggling between binary dwelling attributes (e.g., “renovated” <-> “not renovated”).
- Args:
option1: First option value (e.g., “renovated”). option2: Second option value (e.g., “not renovated”). changed_dwellings: List of dwelling attributes to modify.
- Returns:
New list with the appropriate option toggled.
- Example:
>>> other_handler("SFH", "MFH", ["urban", "SFH", "3", "q3"]) ['urban', 'MFH', '3', 'q3']
- scripts.allocation.pref_current_quintile(inhabit_v: DataFrame, ip: Dict[str, Any], alloc_pref_limit: str) DataFrame
Create preference matrix from inhabit vector that focuses on the same quintile.
The whole household-configuration stays the same. The people want to move to the same places as they already live. This creates a preference matrix where each household category’s preferences match the current distribution of households in that same category.
- Args:
- inhabit_v: Inhabit vector containing household weights and configurations.
Must have a ‘weights’ column and be indexed by household/dwelling categories.
- ip: Input parameters dictionary containing configuration settings.
Required keys: ‘cols_dwell’, ‘col_weights’.
- alloc_pref_limit: Allocation preference limit setting (not used in this variant
but kept for consistency with other preference functions).
- Returns:
DataFrame containing preference weights for each household-dwelling combination. Index is the full household and dwelling configuration, with a single column ‘weights_pref’ containing the preference probability (0-1).
- Example:
>>> inhabit = pd.DataFrame({'weights': [100, 50, 25]}, ... index=pd.MultiIndex.from_tuples([...])) >>> ip = {'cols_dwell': ['rooms', 'building_type'], 'col_weights': 'weights'} >>> prefs = pref_current_quintile(inhabit, ip, 'default')
- scripts.allocation.pref_no_underoccupation(inhabit_v: DataFrame, ip: Dict[str, Any], alloc_pref_limit: str) DataFrame
Calculate preference matrix that discourages underoccupation.
This function modifies “current_quintile” preferences so that households prefer dwellings with a maximum number of rooms of hh_size+1. Preferences for larger dwellings (which would lead to underoccupation) are redistributed to appropriately sized dwellings. The intensity of this preference is controlled by quintile-specific limits.
- Args:
inhabit_v: Inhabit vector containing household weights and configurations. ip: Input parameters dictionary. Required keys: ‘cols_dwell’, ‘col_weights’,
‘max_hh_size’, ‘max_rooms’, ‘to_group_col’.
- alloc_pref_limit: String specifying per-quintile limits (e.g., “q5: 100, q4: 75”).
Controls what fraction of each quintile gets anti-underoccupation preferences.
- Returns:
DataFrame with preference weights redistributed to discourage underoccupation. Total preference weight is preserved (sums to same value as input).
- Notes:
For hh_size N, preferences for rooms > N+1 are moved to rooms = N+1.
Quintile limits determine what fraction of each quintile uses this preference.
Also handles overoccupation preferences (rooms < hh_size-1).
Final output merges modified and unmodified preferences based on limits.
- scripts.allocation.pref_q4_aspiration(inhabit_v: DataFrame, ip: Dict[str, Any], alloc_pref_limit: str) DataFrame
Calculate preference matrix for ‘quintile 4 aspiration’ setting.
This is a hybrid approach: all quintiles except q4 use “current_quintile” preferences, while q4 households adopt preferences from q5 (quintile above). This models a scenario where upper-middle class (q4) aspires to upper class (q5) living conditions.
- Args:
inhabit_v: Inhabit vector containing household weights and configurations. ip: Input parameters dictionary. Required keys: ‘cols_dwell’, ‘col_weights’. alloc_pref_limit: Allocation preference limit setting.
- Returns:
DataFrame with preference weights where only q4 households adopt q5 preferences, all others keep current quintile preferences.
- Notes:
Similar to pref_quintile_above but only affects quintile 4.
All q4 household preferences are replaced (no fractional adoption).
Avoids edge cases where distributions would be invalid.
- scripts.allocation.pref_quintile_above(inhabit_v: DataFrame, ip: Dict[str, Any], alloc_pref_limit: str) DataFrame
Calculate preference matrix for ‘quintile above’ setting with aspiration logic.
This function creates preferences where households in quintile qX aspire to live like households in quintile q(X+1). For a fraction of households in each quintile (except q5), their preferences are copied from the quintile above them. Additionally, 20% of q5 households are given shifted room preferences (larger rooms).
- Args:
inhabit_v: Inhabit vector containing household weights and configurations. ip: Input parameters dictionary. Required keys: ‘cols_dwell’, ‘col_weights’,
‘quintile_fraction’, ‘max_rooms’.
alloc_pref_limit: Allocation preference limit setting.
- Returns:
DataFrame with preference weights where lower quintiles partially adopt preferences from higher quintiles, normalized to sum to 1.0 per household type.
- Notes:
Processes quintiles 1-4 (q5 is the reference quintile).
Only copies preferences for a fraction of households (specified by quintile_fraction).
Includes special logic to shift 20% of q5 households to prefer larger rooms.
Skips copying when current quintile has all households (100%) and above has none (0%).
- scripts.allocation.standard_handler(check_value: str, check_list: List[str], changed_dwellings: List[str]) List[str]
Replace values from check_list with check_value in dwelling configuration.
Used for changing dwelling attributes to a specific value. Replaces any element in changed_dwellings that appears in check_list with check_value.
- Args:
check_value: Value to insert (e.g., “4” for 4 rooms). check_list: List of values to replace (e.g., [“3”, “5”] for other room sizes). changed_dwellings: List of dwelling attributes to modify.
- Returns:
New list with matching values replaced by check_value.
- Example:
>>> standard_handler("4", ["3", "5"], ["urban", "SFH", "3", "q3"]) ['urban', 'SFH', '4', 'q3']
Module: Census Calibration
Handles the process of calibrating simulation data to match German Census 2022 targets using Iterative Proportional Fitting (IPF).
Census Calibration Module for Housing Data.
This module provides comprehensive calibration functionality for aligning housing and household data with census statistics. It implements Iterative Proportional Fitting (IPF) algorithms to adjust survey weights to match known population totals across multiple dimensions (building type, ownership, rooms, condition).
The calibration process handles vacancy adjustments and supports multi-year calibration with factor reuse for computational efficiency.
- Typical usage example:
from scripts import census_calibration from scripts import misc
ip = misc.load_input_parameters() inh = load_inhabit_data() census_calibration.calibrate(ip, inh)
- scripts.census_calibration.apply_calibration_factors(df: DataFrame, factor_lookup: Dict[Tuple, float], ip: Dict[str, Any], use: str) DataFrame
Apply pre-computed calibration factors to a DataFrame.
This function provides fast calibration by applying pre-calculated factors instead of running the full IPF algorithm. It’s used for years other than the base year.
The function applies the total factor to all weights, then rescales if necessary to ensure the total matches exactly.
- Args:
df: DataFrame to calibrate factor_lookup: Dictionary of calibration factors from base year,
must include ‘total’ key
ip: Input parameters dictionary (for logging) use: Data type indicator:
‘evidence’: Inhabit evidence data
‘hh_stock’: Household stock data
‘housing_model’: Housing model dwelling stock
- Returns:
Calibrated DataFrame with adjusted weight columns.
- Raises:
SystemExit: If rescaling fails to achieve correct total
- Note:
The function applies a uniform ‘total’ factor rather than individual group factors. If sum changes unexpectedly, it applies a rescaling correction.
- scripts.census_calibration.calibrate_dataset(df: DataFrame, ip: Dict[str, Any], weight_col: str = 'weights') DataFrame
Main calibration method using Iterative Proportional Fitting (IPF).
This is the core calibration function that: 1. Loads census target totals 2. Prepares data as contingency table 3. Runs IPF algorithm separately for rural and urban regions 4. Converts results back to original DataFrame format
The IPF algorithm adjusts survey weights to match known population totals across multiple dimensions while preserving the internal structure of the data.
- Args:
- df: DataFrame to calibrate with columns:
region_type_dwell: ‘rural’ or ‘urban’
building_type: ‘MFH’ or ‘SFH’
ownership: ‘private owner’ or ‘private tenant’
rooms: ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7+’
condition: ‘renovated’ or ‘not renovated’
weight_col: Column name containing weights to calibrate
- ip: Input parameters dictionary containing:
census_no_vac_path: Path to census without vacancies (for inhabit)
census_no_vac_model_path: Path to census for modeling (for dwells)
weight_col: Name of column containing weights (‘weights’ or ‘dwells’)
- Returns:
Calibrated DataFrame with adjusted weight_col values matching census totals.
- Note:
Uses census_no_vac_path for weight_col=’weights’ (inhabit data)
Uses census_no_vac_model_path for weight_col=’dwells’ (dwelling stock)
- scripts.census_calibration.calibrate_ds_hm(dwell_stock: DataFrame, year: int, ip: Dict[str, Any], factor_lookup: Optional[Dict[Tuple, float]] = None) Dict[Tuple, float]
Calibrate dwelling stock from housing model to census totals.
This function calibrates dwelling stock data (as opposed to inhabit data) using the same IPF methodology. It only operates in “calculate” mode since dwelling stock factors are typically calculated once for the base year.
- Args:
- dwell_stock: DataFrame with dwelling stock data including:
building_type: ‘MFH’ or ‘SFH’
condition: ‘renovated’ or ‘not renovated’
region_type_dwell: ‘rural’ or ‘urban’
dwells: Number of dwellings
year: Year to calibrate (must be base year) ip: Input parameters dictionary containing:
empirical_end_year: Base year for calibration
debug: Flag for debug output
factor_lookup: Must be None (this function only calculates factors)
- Returns:
Dictionary of calibration factors mapping group keys to float factors.
- Raises:
ValueError: If year is not base year or factor_lookup is not None
- scripts.census_calibration.calibrate_inhabit(year: int, ip: Dict[str, Any], factor_lookup: Optional[Dict[Tuple, float]] = None) Dict[Tuple, float]
Calibrate inhabit (household-dwelling) data to Census 2022 targets.
This function implements two modes of operation: 1. Calculate mode (factor_lookup=None, year=base_year):
Runs full IPF calibration
Calculates and returns calibration factors
Apply mode (factor_lookup provided): - Applies pre-calculated factors to data - Maintains computational efficiency for multiple years
The function also adjusts ‘search’ and ‘stay’ columns proportionally to maintain their relationship to total weights.
- Args:
year: Year of inhabit data to calibrate ip: Input parameters dictionary containing:
inhabit_evidence_path: Path template for inhabit CSV files
empirical_end_year: Base year for factor calculation
to_group_col: Columns for grouping households
debug: Flag for debug output
analysis_evidence: Path for debug output
- factor_lookup: Optional pre-calculated calibration factors.
If None and year==base_year, factors will be calculated.
- Returns:
Dictionary of calibration factors mapping group keys to float factors. Only returned when calculating factors (base year, factor_lookup=None).
- Raises:
FileNotFoundError: If inhabit file for specified year doesn’t exist ValueError: If factor calculation requested for wrong year
- Side Effects:
Saves calibrated inhabit data to factorized evidence path
Writes debug information if enabled
- scripts.census_calibration.calibrate_with_factor_reuse(years: List[int], ip: Dict[str, Any]) Dict[Tuple, float]
Calibrate multiple years using factor reuse strategy.
This is the main workflow function that implements an efficient multi-year calibration strategy: 1. Calculate calibration factors from the most recent (base) year 2. Save these factors for future use 3. Apply the same factors to all other years
This approach assumes that the relationship between survey data and census totals is relatively stable across years, allowing factor reuse.
- Args:
- years: List of years to calibrate. The last year is used as base year
for calculating factors. Example: [2020, 2021, 2022]
- ip: Input parameters dictionary containing:
cal_factors_file: Path template for factor files
cal_factors_evidence: Path template for evidence factor files
- Returns:
Dictionary of calibration factors calculated from base year. Keys are tuples of group identifiers, values are float factors.
- Side Effects:
Saves calibration factors to disk
Creates calibrated inhabit files for all years
- Example:
If years = [2020, 2021, 2022]: - Calculates factors using 2022 data - Applies these factors to 2020, 2021, and 2022
- scripts.census_calibration.census_no_vacated(ip: Dict[str, Any]) None
Create census datasets with vacancy adjustments.
Processes raw census data to create two versions: 1. Complete vacancy removal (for historical backward compatibility) 2. Partial vacancy removal (excluding long-term vacancies, for modeling)
Vacancies are removed proportionally across building types, ownership types, and room counts based on their share of total dwellings.
- Args:
- ip: Input parameters dictionary containing:
census_file_path: Path to processed census file
census_dwellings_path: Path to raw census Excel file
census_rural_urban_path: Path to rural/urban classification
census_no_vac_path: Output path for census without all vacancies
census_no_vac_model_path: Output path for census for modeling
- Returns:
None. Creates two census CSV files with vacancy adjustments.
- Side Effects:
Creates or loads processed census file
Creates two vacancy-adjusted census files
- Note:
LEERSTAND_PART is currently set to 0 (all vacancy removal disabled). Uncomment calculation to enable partial vacancy removal.
- scripts.census_calibration.convert_back_to_dataframe(calibrated_contingency: Series, original_df: DataFrame, dwellsweights: str, ip: Dict[str, Any]) DataFrame
Convert calibrated contingency table back to original DataFrame format.
After IPF calibration, we have a contingency table (aggregated by groups). This function distributes the calibrated totals back to individual rows in the original DataFrame, preserving the relative proportions within each group.
The process: 1. For each row in original data, find its group in contingency table 2. Calculate row’s proportion within its group 3. Assign calibrated_total × proportion to the row
- Args:
- calibrated_contingency: Series with calibrated weights indexed by
(region, building_type, ownership, condition, rooms)
original_df: Original DataFrame with individual records dwellsweights: Name of weight column (‘weights’ or ‘dwells’)
- Returns:
DataFrame with same structure as original_df but with calibrated weights.
- Example:
If a group had original total 100 split as [60, 40] and calibrated total is 120, the result will be [72, 48] (same 60:40 ratio).
- scripts.census_calibration.get_dwell_stock_factors(ip: Dict[str, Any], year: int, dwell_stock: DataFrame) Dict[Tuple, float]
Calculate or load calibration factors for dwelling stock data.
This function manages the calibration factor workflow for dwelling stock: - Checks if factors already exist for the base year - If not, calculates new factors using IPF calibration - Saves factors for future reuse
- Args:
- ip: Input parameters dictionary containing:
empirical_end_year: Base year for calibration
cal_factors_file: Path template for factor files
dwelling_stock: Source of dwelling stock data (must be ‘housing_model’)
debug: Flag for debug output
analysis_generated: Path for debug output
year: Year to calibrate dwell_stock: DataFrame with dwelling stock data including:
building_type: ‘MFH’ or ‘SFH’
condition: ‘renovated’ or ‘not renovated’
dwells: Number of dwellings
- Returns:
Dictionary mapping group keys to calibration factors. Keys are tuples of (building_type, condition). Also includes ‘total’ key with overall scaling factor.
- Raises:
SystemExit: If dwelling_stock type is not ‘housing_model’
- scripts.census_calibration.get_row_wise_factor(calibrated_df: DataFrame, original_df: DataFrame, ip: Dict[str, Any], year: int, use: str, use_weights: bool = True) Dict[Tuple, float]
Calculate calibration factors from original and calibrated data.
Creates a lookup dictionary mapping group identifiers to calibration factors. These factors can be applied to other years’ data, avoiding expensive re-calibration.
Factor = Calibrated_Weight / Original_Weight for each group
- Args:
calibrated_df: DataFrame with calibrated weights original_df: DataFrame with original weights ip: Input parameters dictionary containing:
to_group_col: Columns for grouping households (evidence/household_stock)
cols_hh: Columns for household grouping
cols_dwell: Columns for dwelling grouping
year: Year being calibrated (for logging) use: Data type indicator:
‘evidence’: Inhabit evidence data
‘household_stock’: Household stock data
‘housing_model’: Dwelling stock from housing model
use_weights: Whether to use ‘weights’ (True) or ‘inhabits’ (False) column
- Returns:
Dictionary mapping group keys to calibration factors. Keys are tuples of group identifiers. Also includes ‘total’ key with overall scaling factor.
- Example:
- {
(‘MFH’, ‘private owner’, …): 1.05, (‘SFH’, ‘private tenant’, …): 0.98, ‘total’: 1.02
}
- scripts.census_calibration.inject_condition_to_census(inh: DataFrame, base_year: int, ip: Dict[str, Any], census_file: str) None
Inject building condition totals into census data.
This function performs two key operations: 1. Calibrates inhabit weights to match census building type totals 2. Adds building condition (‘renovated’/’not renovated’) totals to census
The census data lacks information about building condition, but this is needed for calibration. This function uses the inhabit data (already calibrated to building types) to estimate condition totals.
- Args:
- inh: Inhabit DataFrame with columns:
region_type_dwell: ‘rural’ or ‘urban’
building_type: ‘MFH’ or ‘SFH’
condition: ‘renovated’ or ‘not renovated’
weights: Survey weights
base_year: Year for calibration (currently not used but kept for API consistency) ip: Input parameters dictionary census_file: Path to census CSV file to be modified
- Returns:
None. Modifies census_file in place by adding ‘renovated’ and ‘not renovated’ columns.
- Side Effects:
Modifies inhabit weights to match census building type totals
Updates census CSV file with condition columns
- scripts.census_calibration.ipf_calibrate_region(region_data: Series, region: str, targets: Dict[str, Dict[str, float]], disaggs: Dict[str, List[str]]) Series
Apply Iterative Proportional Fitting (IPF) calibration for a region.
IPF is an iterative algorithm that adjusts cell values in a multi-dimensional table to match known marginal totals. The algorithm:
Takes initial survey weights in cells
Adjusts weights to match target totals for each dimension
Repeats until convergence (weights stable) or max iterations reached
The order of adjustments: building_type → ownership → condition → rooms
- Args:
- region_data: Series of weights indexed by (building_type, ownership,
condition, rooms) for a single region
region: Region name (‘rural’ or ‘urban’) for logging targets: Dictionary of target totals for each dimension:
- {
‘building_type’: {‘MFH’: 1000, ‘SFH’: 2000}, ‘ownership’: {‘private owner’: 1500, ‘private tenant’: 1500}, ‘condition’: {‘renovated’: 800, ‘not renovated’: 2200}, ‘rooms’: {‘1’: 100, ‘2’: 200, …}
}
- disaggs: Dictionary defining valid categories for each dimension:
- {
‘building_type’: [‘MFH’, ‘SFH’], ‘ownership’: [‘private owner’, ‘private tenant’], …
}
- Returns:
Calibrated Series with same structure as region_data but adjusted values that match target marginal totals.
- Note:
Convergence tolerance: 1e-6 (max change in any cell)
Max iterations: 50
Adjustments are multiplicative: new_value = old_value × (target/current)
- scripts.census_calibration.load_calibration_factors(ip: Dict[str, Any], factors_file: str) Optional[Dict[Tuple, float]]
Load previously saved calibration factors from pickle file.
- Args:
ip: Input parameters dictionary (for logging) factors_file: Path to pickle file containing factor dictionary
- Returns:
Dictionary of calibration factors if file exists, None otherwise. Keys are tuples of group identifiers, values are float factors.
- scripts.census_calibration.load_census_2022(ip: Dict[str, Any], census_path: str) DataFrame
Load and process raw census 2022 data from Excel files.
Processes census data by: 1. Loading dwelling counts from census Excel file 2. Merging with rural/urban classification 3. Categorizing dwellings by building type (SFH/MFH) 4. Categorizing by ownership (private owner/tenant) 5. Including room counts (1-7+) 6. Including vacancy statistics 7. Aggregating by rural/urban region type
- Args:
- ip: Input parameters dictionary containing:
census_dwellings_path: Path to census dwelling data Excel file
census_rural_urban_path: Path to rural/urban classification Excel
census_path: Output path for processed census CSV file
- Returns:
- DataFrame indexed by region_type (‘rural’/’urban’) with columns:
SFH: Single family home count
MFH: Multi family home count
private owner: Private owner count
private tenant: Private tenant count
RAUMANZAHL__01 through RAUMANZAHL__07: Room counts
LEERSTAND_INSGESAMT: Total vacancies
LEERSTAND_DAUER: Long-term vacancies (aggregated)
- Side Effects:
Saves processed census data to census_path
Sets pandas option ‘future.no_silent_downcasting’
- Note:
SFH includes buildings with 1-2 apartments
MFH includes buildings with 3+ apartments
Private tenants include various rental arrangements (items 3-8)
Special census values (’–’ and ‘.’) are replaced with 0
- scripts.census_calibration.prepare_data_for_calibration(df: DataFrame, ip: Dict[str, Any], weight_col: str = 'weights') Series
Prepare data for calibration by creating multi-dimensional contingency table.
Converts the detailed DataFrame into a contingency table (pivot table) that aggregates weights by all relevant dimensions. This format is required for the IPF algorithm.
- Args:
- df: DataFrame with individual records and columns:
region_type_dwell: ‘rural’ or ‘urban’
building_type: ‘MFH’ or ‘SFH’
ownership: ‘private owner’ or ‘private tenant’
condition: ‘renovated’ or ‘not renovated’
rooms: ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7+’
weight_col: Weights to aggregate
weight_col: Name of column containing weights to sum
- Returns:
Series indexed by (region_type_dwell, building_type, ownership, condition, rooms) with aggregated weights as values.
- Example:
Input: 1000 rows with weights Output: ~160 cells (2 regions × 2 building types × 2 ownership ×
2 conditions × 7 room counts, though some may be empty)
- scripts.census_calibration.save_calibrated_data(calibrated_df: DataFrame, ip: Dict[str, Any], year: int, dataset_type: str) None
Save calibrated data to CSV file.
- Args:
calibrated_df: DataFrame with calibrated weights to save ip: Input parameters dictionary containing:
inhabit_fac_evidence_path: Path template for calibrated inhabit data
hh_stock_fac_evi_path: Path template for calibrated household stock
year: Year of the data dataset_type: Type of dataset to save:
‘evidence_weights’: Calibrated inhabit data
‘hh_stock’: Calibrated household stock data
- Returns:
None. Saves data to CSV file.
- Raises:
SystemExit: If dataset_type is not recognized
- Side Effects:
Creates CSV file with calibrated data (rounded to integers)
Prints save confirmation message
- scripts.census_calibration.save_calibration_factors(factor_lookup: Dict[Tuple, float], ip: Dict[str, Any], factors_file: str) str
Save calibration factors to pickle file for future reuse.
- Args:
factor_lookup: Dictionary mapping group keys to calibration factors ip: Input parameters dictionary (currently unused but kept for consistency) factors_file: Path where pickle file should be saved
- Returns:
Path to saved factors file
- Side Effects:
Creates pickle file containing factor_lookup dictionary
Module: Dwelling Stock Management
Manages and calibrates the dwelling stock, ensuring consistency with census data.
Dwelling Stock Management Module
This module provides functions for managing and calibrating dwelling stock data, including census calibration adjustments and dwelling distribution calculations.
- scripts.dwelling_stock.apply_dwell_factor(ip: Dict[str, Any], year: int, dwell_stock: DataFrame) DataFrame
Apply census calibration factors to dwelling stock data.
This function calibrates dwelling stock data to match census totals through an iterative rescaling process. It applies a total dwelling factor and then iteratively adjusts the dwelling counts until convergence is achieved.
- Args:
- ip: Input parameters dictionary containing configuration settings.
Must include ‘cols_dwell’ key with column names for dwelling indexing.
year: The year for which to apply census calibration factors. dwell_stock: DataFrame containing dwelling stock data with columns:
dwells: number of dwellings
vacated: number of vacated dwellings
occupied: number of occupied dwellings
Additional columns may be present and will be preserved.
- Returns:
DataFrame with calibrated dwelling stock data. Adds ‘census_dwells’ column containing the calibrated dwelling counts. The ‘vacated’ and ‘occupied’ columns are also rescaled to match the calibration.
- Raises:
- AssertionError: If the iterative rescaling fails to converge after
10 iterations or if final totals don’t match within numerical tolerance.
SystemExit: If rescaling doesn’t converge after 10 iterations (exits with code 1).
- Notes:
Uses an iterative rescaling approach with a maximum of 10 iterations
Convergence is checked using np.isclose() for numerical stability
The function preserves the original ‘dwells’ column as ‘census_dwells’
Commented code shows an old per-index factor application approach
Module: Filters
Provides data filtering and preprocessing functions for SOEP dataframes.
Define filters for SOEP (Socio-Economic Panel) dataframes.
This module provides filtering functions for processing SOEP survey data, including age-based filtering, interview quality filtering, year selection, and household-level aggregation.
- scripts.filters.age_filter(df: DataFrame, ip: Dict[str, Any]) DataFrame
Filter dataframe by age range and add survey age column.
Creates a new column ‘sage’ (survey age) calculated as the difference between survey year and birth year, then filters rows to keep only those within the specified age range (inclusive).
- Args:
- df: DataFrame containing SOEP survey data with columns ‘syear’
(survey year) and ‘gebjahr’ (birth year).
- ip: Dictionary of input parameters containing:
‘min_age’ (int): Minimum age threshold (inclusive).
‘max_age’ (int): Maximum age threshold (inclusive).
- Returns:
DataFrame filtered to include only rows where min_age <= sage <= max_age, with an additional ‘sage’ column representing age at time of survey.
- Example:
>>> ip = {'min_age': 18, 'max_age': 65} >>> filtered_df = age_filter(df, ip)
- scripts.filters.filter_df(df: DataFrame, ip: Dict[str, Any], set_year: int) Tuple[DataFrame, Dict[str, int]]
Apply a comprehensive set of filters to SOEP survey data.
This is the main filtering function that applies multiple filters in sequence: 1. Filter for successful household interviews (netto_filter) 2. Filter by age range (age_filter) 3. Aggregate household data and mark oldest members (households_filter) 4. Remove rows with NaN values (misc.clean_nan) 5. Validate that the resulting dataframe is not empty (misc.check_empty)
Note: The year_filter is currently commented out but available if needed.
- Args:
df: DataFrame containing raw SOEP survey data. ip: Dictionary of input parameters containing:
‘min_age’ (int): Minimum age threshold.
‘max_age’ (int): Maximum age threshold.
‘col_oldest’ (str): Name of column to mark oldest household member.
- set_year: The survey year to process (currently not actively used
as year_filter is commented out).
- Returns:
- A tuple containing:
Filtered and cleaned DataFrame ready for analysis.
Dictionary mapping column names to count of NaN values removed.
- Raises:
May raise exceptions from misc.check_empty if dataframe is empty after filtering.
- Example:
>>> ip = {'min_age': 18, 'max_age': 65, 'col_oldest': 'is_oldest'} >>> filtered_df, nan_counts = filter_df(df, ip, set_year=2019)
- scripts.filters.households_filter(df: DataFrame, ip: Dict[str, Any]) DataFrame
Aggregate household-level data and identify oldest household member.
This function performs two main operations: 1. For each household (identified by ‘hid’ and ‘syear’), keeps the first
non-NaN value for each variable. Since household-level variables are the same for all members, this effectively creates one aggregated row per household.
Adds a boolean column to mark the oldest person in each household (person with minimum birth year).
The function preserves all rows from the original dataframe but adds a marker column indicating which person in each household is the oldest.
- Args:
- df: DataFrame containing SOEP survey data with columns ‘hid’
(household ID), ‘syear’ (survey year), ‘gebjahr’ (birth year), and ‘pid’ (person ID).
- ip: Dictionary of input parameters containing:
‘col_oldest’ (str): Name of the column to add that marks the oldest person in each household.
- Returns:
DataFrame with an additional boolean column (name specified in ip[‘col_oldest’]) where True indicates the oldest person in the household, False otherwise.
- Example:
>>> ip = {'col_oldest': 'is_oldest'} >>> df_with_oldest = households_filter(df, ip)
- scripts.filters.netto_filter(df: DataFrame) DataFrame
Filter to keep only successfully conducted household interviews.
The ‘hnetto’ variable in SOEP indicates the interview status, where a value of 1 represents a successfully conducted interview. This function filters the dataframe to retain only successful interviews.
- Args:
- df: DataFrame containing SOEP survey data with ‘hnetto’ column
(household interview status).
- Returns:
DataFrame containing only rows where hnetto == 1 (successful interviews).
- Example:
>>> successful_interviews = netto_filter(df)
- scripts.filters.year_filter(df: DataFrame, set_year: int = 2019) DataFrame
Filter dataframe to include only a specific survey year.
Selects rows from the dataframe that correspond to interviews conducted in the specified year.
- Args:
- df: DataFrame containing SOEP survey data with ‘syear’ column
(survey year).
set_year: The specific year to filter for. Defaults to 2019.
- Returns:
DataFrame containing only rows from the specified survey year.
- Example:
>>> df_2019 = year_filter(df, set_year=2019) >>> df_2020 = year_filter(df, set_year=2020)
Module: Household Stock Management
Manages household stock data and performs calibration against empirical data.
Load household stock data and insert it into inhabit matrix.
This module provides functions to process household stock data by loading it from Excel files, balancing it against existing inhabit data, and calibrating it using factors. The main workflow involves identifying households with too many or too few inhabitants and adjusting the dwelling stock accordingly.
- scripts.household_stock.get_household_stock(ip: Dict[str, Any], inhabit: Any, year: int, hh_stock: DataFrame) DataFrame
Process household stock data and balance it against existing inhabit data.
This function performs the core household stock balancing logic. It: 1. Loads statistical household weights for the given year 2. Compares statistical weights against current inhabitants 3. Identifies imbalances (too many or too few inhabitants) 4. Adjusts search/stay/inhabit counts to match statistical weights 5. Handles “ignored” households (those with no current inhabitants)
- Args:
- ip: Input parameters dictionary containing configuration keys including:
household_stock_path: Path to household stock Excel file
cols_hh: List of household column names for merging
mean_ls_hh: Path to CSV with mean living space evidence data
empirical_end_year: End year of empirical period (for commented experiment)
inhabit: Inhabit matrix (currently unused parameter - passed but not referenced) year: The year for which to process household stock hh_stock: DataFrame containing current household stock with columns:
inhabits: Current number of inhabitants
search: Number of households searching for dwellings
stay: Number of households staying in current dwellings
mean_ls: Mean living space per household
- Returns:
- pd.DataFrame: Updated household stock with additional columns:
stat_weights: Statistical household weights from source data
+too_many_-too_less: Difference between stat weights and current inhabitants (positive = too many, negative = too few)
ignored_hhs: Households with zero current inhabitants
search_plus: Additional searchers due to overpopulation
stay_minus: Reduction in stayers due to underpopulation
stay_neg: Negative values when stay reduction exceeds available stayers
stay_untouched: Original stay values before adjustment
mean_ls_evi: Mean living space from evidence (for ignored households)
ignored_ls_year: Year marker for ignored households
inhabits: Updated to match statistical weights
- Note:
There is commented experimental code for the empirical_end_year that would recalculate search/stay based on percentages. This is currently disabled.
- scripts.household_stock.process_household_projection_data(ip)
Process BBSR household projection data to prepare it for household stock modeling.
This function reads household projection data and ROR classification data from Excel files, processes and cleans the data, and prepares it for use in household stock models.
Parameters:
ip
Returns:
- tuple
A tuple containing two DataFrames: - processed_household_data: The cleaned and processed household projection data - processed_ror_data: The cleaned and processed ROR classification data
Example:
>>> household_data, ror_data = process_household_projection_data( ... "20240715_HHProg2040_HVQHH_WupertalInstitut.xlsx", ... "ROR_Klassifizierung.xlsx" ... )
Module: Input Configuration
Central module for loading and managing all configuration parameters and file paths.
Input configuration loader for the inhabit housing model.
This module provides functionality to load and configure all input parameters for the inhabit model from both CSV/Excel files and code-defined defaults. It handles path resolution, scenario configuration, and combines multiple configuration sources into a unified parameter dictionary.
- scripts.inputs.load_inputs(**flags: Any) Dict[str, Any]
Load and configure all input parameters for the inhabit model.
This function loads configuration from an Excel file and combines it with code-defined defaults, creating a comprehensive parameter dictionary used throughout the model. It handles: - Loading user parameters from Excel - Setting code-defined defaults for household and dwelling attributes - Configuring scenario-specific parameters - Resolving all file and directory paths - Setting up output paths and folder structure
- Args:
- **flags: Variable keyword arguments that override default parameters.
- Common flags include:
scenario_name (str): Name of scenario to run (e.g., ‘default’, ‘MoR_incr’, ‘Nopreferred_uo’, ‘No_alloc_uo_2’, ‘Split_sfh_6’, ‘RED_UO’)
target_year (int or str): Target year for projection or ‘last_available_data’
dwelling_stock (str): Source of dwelling stock data (e.g., ‘housing_model’, ‘inhabit’)
housing_model_results (str): Which housing model results to use (e.g., ‘slow_decrease_ls’, ‘slow_increase_ls’, ‘sme’, ‘soe’, ‘eff’, ‘reduce’, ‘test’)
- Returns:
- Dict[str, Any]: Comprehensive dictionary containing all model parameters,
including: - File paths for inputs and outputs - Household classification functions and parameters - Dwelling classification functions and parameters - Scenario-specific settings - Column names for data processing - Calibration and evidence paths
- Example:
>>> # Load with default scenario >>> params = load_inputs(scenario_name='default', target_year=2030) >>> >>> # Load with custom scenario and settings >>> params = load_inputs( ... scenario_name='RED_UO', ... target_year=2050, ... housing_model_results='slow_increase_ls' ... ) >>> >>> # Access configuration values >>> output_path = params['output'] >>> hh_types_func = params['hh']['hh_type']
- Notes:
The function creates directories as needed for outputs and data
Paths are resolved relative to the project root or absolute paths
Scenario configurations override defaults for specific parameters
All functions referenced (e.g., household.household_types) must exist in their respective modules
Module: Miscellaneous Utilities
A collection of essential utility functions used across the project for path management, data cleaning, and debugging.
Define several global variables, paths and basic functions.
This module contains utility functions for the Inhabit housing allocation model. It provides functions for data manipulation, path handling, validation, and various preprocessing operations for household and dwelling data.
- scripts.misc.add_mean_ls(ip: Dict[str, Any], inhabit: DataFrame, mean_ls_inh=None) DataFrame
Add mean living space (ls) data to the inhabit dataframe from multiple sources.
This function merges living space data from three hierarchical sources: 1. Individual inhabitant level (mean_ls_inh) 2. Household level (mean_ls_hh) 3. Dwelling level (mean_ls_dwell)
The function uses a cascading fallback strategy: if mean_ls is not available at the most granular level, it falls back to the next level in the hierarchy. Finally, it calculates total_living_space by multiplying weights by mean_ls.
- Args:
- ip: Dictionary containing input parameters with keys:
‘mean_ls_inh’: Path to inhabitant-level mean living space CSV
‘mean_ls_hh’: Path to household-level mean living space CSV
‘mean_ls_dwell’: Path to dwelling-level mean living space CSV
‘to_group_col’: Column name(s) for inhabitant grouping
‘cols_hh’: Column name(s) for household grouping
‘cols_dwell’: Column name(s) for dwelling grouping
inhabit: DataFrame containing inhabitant data with ‘weights’ and ‘mean_ls’ columns.
- Returns:
Updated inhabit DataFrame with merged mean_ls data and calculated total_living_space column.
- Notes:
The cascading logic prioritizes more granular data over aggregated data. Only positive weights and positive mean_ls values are considered.
- scripts.misc.check_absolute_path(path: str) str
Adjust file path based on current working directory.
Checks if the current working directory ends with ‘inhabit’. If yes, returns the path unchanged. If not, prepends ‘../’ to the path to navigate up one directory level.
- Args:
path: Relative file path to check and potentially adjust.
- Returns:
Adjusted path string. Either the original path or ‘../’ + path.
- Example:
>>> # If cwd is /home/user/inhabit >>> check_absolute_path('data/file.csv') 'data/file.csv'
>>> # If cwd is /home/user/inhabit/scripts >>> check_absolute_path('data/file.csv') '../data/file.csv'
- Notes:
This function helps maintain path compatibility when scripts are run from different directories within the project structure.
WARNING: This function may be orphaned or deprecated. Consider using absolute paths or pathlib for more robust path handling.
- scripts.misc.check_empty(df: DataFrame, source: str) None
Check if a DataFrame is empty and raise an error if it is.
This validation function ensures that critical DataFrames are not empty before proceeding with calculations. It’s used as a safeguard to prevent downstream errors from propagating.
- Args:
df: pandas DataFrame to check. source: String describing the source/name of the DataFrame
(used in error message).
- Raises:
- ValueError: If the DataFrame is empty, with a descriptive message
indicating which source DataFrame is empty.
- Example:
>>> df = pd.DataFrame() >>> check_empty(df, "household_data") ValueError: The household_data-df is empty. No further calculations possible.
- Notes:
This is a guard function that should be called before performing operations that require non-empty DataFrames.
- scripts.misc.clean_nan(df: DataFrame, on_cols: Optional[List[str]] = None) Tuple[DataFrame, DataFrame]
Remove rows containing NaN values in specified columns from DataFrame.
This function first replaces negative values (error codes) with NaN, then removes rows that contain NaN in the specified columns. It returns both the cleaned DataFrame and the removed rows for inspection.
- Args:
df: pandas DataFrame to be cleaned. on_cols: List of column names to check for NaN values.
If None (default), all columns are considered.
- Returns:
Tuple of (cleaned_df, removed_df): - cleaned_df: DataFrame with NaN-containing rows removed - removed_df: DataFrame containing only the removed rows
- Example:
>>> df = pd.DataFrame({'a': [1, -1, 3], 'b': [4, 5, -8]}) >>> clean_df, nan_df = clean_nan(df, on_cols=['a']) >>> # clean_df will not contain the row with -1 in column 'a'
- Notes:
Values from -8 to -1 are treated as error codes and replaced with NaN. This is common in survey data where negative values indicate: - Don’t know, No answer, Not applicable, etc.
- scripts.misc.create_stocks(inhabit: DataFrame, ip: Dict[str, Any]) Tuple[DataFrame, DataFrame]
Create aggregated household and dwelling stock dataframes.
This function aggregates the granular inhabit data into two stock summaries: 1. Household stock: aggregated by household characteristics 2. Dwelling stock: aggregated by dwelling characteristics
Both stocks include weighted mean living space calculations.
- Args:
- inhabit: DataFrame containing inhabit data with columns:
weights: Weighting factor for each record
search: Number of households searching for dwellings
stay: Number of households staying in current dwelling
mean_ls: Mean living space per inhabitant/household
- ip: Dictionary containing input parameters with keys:
‘cols_hh’: List of household grouping columns
‘cols_dwell’: List of dwelling grouping columns
- Returns:
Tuple containing (dwell_stock, hh_stock): - dwell_stock: DataFrame with dwelling statistics aggregated by cols_dwell - hh_stock: DataFrame with household statistics aggregated by cols_hh
- Notes:
Both stock dataframes are rounded to 0 decimal places and NaN values are filled with 0.
- scripts.misc.debug_messages(message: str, ip: Dict[str, Any]) None
Append debugging messages to a cumulative debug string if debug flag is enabled.
This function accumulates debug messages in ip[‘debug_message’] when the debug flag is True. This allows collecting all debug information for later output or logging.
- Args:
message: Debug message string to append. ip: Dictionary containing input parameters with keys:
‘debug’: Boolean flag indicating whether to store debug messages
‘debug_message’: String that accumulates all debug messages
- Returns:
None. Modifies ip[‘debug_message’] in place if debug flag is True.
- Example:
>>> ip = {'debug': True, 'debug_message': ''} >>> debug_messages("Step 1 complete", ip) >>> debug_messages("Step 2 complete", ip) >>> print(ip['debug_message']) Step 1 complete Step 2 complete
- Notes:
Each message is appended with a newline character. The ip dictionary is modified in place.
- scripts.misc.get_age_classes(ip: Dict[str, Any]) Dict[int, str]
Generate age class mappings based on age limiters in input parameters.
This function creates a dictionary that maps individual ages to age class labels (e.g., “0-17”, “18-64”, “65-99”). It supports both single-threshold and multi-threshold configurations.
- Args:
- ip: Dictionary containing input parameters with keys:
- ‘age_limiter’: Age threshold(s) as string. Either:
Single value: “65” (creates two classes: <65 and >=65)
Multiple values: “18, 65” (creates classes for each range)
‘min_age’: Minimum age in the dataset (e.g., 0)
‘max_age’: Maximum age in the dataset (e.g., 99)
- Returns:
Dictionary mapping each integer age to its age class label string.
- Example:
>>> # Single threshold >>> ip = {'age_limiter': '65', 'min_age': 0, 'max_age': 99} >>> get_age_classes(ip) {0: '<65', 1: '<65', ..., 64: '<65', 65: '>=65', ..., 99: '>=65'}
>>> # Multiple thresholds >>> ip = {'age_limiter': '18, 65', 'min_age': 0, 'max_age': 99} >>> get_age_classes(ip) {0: '0-17', ..., 17: '0-17', 18: '18-64', ..., 64: '18-64', 65: '65-99', ...}
- Notes:
Class labels use inclusive ranges (e.g., “18-64” includes both 18 and 64)
For multiple thresholds, separate values with “, ” (comma-space)
The last class extends to max_age inclusive
- scripts.misc.get_full_inhabit(all_dimensions: List[Dict[str, Any]], ip: Dict[str, Any], df_all_v: DataFrame, index_var: List[str]) DataFrame
Create a full inhabit matrix by cross-joining all dimension combinations.
This function generates a complete inhabit matrix by: 1. Creating a Cartesian product of all dimension combinations 2. Performing a right join with the existing data 3. Filtering to ensure region_type consistency 4. Validating that total weights are preserved
- Args:
- all_dimensions: List of dictionaries, each representing a dimension
(e.g., age groups, household types, dwelling types).
- ip: Dictionary containing input parameters with key:
‘to_group_col’: Column name(s) for grouping.
df_all_v: DataFrame containing existing inhabit data with ‘weights’ column. index_var: List of column names to use as the index.
- Returns:
Complete inhabit DataFrame with all dimension combinations, indexed by index_var.
- Raises:
AssertionError: If total weights before and after operations don’t match.
- Notes:
Removes duplicate rows before merging
Ensures region_type_hh matches region_type_dwell
The m:1 validation ensures many-to-one merge relationship
- scripts.misc.get_negative_dict() Dict[int, float]
Create a dictionary mapping negative integers to NaN values.
This function generates a dictionary with keys from -8 to 0 (inclusive), all mapped to np.nan. This is used for replacing error codes or missing data indicators in survey data with proper NaN values.
- Returns:
Dictionary with integer keys from -8 to 0 mapped to np.nan.
- Example:
>>> get_negative_dict() {-8: nan, -7: nan, -6: nan, -5: nan, -4: nan, -3: nan, -2: nan, -1: nan, 0: nan}
- Notes:
Negative values in survey data often indicate: - Don’t know (-1, -2) - No answer (-3) - Not applicable (-8) etc.
WARNING: This function appears to be orphaned and may not be actively used.
- scripts.misc.get_save_path(folder1: str, folder2: str) str
Create and return a save path by combining two folder names.
This function joins two folder paths, adjusts for the current working directory using check_absolute_path(), and creates the directory structure if it doesn’t exist.
- Args:
folder1: First folder path component (e.g., ‘output’). folder2: Second folder path component (e.g., ‘results’).
- Returns:
Complete path string to the created directory.
- Example:
>>> get_save_path('output', 'results') 'output/results' # Directory is created if it doesn't exist
- Notes:
Creates all intermediate directories as needed (like mkdir -p)
Uses check_absolute_path() for directory-aware path adjustment
- scripts.misc.get_split(ip: Dict[str, Any], split_large_dwell: str) Dict[str, List[int]]
Calculate how to split large dwellings with many rooms into smaller units.
This function takes a configuration string (e.g., “6_into_2”) and generates a mapping that shows how to split dwellings with 6+ rooms into 2 smaller dwellings. The split aims to distribute rooms as evenly as possible.
- Args:
- ip: Dictionary containing input parameters with keys:
‘max_rooms’: Maximum number of rooms to consider (int)
- split_large_dwell: String in format “{rooms}_into_{dwellings}”
(e.g., “6_into_2” means split dwellings with 6+ rooms into 2 units)
- Returns:
Dictionary mapping room counts to lists of split dwelling sizes. Example: {“6”: [3, 3], “7”: [3, 4], “7+”: [4, 4]}
- Example:
>>> get_split({'max_rooms': 7}, "6_into_2") {'6': [3, 3], '7': [3, 4], '7+': [3, 4]}
- Notes:
Remainder rooms are added to the last dwelling unit
The max_rooms entry gets a “+” suffix (e.g., “7+”)
- scripts.misc.load_inhabit_moving(inhabit_or_move: str, year: int, ip: Dict[str, Any], evidence_folder: str, load_abs: bool = False, alt_path: Optional[str] = None, use_weights: bool = True) Tuple[DataFrame, Optional[DataFrame]]
Load inhabit or move_out data from CSV files to DataFrames.
This function loads household allocation data from CSV files, handling both weighted and absolute count versions. It supports loading from different folder structures depending on whether the data is pre-created or evidence-based.
- Args:
- inhabit_or_move: String identifier, either “inhabit” or “move_out”
(or variants like “inhabit_created”).
year: Year for which to load the data. ip: Dictionary containing input parameters with keys:
‘to_group_col’: List of grouping columns
‘col_weights’: Column name for weights
‘col_move_per_pop’: Column name for movement per population
‘alloc_output_path’: Path to allocation output (if created data)
evidence_folder: Base folder path for evidence data. load_abs: Whether to also load absolute count data (default: False). alt_path: Alternative path override (default: None). use_weights: Whether to use weights column or move_per_pop (default: True).
- Returns:
Tuple of (df_grouped, df_absolute): - df_grouped: Main DataFrame with grouped data, indexed by to_group_col - df_absolute: Absolute count DataFrame (or None if load_abs=False)
- Example:
>>> df, df_abs = load_inhabit_moving("inhabit", 2020, ip, "data/evidence")
- Notes:
Converts ‘rooms’ column to string type
Always includes ‘mean_ls’ column
Path construction differs for “_created” variants
- scripts.misc.order_move_in_want(move_in_want_v: DataFrame, ip: Dict[str, Any]) DataFrame
Sort households by allocation priority based on configurable attributes.
This function orders searching households according to a priority system defined in the input parameters. The order determines which households get first choice of available dwellings. Supports sorting by up to 4 different household attributes (e.g., income, age, household type, size).
- Args:
- move_in_want_v: DataFrame containing households wanting to move,
with columns for household and dwelling characteristics.
- ip: Dictionary containing input parameters with keys:
‘alloc_hh_prio_1_feature’: First priority attribute name
‘alloc_hh_prio_1_feature_order’: First priority sort order
‘alloc_hh_prio_2_feature’: Second priority attribute name
‘alloc_hh_prio_2_feature_order’: Second priority sort order
‘alloc_hh_prio_3_feature’: Third priority attribute name
‘alloc_hh_prio_3_feature_order’: Third priority sort order
‘alloc_hh_prio_4_feature’: Fourth priority attribute name
‘alloc_hh_prio_4_feature_order’: Fourth priority sort order
‘household_classification’: Classification type (e.g., “children”)
‘cols_hh’: List of household grouping columns
‘cols_dwell’: List of dwelling grouping columns
- Returns:
Sorted DataFrame indexed by household and dwelling columns, with households ordered by allocation priority.
- Example:
Priority configuration: - Priority 1: income_quintile, lowest_first - Priority 2: age, oldest_first - Priority 3: hh_size, smallest_first - Priority 4: hh_type, specific order
- Notes:
Uses mergesort algorithm for stable sorting
Special handling for hh_type with custom ordering (mapped to letters)
Supports both specific type ordering and children-based classification
The function temporarily converts hh_type values to sortable codes
- scripts.misc.printd(str: str, ip: Dict[str, Any]) None
Print debugging information if debug flag is enabled.
This is a conditional print function that only outputs messages when the debug flag is set to True in the input parameters.
- Args:
str: String message to print. ip: Dictionary containing input parameters with key:
‘debug’: Boolean flag indicating whether to print debug messages.
- Returns:
None. Prints to stdout if debug flag is True.
- Example:
>>> ip = {'debug': True} >>> printd("Debug message", ip) Debug message
>>> ip = {'debug': False} >>> printd("Debug message", ip) # No output
- Notes:
This function provides a simple way to add debug output that can be toggled on/off without modifying code.
- scripts.misc.save_params_to_file(ip: Dict[str, Any]) None
Save input parameters dictionary to a formatted text file.
This function serializes the input parameters dictionary to a human-readable file using Python’s pprint format with proper indentation.
- Args:
- ip: Dictionary containing input parameters, must include key:
‘input_parameters’: Path to output file for saving parameters
- Returns:
None. Writes to file specified in ip[‘input_parameters’].
- Notes:
Uses pprint.pformat with 4-space indentation
Preserves dictionary order (sort_dicts=False)
Overwrites existing file if present
- scripts.misc.stock_assertions(dwell_stock: DataFrame, hh_stock: DataFrame) None
Validate consistency and integrity of household and dwelling stock data.
This function performs multiple assertion checks to ensure: 1. Household sums are internally consistent (inhabits = search + stay) 2. Dwelling sums are internally consistent (dwells = vacated + occupied) 3. All values are non-negative
- Args:
- dwell_stock: DataFrame containing dwelling stock with columns:
dwells: Total number of dwellings
vacated: Number of vacated dwellings
occupied: Number of occupied dwellings
- hh_stock: DataFrame containing household stock with columns:
inhabits: Total number of households
search: Number of searching households
stay: Number of staying households
- Raises:
- AssertionError: If any validation check fails. Prints detailed
diagnostic information before raising.
- Notes:
Uses np.isclose() for float comparisons to handle rounding errors
Contains commented-out assertions for additional validations
Provides detailed debugging output when assertions fail
- scripts.misc.timer_func(func: Callable) Callable
Decorator that measures and prints the execution time of a function.
This decorator wraps a function to automatically print: 1. A message when the function starts executing 2. The execution time when the function completes
- Args:
func: Function to be timed.
- Returns:
Wrapped function that includes timing functionality.
- Example:
>>> @timer_func ... def slow_function(): ... time.sleep(2) ... return "done" >>> slow_function() Calling function slow_function. slow_function executed in 2.00s 'done'
- Notes:
Uses time.time() for measurement
Preserves the original function’s return value
Prints to stdout (not suitable for silent/logging-only execution)
Module: Move Out Rate Calculation
Calculates and predicts household move-out rates, incorporating regression and scenario factors.
Evaluate the inhabit matrices and move out matrices created by inhabit_matrix.py.
This module provides functionality to calculate and predict move-out rates (MOR) for different household types based on historical data. It uses linear regression to extrapolate future move-out rates and applies various factors to adjust the rates based on scenario-specific parameters.
- scripts.move_out_rate.apply_factors(mor: DataFrame, ip: Dict[str, Any], year: int, factors1: DataFrame, factors2: DataFrame) DataFrame
Apply adjustment factors to move-out rates for specific demographic segments.
This function modifies move-out rates based on scenario-specific factors that can target particular household characteristics (e.g., specific room counts, household sizes). Factors are applied multiplicatively to the base move-out rates.
- Args:
- mor: DataFrame containing move-out rates with demographic columns and
move_per_pop values. Will be reset and have its index modified.
- ip: Dictionary containing configuration parameters including:
‘col_move_per_pop’: Name of the move-out rate column
‘max_rooms’: Maximum room count before aggregation
‘max_hh_size’: Maximum household size before aggregation
year: The year to extract factors for from the factors dataframes. factors1: First set of adjustment factors with columns for ‘column name’,
‘parameter’, and year-specific values.
factors2: Second set of adjustment factors with same structure as factors1.
- Returns:
Modified mor DataFrame with factors applied to the move_per_pop column. The DataFrame will have its index reset (index becomes a regular column).
- Note:
Factors of 1.0 result in no change (optimization to skip processing)
Special handling for max values: rooms and hh_size get ‘+’ suffix
Factors are applied to rows matching both column name and parameter value
- scripts.move_out_rate.create_mor(ip: Dict[str, Any], mor: DataFrame, year: int, all_disaggs: Dict[str, List[str]], mors_factors: Dict[str, float], mors_disaggs: Dict[str, DataFrame], keys: List[str], factors1_def: DataFrame, factors2_def: DataFrame, factors1_scen: DataFrame, factors2_scen: DataFrame) None
Create move-out rate matrix for a specific year and save to file.
This function generates a complete move-out rate matrix by combining regression predictions from multiple disaggregation dimensions (age, building type, household size, etc.) using weighted averaging. It then applies scenario-specific factors and saves the result to a CSV file.
- Args:
- ip: Configuration dictionary containing:
‘col_move_per_pop’: Name of move-out rate column
‘min_mor_SFH’: Minimum move-out rate for single-family homes
‘min_mor_MFH’: Minimum move-out rate for multi-family homes
‘to_group_col’: List of columns defining the matrix structure
‘scenario_start_year’: Year when scenario-specific factors begin
‘mor’: Path template for saving move-out rate files
- mor: Empty DataFrame with the structure to fill (columns for all disaggregation
categories plus move_per_pop).
year: The year to generate move-out rates for. all_disaggs: Dictionary mapping disaggregation names to their possible values
(e.g., {‘age’: [‘<45’, ‘>=45’], ‘building_type’: [‘SFH’, ‘MFH’]}).
- mors_factors: Dictionary mapping each disaggregation to its weight in the
weighted average (weights sum to 1.0).
- mors_disaggs: Dictionary mapping each disaggregation to its regression results
DataFrame (contains predicted rates for all years).
keys: List of disaggregation names in consistent order. factors1_def: Default adjustment factors (set 1) for years before scenario start. factors2_def: Default adjustment factors (set 2) for years before scenario start. factors1_scen: Scenario adjustment factors (set 1) for years after scenario start. factors2_scen: Scenario adjustment factors (set 2) for years after scenario start.
- Returns:
None. Saves the computed move-out rate matrix to a CSV file.
- Raises:
SystemExit: If any move-out rate falls below the minimum threshold after calculation.
- Note:
Move-out rates are calculated as a weighted sum of predictions from each disaggregation dimension
Minimum rates are enforced for SFH and MFH
Different factors are applied before/after the scenario start year
- scripts.move_out_rate.interpolate(year_1: str, year_2: str, factors: DataFrame) DataFrame
Interpolate factor values between two years on a yearly basis.
Creates columns for all years between year_1 and year_2, then linearly interpolates the factor values for intermediate years based on the values at year_1 and year_2.
- Args:
year_1: Starting year as a string (column name in factors dataframe). year_2: Ending year as a string (column name in factors dataframe). factors: DataFrame containing factors for each column and year for margins.
Must have columns for year_1 and year_2.
- Returns:
Updated factors DataFrame with interpolated values for all years between year_1 and year_2.
- Example:
If year_1=’2010’, year_2=’2015’, and the factor value changes from 1.0 to 1.5, the intermediate years will be filled with values 1.1, 1.2, 1.3, 1.4.
- scripts.move_out_rate.linear_regression(x_regression: ndarray, y_regression: List[float], x_new: ndarray) ndarray
Apply linear regression to predict future values.
Fits a linear model of the form f(x) = b₀ + b₁x to historical data and uses it to predict values for new x values.
- Args:
x_regression: Array of historical x values (e.g., years) with shape (n, 1). y_regression: List of historical y values (e.g., move-out rates) with length n. x_new: Array of x values to predict for with shape (m, 1).
- Returns:
Array of predicted y values with shape (m,).
- Example:
If x_regression = [[2010], [2011], [2012]] and y_regression = [0.1, 0.12, 0.14], the function fits a line and can predict values for x_new = [[2013], [2014]].
- scripts.move_out_rate.load_factors(ip: Dict[str, Any], scen_def: str) Tuple[DataFrame, DataFrame]
Load factors from move_out_rate_input.xlsx file and apply yearly interpolation.
Reads two sets of factors from an Excel file and interpolates values for years that are missing between specified years. This allows smooth transitions in factor values over time.
- Args:
- ip: Dictionary containing user input from xlsx and program-added variables.
Must contain ‘mor_modification’ key with path template.
scen_def: Scenario definition string used to format the input path.
- Returns:
Tuple of two DataFrames (factors1, factors2), each containing: - ‘column name’: The column to apply the factor to - ‘parameter’: The specific parameter value to filter - Year columns: Factor values for each year (interpolated)
- Raises:
FileNotFoundError: If the Excel file specified in ip[‘mor_modification’] doesn’t exist. KeyError: If required columns are missing from the Excel file.
Module: SOEP Data Loader
Handles the loading and preprocessing of SOEP survey data.
Load SOEP (Socio-Economic Panel) data.
This module provides utilities for loading, merging, and processing SOEP survey data. It handles household, dwelling, and individual person data from the German SOEP dataset.
Note: The functions csv_gen_households() and csv_gen_dwellings() appear to be orphaned and may not be actively used in the current codebase. They are retained here for potential future use or migration purposes.
- scripts.soep_loader.attachVariable(soep_path: str, add_vars: List[str], merge_from: str, merge_to: DataFrame, match_keys: List[str], new_names: Optional[List[str]] = None) DataFrame
Merge variables from a SOEP CSV file into an existing DataFrame.
This function loads specific variables from a SOEP dataset file and merges them into an existing DataFrame using specified key columns. It performs a left join, preserving all rows from merge_to and adding matched data from merge_from.
- Args:
soep_path: Path to the SOEP folder or data directory containing CSV files. add_vars: List of variable (column) names to merge into the target DataFrame. merge_from: Name of the source dataset file (without .csv extension) to merge from. merge_to: Base DataFrame to which variables will be added. match_keys: List of column names to use as keys for joining datasets.
Common keys include ‘hid’ (household ID), ‘pid’ (person ID), and ‘syear’ (survey year).
- new_names: Optional list of new names for the variables being added.
If provided, must match the length of add_vars. Variables are renamed before merging. Defaults to None (keeps original names).
- Returns:
Updated DataFrame with merged variables from the source dataset.
- Raises:
FileNotFoundError: If the source CSV file does not exist. KeyError: If match_keys are not present in both DataFrames.
- Warning:
This function should only be run once per variable to avoid duplicate columns. Running it multiple times for the same variable may cause unexpected behavior.
- Example:
>>> df = pd.DataFrame({'hid': [1, 2], 'syear': [2020, 2020]}) >>> df = attachVariable( ... soep_path='/data/soep', ... add_vars=['income'], ... merge_from='hgen', ... merge_to=df, ... match_keys=['hid', 'syear'], ... new_names=['household_income'] ... )
- scripts.soep_loader.movers(df: DataFrame, ip: Dict[str, Any]) DataFrame
Add residential move indicator variable to household IDs.
This function merges pre-computed “will_move” indicators into a DataFrame, marking individuals who are expected to move based on historical data. Missing values (households without move data) are filled with 0 (won’t move).
- Args:
- df: Base DataFrame to which move indicators will be added.
Must contain ‘pid’ and ‘syear’ columns for matching.
- ip: Configuration dictionary containing paths and column names. Expected keys:
‘soep_composita_path’: Path to derived/composite datasets
- ‘move’: Dict with key:
‘col_will_move’: Column name for will-move indicator
- Returns:
- DataFrame with added move indicator column. The column contains:
1: Person will move (based on historical move data)
0: Person will not move (either confirmed or assumed)
- Note:
This function assumes calc_will_move() has been run previously to create the will_move dataset. It matches on pid (person ID) not hid (household ID).
Module: Household Classification
Transforms raw SOEP household data into standardized categories.
Define functions to count households by income, household type, age and household size.
This module provides functions for processing household demographic data from SOEP (Socio-Economic Panel) surveys. The functions handle encoding transformations, classification of households, and calculation of income quintiles.
- Functions defined in this script:
household_types: Replace encoding of household types with custom encoding
size_class: Assign household size class to every person
age_class: Assign age classes to every person
income_quintiles: Calculate weighted income quintiles
- Disaggregation of households in inhabit matrix:
Region –> Income Quantile –> household type –> age –> number of people
Functions are applied in “inhabit_matrix.py”.
- scripts.household.age_class(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Assign age classes to every person.
This function categorizes individuals into age groups based on configuration parameters. Age classes can be defined with custom boundaries via the input parameters.
- Args:
df: DataFrame containing person-level data with ‘sage’ (survey age) column. ip: Dictionary of input parameters used by get_age_classes function to
determine age class boundaries.
col_name: Name of the output column for age class assignments.
- Returns:
- A tuple containing:
df: Modified DataFrame with age class column renamed from ‘sage’.
all_dimensions: Dictionary mapping column name to tuple of age classes.
- Notes:
The function renames ‘sage’ column to col_name
Age class boundaries are determined by misc.get_age_classes(ip)
Negative values (-8 to 0) are replaced with NaN
- scripts.household.household_types(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Replace encoding of household types with custom encoding.
This function transforms household type encodings from SOEP data into simplified categories. It supports two classification schemes: ‘children’ (households with/without children) and ‘types’ (detailed household composition).
- Original SOEP encoding:
1: 1-Person-HH (single) 2: Couple Without Children 3: Single Parent 4: Couple With Children LE 16 5: Couple With Children GT 16 6: Couple With Children LE And GT 16 7: Multiple Generation-HH 8: Other Combination
- Args:
df: DataFrame containing household data with household type column. ip: Dictionary of input parameters containing ‘household_classification’ key
which can be ‘children’ or ‘types’.
col_name: Name of the column containing household type information.
- Returns:
- A tuple containing:
df: Modified DataFrame with replaced household type encodings.
all_dimensions: Dictionary mapping column name to tuple of dimension values.
- Notes:
In ‘children’ mode: merges all types into ‘with_children’ or ‘without_children’
In ‘types’ mode: creates 5 categories (single, single_parent, couple_no_children, couple_parent, other)
Negative values (-8 to 0) are replaced with NaN
- scripts.household.income_quintiles(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Calculate weighted income quintiles for DataFrame and add them in new column.
This function divides households into five equal-sized groups (quintiles) based on weighted household income. The weighting accounts for survey sampling weights to create representative quintile distributions.
- Quintile definitions:
q1: The quintile with least income (bottom 20%)
q2: Second quintile (20-40%)
q3: Third quintile (40-60%)
q4: Fourth quintile (60-80%)
q5: The quintile with highest income (top 20%)
- Args:
- df: DataFrame containing household data with ‘hghinc’ (household income)
and ‘weights’ columns.
ip: Dictionary of input parameters (used for validation). col_name: Name of the output column for income quintile assignments.
- Returns:
- A tuple containing:
df: Modified DataFrame with income quintile column added.
all_dimensions: Dictionary mapping column name to tuple of quintiles (ordered from highest to lowest: q5, q4, q3, q2, q1).
- Raises:
ValueError: If the input DataFrame is empty (via misc.check_empty).
- Notes:
Uses weighted quantile cutting to ensure each quintile represents 20% of the weighted population
Households are sorted by income before quintile assignment
The function uses mergesort for stable sorting
- scripts.household.size_class(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Assign household size class to every person.
This function categorizes households by size, with a configurable maximum size class. Households larger than the maximum are grouped into a “max+” category.
- Args:
df: DataFrame containing household data with household size information. ip: Dictionary of input parameters containing ‘max_hh_size’ key which
specifies the maximum household size before grouping.
col_name: Name of the column containing household size information.
- Returns:
- A tuple containing:
df: Modified DataFrame with household size class assignments.
all_dimensions: Dictionary mapping column name to tuple of size classes.
- Example:
If max_hh_size=4, size classes will be: ‘1’, ‘2’, ‘3’, ‘4+’. A household with 6 people would be classified as ‘4+’.
- Notes:
Negative values (-8 to 0) are replaced with NaN
Households with size >= max_hh_size are grouped into ‘max_hh_size+’
Module: Dwelling Classification
Transforms raw SOEP dwelling data into standardized categories.
Load dwelling information from SOEP (Socio-Economic Panel).
This module provides functions for processing dwelling characteristics data from SOEP surveys. The functions handle encoding transformations for building types, ownership status, house conditions, number of rooms, and region types.
- Disaggregation hierarchy:
House type (SFH, MFH, …) –> Restoration status –> House Owner –> Number of Rooms
- Functions defined in this script:
building_type: Translate building type codes to standardized names
owner_type: Translate ownership status from integers to strings
house_condition: Translate house condition/renovation status
room_num: Limit and categorize number of rooms
region_type: Assign region type (urban/rural) to households
- scripts.dwelling.building_type(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Return building type namings standardized from SOEP tabula encoding.
This function transforms building type codes from SOEP data into simplified categories. The main distinction is between Single Family Houses (SFH) and Multi-Family Houses (MFH).
- Original SOEP encoding:
1: Farm House 2: 1-2 Family House 3: 1-2 Family Rowhouse 4: Apartment in 3-4 Unit Building 5: Apartment in 5-8 Unit Building 6: Apartment Building with 9+ dwellings 7: High-rise Building (Hochhaus) 8: Other building type 9: Not specified (mapped to NaN)
- Simplified categories:
SFH (Single Family House): codes 1, 2, 3 (1-2 dwellings)
MFH (Multi Family House): codes 4, 5, 6, 7, 8 (3+ dwellings)
- Args:
df: DataFrame containing dwelling data with building type column. ip: Dictionary of input parameters (not currently used but maintained
for consistency with other functions).
col_name: Name of the column containing building type information.
- Returns:
- A tuple containing:
df: Modified DataFrame with standardized building type encodings.
all_dimensions: Dictionary mapping column name to tuple of dimension values (‘SFH’, ‘MFH’).
- Notes:
Negative values (-8 to 0) are replaced with NaN
Code 9 is explicitly mapped to NaN (unknown building type)
- scripts.dwelling.house_condition(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Translate house condition/renovation status from integers to strings.
This function simplifies SOEP house condition codes into a binary classification of renovation status, suitable for matching with TABULA building typology data.
- Original SOEP encoding (hgcondit):
1: In a good condition –> renovated (TABULA: ambitiously sanitized) 2: Some renovations needed –> not renovated (TABULA: sanitized) 3: Full renovations needed –> not renovated (TABULA: not sanitized) 4: Dilapidated –> not renovated (TABULA: not sanitized)
- Simplified categories:
renovated: code 1 (good condition)
not renovated: codes 2, 3, 4 (any renovation needs)
- Args:
df: DataFrame containing dwelling data with ‘hgcondit’ column. ip: Dictionary of input parameters (not currently used but maintained
for consistency with other functions).
col_name: Name of the output column for house condition.
- Returns:
- A tuple containing:
df: Modified DataFrame with ‘hgcondit’ renamed and values replaced.
all_dimensions: Dictionary mapping column name to tuple of condition categories (‘renovated’, ‘not renovated’).
- Notes:
Negative values (-8 to 0) are replaced with NaN
The binary classification aligns with TABULA building typology standards
- scripts.dwelling.owner_type(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Translate dwelling ownership status information from integers to strings.
This function processes ownership status from two SOEP data sources and combines them based on configuration. It supports both simplified (owner/tenant) and detailed (including non-profit dwellings) classifications.
- Data sources:
- hlf0013_h: Dwelling ownership type (lower data availability)
1: Communal Dwelling 2: Co-Operative Apartment 3: Company Apartment 4: Private Owner 5: Do Not Know 6: Private Company 7: Non Profit Organization (Church, Foundation, etc.) 8: NaN
- hgowner: Ownership/tenancy status
1: Owner 2: Main tenant 3: Sub-tenant 4: Tenant 5: Living in a home (Heim) or shared accommodation
- Args:
df: DataFrame containing dwelling data with ‘hlf0013_h’ and/or ‘hgowner’ columns. ip: Dictionary of input parameters containing ‘dwelling_ownership_short’ key
which determines classification detail (‘true’ for simplified).
col_name: Name of the output column for ownership status.
- Returns:
- A tuple containing:
df: Modified DataFrame with ownership status column.
all_dimensions: Dictionary mapping column name to tuple of ownership categories (either 2 or 3 categories depending on configuration).
- Notes:
In ‘short’ mode: only ‘private owner’ and ‘private tenant’
In ‘detailed’ mode: adds ‘non profit dwelling’ category
Negative values and code 5 in hlf0013_h are mapped to NaN
The function combines data from both sources, using hlf0013_h for non-profit classification and hgowner for private dwelling details
- scripts.dwelling.region_type(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Assign region type (urban/rural) to household identifiers.
This function translates region type codes into descriptive labels, classifying households as either urban or rural based on SOEP regional classification data.
- Original encoding:
1: Urban 2: Rural
- Args:
df: DataFrame containing household data with region type column. ip: Dictionary of input parameters (not currently used but maintained
for consistency with other functions).
- col_name: Name of the output column for region type. The function expects
the input column to be derived from the first two parts of col_name joined with underscore (e.g., ‘region_type_hh’ -> looks for ‘region_type’).
- Returns:
- A tuple containing:
df: Modified DataFrame with region type column renamed and values replaced.
all_dimensions: Dictionary mapping column name to tuple of region types (‘urban’, ‘rural’).
- Notes:
Negative values (-8 to 0) are replaced with NaN
The function extracts the base column name from col_name by taking the first two underscore-separated components
- scripts.dwelling.room_num(df: DataFrame, ip: Dict[str, any], col_name: str) Tuple[DataFrame, Dict[str, Tuple[str, ...]]]
Limit and categorize the number of rooms in dwelling.
This function groups dwelling room counts into discrete classes, with a configurable maximum. Dwellings with more rooms than the maximum are grouped into a “max+” category. Very large room counts (16+) are treated as invalid and mapped to NaN.
- Args:
df: DataFrame containing dwelling data with ‘hgroom’ column. ip: Dictionary of input parameters containing ‘max_rooms’ key which
specifies the maximum room count before grouping (e.g., 4 creates classes ‘1’, ‘2’, ‘3’, ‘4+’).
col_name: Name of the output column for room number classes.
- Returns:
- A tuple containing:
df: Modified DataFrame with ‘hgroom’ renamed and values replaced with room class labels.
all_dimensions: Dictionary mapping column name to tuple of room number classes.
- Example:
If max_rooms=4, room classes will be: ‘1’, ‘2’, ‘3’, ‘4+’. A dwelling with 6 rooms would be classified as ‘4+’. A dwelling with 20 rooms would be mapped to NaN (likely data error).
- Notes:
Negative values (-8 to 0) are replaced with NaN
Room counts >= 16 are treated as invalid and mapped to NaN
Individual classes exist for 1 to (max_rooms - 1)
All counts from max_rooms to 15 are grouped into ‘max_rooms+’
Module: Dwelling Stock Interface
Acts as an interface layer between Inhabit evidence data and the housing model for dwelling stock projections.
- scripts.dwell_stock_interface.create_dwell_stock(ip, new_dwell_stock, year, abs_extra_sfh)
Create a dwell stock for inhabit projections base on housing model. - load last year of evidence from inhabit - based on it’s dwelling-marginal distribution, and on it’s mean_ls, merge it with housing model outputs and calculate dwells for all inhabit dimensions. - save it.
- scripts.dwell_stock_interface.load_inhabit(ip)
Loads Inhabit from last year of evidence and returns it.
Module: Load Housing Model Data
Interfaces with external housing model results to create dwelling stock data.
Load all housing model related information. Housing model is used here for dwelling stock creation. Housing model base code can be checked here: https://gitlab.com/energy-sufficiency-model/housing-model
- scripts.load_housing_model_ds.get_calibrated_ds_housing_model(dwell_stock: DataFrame, ip: Dict[str, Any], year: int, dwell_factor: DataFrame) DataFrame
Creates a calibrated dwelling stock DataFrame by integrating housing model results.
This function combines dwelling stock data with housing model living space data. It applies calibration factors and distributes the housing model’s dwelling counts based on inhabit’s weights and building types/conditions.
- Args:
- dwell_stock (pd.DataFrame): The initial dwelling stock DataFrame, potentially
with multi-index, containing ‘building_type’, ‘condition’, ‘dwells’, and ‘mean_ls’.
- ip (Dict[str, Any]): Dictionary of input parameters, including keys for
housing model scenario splits and column definitions.
year (int): The target year for which to generate the dwelling stock. dwell_factor (pd.DataFrame): DataFrame containing calibration factors, likely
from census data.
- Returns:
- pd.DataFrame: A DataFrame representing the calibrated dwelling stock for the
given year, including ‘bs_dwells’, ‘dwells’, ‘vacated’, and ‘occupied’ columns.
- Raises:
- FileNotFoundError: If an unexpected error occurs during the distribution
of dwelling counts, indicating a potential logic flaw.
- scripts.load_housing_model_ds.get_ds_housing_model(dwell_stock: DataFrame, ip: Dict[str, Any], year: int, abs_extra_sfh: float) DataFrame
Generates dwelling stock data based on housing model living space and inhabit weights.
This function distributes living space from the housing model across inhabit’s dwelling stock weights for the current year. It handles scenarios involving splitting large dwellings and may incorporate additional single-family house (SFH) dwellings based on input parameters.
- Args:
- dwell_stock (pd.DataFrame): The initial dwelling stock DataFrame, potentially
with multi-index, containing ‘building_type’, ‘condition’, and ‘mean_ls’.
- ip (Dict[str, Any]): Dictionary of input parameters, including keys for
housing model scenarios, splitting large dwellings, and column definitions.
year (int): The target year for which to generate the dwelling stock. abs_extra_sfh (float): An absolute number of extra single-family houses to
potentially add, used in specific splitting scenarios.
- Returns:
- pd.DataFrame: A DataFrame representing the dwelling stock for the given year,
adjusted by the housing model and scenario parameters. It includes ‘dwells’, ‘vacated’, and ‘occupied’ columns.
- Raises:
- ValueError: If an unexpected error occurs during dwelling distribution,
indicating a potential logic flaw or duplicate index mapping.
- scripts.load_housing_model_ds.load_housing_results(ip: Dict[str, Any], year: int) DataFrame
Loads housing results from an Excel file for a given year.
This function reads dwelling stock data, specifically living space, from a specified Excel file. It handles year-specific data, remapping building types, and aggregating results.
- Args:
- ip (Dict[str, Any]): A dictionary containing input parameters, including
‘housing_model_results_path’.
- year (int): The target year for which to load housing model results.
If the year is before 2020, it uses 2020 data as a reference due to calibration issues with earlier years.
- Returns:
- pd.DataFrame: A DataFrame containing aggregated dwelling types and their
corresponding living space for the specified year.
- Raises:
FileNotFoundError: If the specified housing model results file does not exist.
- scripts.load_housing_model_ds.load_param(ip: Dict[str, Any]) Dict[int, float]
Loads and interpolates scenario parameters (living space per capita) over years.
Reads living space per capita data from an Excel file based on specified key years and interpolates these values linearly for all years between the key years.
- Args:
- ip (Dict[str, Any]): A dictionary containing input parameters, including:
‘housing_model_scen_path’: Path to the scenarios Excel file.
‘housing_model_results’: Sheet name in the Excel file.
‘key_years’: List of years defining the scenario points (e.g., [2020, 2030, …]).
- Returns:
- Dict[int, float]: A dictionary mapping each year (from 2019 onwards) to its
interpolated living space per capita value.
- scripts.load_housing_model_ds.prepare_bs_housing(dwell_stock: DataFrame, ip: Dict[str, Any], year: int, hm_type: str = None) DataFrame
Prepares base housing model data by merging it with inhabit dwelling stock data.
This function loads housing model results, calculates weighted average living space from the inhabit data, and merges them. It also handles scenario-specific logic for freezing mean living space from a ‘scenario_start_year’.
- Args:
- dwell_stock (pd.DataFrame): The inhabit dwelling stock DataFrame, containing
‘building_type’, ‘dwells’, and ‘mean_ls’.
- ip (Dict[str, Any]): Dictionary of input parameters, including paths to
scenario files and the scenario start year.
year (int): The target year for which to prepare the data. hm_type (str, optional): Type of housing model results. Not directly used in
the current implementation but kept for signature compatibility. Defaults to None.
- Returns:
- pd.DataFrame: A DataFrame containing merged housing model living space
and inhabit dwelling stock information, including calculated ‘bs_dwells’ (housing model dwellings) and ‘inhabit_dwells’.
Module: Household Stock Matching (ML)
Matches BBSR household projection data to full inhabit household vector categories using machine learning.
Module: Allocation Calibration and Evaluation
Evaluates the dwelling allocation process by comparing modeled matrices against empirical data.
This script evaluates the yearly results of the dwelling allocation process. It compares the inhabit matrices of the allocation output with the respective empirical inhabit matrix.
As a metric, the root-mean-square error (rmse) is calculated for the whole matrix.
- scripts.alloc_calibration.inhabit_metrics(inhabit_1_v: DataFrame, inhabit_2_v: DataFrame) Tuple[float, float]
Calculates Root Mean Squared Error (RMSE) and normalized RMSE between two inhabit matrices.
- Args:
inhabit_1_v: The first inhabit DataFrame. This is typically the modeled inhabit data. inhabit_2_v: The second inhabit DataFrame. This is typically the empirical inhabit data.
- Returns:
- A tuple containing:
rmse: The root-mean-squared error between the two DataFrames.
nrmse: The normalized root-mean-squared error, scaled by the sum of the empirical inhabit data.
- scripts.alloc_calibration.main(ip: Dict[str, Any]) None
Main function to perform dwelling allocation calibration by comparing modeled and empirical inhabit data.
It iterates through predefined sets of ‘original’ and ‘created’ inhabit types, loads data for each year within the model’s simulation period, merges and fills missing values, calculates RMSE and normalized RMSE using inhabit_metrics, and saves the results to CSV files.
- Args:
- ip: A dictionary containing input parameters, including model start year,
empirical end year, output directory, and the column name for grouping.
Module: Analysis Charts
Generates charts for analyzing model outputs.
- scripts.analysis_charts.main(ip, used_scens)
Module: Dwelling Stock Charts
Generates charts related to dwelling stock data.
Module: Evaluation Plots
Generates various plots for evaluating model performance and results.
- scripts.evaluation_plots.aggregation_plots(ip, default_output_folder)
- scripts.evaluation_plots.arange_room_nos_list(max_rooms)
- scripts.evaluation_plots.check_allocation(ip)
- scripts.evaluation_plots.check_folder(folder_name)
- scripts.evaluation_plots.load_inhabit_matrix_from_csv(file, ip, empirical=False)
- scripts.evaluation_plots.plot_from_various_files(ip, folder_name, years_empirical, years_model, years)
- scripts.evaluation_plots.single_plots(large_movers_df, base_folder, folder_name, years_empirical, years_model, years)
- scripts.evaluation_plots.single_plots_dis(large_movers_df, base_folder, folder_name, dis, years_empirical, years_model, years)
- scripts.evaluation_plots.summed_plots(large_movers_df, base_folder, folder_name, years_empirical, years_model, years)
- scripts.evaluation_plots.summed_plots_dis(large_movers_df, base_folder, folder_name, dis, years_empirical, years_model, years)
Module: Living Space Charts
Generates charts related to living space data.
Module: Occupation Charts
Generates charts related to occupation data.
Create charts to evaluate all inhabit tables over time.
- scripts.occupation_charts.create_charts(ip, plot_year, ms_year, t_year, s_year, incr, decr)
- scripts.occupation_charts.get_chart_vars(ip, ms_year, t_year, s_year, incr_decr, all_avg_disaggs, all_disaggs)
- scripts.occupation_charts.get_inh_all(inh_path, ms_year, t_year, s_year)
- scripts.occupation_charts.plot_all_scenarios(dfs, s_year, ms_year, t_year, save_path, all_avg_disaggs, incr_decr)
- scripts.occupation_charts.plot_all_yrs(df_all, avg_disagg, disagg, start_year, model_start_year, end_year, save_path, inh_scen, scen)
Plots the average value of “avg_disagg” by “disagg” for all years.
Params: - df: weighted DataFrame with columns year, “avg_disagg, “disagg” - avg_disagg: str, main avg_disagg (e.g. “underoccupation”) - disagg: list of str, disaggregation avg_disagg, e.g. [“income_group”] - start_year: int - end_year: int - save_path: str - multilelvel: bool, if True: multiple groupby avg_disagg can be passed as a list of strings, if only one str is passed in the list: leave out multilevel parameter (defaults to False) - share: bool, if True: title is called “Share of underoccupation => 1” resp. “>=2”
Return: Plots a line graph with - x-axis: years - y-axis: main variable (weighted) - lines for each disaggregation (groups of groupby variable)
- scripts.occupation_charts.plot_line(df_incr, df_decr, df_d, avg_disagg, disagg, plot_year, save_path, scen, incr_decr, year)
Return: Plots a line graph with - x-axis: years - y-axis: main variable (weighted) - lines
- scripts.occupation_charts.plot_mor(ip, incr_decr)
- scripts.occupation_charts.underoccup(ip, inh_all, disagg, start_year, end_year, save_path, average_disagg, plot_year)
Calculate the underoccupation over the years for different disaggregations
- scripts.occupation_charts.underoccup_col(ip, df, average_disagg)
Module: Check Values Labels
Utility module for checking and managing value labels.