Predicting Airline Customer Satisfaction: Drivers, Personas, and Actionable Uplift

Airline CSAT Jeff Hopp
2025-11-15

01 - Introduction

Customer satisfaction is a leading indicator of revenue, cost, and loyalty in commercial aviation. When travelers feel the journey is smooth and supported—from booking to boarding to in-flight experience—they are more likely to repurchase, recommend, and tolerate inevitable disruptions (weather, ATC, crew timing). Conversely, pain at digital touchpoints, long lines, or in-cabin discomfort compounds quickly and shows up as churn, refunds, and operational load on support.

This analysis treats satisfaction not as a vanity metric but as a predictable, diagnosable outcome. By combining structured survey ratings with operational/context features, we can:

Predict satisfaction reliably
Explain what moves it
Prioritize concrete actions by segment—so improvements are targeted, measurable, and defensible.

Project Overview

This project uses a public airline passenger satisfaction dataset (Kaggle) representing an undisclosed carrier’s post-flight survey results.

Data source & license: Airline Customer Satisfaction dataset (Kaggle). Used under stated terms. Accessed: 2025-09-01.

Each record includes Likert-scale ratings (1–5) across digital, airport, and in-flight touchpoints; passenger profile fields (e.g., travel type/class); and simple operational attributes (e.g., delays). The working goal is twofold:

Predict whether a passenger reports being “satisfied.”
Explain which factors most strongly drive satisfaction overall and by segment, so the airline can act with confidence.

Approach (end-to-end)

EDA & Data Hygiene: Validate schema, handle missing/invalid values (e.g., convert 0 → NaN for ratings), and lock fixed scales so comparisons are meaningful.
Modeling: Build a preprocessing pipeline (encode categoricals, scale numerics), train baseline and tuned models, and calibrate probabilities for decision support.
Driver Analysis: Quantify feature influence and rank top negative deltas that differentiate dissatisfied travelers from the cohort.
Personas (Clustering): Segment travelers into behaviorally distinct groups; produce persona cards highlighting the largest gaps vs. cohort means.
What-if / Uplift: Simulate targeted improvements to top pain ratings and estimate the probability lift in satisfaction overall.

Deliverables include a single HTML report (this notebook), persona cards, key charts (heatmaps/radars), and a concise recommendations section tied to measurable levers.

Business Objective

Objective: Enable the airline to raise customer satisfaction efficiently by predicting outcomes, pinpointing the most impactful drivers, and prioritizing actions for the segments who benefit most.

What success looks like

A calibrated model that predicts satisfaction with reliable, interpretable probabilities suitable for targeting and forecasting.
Clear, ranked drivers of dissatisfaction (global and by persona) that link directly to actionable levers (e.g., online support, boarding, Wi-Fi, seat comfort).
Persona-level guidance showing where divergence is largest, so resources can be focused on the most impactful drivers.
Uplift scenarios that translate proposed improvements (e.g., “raise online support to ≥3.5”) into expected probability gains, supporting ROI trade-offs.

Constraints & scope

This is survey-based and retrospective; causality is not guaranteed.
Results guide prioritization and testing (e.g., targeted pilots, A/Bs), not blanket policy changes without validation.
Metrics and visuals are standardized (1–5 scales; fixed color limits) to keep comparisons fair across cohorts and time.

02 - EDA & Data Hygiene

Profile the dataset, handle missing/invalid values, and prepare features for reliable modeling. This includes schema checks, missingness review, and standardizing 1-5 rating scales.

Setup & Imports

Import the Python stack and initialize the working environment.

Code

# === Core ===
import numpy as np
import pandas as pd

# === Plotting & formatting ===
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import textwrap  # label wrapping for plots
import seaborn as sns

# === Notebook display ===
from IPython.display import Markdown, display, HTML

# === Sparse utilities ===
from scipy import sparse

# === Data splitting ===
from sklearn.model_selection import train_test_split

# === Preprocessing ===
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# === Models ===
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# === Evaluation & diagnostics ===
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    precision_recall_curve, average_precision_score,
    confusion_matrix, brier_score_loss,
    silhouette_score, calinski_harabasz_score, davies_bouldin_score
)

# === Probability calibration ===
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

# === Clustering & similarity ===
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics.pairwise import cosine_similarity

# === Pipeline utilities/config ===
from sklearn import set_config
from sklearn.base import clone
from sklearn.utils.validation import check_is_fitted

Load Data

Load the airline_customer_satisfaction.csv file into a raw DataFrame.

Code

# Load data
pd.set_option('display.max_columns',200)
RAW_PATH='../data/airline_customer_satisfaction.csv'
df=pd.read_csv(RAW_PATH)
df.head()

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes
0	satisfied	Loyal Customer	65	Personal Travel	Eco	265	2	2	4	2	3	3	0	3	5	3	2	0	0.0
1	satisfied	Loyal Customer	47	Personal Travel	Business	2464	3	0	2	2	3	4	4	4	2	3	2	310	305.0
2	satisfied	Loyal Customer	15	Personal Travel	Eco	2138	3	2	0	2	2	3	3	4	4	4	2	0	0.0
3	satisfied	Loyal Customer	60	Personal Travel	Eco	623	3	3	4	3	1	1	0	1	4	1	3	0	0.0
4	satisfied	Loyal Customer	70	Personal Travel	Eco	354	3	4	3	4	2	2	0	2	4	2	5	0	0.0

Examine Raw Data

Columns, non-null counts, datatypes, missingness

Validate raw schema and quantify missingness per column before cleaning.

Code

# Build a per-column summary table:
# - dtype: pandas dtype inferred from the raw data
# - non_null / nulls: counts of present vs missing values
# - null_pct: percent missing (formatted to two decimals with a % sign)
summary = (
    df.dtypes.to_frame("dtype")
        .assign(
            non_null=df.notna().sum(),
            nulls=df.isna().sum(),
            null_pct=df.isna().mean()*100
        )
        .reset_index(names="column")
)
display(summary.style.format({"null_pct": "{:.2f}%"}))

	column	dtype	non_null	nulls	null_pct
0	satisfaction	object	129880	0	0.00%
1	Customer Type	object	129880	0	0.00%
2	Age	int64	129880	0	0.00%
3	Type of Travel	object	129880	0	0.00%
4	Class	object	129880	0	0.00%
5	Flight Distance	int64	129880	0	0.00%
6	Seat comfort	int64	129880	0	0.00%
7	Departure/Arrival time convenient	int64	129880	0	0.00%
8	Food and drink	int64	129880	0	0.00%
9	Gate location	int64	129880	0	0.00%
10	Inflight wifi service	int64	129880	0	0.00%
11	Inflight entertainment	int64	129880	0	0.00%
12	Online support	int64	129880	0	0.00%
13	Ease of Online booking	int64	129880	0	0.00%
14	On-board service	int64	129880	0	0.00%
15	Leg room service	int64	129880	0	0.00%
16	Baggage handling	int64	129880	0	0.00%
17	Checkin service	int64	129880	0	0.00%
18	Cleanliness	int64	129880	0	0.00%
19	Online boarding	int64	129880	0	0.00%
20	Departure Delay in Minutes	int64	129880	0	0.00%
21	Arrival Delay in Minutes	float64	129487	393	0.30%

Ratings: min/max; “0” values identified

Isolate rating columns, display min/max values, identify counts/percentage of “0” values

Code

# Isolate rating columns, display min/max values, and identify counts/percentage of "0" values
RATING_COLS=[
    'Seat comfort', 'Departure/Arrival time convenient', 'Food and drink', 'Gate location', 
    'Inflight wifi service', 'Inflight entertainment', 'Online support', 'Ease of Online booking', 
    'On-board service', 'Leg room service', 'Baggage handling', 'Checkin service', 
    'Cleanliness', 'Online boarding'
]
mins=df[RATING_COLS].min(numeric_only=True)
maxs=df[RATING_COLS].max(numeric_only=True)
display(pd.DataFrame({'min':mins,'max':maxs}))
zero_mask=(df[RATING_COLS]==0)
zero_pct=(zero_mask.mean()*100).round(3).rename('pct_zero')
zero_cnt=zero_mask.sum().rename('count_zero')
display(pd.concat([zero_cnt,zero_pct],axis=1).sort_values('pct_zero',ascending=False))

	min	max
Seat comfort	0	5
Departure/Arrival time convenient	0	5
Food and drink	0	5
Gate location	0	5
Inflight wifi service	0	5
Inflight entertainment	0	5
Online support	0	5
Ease of Online booking	0	5
On-board service	0	5
Leg room service	0	5
Baggage handling	1	5
Checkin service	0	5
Cleanliness	0	5
Online boarding	0	5

	count_zero	pct_zero
Departure/Arrival time convenient	6664	5.131
Food and drink	5945	4.577
Seat comfort	4797	3.693
Inflight entertainment	2978	2.293
Leg room service	444	0.342
Inflight wifi service	132	0.102
Ease of Online booking	18	0.014
Online boarding	14	0.011
Cleanliness	5	0.004
On-board service	5	0.004
Gate location	2	0.002
Online support	1	0.001
Checkin service	1	0.001
Baggage handling	0	0.000

Persist cleaned ratings (0 → NaN)

Create a cleaned working copy, flag original zeros, and convert rating 0 values to missing (NaN) for consistent 1–5 scales.

Code

# Save off cleaned copy of raw DataFrame (keep it untouched for reference)
dfc=df.copy()

# For each rating:
# - Add a binary flag column marking rows that had a 0 in the raw data (audit trail).
# - Convert 0 → NaN so all ratings use a true 1–5 scale downstream.
for c in RATING_COLS:
    dfc[f"{c}__na"]=(dfc[c]==0).astype("int64")
dfc[RATING_COLS] = dfc[RATING_COLS].replace(0, np.nan)

# Quick side-by-side sanity check (first 10 rows)
display(Markdown("#### Raw DataFrame (first 10 rows)"))
display(df.head(10))
display(Markdown("#### Cleaned DataFrame (first 10 rows)"))
display(dfc.head(10))
display(Markdown("*Note: When using NumPy, `int64` can't represent `NaN`, and upcasts the column to `float64`.  This is why ratings now show as decimals in cleaned dataframe.*"))
display(Markdown(f"**Raw DF shape:** `{df.shape}`  |  **Cleaned DF shape:** `{dfc.shape}`"))

# Persist cleaned file for reuse outside of this notebook
CLEAN_PATH='../data/airline_customer_satisfaction_clean.csv'
dfc.to_csv(CLEAN_PATH,index=False)
display(Markdown(f"**Saved:** `{CLEAN_PATH}`"))

Raw DataFrame (first 10 rows)

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes
0	satisfied	Loyal Customer	65	Personal Travel	Eco	265	2	2	4	2	3	3	0	3	5	3	2	0	0.0
1	satisfied	Loyal Customer	47	Personal Travel	Business	2464	3	0	2	2	3	4	4	4	2	3	2	310	305.0
2	satisfied	Loyal Customer	15	Personal Travel	Eco	2138	3	2	0	2	2	3	3	4	4	4	2	0	0.0
3	satisfied	Loyal Customer	60	Personal Travel	Eco	623	3	3	4	3	1	1	0	1	4	1	3	0	0.0
4	satisfied	Loyal Customer	70	Personal Travel	Eco	354	3	4	3	4	2	2	0	2	4	2	5	0	0.0
5	satisfied	Loyal Customer	30	Personal Travel	Eco	1894	3	2	0	2	2	5	4	5	5	4	2	0	0.0
6	satisfied	Loyal Customer	66	Personal Travel	Eco	227	3	2	5	5	5	5	0	5	5	5	3	17	15.0
7	satisfied	Loyal Customer	10	Personal Travel	Eco	1812	3	2	0	2	2	3	3	4	5	4	2	0	0.0
8	satisfied	Loyal Customer	56	Personal Travel	Business	73	3	5	3	5	4	4	0	1	5	4	4	0	0.0
9	satisfied	Loyal Customer	22	Personal Travel	Eco	1556	3	2	0	2	2	2	4	5	3	4	2	30	26.0

Cleaned DataFrame (first 10 rows)

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes	Seat comfort__na	Departure/Arrival time convenient__na	Food and drink__na	Inflight wifi service__na	Inflight entertainment__na	Leg room service__na
0	satisfied	Loyal Customer	65	Personal Travel	Eco	265	NaN	NaN	NaN	2.0	2.0	4.0	2.0	3.0	3.0	NaN	3	5.0	3.0	2.0	0	0.0	1	1	1	0	0	1
1	satisfied	Loyal Customer	47	Personal Travel	Business	2464	NaN	NaN	NaN	3.0	NaN	2.0	2.0	3.0	4.0	4.0	4	2.0	3.0	2.0	310	305.0	1	1	1	1	0	0
2	satisfied	Loyal Customer	15	Personal Travel	Eco	2138	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	4.0	4.0	2.0	0	0.0	1	1	1	0	1	0
3	satisfied	Loyal Customer	60	Personal Travel	Eco	623	NaN	NaN	NaN	3.0	3.0	4.0	3.0	1.0	1.0	NaN	1	4.0	1.0	3.0	0	0.0	1	1	1	0	0	1
4	satisfied	Loyal Customer	70	Personal Travel	Eco	354	NaN	NaN	NaN	3.0	4.0	3.0	4.0	2.0	2.0	NaN	2	4.0	2.0	5.0	0	0.0	1	1	1	0	0	1
5	satisfied	Loyal Customer	30	Personal Travel	Eco	1894	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	5.0	4.0	5	5.0	4.0	2.0	0	0.0	1	1	1	0	1	0
6	satisfied	Loyal Customer	66	Personal Travel	Eco	227	NaN	NaN	NaN	3.0	2.0	5.0	5.0	5.0	5.0	NaN	5	5.0	5.0	3.0	17	15.0	1	1	1	0	0	1
7	satisfied	Loyal Customer	10	Personal Travel	Eco	1812	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	5.0	4.0	2.0	0	0.0	1	1	1	0	1	0
8	satisfied	Loyal Customer	56	Personal Travel	Business	73	NaN	NaN	NaN	3.0	5.0	3.0	5.0	4.0	4.0	NaN	1	5.0	4.0	4.0	0	0.0	1	1	1	0	0	1
9	satisfied	Loyal Customer	22	Personal Travel	Eco	1556	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	2.0	4.0	5	3.0	4.0	2.0	30	26.0	1	1	1	0	1	0

Note: When using NumPy, int64 can’t represent NaN, and upcasts the column to float64. This is why ratings now show as decimals in cleaned dataframe.

Raw DF shape: (129880, 22) | Cleaned DF shape: (129880, 36)

Saved: ../data/airline_customer_satisfaction_clean.csv

Examine Cleaned Data

Ratings ranges, columns, non-null counts, datatypes (cleaned)

Validate post-cleaning schema and missingness per column (ratings now 1–5 with NaNs, often upcast to float).

Code

# Examine cleaned data, non-null counts, datatypes
c_mins = dfc[RATING_COLS].min(numeric_only=True)
c_maxs = dfc[RATING_COLS].max(numeric_only=True)
summary = pd.DataFrame({'min':c_mins,'max':c_maxs}).round(0).astype('int64')
display(summary)

# Build a per-column summary table (cleaned):
# - dtype: inferred dtype after cleaning (ratings may be float due to NaNs)
# - non_null / nulls: counts of present vs missing values
# - null_pct: percent missing (formatted to two decimals with a % sign)
summary_clean = (
    dfc.dtypes.to_frame("dtype")
       .assign(
           non_null = dfc.notna().sum(),
           nulls    = dfc.isna().sum(),
           null_pct = dfc.isna().mean() * 100
       )
       .reset_index(names="column")
)

display(
    summary_clean
      .style
      .format({"null_pct": "{:.2f}%"})
)

	min	max
Seat comfort	1	5
Departure/Arrival time convenient	1	5
Food and drink	1	5
Gate location	1	5
Inflight wifi service	1	5
Inflight entertainment	1	5
Online support	1	5
Ease of Online booking	1	5
On-board service	1	5
Leg room service	1	5
Baggage handling	1	5
Checkin service	1	5
Cleanliness	1	5
Online boarding	1	5

	column	dtype	non_null	nulls	null_pct
0	satisfaction	object	129880	0	0.00%
1	Customer Type	object	129880	0	0.00%
2	Age	int64	129880	0	0.00%
3	Type of Travel	object	129880	0	0.00%
4	Class	object	129880	0	0.00%
5	Flight Distance	int64	129880	0	0.00%
6	Seat comfort	float64	125083	4797	3.69%
7	Departure/Arrival time convenient	float64	123216	6664	5.13%
8	Food and drink	float64	123935	5945	4.58%
9	Gate location	float64	129878	2	0.00%
10	Inflight wifi service	float64	129748	132	0.10%
11	Inflight entertainment	float64	126902	2978	2.29%
12	Online support	float64	129879	1	0.00%
13	Ease of Online booking	float64	129862	18	0.01%
14	On-board service	float64	129875	5	0.00%
15	Leg room service	float64	129436	444	0.34%
16	Baggage handling	int64	129880	0	0.00%
17	Checkin service	float64	129879	1	0.00%
18	Cleanliness	float64	129875	5	0.00%
19	Online boarding	float64	129866	14	0.01%
20	Departure Delay in Minutes	int64	129880	0	0.00%
21	Arrival Delay in Minutes	float64	129487	393	0.30%
22	Seat comfort__na	int64	129880	0	0.00%
23	Departure/Arrival time convenient__na	int64	129880	0	0.00%
24	Food and drink__na	int64	129880	0	0.00%
25	Gate location__na	int64	129880	0	0.00%
26	Inflight wifi service__na	int64	129880	0	0.00%
27	Inflight entertainment__na	int64	129880	0	0.00%
28	Online support__na	int64	129880	0	0.00%
29	Ease of Online booking__na	int64	129880	0	0.00%
30	On-board service__na	int64	129880	0	0.00%
31	Leg room service__na	int64	129880	0	0.00%
32	Baggage handling__na	int64	129880	0	0.00%
33	Checkin service__na	int64	129880	0	0.00%
34	Cleanliness__na	int64	129880	0	0.00%
35	Online boarding__na	int64	129880	0	0.00%

Numeric ranges (cleaned)

Check min/max for true numeric columns to spot bad values or outliers.

Code

# Show min/max for numeric columns only (not ratings cols)
NUM_COLS=['Age', 'Flight Distance', 'Departure Delay in Minutes', 'Arrival Delay in Minutes']
# dfc[NUM_COLS].agg(['min','max']).transpose()

num_ranges = dfc[NUM_COLS].agg(["min", "max"]).T  # rows = columns, cols = min/max
display(num_ranges.style.format("{:.0f}").hide(axis="index", subset=[]))

	min	max
Age	7	85
Flight Distance	50	6951
Departure Delay in Minutes	0	1592
Arrival Delay in Minutes	0	1584

Categoricals (cleaned)

Confirm each categorical’s distinct values and inspect the most common levels (including missing) to catch bad or unexpected categories.

Code

# Columns to audit (cleaned)
CAT_COLS = ["Customer Type", "Type of Travel", "Class"]

# 1) Cardinality overview (number of distinct values per column, incl. NaN)
display(Markdown("#### Cardinality by column"))
cardinality = (
    dfc[CAT_COLS]
      .nunique(dropna=False)
      .sort_values(ascending=False)
      .rename("unique_values")
      .to_frame()
)
display(cardinality)   # index shows the categorical column names

# 2) Per-column breakdown
#    - Show all levels if small (≤10), else top 10 with a note
for c in CAT_COLS:
    nuniq = dfc[c].nunique(dropna=False)
    display(Markdown(f"#### {c}"))

    vc = (
        dfc[c]
          .value_counts(dropna=False)    # include NaN in counts
          .rename("count")
          .to_frame()
          .reset_index()
          .rename(columns={"index": c})
    )

    out = vc if nuniq <= 10 else vc.head(10)
    display(out.style.hide(axis="index"))

    if nuniq > 10:
        display(Markdown(f"*…and **{nuniq - 10}** more levels not shown.*"))

Cardinality by column

	unique_values
Class	3
Customer Type	2
Type of Travel	2

Customer Type

Customer Type	count
Loyal Customer	106100
disloyal Customer	23780

Type of Travel

Type of Travel	count
Business travel	89693
Personal Travel	40187

Class

Class	count
Business	62160
Eco	58309
Eco Plus	9411

03 - Modeling

Construct and asess a reporducible modeling pipeline - preprocessing, training, calibration, and validation.

Target & Feature Groups

Define the target column and group features by type (ordinal ratings, categoricals, numerics). This keeps preprocessing explicit and readable.

Code

from IPython.display import Markdown, display

def bullets(xs): 
    return "\n".join(f"- `{x}`" for x in xs)

# Set target
TARGET = 'satisfaction'

# Set feature groups
# RATING_COLS, CAT_COLS, NUM_COLS - previously declared

# Explicit order for the ordinal encoder. Since 0 was turned into NaN, the valid set is 1..5.
ORDINAL_SCALE = [1, 2, 3, 4, 5]

md = f"""
### Modeling inputs — quick reference
- **Target:** `{TARGET}`

**Ordinal ratings ({len(RATING_COLS)}):**

{bullets(RATING_COLS)}

**Categoricals ({len(CAT_COLS)}):**

{bullets(CAT_COLS)}

**Numeric ({len(NUM_COLS)}):**

{bullets(NUM_COLS)}

**Ordinal scale:** {", ".join(map(str, ORDINAL_SCALE))}
"""
display(Markdown(md))

Modeling inputs — quick reference

Target: satisfaction

Ordinal ratings (14):

Seat comfort
Departure/Arrival time convenient
Food and drink
Gate location
Inflight wifi service
Inflight entertainment
Online support
Ease of Online booking
On-board service
Leg room service
Baggage handling
Checkin service
Cleanliness
Online boarding

Categoricals (3):

Customer Type
Type of Travel
Class

Numeric (4):

Age
Flight Distance
Departure Delay in Minutes
Arrival Delay in Minutes

Ordinal scale: 1, 2, 3, 4, 5

Encode target

Map the text label (satisfied/dissatisfied) to a binary target (1/0) for modeling.

Code

# Map string labels to binary: satisfied -> 1, everything else -> 0
y = (dfc[TARGET] == 'satisfied').astype(int)

# Feature matrix: everything except the target
X = dfc.drop(columns=[TARGET])

Shapes & random sample

Code

# Check shapes (cleaned df)
display(Markdown(f"**Cleaned DF shape:** `{dfc.shape}` | **X shape:** `{X.shape}` | **y length:** `{len(y)}`"))

# Small random sample (helps spot weird rows)
display(dfc.sample(5, random_state=42))

Cleaned DF shape: (129880, 36) | X shape: (129880, 35) | y length: 129880

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes	Seat comfort__na	Food and drink__na
103044	satisfied	Loyal Customer	59	Business travel	Business	1470	4.0	4.0	4.0	4.0	5.0	4.0	4.0	4.0	4.0	4.0	4	5.0	4.0	3.0	7	0.0	0	0
43282	dissatisfied	disloyal Customer	22	Business travel	Eco	1771	1.0	1.0	1.0	4.0	4.0	1.0	5.0	4.0	3.0	4.0	3	1.0	4.0	4.0	0	0.0	0	0
65543	satisfied	Loyal Customer	55	Business travel	Business	3657	NaN	5.0	NaN	2.0	4.0	5.0	4.0	4.0	4.0	4.0	4	3.0	4.0	3.0	12	8.0	1	1
65083	satisfied	Loyal Customer	41	Business travel	Business	1796	NaN	4.0	NaN	1.0	2.0	4.0	5.0	3.0	3.0	3.0	3	5.0	3.0	3.0	0	0.0	1	1
76496	dissatisfied	Loyal Customer	42	Business travel	Eco	1709	2.0	3.0	3.0	3.0	2.0	2.0	2.0	2.0	4.0	4.0	4	1.0	3.0	2.0	0	0.0	0	0

Target distribution and balance

Quantify class balance to inform weighting, thresholds, and evaluation choices.

Code

# # Target distribution (counts + %)
# vc = y.value_counts()
# target_table = pd.DataFrame({
#     "count": vc,
#     "percent": (vc / vc.sum() * 100).round(2)
# }).rename(index={1: "satisfied", 0: "dissatisfied"})
# display(target_table)

# # Simple bar chart of target balance
# ax = vc.rename(index={1: "satisfied", 0: "dissatisfied"}).plot(kind="bar")
# ax.set_title("Target balance")
# ax.set_xlabel("class")
# ax.set_ylabel("count")
# plt.show()

# Counts by class (rename for readability)
vc = y.value_counts().rename(index={1: "satisfied", 0: "dissatisfied"})

# Table: make "class" a real column to avoid the awkward header gap; format percent
target_table = (
    vc.to_frame("count")
      .assign(percent=lambda d: d["count"] / d["count"].sum())
      .reset_index(names="class")
)

display(
    target_table
      .style
      .hide(axis="index")
      .format({"percent": "{:.2%}"})
)

# Simple bar chart of target balance
order = ["satisfied", "dissatisfied"]
ax = vc.reindex(order).plot(kind="bar")
ax.set_title("Target balance")
ax.set_xlabel("class")
ax.set_ylabel("count")
ax.tick_params(axis="x", labelrotation=0)   # Make x labels horizontal
plt.tight_layout()
plt.show()

class	count	percent
satisfied	71087	54.73%
dissatisfied	58793	45.27%

Feature-group samples

Observe sample sets of each feature group.

Code

display(Markdown("#### Ratings (first 5 rows):"))
display(dfc[RATING_COLS].head(5))
display(Markdown("#### Categoricals (first 5 rows):"))
display(dfc[CAT_COLS].head(5))
display(Markdown("#### Numerics (first 5 rows):"))
display(dfc[NUM_COLS].head(5))

Ratings (first 5 rows):

	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	NaN	NaN	NaN	2.0	2.0	4.0	2.0	3.0	3.0	NaN	3	5.0	3.0	2.0
1	NaN	NaN	NaN	3.0	NaN	2.0	2.0	3.0	4.0	4.0	4	2.0	3.0	2.0
2	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	4.0	4.0	2.0
3	NaN	NaN	NaN	3.0	3.0	4.0	3.0	1.0	1.0	NaN	1	4.0	1.0	3.0
4	NaN	NaN	NaN	3.0	4.0	3.0	4.0	2.0	2.0	NaN	2	4.0	2.0	5.0

Categoricals (first 5 rows):

	Customer Type	Type of Travel	Class
0	Loyal Customer	Personal Travel	Eco
1	Loyal Customer	Personal Travel	Business
2	Loyal Customer	Personal Travel	Eco
3	Loyal Customer	Personal Travel	Eco
4	Loyal Customer	Personal Travel	Eco

Numerics (first 5 rows):

	Age	Flight Distance	Departure Delay in Minutes	Arrival Delay in Minutes
0	65	265	0	0.0
1	47	2464	310	305.0
2	15	2138	0	0.0
3	60	623	0	0.0
4	70	354	0	0.0

Train/Valid Split (Holdout)

Randomly split into train and validation sets with stratify=y so the class ratio is similar in both. Print sizes and positive-rate checks to verify the split.

Code

# === Holdout split (train/valid). Stratify keeps class balance similar in each split. ===
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
display(Markdown(f"**Train shape:** `{X_train.shape}` | **Valid shape:** `{X_valid.shape}`"))

# Sanity check on class balance
display(Markdown(f"**Overall positive rate:** `{y.mean():.3f}`"))
display(Markdown(f"**Train positive rate:** `{y_train.mean():.3f}`"))
display(Markdown(f"**Valid positive rate:** `{y_valid.mean():.3f}`"))

Train shape: (103904, 35) | Valid shape: (25976, 35)

Overall positive rate: 0.547

Train positive rate: 0.547

Valid positive rate: 0.547

Missingness by split

Before building the preprocessor, quantify NaNs in numeric and rating columns per split to prioritize imputation strategies.

Code

# Missingness by split (pre-imputation)
# --- Numeric columns ---
train_num = (
    (X_train[NUM_COLS].isna().mean())
    .rename("missing_%")
    .reset_index(name="column")
)
valid_num = (
    (X_valid[NUM_COLS].isna().mean())
    .rename("missing_%")
    .reset_index(name="column")
)

display(Markdown("#### Train missing % (numeric)"))
display(
    train_num
    .style
    .hide(axis="index")
    .hide(axis="columns")
    .format({"missing_%": "{:.2%}"})
)

display(Markdown("#### Valid missing % (numeric)"))
display(
    valid_num
    .style
    .hide(axis="index")
    .hide(axis="columns")
    .format({"missing_%": "{:.2%}"})
)

# --- Ratings (top 5) ---
train_rat_top5 = (
    (X_train[RATING_COLS].isna().mean())
    .sort_values(ascending=False).head(5)
    .rename("missing_%")
    .reset_index(name="rating")
)

valid_rat_top5 = (
    (X_valid[RATING_COLS].isna().mean())
    .sort_values(ascending=False).head(5)
    .rename("missing_%")
    .reset_index(name="rating")
)

display(Markdown("#### Train missing % (ratings) — top 5"))
display(
    train_rat_top5
    .style
    .hide(axis="index")
    .hide(axis="columns")
    .format({"missing_%": "{:.2%}"})
)

display(Markdown("#### Valid missing % (ratings) — top 5"))
display(
    valid_rat_top5
    .style
    .hide(axis="index")
    .hide(axis="columns")
    .format({"missing_%": "{:.2%}"})
)

Train missing % (numeric)

Age	0.000000
Flight Distance	0.000000
Departure Delay in Minutes	0.000000
Arrival Delay in Minutes	0.003157

Valid missing % (numeric)

Age	0.000000
Flight Distance	0.000000
Departure Delay in Minutes	0.000000
Arrival Delay in Minutes	0.002502

Train missing % (ratings) — top 5

Departure/Arrival time convenient	0.050980
Food and drink	0.045552
Seat comfort	0.036765
Inflight entertainment	0.022810
Leg room service	0.003455

Valid missing % (ratings) — top 5

Departure/Arrival time convenient	0.052626
Food and drink	0.046658
Seat comfort	0.037612
Inflight entertainment	0.023406
Leg room service	0.003272

Preprocessing Pipeline

Build a ColumnTransformer that imputes + encodes ratings and categoricals and scales numerics. This object will be fitted only on the training split to avoid leakage.

Code

# Preprocessor: fit ONLY on TRAIN to avoid leakage
# - Ratings: impute most frequent (common for Likert items), then OrdinalEncode to preserve order
# - Categoricals: impute most frequent, then OneHot
# - Numerics: median impute (robust to skew/outliers), then standardize
preprocessor = ColumnTransformer([
    ('ratings', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('ord', OrdinalEncoder(
            categories=[ORDINAL_SCALE] * len(RATING_COLS),   # enforce known order 1<2<3<4<5
            handle_unknown='use_encoded_value', unknown_value=-1  # unseen values -> -1
        )),
    ]), RATING_COLS),

    ('cats', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(handle_unknown='ignore')),     # ignore unseen categories at validation time
    ]), CAT_COLS),

    ('nums', Pipeline([
        ('impute', SimpleImputer(strategy='median')),
        ('scale', StandardScaler()),
    ]), NUM_COLS),
])

# Visualize the preprocessor structure
set_config(display='diagram')
display(preprocessor)

Sample outputs from cloned preprocessor

This can help to confirm preprocessing graph behaves as intended (column selection, imputers, encoders, scaler, feature counts/names) without accidental mutation of the real pipeline. Once review of viz_pre is complete, can proceed with actual preprocessor.

Code

# Clone the preprocessor (to avoid disturbing the "real" preprocessor)
viz_pre = clone(preprocessor)

# Make OHE dense just for this preview clone (needed for pandas output)
viz_pre.set_params(**{"cats__ohe__sparse_output": False})
viz_pre.set_output(transform='pandas')

# Fit on TRAIN ONLY (no leakage)
viz_pre.fit(X_train)

# Preview a few transformed rows with column names
Xtr_preview = viz_pre.transform(X_train.head(5))
display(Xtr_preview.head())

# # Totals by block
# ohe = viz_pre.named_transformers_["cats"].named_steps["ohe"]
# num_ohe = sum(len(c) for c in ohe.categories_)
# print(f"TOTAL features: {len(viz_pre.get_feature_names_out())}")
# print(f"  • Ratings (ordinal): {len(RATING_COLS)}")
# print(f"  • Categoricals (one-hot dummies): {num_ohe}")
# print(f"  • Numerics (scaled): {len(NUM_COLS)}")

# # Per-categorical breakdown
# for col, cats in zip(CAT_COLS, ohe.categories_):
#     print(f"  - {col}: {len(cats)} levels -> {list(cats)}")

# Totals by block
ohe = viz_pre.named_transformers_["cats"].named_steps["ohe"]
num_ohe = sum(len(c) for c in ohe.categories_)
total_feats = len(viz_pre.get_feature_names_out())

md = f"""
**Total features:** `{total_feats}`

  - Ratings (ordinal): `{len(RATING_COLS)}`
  - Categoricals (one-hot dummies): `{num_ohe}`
  - Numerics (scaled): `{len(NUM_COLS)}`

**Categorical breakdown**

"""

# Per-categorical breakdown as nested bullets
for col, cats in zip(CAT_COLS, ohe.categories_):
    levels = ", ".join(f"`{c}`" for c in cats)
    md += f"- **{col}**: {len(cats)} levels — {levels}\n"

display(Markdown(md))

# Unseen categories in VALID (will be ignored by OHE due to handle_unknown='ignore')
for col, cats in zip(CAT_COLS, ohe.categories_):
    val_only = set(X_valid[col].dropna().unique()) - set(cats)
    if val_only:
        print(f"⚠️ Unseen in TRAIN, present in VALID for {col}: {sorted(val_only)}")

	ratings__Seat comfort	ratings__Departure/Arrival time convenient	ratings__Food and drink	ratings__Gate location	ratings__Inflight wifi service	ratings__Inflight entertainment	ratings__Online support	ratings__Ease of Online booking	ratings__On-board service	ratings__Leg room service	ratings__Baggage handling	ratings__Checkin service	ratings__Cleanliness	ratings__Online boarding	cats__Customer Type_Loyal Customer	cats__Customer Type_disloyal Customer	cats__Type of Travel_Business travel	cats__Class_Business	cats__Class_Eco	cats__Class_Eco Plus	nums__Age	nums__Flight Distance	nums__Departure Delay in Minutes	nums__Arrival Delay in Minutes
43835	0.0	1.0	1.0	3.0	3.0	1.0	3.0	3.0	0.0	4.0	2.0	1.0	2.0	3.0	0.0	1.0	1.0	0.0	1.0	0.0	-1.349031	0.434696	-0.383006	-0.387873
46917	1.0	1.0	1.0	2.0	0.0	1.0	4.0	0.0	3.0	3.0	2.0	2.0	2.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	-0.423487	1.075794	0.293040	0.385187
82684	2.0	4.0	4.0	4.0	2.0	2.0	2.0	2.0	0.0	4.0	2.0	0.0	3.0	2.0	1.0	0.0	1.0	0.0	1.0	0.0	0.369837	-0.099390	0.163031	0.050194
79570	2.0	3.0	3.0	3.0	2.0	2.0	3.0	1.0	1.0	2.0	2.0	3.0	1.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	-0.159045	-1.687056	-0.383006	-0.387873
61071	3.0	4.0	3.0	3.0	2.0	3.0	0.0	2.0	0.0	4.0	2.0	2.0	1.0	2.0	0.0	1.0	1.0	0.0	0.0	1.0	-0.952369	-0.860146	-0.383006	-0.387873

Total features: 25

Ratings (ordinal): 14
Categoricals (one-hot dummies): 7
Numerics (scaled): 4

Categorical breakdown

Customer Type: 2 levels — Loyal Customer, disloyal Customer
Type of Travel: 2 levels — Business travel, Personal Travel
Class: 3 levels — Business, Eco, Eco Plus

Fit on TRAIN, transform TRAIN/VALID

This trains the preprocessor pipeline on X_train (learn imputation values, encoders, scalers), then applies those learned parameters on both splits to produce model-ready matrices (Xtr and Xva).

Code

# Fit transforms on TRAIN only, then transform both TRAIN and VALID
preprocessor.fit(X_train)
Xtr = preprocessor.transform(X_train)   # transformed train matrix
Xva = preprocessor.transform(X_valid)   # transformed valid matrix

Transformed shapes & dtypes

Verify the post-preprocessing matrices have the expected dimensions and storage type (dense vs sparse) before modeling.

Code

# Show the size and Python type for each split to confirm:
# - row counts match the original splits
# - column count equals total engineered features
# - storage is as expected (e.g., scipy.sparse.csr_matrix vs numpy.ndarray)
display(Markdown(f"**Xtr shape, dtype:** `{Xtr.shape}` `{type(Xtr)}`"))
display(Markdown(f"**Xva shape, dtype:** `{Xva.shape}` `{type(Xva)}`"))

Xtr shape, dtype: (103904, 25) <class 'numpy.ndarray'>

Xva shape, dtype: (25976, 25) <class 'numpy.ndarray'>

NaNs/inf check

Confirm no missing or infinite values remain after preprocessing—many models require finite values only.

Code

# Helper: detect NaN/Inf in either dense or sparse matrices
# - For sparse inputs, only non-zero entries live in `.data`; zeros are implicit and cannot be NaN/Inf.
# - For dense inputs, check the full array.
def has_nan_or_inf(X):
    if sparse.issparse(X):
        data = X.data
        return np.isnan(data).any() or np.isinf(data).any()
    return np.isnan(X).any() or np.isinf(X).any()

# Expect both to be False after proper imputation/scaling
display(Markdown(f"**NaN/Inf in Xtr:** `{has_nan_or_inf(Xtr)}`"))
display(Markdown(f"**NaN/Inf in Xva:** `{has_nan_or_inf(Xva)}`"))

NaN/Inf in Xtr: False

NaN/Inf in Xva: False

Zero-variance features (TRAIN)

Identify constant (zero-variance) features in TRAIN that carry no predictive signal.

Code

# Compute per-feature variance on TRAIN to flag dead (constant) columns.
# - For sparse matrices, compute var via E[x^2] - (E[x])^2 using only stored values;
#   zeros are implicit, so we avoid densifying.
# - For dense arrays, use the standard variance.
# Note: use a small epsilon instead of equality to 0 to avoid FP precision traps.
def col_variance(X):
    if sparse.issparse(X):
        mean = np.asarray(X.mean(axis=0)).ravel()
        mean_sq = np.asarray(X.multiply(X).mean(axis=0)).ravel()
        return mean_sq - mean**2
    return X.var(axis=0)

var = col_variance(Xtr)
eps = 1e-12  # numeric tolerance
zero_var = int((var <= eps).sum())
display(Markdown(f"**Zero-variance features in TRAIN:** `{zero_var}`"))

Zero-variance features in TRAIN: 0

Numeric scaling (z-scores, TRAIN)

Should see max absolute mean near 0 and std near 1 (tiny diviations are fine).

Code

# Verify numeric features are z-scored on TRAIN:
# - Locate columns produced by the numeric pipeline (prefix "nums__").
# - Extract only those cols from Xtr (handle sparse vs dense).
# - Check means ≈ 0 and stds ≈ 1 (population std, ddof=0, matching StandardScaler).

feat_names = preprocessor.get_feature_names_out()
nums_idx = [i for i, n in enumerate(feat_names) if n.startswith("nums__")]

Xtr_nums = Xtr[:, nums_idx].toarray() if sparse.issparse(Xtr) else Xtr[:, nums_idx]

m = Xtr_nums.mean(axis=0)
s = Xtr_nums.std(axis=0, ddof=0)  # population std like StandardScaler
display(Markdown(f"**Numeric z-score means — max |mean|:** `{float(np.abs(m).max())}`"))
display(Markdown(f"**Numeric z-score stds  — min/max:** `{float(s.min())}` / `{float(s.max())}`"))

Numeric z-score means — max |mean|: 1.414876155189066e-16

Numeric z-score stds — min/max: 0.9999999999999998 / 1.0

Unseen categories in VALID

Detect categorical values that appear only in VALID (not in TRAIN); these won’t have learned encodings and can silently drop to zeros with OneHotEncoder(handle_unknown=“ignore”), shifting feature distributions.

Code

# OneHotEncoder learns category levels from TRAIN only.
# If VALID has a level not seen in TRAIN, that column encodes to all-zeros (ignored),
#   which can change feature meaning or reduce model signal for those rows.
# This check lists any such "unseen-in-TRAIN" levels and how often they occur.


ohe = preprocessor.named_transformers_["cats"].named_steps["ohe"]
unseen = {}

for col, cats in zip(CAT_COLS, ohe.categories_):
    train_levels = set(cats)                               # learned from TRAIN
    val_levels   = set(X_valid[col].dropna().unique())     # observed in VALID
    missing_in_train = sorted(val_levels - train_levels)   # VALID-only levels

    if missing_in_train:
        counts = (
            X_valid.loc[~X_valid[col].isin(train_levels), col]
            .value_counts(dropna=False)
            .to_dict()
        )
        unseen[col] = {"levels": missing_in_train, "counts": counts}

# Display results in Markdown (no plain prints)
if unseen:
    lines = ["**⚠️ Unseen categorical levels found in VALID (not seen in TRAIN):**"]
    for col, info in unseen.items():
        levels_str = ", ".join(f"`{lv}`" for lv in info["levels"])
        lines.append(f"- **{col}**: {levels_str} &nbsp; *(counts: {info['counts']})*")
    display(Markdown("\n".join(lines)))
else:
    display(Markdown("**✅ All categorical values in VALID were seen in TRAIN (no unseen levels).**"))

✅ All categorical values in VALID were seen in TRAIN (no unseen levels).

Feature matrix density

This shows the fraction of entries in feature matrix that are non-zero. Helpful to anticipate model/memory behavior.

Code

# Compute fraction of non-zero entries (sparse-aware).
# For sparse matrices, use .nnz (stored non-zeros) over total cells.
# For dense arrays, count non-zeros directly.
def density(X):
    if sparse.issparse(X):
        return X.nnz / (X.shape[0] * X.shape[1])
    return np.count_nonzero(X) / (X.shape[0] * X.shape[1])

display(Markdown(f"**Density - TRAIN:** `{density(Xtr):.2%}`  |  **VALID:** `{density(Xva):.2%}`"))

Density - TRAIN: 77.52% | VALID: 77.54%

Ratings encoding range

Verify the ordinal-encoded ratings are in the expected 0..4 band (with unknown_value=-1 for unseen/invalid), which implies inputs were 1–5 and imputation worked.

Code

# Identify columns produced by the ratings ordinal encoder (prefix "ratings__"),
# then compute min/max over those columns for TRAIN and VALID.
# Expected:
#   - Encoded range: 0..4 (for original 1..5 ratings)
#   - If you see -1: an unknown/invalid category slipped through (e.g., NaN or out-of-range)

ratings_idx = [i for i, n in enumerate(feat_names) if n.startswith("ratings__")]

def min_max_cols(X, idxs):
    sub = X[:, idxs].toarray() if sparse.issparse(X) else X[:, idxs]
    return float(sub.min()), float(sub.max())

rmin_tr, rmax_tr = min_max_cols(Xtr, ratings_idx)
rmin_va, rmax_va = min_max_cols(Xva, ratings_idx)
display(Markdown(f"**Ratings encoded range - TRAIN:** `[{rmin_tr:.0f}, {rmax_tr:.0f}]` | **VALID:** `[{rmin_va:.0f}, {rmax_va:.0f}]`"))
if rmin_tr < 0 or rmin_va < 0:
    display(Markdown("**⚠️ Found values < 0 in ratings encoding (unknown_value=-1). Check rating inputs are 1..5 and imputation logic.**"))

Ratings encoded range - TRAIN: [0, 4] | VALID: [0, 4]

Model Comparison

Instantiate two complementary baselines: Logistic Regression (linear, interpretable) and Random Forest (non-linear, interaction-friendly) with class weighting for imbalance. Show a compact hyperparameter summary of the defined models.

Code

# Models to compare:
# - LogisticRegression: linear, interpretable baseline; class_weight='balanced' mitigates label imbalance.
#   (Defaults like solver='lbfgs' work with L2 penalty; we keep defaults here.)
# - RandomForestClassifier: non-linear, captures interactions; class_weight='balanced_subsample' reweights per tree;
#   n_jobs=-1 uses all cores; random_state ensures reproducibility.
models = {
    "logreg": LogisticRegression(max_iter=2000, class_weight="balanced"),
    "rf": RandomForestClassifier(
        n_estimators=400,
        class_weight="balanced_subsample",
        n_jobs=-1,
        random_state=42,
    ),
}

# Summarize the instantiated models and key hyperparameters (no training yet).
display(Markdown(f"**Defined models ({len(models)}):** `{list(models.keys())}`"))

def key_params(est):
    params = est.get_params(deep=False)
    keep = [
        "solver","penalty","C","class_weight",
        "n_estimators","max_depth","min_samples_leaf",
        "max_features","random_state","n_jobs"
    ]
    return {k: params[k] for k in keep if k in params}

summary = (
    pd.DataFrame(
        [{"model": name, "estimator": est.__class__.__name__, **key_params(est)}
         for name, est in models.items()]
    )
    .fillna("")
    # reorder columns for readability
    .loc[:, ["model", "estimator", "solver", "penalty", "C",
             "n_estimators", "max_depth", "min_samples_leaf",
             "max_features", "class_weight", "random_state", "n_jobs"]]
)

display(summary.style.hide(axis="index"))

Defined models (2): ['logreg', 'rf']

model	estimator	solver	penalty	C	n_estimators	max_depth	min_samples_leaf	max_features	class_weight	random_state	n_jobs
logreg	LogisticRegression	lbfgs	l2	1.000000					balanced
rf	RandomForestClassifier				400.000000		1.000000	sqrt	balanced_subsample	42.000000	-1.000000

Pipeline diagrams

Provides visual confirmation of “preprocessor -> classifier” diagram for each model. Note: These diagrams use a clone of the preprocessor and are not fitted. It’s just to visualize the structure.

Code

# Pretty HTML diagrams for estimators/pipelines
set_config(display='diagram') # Persists for the rest of the session/notebook

for name, est in models.items():
    display(Markdown(f"#### Pipeline: `{name}`"))
    # pipe_vis = Pipeline([('pre', clone(preprocessor)), ('clf', est)])
        
    # Clone the preprocessor to avoid mutating the fitted one later (this is a view-only diagram)
    pipe_vis = Pipeline([
        ("pre", clone(preprocessor)),  # unfitted clone: safe, fast, structural
        ("clf", est)                   # estimator instance (also unfitted here)
    ])

    # Render block diagram9structure only; no fitting performed)
    display(pipe_vis)

Pipeline: `logreg`

Pipeline: `rf`

Evaluation & Thresholding

Fit models and identify the best one for downstream analysis.

Fit on TRAIN, evaluate on VALID

Fit each model on the training matrix and score on the validation set. Plot Precision–Recall (PR) curves and report Average Precision (AP), which is more informative than ROC-AUC when classes are imbalanced.

Code

# Plot Precision-Recall (PR) curve and report Average Precision (AP)
plt.figure()

ap_table = []    # (model_name, AP on validation)
probs = {}       # VALID-set predicted probabilities per model

for name, clf in models.items():
    # Fit on TRAIN
    clf.fit(Xtr, y_train)

    # Probs for the positive class on VALID
    prob = clf.predict_proba(Xva)[:, 1]
    probs[name] = prob

    # PR curve + AP
    precision, recall, _ = precision_recall_curve(y_valid, prob)
    ap = average_precision_score(y_valid, prob)
    ap_table.append((name, ap))

    # Plot PR curve
    plt.plot(recall, precision, label=f"{name} (AP={ap:.3f})")

# No-skill baseline = positive rate in VALID (random guesser precision)
prevalence = y_valid.mean()
plt.hlines(prevalence, xmin=0, xmax=1, linestyles='dashed',
           label=f'baseline (pos rate={prevalence:.3f})')

# Cosmetics
plt.xlim(0, 1); plt.ylim(0, 1)
plt.xlabel("Recall"); plt.ylabel("Precision")
plt.title("Holdout PR Curves (validation) + baseline")
plt.legend(); plt.show()

# AP summary table (higher is better)
pd.DataFrame(ap_table, columns=['model','avg_precision']).sort_values('avg_precision', ascending=False)

	model	avg_precision
1	rf	0.993665
0	logreg	0.940099

Interpreting results:

The curve plots precision (y) vs recall (x) as you vary the decision threshold.
Curves closer to top right are better (rf).
Average Precision (AP): A single-number summary of the entire PR curve (higher is better).
Baseline: The positive rate (prevelance) is the no-skill baseline for precision; AP meaningfully above this indictes real lift.
Model Choice: Random Forest (rf) produces a stronger ranking of satisfied vs dissatisfied customers on the houdout set.
Separation is strong: APs this high mean the models can rank most “satisfied” cases ahead of “dissatisfied” cases very reliably.
Threshold is a policy knob: The PR Curve is threshold-agnostic. You’ll still choose a decision threshold later to trade off preceision (fewer false alarms) vs recall (catch more dissatisfied cases).

On a stratified 20% validation holdout, Random Forest achieved AP ≈ 0.994, outperforming Logistic Regression (AP ≈ 0.940), indicating substantially better ranking of customer satisfaction. Because features include post-flight service ratings, the task reflects a driver analysis framing, so high AP is expected.

Calibration curves

Calibration checks whether predicted probabilities match reality (e.g., among cases scored ~0.7, ≈70% should be positive).

Code

# Calibration curves (both models)
plt.figure()
for name, prob in probs.items():
    frac_pos, mean_pred = calibration_curve(y_valid, prob, n_bins=10, strategy="quantile")
    display(Markdown(f"**{name} non-empty bins:** `{len(mean_pred)}`"))
    plt.plot(mean_pred, frac_pos, marker="o", label=name)
plt.plot([0,1],[0,1],"--", color="gray", linewidth=1)
plt.xlabel("Mean predicted probability")
plt.ylabel("Fraction of positives")
plt.title("Calibration curves (validation)")
plt.legend(); plt.show()

logreg non-empty bins: 10

rf non-empty bins: 8

Interpreting results:

Closer to the diagonal = better. A curve hugging y=x line means well-calbrated probabilities
Below the line = over-confident. Model predicts too high; true positive rate is lower than the score.
Above the line = under-confident. Model predicts too low; true positive rate is higher than the score.
Each point represents a bin (grouping validation rows by predicted score).
- Logistic Regression typically produces smoother, more continuous spread of probabilities (so all 10 quanitile bins tend to be non-empty).
- Random Forests often yields concentrated/extreme or tied probabilities; some quantile cutpoints collapse, creating empty bins.
- Note: This only affects plot point count only - not AP.
On this holdout, Logistic Regression appears better calibrated (hugs diagonal more consistently)
- RF can separate classes better (higher AP) while raw probabilities are a bit over/under-confident compared to LogReg

Even though LogReg’s probabilities are better calibrated, Random Forest separates classes much better (higher AP ≈ ranking quality), and we can easily calibrate RF’s probabilities afterward—so RF is the stronger overall choice for this task.

Predicted probability distribution

This overlays score histograms for each model to show how sharply each separates cases.

Code

# Predicted probability distribution (VALID)
# - Compare model sharpness by overlaying calibrated class-1 probability histograms.
# - Common bin edges [0,1] keep scales identical across models.

# probs: dict[str, np.ndarray] where each value is p(y=1) on VALID
# e.g., probs = {"logreg": p_lr, "rf": p_rf_cal}
plt.figure()
bin_edges = np.linspace(0, 1, 21)  # 20 equal-width bins from 0 to 1

for name, prob in probs.items():
    prob = np.asarray(prob).ravel()              # ensure 1D
    plt.hist(prob, bins=bin_edges, histtype="step",
             density=True, label=name)

# (Optional) show your decision threshold
# thresh = 0.5
# plt.axvline(thresh, linestyle="--")

plt.xlabel("Predicted probability (p[y=1] on VALID)")
plt.ylabel("Density")
plt.title("Predicted probability distribution (validation)")
plt.grid(alpha=0.1)
plt.legend()
plt.tight_layout()
plt.show()

Interpreting results:

Bimodal is usually good. Peaks near 0 and 1 mean strong separation. Thresholds are easier to choose.
Mass in the middle indicates weaker separation; precision will fall quickly as you push recall higher.
More spread toward the extremes typically correlates with higher AP (better ranking), assuming similar prevalence and no severe callibration issues.

Score histograms show how sharply a model separates cases. Here, RF puts more probability mass near 0 and 1, consistent with its higher AP. It ranks cases more sharply even if its probabilities are less well calibrated than LogReg.

Evaluation metrics table

Compare discrimination (AP) with probability quality (Brier, Expected Calibration Error (ECE).

Code

# Compact metrics table: AP (from earlier), Brier score, ECE
def expected_calibration_error(y_true, y_prob, n_bins=10):
    # Simple ECE: weighted average of |bin_accuracy - bin_confidence|
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    idx = np.digitize(y_prob, bins) - 1
    ece = 0.0
    total = len(y_true)
    for b in range(n_bins):
        mask = idx == b
        if not np.any(mask):
            continue
        conf = y_prob[mask].mean()
        acc = y_true[mask].mean()
        w = mask.mean()
        ece += w * abs(acc - conf)
    return ece

# Start from your AP table, then add Brier and ECE
ap_df = pd.DataFrame(ap_table, columns=['model','avg_precision']).set_index('model')

rows = []
for name, prob in probs.items():
    rows.append({
        "model": name,
        "avg_precision": ap_df.loc[name, "avg_precision"],
        "brier": brier_score_loss(y_valid, prob),           # lower is better
        "ece_10bin": expected_calibration_error(y_valid.values, prob, n_bins=10)  # lower is better
    })
cmp_df = pd.DataFrame(rows).sort_values("avg_precision", ascending=False)
display(cmp_df.style.hide(axis="index"))

model	avg_precision	brier	ece_10bin
rf	0.993665	0.036888	0.031158
logreg	0.940099	0.109240	0.021468

Interpreting results:

Average Precision (AP): Overall ranking quality from the PR curve (area under PR). Higher is better → RF wins.
Brier score: Mean squared error of probabilities (captures calibration + sharpness). Lower is better → RF wins.
Expected Calibration Error (ECE): Average gap between predicted confidence and observed positive rate across bins; Lower is better (closer to perfectly calibrated).
- In this run, LogReg has lower ECE which matches the calibration plot and its curve hugs the diagonal more closely than RF.

Random Forest (rf) is best overall choice: It delivers stronger ranking (higher AP) and lower overall probability error (Brier). We can then calibrate RF’s probabilities afterward to close the ECE gap — keeping its ranking strength while improving probability reliability.

Random Forest probability calibration (5-fold CV)

Improve probability reliability by fitting an isotonic calibrator on TRAIN (CV=5), then evaluate on VALID; ranking (AP) should stay similar while Brier/ECE improve.

Code

# Base RF (matching params used)
rf_base = RandomForestClassifier(
    n_estimators=400,
    class_weight='balanced_subsample',
    n_jobs=-1,
    random_state=42
)

# Isotonic calibration with 5-fold CV on TRAIN only (leakage-safe)
rf_cal = CalibratedClassifierCV(estimator=rf_base, method='isotonic', cv=5)  # <-- changed here
rf_cal.fit(Xtr, y_train)

# Compare RF before vs after calibration on VALID
prob_rf_uncal = probs['rf']                      # from step 10
prob_rf_cal   = rf_cal.predict_proba(Xva)[:, 1]  # calibrated probs

# If you don't already have ECE defined:
def expected_calibration_error(y_true, y_prob, n_bins=10):
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    idx = np.digitize(y_prob, bins) - 1
    ece = 0.0
    for b in range(n_bins):
        m = idx == b
        if not np.any(m):
            continue
        ece += m.mean() * abs(y_prob[m].mean() - y_true[m].mean())
    return ece

ap_uncal    = average_precision_score(y_valid, prob_rf_uncal)
ap_cal      = average_precision_score(y_valid, prob_rf_cal)
brier_uncal = brier_score_loss(y_valid, prob_rf_uncal)
brier_cal   = brier_score_loss(y_valid, prob_rf_cal)
ece_uncal   = expected_calibration_error(y_valid.values, prob_rf_uncal, n_bins=10)
ece_cal     = expected_calibration_error(y_valid.values, prob_rf_cal,   n_bins=10)


display(Markdown(f"**RF AP:** `{ap_uncal:.3f}` → `{ap_cal:.3f}`  (ranking should be similar)"))
display(Markdown(f"**RF Brier:** `{brier_uncal:.4f}` → `{brier_cal:.4f}`  (↓ is better)"))
display(Markdown(f"**RF ECE:** `{ece_uncal:.4f}` → `{ece_cal:.4f}`  (↓ is better)"))

# Plot: RF calibration before vs after
plt.figure()
for label, p in {"RF (uncalibrated)": prob_rf_uncal, "RF (calibrated)": prob_rf_cal}.items():
    frac_pos, mean_pred = calibration_curve(y_valid, p, n_bins=10, strategy="quantile")
    plt.plot(mean_pred, frac_pos, marker="o", label=label)
plt.plot([0,1],[0,1],"--", color="gray", linewidth=1)
plt.xlabel("Mean predicted probability"); plt.ylabel("Fraction of positives")
plt.title("RF calibration: before vs after (validation)")
plt.legend(); plt.show()

# Use calibrated probabilities for thresholding in Step 12
best_name = "rf_cal"
best_prob = prob_rf_cal

RF AP: 0.994 → 0.994 (ranking should be similar)

RF Brier: 0.0369 → 0.0348 (↓ is better)

RF ECE: 0.0312 → 0.0026 (↓ is better)

Interpreting results:

What changed: Calibration learns a monotonic mapping to RF’s scores via CV on TRAIN
- Ranking (AP) stays about the same while probability reliability improves.
- Any tiny AP drift is expected but usually negligible with isotonic.
Visual: On validation, the RF curve shifts closer to the diagonal.
Metrics: Brier and ECE decreased (better calibration).

Completing calibration keeps the RF model’s stronger ranking and gives better-behaved probabilities for thesholding or probability-based actions.

Decision threshold sweep

Sweep the decision threshold to visualize precision/recall/F1 across thresholds and select the operating point that maximizes F1 on VALID.

Code

# Decision threshold sweep
# Inputs expected:
# - best_prob: array-like of calibrated p(y=1) on VALID (e.g., from best model)
# - y_valid  : ground-truth labels (0/1) for VALID

# Threshold grid (0.05..0.95, step 0.05)
ts = np.linspace(0.05, 0.95, 19)

precisions, recalls, f1s = [], [], []
for t in ts:
    yhat = (best_prob >= t).astype(int)
    precisions.append(precision_score(y_valid, yhat, zero_division=0))
    recalls.append(recall_score(y_valid, yhat, zero_division=0))
    f1s.append(f1_score(y_valid, yhat, zero_division=0))

# Pick the F1-maximizing threshold (first best if ties)
best_idx = int(np.nanargmax(f1s))
best_t = float(ts[best_idx])

# Compact summary (Markdown + table)
display(Markdown(
    f"**Chosen threshold (max F1):** `{best_t:.2f}`  |  "
    f"**Precision:** `{precisions[best_idx]:.3f}`  |  "
    f"**Recall:** `{recalls[best_idx]:.3f}`  |  "
    f"**F1:** `{f1s[best_idx]:.3f}`"
))

# summary_tbl = pd.DataFrame({
#     "threshold": ts,
#     "precision": precisions,
#     "recall": recalls,
#     "f1": f1s
# })
# display(summary_tbl.style.hide(axis="index").format({
#     "threshold": "{:.2f}", "precision": "{:.3f}", "recall": "{:.3f}", "f1": "{:.3f}"
# }))

# Plot curves
plt.figure()
plt.plot(ts, precisions, label="Precision")
plt.plot(ts, recalls,   label="Recall")
plt.plot(ts, f1s,       label="F1")
plt.axvline(best_t, linestyle="--", label=f"chosen = {best_t:.2f}")
plt.ylim(0, 1); plt.xlim(ts.min(), ts.max())
plt.xlabel("Threshold"); plt.ylabel("Score"); plt.title("Threshold sweep (validation)")
plt.grid(alpha=0.2)
plt.legend()
plt.tight_layout()
plt.show()

Chosen threshold (max F1): 0.50 | Precision: 0.967 | Recall: 0.942 | F1: 0.955

Interpreting results:

The dashed line marks the best discrete threshold among 19 sampled cutoffs (here 0.50).
F1 peak ≠ where P=R: F1 (harmoic mean) is often highest where precision and recall are both strong and reasonably balanced, but not necessarily at their intersection.
- The visible interesection of the interpolated curves can lie between sampled thresholds.
- This explains why the chosen threshold may not exactly align with the crossing point.

This threshold sweep picks the operating threshold that maximizes F1 over the sampled grid (here 0.50), yielding a practical balance of precision and recall. Any visual offset comes from plotting smooth lines between discrete evaluation points.

Confusion matrix

This shows how predictions break down by actual class at the selected threshold; diagonals are correct, off-diagonals are errors.

Code

# Confusion matrix
# Ensure your y encoding is: dissatisfied -> 0, satisfied -> 1
labels = ["dissatisfied", "satisfied"]

y_pred = (best_prob >= best_t).astype(int)

# Counts and row-normalized percentages
cm_counts = confusion_matrix(y_valid, y_pred, labels=[0, 1])                 # [[TN,FP],[FN,TP]]
cm_perc   = confusion_matrix(y_valid, y_pred, labels=[0, 1], normalize="true")

# Annotate with "count\n(percent)"
annot = (np.array([f"{c:,}\n({p:.1%})" for c, p in zip(cm_counts.ravel(), cm_perc.ravel())])
         .reshape(cm_counts.shape))

plt.figure(figsize=(6, 6))
sns.heatmap(cm_perc, annot=annot, fmt="", vmin=0, vmax=1, cmap="Blues",
            xticklabels=labels, yticklabels=labels)
plt.title(f"Confusion Matrix @ threshold={best_t:.2f}")
plt.xlabel("Predicted"); plt.ylabel("Actual")
plt.tight_layout(); plt.show()

Interpreting results:

Classes: dissatisfied = 0, satisfied = 1
Cells:
- Top-left = true dissatisfied, predicted dissatisfied (correct).
- Top-right = true dissatisfied, predicted satisfied (these are the “missed dissatisfied” cases).
- Bottom-left = true satisfied, predicted dissatisfied (false alarms).
- Bottom-right = true satisfied, predicted satisfied (correct).
Precision/Recall link (for the positive class = satisfied):
- Precision = bottom-right / (right column total).
- Recall = bottom-right / (bottom row total).
Threshold intuition:
- Raise the threshold to predict “satisfied” less often → fewer false alarms (bottom-left ↓) but more missed satisfied (bottom-right ↓).
- Lower the threshold to predict “satisfied” more often → higher recall for satisfied (bottom-right ↑) but more false alarms (bottom-left ↑).

At the chosen operating point (from the F1 sweep), this matrix makes the trade-offs explicit: it balances how many satisfied customers we correctly recognize against how many dissatisfied customers slip through as “satisfied.” If your business goal changes (e.g., “never miss a dissatisfied customer”), adjust the threshold accordingly and re-read this matrix to verify the new trade-off.

04 - Personas (Clustering) & Diagnostics

It’s now time to build personas, which are coherent groups (clusters) of passengers who share similar service-rating patterns. This helps to discover who tends to rate what, then link each group to its satisfaction rate so we can propose targeted actions.

Feature Prep & Standardization

Use the 14 post-flight rating features, median-impute missing values, and z-score standardize so K-Means treats all factors fairly (Euclidean distance sensitive to scale).

Code

# Prepare features (ratings only) and standardize
# Continue using TARGET, y and RATING_COLS already declared

# Ratings only; median-impute
X_rat = dfc[RATING_COLS].copy().fillna(dfc[RATING_COLS].median(numeric_only=True))

# Standardize (store the scaler for later inverse_transform of centroids)
scaler = StandardScaler()
X_rat_s = scaler.fit_transform(X_rat)

# -- Sanity Checks --
# Check data shapes
display(Markdown("#### Data shapes:"))
display(Markdown(f"**X_rat:** `{X_rat.shape}`"))
display(Markdown(f"**X_rat_s (standardized):** `{X_rat_s.shape}`"))

# Ensure no missing values remain
assert not np.isnan(X_rat_s).any(), "Unexpected NaNs after standardization"

# Columnwise mean=0, std≈1 (tolerate small float noise)
xs_df = pd.DataFrame(X_rat_s, columns=RATING_COLS)
std_summary = pd.DataFrame({
    "mean": xs_df.mean().round(3),
    "std":  xs_df.std(ddof=0).round(3)  # population std to match scaler convention
})
display(std_summary)

Data shapes:

X_rat: (129880, 14)

X_rat_s (standardized): (129880, 14)

	mean	std
Seat comfort	-0.0	1.0
Departure/Arrival time convenient	-0.0	1.0
Food and drink	0.0	1.0
Gate location	0.0	1.0
Inflight wifi service	-0.0	1.0
Inflight entertainment	0.0	1.0
Online support	-0.0	1.0
Ease of Online booking	-0.0	1.0
On-board service	0.0	1.0
Leg room service	0.0	1.0
Baggage handling	-0.0	1.0
Checkin service	-0.0	1.0
Cleanliness	-0.0	1.0
Online boarding	0.0	1.0

Interpreting results:

Why standardize: K-Means uses Euclidean distance; z-scoring keeps all rating factors on equal footing.
Model vs. reporting space: We cluster on Xs (standardized) but report personas in 1–5 by inverse-transforming centroids.
Keep the scaler: Retaining the fitted scaler lets us invert centroids back to original units later.
Sanity checks: Means ≈0 and std ≈1 confirm the transform; no NaNs confirmed so clustering won’t error out.
About “−0.0”: That’s floating-point rounding of tiny negatives (numerically zero).

K selection overview

Choose how many personas (K) to learn by balancing objective separation metrics with interpretability and actionability.

Elbow & silhouette scans

Sweep K=2–8 and plot elbow (inertia) and silhouette to shortlist sensible K values.

Code

# Elbow + Silhouette scan
# Sweep K=2–8 with MiniBatchKMeans, collect inertia and silhouette (sampled for speed),
# then plot both and show a small reference table.

# --- grid to try + holders for metrics ---
k_grid   = range(2, 9)
inertias = []
sils     = []

for k in k_grid:
    # MiniBatchKMeans: fast on large n, robust with n_init>1
    km = MiniBatchKMeans(
        n_clusters=k, n_init=20, batch_size=2048, max_iter=150, random_state=42
    )
    labels = km.fit_predict(X_rat_s)         # fit on standardized ratings (X_rat_s)
    inertias.append(km.inertia_)             # within-cluster SSE (lower is better)
    # Silhouette on a sample (10k) to keep O(n); higher is better (separation/compactness)
    sils.append(silhouette_score(X_rat_s, labels, sample_size=10_000, random_state=42))

# --- plotting (side-by-side) ---
# Scale inertia to engineering units so the y-axis is readable (e.g., ×10^6)
p = int(np.floor(np.log10(max(inertias)) / 3) * 3)
p = max(0, min(p, 9))    # clamp to a sensible range
scale = 10 ** p
unit  = {0: "", 3: "×10³", 6: "×10⁶", 9: "×10⁹"}.get(p, f"×10^{p}")

fig, ax = plt.subplots(1, 2, figsize=(10, 4), constrained_layout=True)

# Elbow: look for the bend where marginal gains diminish
ax[0].plot(list(k_grid), np.array(inertias) / scale, marker="o")
ax[0].set_title("Elbow — Inertia (↓ better)")
ax[0].set_xlabel("Number of clusters (K)")
ax[0].set_ylabel(f"Inertia ({unit})")

# Silhouette: higher is better; helps pick among elbow candidates
ax[1].plot(list(k_grid), sils, marker="o")
ax[1].set_title("Silhouette (↑ better) — sampled 10k")
ax[1].set_xlabel("Number of clusters (K)")
ax[1].set_ylabel("Silhouette")

# a touch more horizontal breathing room between plots
fig.set_constrained_layout_pads(wspace=0.15)

plt.show()

# --- small reference table ---
k_metrics = pd.DataFrame({"K": list(k_grid), "Inertia": inertias, "Silhouette": sils})
display(k_metrics.style.hide(axis="index"))

K	Inertia	Silhouette
2	1459276.405169	0.195938
3	1281288.316555	0.161604
4	1176424.766475	0.155512
5	1111903.382490	0.145347
6	1097321.491932	0.095158
7	1026440.110634	0.114746
8	999085.499274	0.110523

Interpreting results:

Elbow: Inertia drops steeply then flattens around K≈5. Shortlist K=4–5.
Silhouette: Peaks at K=2 (highest separation). Keep K=3 in play as runner-up.
Next: Evalue K (candidates = [2,3,4,5]) to pick the smallest K that’s clearly distinct and actionable.

Compare K candidates (metrics)

For the shortlisted K values, compute Silhouette, Calinski–Harabasz, Davies–Bouldin, satisfaction gap, min cluster %, and centroid similarity, then preview persona summaries to balance separation with actionability.

Code

# Helper: fit & summarize personas for a given K
def fit_personas_for_k(X_rat, X_rat_s, y, rating_cols, k, sample_size=10_000, random_state=42, scaler=None):
    """
    Fit MiniBatchKMeans for K clusters and return:
      - labels, centers in original 1–5 scale, drivers (Δ vs global mean),
      - counts, sat_rate, compact summary,
      - metrics: silhouette↑, CH↑, DB↓, sat_gap, min_cluster_pct, max_centroid_cos_sim.

    X_rat  : DataFrame of ratings (1–5), imputed
    X_rat_s : ndarray standardized from X (the array used for clustering)
    y  : 0/1 Series aligned to X_rat
    scaler: Optional StandardScaler fitted on X_rat (if None, will fit inside)
    """
    km = MiniBatchKMeans(n_clusters=k, n_init=20, batch_size=2048, max_iter=200, random_state=random_state)
    labels = km.fit_predict(X_rat_s)

    # --- separation/compactness metrics ---
    sil = silhouette_score(X_rat_s, labels, sample_size=min(sample_size, len(X_rat_s)), random_state=random_state)  # ↑
    ch  = calinski_harabasz_score(X_rat_s, labels)                                                             # ↑
    db  = davies_bouldin_score(X_rat_s, labels)                                                                # ↓
    counts = pd.Series(labels).value_counts().sort_index()

    # --- centroids: z-space -> original 1–5 scale (readable for stakeholders) ---
    if scaler is None:
        scaler = StandardScaler().fit(X_rat)  # OK if X_rat unchanged; passing the external scaler is safest
    centers_z    = km.cluster_centers_
    centers_orig = pd.DataFrame(
        scaler.inverse_transform(centers_z),
        index=range(k), columns=rating_cols
    ).round(2)

    # --- per-cluster satisfaction rate ---
    sat_rate = pd.Series(y.values).groupby(labels).mean().rename("sat_rate").round(3)

    # --- drivers vs global mean (what’s unusually high/low for each persona) ---
    drivers = (centers_orig - X_rat.mean()).round(2)

    # --- redundancy check (are two personas near-duplicates?) ---
    cos_sim = cosine_similarity(centers_z)
    max_sim = float(pd.DataFrame(cos_sim).where(~np.eye(k, dtype=bool)).max().max()) if k > 1 else 1.0

    # --- compact persona table for quick review ---
    summary = (
        centers_orig.join([sat_rate, counts.rename("n")])
        .loc[:, ["n", "sat_rate"] + rating_cols]
        .sort_values("sat_rate", ascending=False)
    )

    # --- extra selection signals ---
    sat_gap = float(sat_rate.max() - sat_rate.min())
    min_pct = float((counts.min() / counts.sum()) * 100)

    return {
        "k": k,
        "labels": labels,
        "centers_orig": centers_orig,
        "drivers": drivers,
        "counts": counts.rename("n"),
        "sat_rate": sat_rate,
        "summary": summary,
        "metrics": {
            "silhouette": sil,
            "calinski_harabasz": ch,
            "davies_bouldin": db,
            "sat_gap": sat_gap,
            "min_cluster_pct": min_pct,
            "max_centroid_cos_sim": max_sim
        }
    }

# K comparison driver
candidates = [2, 3, 4, 5]   # shortlist from 2a
results, rows = {}, []

for k in candidates:
    r = fit_personas_for_k(X_rat, X_rat_s, y, RATING_COLS, k, scaler=scaler)  # pass the same scaler used for X_rat_s
    results[k] = r
    rows.append({
        "K": k,
        "Silhouette": r["metrics"]["silhouette"],
        "Calinski–Harabasz": r["metrics"]["calinski_harabasz"],
        "Davies–Bouldin": r["metrics"]["davies_bouldin"],
        "Sat gap": r["metrics"]["sat_gap"],
        "Min cluster %": r["metrics"]["min_cluster_pct"],
        "Max centroid cos sim": r["metrics"]["max_centroid_cos_sim"]
    })

# Nicely formatted comparison table (sorted by Silhouette)
display(Markdown("**K Comparison Table (sorted by Silhouette)**"))
cmp = pd.DataFrame(rows).sort_values("Silhouette", ascending=False)
# with pd.option_context('display.float_format', '{:,.3f}'.format):
#     display(cmp)
display(cmp.style.hide(axis="index").format({
    "silhouette":"{:.3f}",
    "calinski_harabasz":"{:.1f}",
    "davies_bouldin":"{:.3f}",
    "sat_gap":"{:.3f}",
    "min_cluster_pct":"{:.1%}",
    "max_centroid_cos_sim":"{:.3f}",
}))
display(Markdown("<br>"))

# Persona summaries per K
for k in candidates:
    display(Markdown(f"**Persona Summary (K={k}))**"))
    # display(results[k]["summary"])
    summ_res = results[k]["summary"]
    summ_res = summ_res.sort_index()
    summ_res = summ_res.rename_axis("P").reset_index()
    display(
        summ_res.style
            .hide(axis="index")
            .format({"n": "{:,.0f}", "sat_rate": "{:.1%}", **{c: "{:.1f}" for c in RATING_COLS}})
    )
    display(Markdown("<br>"))

K Comparison Table (sorted by Silhouette)

K	Silhouette	Calinski–Harabasz	Davies–Bouldin	Sat gap	Min cluster %	Max centroid cos sim
2	0.195938	31987.865724	1.926666	0.524000	39.464891	-0.999985
3	0.161604	27256.286121	1.850533	0.595000	29.513397	-0.220235
4	0.155512	23787.627194	1.864007	0.613000	21.348168	0.011048
5	0.145347	20667.639308	1.933001	0.763000	15.258700	0.494366

Persona Summary (K=2))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	51,257	23.0%	2.5	3.1	2.8	3.0	2.4	2.8	2.5	2.2	2.8	2.9	3.1	2.9	3.1	2.3
1	78,623	75.4%	3.2	3.2	3.1	3.0	3.8	3.9	4.2	4.3	3.9	3.9	4.1	3.6	4.1	4.0

Persona Summary (K=3))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	38,332	79.3%	4.3	4.3	4.2	4.0	3.6	4.1	4.0	4.1	4.0	3.9	4.1	3.6	4.1	3.8
1	44,933	19.8%	2.5	3.1	2.8	3.0	2.3	2.7	2.4	2.1	2.7	2.9	3.0	2.8	3.0	2.2
2	46,615	68.3%	2.4	2.3	2.2	2.2	3.9	3.7	4.2	4.3	3.8	3.8	4.0	3.6	4.0	4.1

Persona Summary (K=4))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	36,224	79.1%	4.2	4.3	4.2	4.0	3.7	4.1	4.0	4.1	3.9	3.9	4.1	3.6	4.1	3.8
1	32,978	17.8%	2.5	3.3	2.9	3.1	1.9	2.6	2.0	1.9	2.9	3.0	3.3	2.9	3.4	1.8
2	32,951	78.9%	2.2	2.3	2.2	2.1	3.7	3.9	4.2	4.4	4.3	4.2	4.5	3.9	4.5	4.0
3	27,727	38.1%	2.6	2.4	2.3	2.5	3.9	3.3	3.9	3.4	2.5	2.7	2.7	2.9	2.7	3.9

Persona Summary (K=5))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	19,818	8.1%	1.9	2.7	2.3	2.8	1.8	2.1	2.0	1.8	2.6	2.8	3.0	2.7	3.0	1.8
1	28,750	84.4%	4.3	4.3	4.2	4.1	3.9	4.2	4.3	4.3	4.0	4.0	4.2	3.8	4.2	4.1
2	32,623	79.3%	2.1	2.2	2.1	2.0	3.7	3.8	4.2	4.4	4.3	4.2	4.4	3.8	4.4	4.0
3	21,489	39.7%	3.5	3.9	3.7	3.4	2.3	3.3	2.3	2.6	3.6	3.5	3.9	3.3	3.9	2.2
4	27,200	39.8%	2.8	2.7	2.7	2.8	3.9	3.4	3.9	3.4	2.4	2.6	2.6	2.8	2.6	3.9

Interpreting K metrics summary:

Silhouette (↑): Peaks at K=2 (≈0.196) and declines thereafter.
Calinski–Harabasz (↑): Also favors K=2 (≈31,988), consistent with silhouette.
Davies–Bouldin (↓): Best at K=3 (≈1.851), indicating slightly tighter/clearer centroids by this index.
Satisfaction gap (↑): Increases with K (≈0.524 → 0.763) as clusters isolate extremes — useful for targeting, but risks over-segmentation.
Min cluster %: Healthy across K (≥ ~15%), so no tiny “junk” clusters.
Max centroid cosine similarity: Low/negative (good). For K=2, ≈−1.0 means the two centroids are almost opposites in z-space—strong separation, not redundancy.

What this suggests:

K=2: Cleanest global split (two opposite profiles), simplest story and action plan.
K=3: Reasonable compromise—decent silhouette, best DB, likely three interpretable personas.
K=5: More granularity (larger sat gap) but weaker separation (silhouette/CH) and more personas to maintain.

Final choice:

Use K=2: Maximizes Silhouette and CH, yields distinct, opposite profiles with balanced sizes, and keeps the narrative/action plan simple. Nuance is later recovered via the Deep Dive Within Persona 0 (Dissatisfied) section.

Select final K

Lock the final K (based on metrics + interpretability), assign persona labels to every row, and cache key artifacts for downstream use.

Code

# --- Finalize persona count (set after reviewing 2a/2b) ---
K_FINAL = 2          # ← your selected K
labels  = results[K_FINAL]["labels"]

# Persist persona labels onto the full dataset
dfc["persona"] = labels

# Keep handy: summaries/artifacts for displays and exports
centers_orig = results[K_FINAL]["centers_orig"]   # centroids in 1–5 rating scale
drivers      = results[K_FINAL]["drivers"]        # Δ vs global mean (drivers per persona)
sat_rate     = results[K_FINAL]["sat_rate"]       # satisfaction rate per persona
sizes        = results[K_FINAL]["counts"]         # counts per persona

# Optional export of artifacts
centers_orig.to_csv("../artifacts/persona_centers_k2.csv", index_label="persona")
drivers.to_csv("../artifacts/persona_drivers_k2.csv", index_label="persona")

# (Optional) quick sanity print
display(Markdown(f"**K_FINAL = `{K_FINAL}`**"))
# print(f"counts:\n{sizes.to_string()}")


# sizes is a Series indexed by persona → make a tidy 2-column table
sizes_tbl = (
    sizes
        .rename_axis("persona")
        .reset_index(name="counts")
        .astype({"persona": int, "counts": int})
)

display(
    sizes_tbl
        .style.hide(axis="index")
        .format({"counts": "{:,.0f}"})
)

K_FINAL = 2

persona	counts
0	51,257
1	78,623

Persona Profiles (Final K)

Profile each persona by size, satisfaction rate, mean ratings, and key drivers; then visualize differences to name segments and propose actions.

Persona summary (N, sat_rate, mean ratings)

A compact table showing persona size, satisfaction rate, and mean ratings per factor for quick “who is happier and why” at a glance.

Code

# Persona summary (counts + sat_rate + mean ratings)
# Build a compact, presentation-friendly table with:
# - n (cluster size), sat_rate (% satisfied), and mean ratings (1–5) per factor
# - subtle heatmap on ratings and a readable bar for sat_rate

# Assemble the tidy summary table (few rows -> fast to render)
summary_k = (
    centers_orig.join([sat_rate, sizes])                   # add sat_rate and counts to centroids
    .loc[:, ["n", "sat_rate"] + RATING_COLS]               # column order: n, sat, then ratings
    # .sort_values("sat_rate", ascending=False)              # show happiest persona first
    .sort_index()
    .rename_axis("P").reset_index()
)

# Friendly number formats
# ratings -> 1 decimal, sat_rate -> %, n -> thousands with commas
fmt = {**{c: "{:.1f}" for c in RATING_COLS}, "sat_rate": "{:.1%}", "n": "{:,.0f}"}

# Style helper: auto-contrast text over the sat_rate bar
def _sat_text_contrast(col: pd.Series):
    # Flip sat_rate text to white + bold when high so it stays readable over the bar fill
    return ["color:white; font-weight:bold;" if v >= 0.70 else "color:black;" for v in col]

# Style: gradient on ratings, bar on sat_rate, caption, formats
styled = (
    summary_k.style
    .hide(axis="index")
    .format(fmt)
    .background_gradient(subset=RATING_COLS, cmap="Blues", vmin=1, vmax=5)
    .background_gradient(subset=["sat_rate"], cmap="Blues", vmin=0, vmax=1)  # <- replace .bar(...)
    .apply(_sat_text_contrast, subset=["sat_rate"])
)

display(Markdown("#### Persona profiles: mean ratings (1–5), size (n), and satisfaction rate"))
display(styled)

Persona profiles: mean ratings (1–5), size (n), and satisfaction rate

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	51,257	23.0%	2.5	3.1	2.8	3.0	2.4	2.8	2.5	2.2	2.8	2.9	3.1	2.9	3.1	2.3
1	78,623	75.4%	3.2	3.2	3.1	3.0	3.8	3.9	4.2	4.3	3.9	3.9	4.1	3.6	4.1	4.0

Interpreting results:

Which persona is happier? Compare sat_rate; Persona 1 is most satisfied.
Spot pain vs strengths: Ratings near 1-2 flag pain points; 4-5 are strengths.
Cross-check with drivers: Validate takeaways against top drivers deltas and heatmap in the next sections.
Prioritize fixes: A large n with a low sat_rate is the biggest improvement opportunity.

Persona cards (top drivers)

For each persona, list the lowest vs average (pain points) and highest vs average (strengths) factors to name segments and propose targeted fixes.

Code

# Persona cards with visible feature names ("Factor")
# Shows top lows/highs per persona with Factor, mean rating, and Δ vs avg.

def persona_cards(
    centers_orig: pd.DataFrame,   # centroids in 1–5 scale (rows=personas, cols=ratings)
    drivers: pd.DataFrame,        # Δ vs global mean (same shape)
    sizes: pd.Series,             # counts per persona
    sat_rate: pd.Series,          # satisfaction rate per persona
    top: int = 4,                 # # of lows/highs to show
    delta_sig: float = 0.30       # bold |Δ| ≥ this
):
    # Symmetric color scale for deltas so colors are comparable across personas
    max_abs_delta = float(np.nanmax(np.abs(drivers.values))) if len(drivers) else 0.0
    max_abs_delta = max(max_abs_delta, delta_sig)

    html_blocks = []
    for c in centers_orig.index:
        # Build a tidy table WITH Factor as a visible column
        card = pd.DataFrame({
            "Factor": centers_orig.columns,
            "mean rating": centers_orig.loc[c].values,
            "Δ vs avg":    drivers.loc[c].values
        }).sort_values("Δ vs avg")  # lows first

        # Split into lows/highs
        lows  = card.head(top).copy()
        highs = card.tail(top).iloc[::-1].copy()  # biggest positives first

        # Styling helpers
        ratings_style = dict(subset=["mean rating"], cmap="Blues",  vmin=1, vmax=5)
        deltas_style  = dict(subset=["Δ vs avg"],    cmap="RdBu_r", vmin=-max_abs_delta, vmax= max_abs_delta)

        def emphasize_deltas(col: pd.Series):
            return ["font-weight:bold;" if abs(v) >= delta_sig else "" for v in col]

        def style_card(dfc):
            st = (dfc.style
                  .format({"mean rating": "{:.1f}", "Δ vs avg": "{:+.2f}"})
                  .background_gradient(**ratings_style)
                  .background_gradient(**deltas_style)
                  .apply(emphasize_deltas, subset=["Δ vs avg"]))
            # Keep the numeric index hidden; "Factor" is now the visible label column
            try:
                st = st.hide(axis="index")
            except Exception:
                pass
            return st

        lows_html  = style_card(lows).to_html()
        highs_html = style_card(highs).to_html()

        # Default persona title (no renaming)
        title = (
            f"<h4 style='margin:4px 0'>Persona {int(c)}: "
            f"n={int(sizes.loc[c]):,}, sat_rate={sat_rate.loc[c]:.1%}</h4>"
        )

        block = f"""
        <div style="flex:1;min-width:320px;max-width:560px;border:1px solid #e5e7eb;border-radius:12px;
                    padding:12px;margin:6px;background:#fafafa;box-shadow:0 1px 2px rgba(0,0,0,0.04)">
          {title}
          <div style="display:flex;gap:12px;align-items:flex-start;flex-wrap:wrap">
            <div style="flex:1;min-width:280px">
              <strong>Lowest vs avg (pain points)</strong>
              {lows_html}
            </div>
            <div style="flex:1;min-width:280px">
              <strong>Highest vs avg (strengths)</strong>
              {highs_html}
            </div>
          </div>
        </div>
        """
        html_blocks.append(block)

    grid = "<div style='display:flex;flex-wrap:wrap;gap:12px'>" + "".join(html_blocks) + "</div>"
    display(HTML(grid))

# Run the cards
persona_cards(centers_orig, drivers, sizes, sat_rate, top=4, delta_sig=0.30)

Persona 0: n=51,257, sat_rate=23.0%

Lowest vs avg (pain points)

Factor	mean rating	Δ vs avg
Ease of Online booking	2.2	-1.28
Online boarding	2.3	-1.07
Online support	2.5	-1.04
Inflight wifi service	2.4	-0.90

Highest vs avg (strengths)

Factor	mean rating	Δ vs avg
Gate location	3.0	-0.03
Departure/Arrival time convenient	3.1	-0.07
Food and drink	2.8	-0.16
Seat comfort	2.5	-0.43

Persona 1: n=78,623, sat_rate=75.4%

Lowest vs avg (pain points)

Factor	mean rating	Δ vs avg
Gate location	3.0	+0.02
Departure/Arrival time convenient	3.2	+0.05
Food and drink	3.1	+0.10
Seat comfort	3.2	+0.26

Highest vs avg (strengths)

Factor	mean rating	Δ vs avg
Ease of Online booking	4.3	+0.79
Online boarding	4.0	+0.67
Online support	4.2	+0.65
Inflight wifi service	3.8	+0.57

Interpreting results:

Reading cards: Card for each persona shows top drivers ranked by delta from average global rating (Δ vs avg). Bold Δ marks meaningful gaps (≥ ~0.30).
Persona 0 - Digital Assistance Needed: Show lowest sat and experience the most pain with online booking, online boarding, online support and inflight wifi.
Persona 1 - Self-Service Pros: Show higher sat and rate these same factors the highest (mirroring Persona 0 pain points).

Centroid heatmap

Quick view of mean ratings by persona.

Code

data = centers_orig.copy()
fig, ax = plt.subplots(figsize=(max(8, 0.8*len(RATING_COLS)), 1.1*data.shape[0] + 1.0), constrained_layout=True)
sns.heatmap(data, annot=True, fmt=".1f", cmap="Blues", cbar_kws={"shrink":0.6}, ax=ax)

wrapped = ['\n'.join(textwrap.wrap(c, width=14)) for c in RATING_COLS]
ax.set_xticklabels(wrapped, rotation=45, ha="right")
ax.set_xlabel("Rating factor"); ax.set_ylabel("Persona")
ax.set_title("Centroid profiles (original 1–5 scale)")
plt.show()

Drivers ranked by Δ (Persona 1 - Persona 0)

Identify what most separates the two personas by ranking features on the absolute difference in mean ratings.

Code

# Rank factors by mean rating deltas (Persona 1 − Persona 0)
# centers_orig: rows = personas (0, 1); columns = RATING_COLS (means on 1–5 scale)

# Signed deltas: positive → Persona 1 rates higher than Persona 0
delta = centers_orig.loc[1, RATING_COLS] - centers_orig.loc[0, RATING_COLS]

# Order features by discriminativeness: sort by |Δ| (largest first → biggest gaps on top)
order = delta.abs().sort_values(ascending=True).index

# Wrap long labels for readability in a horizontal bar chart
labs = ['\n'.join(textwrap.wrap(c, 22)) for c in order]

# Plot: horizontal bars; annotate each bar with its signed Δ
plt.figure(figsize=(8, max(4, 0.35 * len(order))))
vals = delta[order].values
plt.barh(labs, vals)

for y, v in enumerate(vals):
    # place label just outside the bar; keep sign in the text
    offset = 0.03 if v >= 0 else -0.03
    ha = "left" if v >= 0 else "right"
    plt.text(v + offset, y, f"{v:+.2f}", va="center", ha=ha, fontsize=9)

# Zero reference line (keeps plot robust if any future Δ < 0)
plt.axvline(0, color="gray", linewidth=1)

# Right-side padding so labels don’t clip
xmax = float(vals.max())
plt.xlim(0, xmax * 1.1)

plt.title("Persona 1 − Persona 0 (mean rating deltas)")
plt.xlabel("Δ rating (1–5 scale)")
plt.tight_layout()
plt.show()

Interpreting results:

Largest |Δ| = biggest separator: Bars farthest from 0 are the key differentiators between personas.
Sign tells direction: Positive = Persona 1 rates that factor higher; negative = Persona 0 rates it higher (but all are positive here).

Max-spread drivers (Top N)

Order features by between-persona spread (max–min), then plot a small heatmap of the top discriminating factors (defined in topN) to visualize profile contrast.

Code

# Show max-spread ranking + heatmap (top-N)
# Order features by between-persona spread (max–min) and plot a compact heatmap of the most discriminating factors.

# Rank features by spread across personas (in original 1–5 scale)
spread = centers_orig.max(axis=0) - centers_orig.min(axis=0)   # range per feature
topN   = 8                                                     # adjust as you like (e.g., 8–12 works well)
cols   = spread.sort_values(ascending=False).head(topN).index  # most discriminating first

# Order personas by sat_rate (happiest on top)
row_order = centers_orig.index
if 'sat_rate' in locals() and isinstance(sat_rate, pd.Series):
    row_order = sat_rate.sort_values(ascending=False).index

# Prepare data + wrapped labels for readability
data   = centers_orig.loc[row_order, cols]
xlabels = ['\n'.join(textwrap.wrap(c, 16)) for c in cols]

# Heatmap: fix color scale to 1–5 so colors are comparable across runs
fig_w = max(8, 0.8 * len(cols))            # width grows with #columns
fig_h = 1.1 * data.shape[0] + 1.2          # height scales with #personas
fig, ax = plt.subplots(figsize=(fig_w, fig_h), constrained_layout=True)

sns.heatmap(
    data,
    annot=True, fmt=".1f",
    cmap="Blues", vmin=1, vmax=5,          # fixed scale = fair comparisons
    linewidths=0.5, linecolor="white",
    cbar_kws={"shrink": 0.6, "label": "Mean rating (1–5)"},
    ax=ax
)

ax.set_xticklabels(xlabels, rotation=45, ha="right")
ax.set_xlabel("Rating factor"); ax.set_ylabel("Persona")
ax.set_title(f"Centroid profiles — top {topN} features by spread (max–min)")
plt.show()

Interpreting results:

Left-most columns are the most discriminating (larger max-min across personas).
Darker cells = higher ratings (strengths); lighter cells flag pain points.
Offers sanity check for top drivers identified in previous outputs.

Composition by profile variables

Stacked percent bars showing how each persona is distributed across key categorical attributes—revealing which groups are over- or under-represented.

Code

CATS = ["Customer Type","Type of Travel","Class"]

for c in CATS:
    # % composition of each persona by category value
    ct = pd.crosstab(dfc["persona"], dfc[c], normalize="index").sort_index()

    ax = ct.plot(kind="bar", stacked=True, figsize=(7,3))
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
    ax.set_xlabel("Persona"); ax.set_ylabel("% of persona")
    ax.set_title(f"Persona composition by {c}")
    ax.legend(title=c, bbox_to_anchor=(1.02, 1), loc="upper left", frameon=False)
    plt.xticks(rotation=0)
    plt.tight_layout(); plt.show()

Interpreting results:

Tallest colored segments show which subgroups dominate each persona.
Persona 0 (low CSAT) skews toward ECO (economy class), and might serve as a focused pilot target.

Mean flight delays by persona

Grouped bars of departure and arrival delays per persona (mean or median); check whether ops pain aligns with the low-CSAT group.

Code

cols = ["Departure Delay in Minutes", "Arrival Delay in Minutes"]
delay_means = dfc.groupby("persona")[cols].mean().round(1).sort_index()

# Make fig/ax first so we can use constrained layout and place legend outside
fig, ax = plt.subplots(figsize=(6.5, 3.5), constrained_layout=True)
delay_means.plot(kind="bar", ax=ax, width=0.75)

ax.set_xlabel("Persona")
ax.set_ylabel("Minutes (avg)")
ax.set_title("Average flight delays by persona")
ax.legend(title="", bbox_to_anchor=(1.02, 1), loc="upper left", frameon=False)
plt.xticks(rotation=0)
plt.show()

Interpreting results:

Slightly higher bars for Persona 0 suggest ops pain correlates with the low-sat group

Deep Dive Within Persona 0 (Dissatisfied)

In order to better understand top drivers of dissatisfaction to focus actions where they matter most.

Subset & re-prep (dissatisfied cohort)

Isolate the dissatisfied cohort, rebuild rating features with cohort-level imputation, and standardize for within-cohort clustering/visuals.

Code

# Filter to dissatisfied cohort and re-prep (robust to restarts)

# Ensure persona labels exist on dfc (handles notebook restarts gracefully)
if "persona" not in dfc.columns:
    if "results" in globals() and "K_FINAL" in globals() and (K_FINAL in results):
        dfc["persona"] = results[K_FINAL]["labels"]
    else:
        # Fallback: recompute labels using earlier standardized ratings matrix (X_rat_s)
        # NOTE: This is only for continuity if labels are missing; it may not match your final chosen K.
        K_FALLBACK = 2 if "K_FINAL" not in globals() else K_FINAL
        km = MiniBatchKMeans(
            n_clusters=K_FALLBACK, n_init=20, batch_size=2048,
            max_iter=200, random_state=42
        )
        dfc["persona"] = km.fit_predict(X_rat_s)  # relies on X_rat_s defined earlier
        display(Markdown(f"[note] Recomputed persona labels with K=`{K_FALLBACK}` (fallback)."))

# Binary target aligned to dfc
y = (dfc[TARGET] == "satisfied").astype(int)

# Identify the low-CSAT persona id (robust if label ids flip)
sat_rate_by_persona = dfc.groupby("persona")[TARGET].apply(lambda s: (s == "satisfied").mean())
dissatisfied_id = int(sat_rate_by_persona.idxmin())

# Slice to dissatisfied cohort
mask0 = (dfc["persona"] == dissatisfied_id)
dfc0  = dfc.loc[mask0].copy()
y0    = y.loc[mask0].reset_index(drop=True)

# Rebuild rating features within this subset (median-impute inside the cohort)
X_rat0 = (
    dfc0[RATING_COLS]
      .apply(pd.to_numeric, errors="coerce")                                # coerce any stray non-numerics
      .fillna(dfc0[RATING_COLS].median(numeric_only=True))                  # cohort-level median impute
)[RATING_COLS]                                                               # keep column order explicit

# Standardize within subset (fresh scaler for Stage 2)
scaler0   = StandardScaler()
X_rat_s0  = scaler0.fit_transform(X_rat0)

# --- Quick sanity checks ---
display(Markdown("#### Data shapes"))
display(Markdown(f"**X_rat0:** `{X_rat0.shape}`"))
display(Markdown(f"**X_rat_s0 (standardized):** `{np.asarray(X_rat_s0).shape}`"))

# No NaNs after standardization
assert not np.isnan(X_rat_s0).any(), "Unexpected NaNs after standardization."

# Columnwise mean≈0, std≈1 (population std to match StandardScaler)
x_rat_s0_df = pd.DataFrame(X_rat_s0, columns=RATING_COLS)
std0_summary = pd.DataFrame({
    "mean": x_rat_s0_df.mean().round(3),
    "std":  x_rat_s0_df.std(ddof=0).round(3),
})
display(std0_summary)

Data shapes

X_rat0: (51257, 14)

X_rat_s0 (standardized): (51257, 14)

	mean	std
Seat comfort	-0.0	1.0
Departure/Arrival time convenient	0.0	1.0
Food and drink	0.0	1.0
Gate location	0.0	1.0
Inflight wifi service	0.0	1.0
Inflight entertainment	0.0	1.0
Online support	-0.0	1.0
Ease of Online booking	-0.0	1.0
On-board service	-0.0	1.0
Leg room service	0.0	1.0
Baggage handling	0.0	1.0
Checkin service	-0.0	1.0
Cleanliness	0.0	1.0
Online boarding	-0.0	1.0

Interpreting results: (same checks as before, but inside the dissatisfied cohort)

Subset confirmed: Persona 0 has 51,257 rows, a substantial slice worth clustering.
No leakage: We re-imputed and re-standardized within this cohort (fresh StandardScaler), not using global stats.
Shapes OK: X0 and Xs0 have matching shapes; will cluster on Xs0.
Data hygiene: Xs0 has no NaNs; columnwise means ≈ 0 and std ≈ 1 (like Step 1), so distances are comparable across factors.

K selection (inside dissatisfied cohort)

Sweep K = 2–5 and compare cluster quality + business separability; pick the smallest K that’s clearly actionable.

Code

# K selection inside Person 0
candidates0 = [2,3,4,5]
res0, rows0 = {}, []
for k in candidates0:
    r = fit_personas_for_k(X_rat0, X_rat_s0, y0, RATING_COLS, k)
    res0[k] = r
    rows0.append({
        "k": k,
        "silhouette": r["metrics"]["silhouette"],
        "calinski_harabasz": r["metrics"]["calinski_harabasz"],
        "davies_bouldin": r["metrics"]["davies_bouldin"],
        "sat_gap": r["metrics"]["sat_gap"],
        "min_cluster_pct": r["metrics"]["min_cluster_pct"],
        "max_centroid_cos_sim": r["metrics"]["max_centroid_cos_sim"],
    })

# compact metrics table
cmp0 = pd.DataFrame(rows0).sort_values("k")
display(Markdown("#### K Comparison Table (sorted by K)"))
display(cmp0.style.hide(axis="index").format({
    "silhouette":"{:.3f}",
    "calinski_harabasz":"{:.1f}",
    "davies_bouldin":"{:.3f}",
    "sat_gap":"{:.3f}",
    "min_cluster_pct":"{:.1%}",
    "max_centroid_cos_sim":"{:.3f}",
}))
display(Markdown("<br>"))
display(Markdown("#### Persona Summaries for each K selection - sorted by sat_rate (ascending)"))

# Persona summaries per K
for k in candidates0:
    display(Markdown("<br>"))
    display(Markdown(f"**Persona Summary (K={k}))**"))
    summ = res0[k]["summary"]
    summ = summ.sort_index()
    summ = summ.rename_axis("P").reset_index()
    display(
        summ.style
            .hide(axis="index")
            .format({"n": "{:,.0f}", "sat_rate": "{:.1%}", **{c: "{:.1f}" for c in RATING_COLS}})
    )

K Comparison Table (sorted by K)

k	silhouette	calinski_harabasz	davies_bouldin	sat_gap	min_cluster_pct	max_centroid_cos_sim
2	0.152	9877.3	2.218	0.042	4707.5%	-1.000
3	0.121	7638.0	2.226	0.211	2188.4%	-0.094
4	0.115	6647.2	2.263	0.194	1551.2%	0.259
5	0.116	6442.5	2.002	0.273	1559.4%	0.215

Persona Summaries for each K selection - sorted by sat_rate (ascending)

Persona Summary (K=2))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	27,128	21.0%	2.4	3.0	2.6	2.8	1.8	2.4	1.9	2.0	3.2	3.3	3.7	3.0	3.7	1.8
1	24,129	25.2%	2.6	3.1	3.1	3.1	3.1	3.2	3.3	2.5	2.2	2.5	2.4	2.7	2.4	3.0

Persona Summary (K=3))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	18,570	28.0%	2.5	3.0	3.0	2.9	3.2	3.3	3.5	2.4	2.0	2.3	2.1	2.8	2.1	3.1
1	21,470	12.9%	2.0	2.8	2.3	2.9	2.2	2.0	2.2	2.4	3.0	3.2	3.5	2.7	3.5	2.1
2	11,217	34.0%	3.5	3.9	3.7	3.1	1.5	3.4	1.6	1.6	3.5	3.3	3.9	3.5	3.9	1.5

Persona Summary (K=4))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	13,731	31.7%	3.5	4.0	3.9	3.5	1.9	3.3	2.0	2.1	3.3	3.3	3.7	3.1	3.7	1.8
1	14,628	27.0%	2.6	2.5	2.5	2.6	3.4	3.1	3.4	2.9	2.4	2.6	2.6	2.8	2.6	3.4
2	7,951	20.7%	2.1	3.3	3.3	3.2	2.5	3.0	2.9	1.6	1.7	2.0	1.7	2.6	1.7	2.3
3	14,947	12.3%	1.8	2.7	1.9	2.7	1.8	1.8	1.9	2.0	3.2	3.3	3.7	2.9	3.7	1.8

Persona Summary (K=5))

P	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	12,269	20.2%	2.3	2.1	2.0	2.3	3.2	2.6	3.2	3.1	2.6	2.8	2.9	2.6	2.9	3.1
1	12,111	8.5%	1.6	2.6	2.0	2.8	1.6	1.6	1.8	1.8	2.9	3.1	3.4	2.8	3.4	1.6
2	7,993	35.8%	2.3	3.4	3.3	3.3	3.1	3.6	3.6	1.7	1.6	2.0	1.6	3.0	1.6	3.0
3	8,240	33.1%	3.2	3.9	3.3	2.9	1.5	3.2	1.6	1.5	3.8	3.5	4.2	3.9	4.2	1.5
4	10,644	25.3%	3.3	3.8	4.0	3.9	2.4	3.1	2.5	2.6	2.7	3.0	3.2	2.4	3.1	2.3

Interpreting results:

Silhouette (↑) & Calinski–Harabasz (↑): Both peak at K=2, indicating the clearest separation and compactness within the dissatisfied cohort.
Davies–Bouldin (↓): Best at K=5, but DB alone shouldn’t drive the choice when Silhouette/CH drop and interpretability gets harder.
Satisfaction gap (↑): Grows with K (≈0.04 → 0.27). That’s expected as you isolate extremes—but can also mean over-segmentation.
Min cluster %: Stays healthy (≥ ~15%) across your Ks, so no obvious “junk” slivers.
Max centroid cosine similarity: Low/negative across Ks; at K=2 ~ −1.0, centroids are near opposites—a clean split.

What this suggests:

K=2: Strongest by Silhouette/CH and gives a simple story.
K=3: Consider only if the persona summaries reveal a new, stable theme (e.g., “digital pain” vs “comfort pain” vs “mixed”), with decent size (≥ ~20%) and low redundancy (cosine sim ≪ 0.9).
K=5: Avoid using unless stakeholders explicitly want micro-segments—it trades separation/clarity for granularity.

Final choice:

Pick K=2: This maximizes separation (Silhouette/CH), yields opposite sub-personas with balanced sizes, and keeps the action plan clear.
Best to default to the smallest K that’s clearly distinct and actionable.

Sub-personas (inside dissatisfied cohort) — finalize & summarize

Fix K for the dissatisfied cohort, attach persona0_sub labels back to dfc, and summarize size, CSAT, and mean ratings (1–5).

Code

# Finalize sub-personas + summary (dissatisfied cohort)

# Lock in K and attach labels back to df
K0_FINAL = 2  # set after reviewing cmp0 + summaries
assert K0_FINAL in res0, f"K0_FINAL={K0_FINAL} not found in res0"

# labels for dissatisfied rows only (index aligns to dfc0.index)
labels0 = (
    pd.Series(res0[K0_FINAL]["labels"], index=dfc0.index, name="persona0_sub")
    .astype(int)
)
dfc.loc[labels0.index, "persona0_sub"] = labels0

# Artifacts within the dissatisfied cohort
centers_orig0 = res0[K0_FINAL]["centers_orig"]   # mean ratings (1–5)
drivers0      = res0[K0_FINAL]["drivers"]        # Δ vs cohort mean
sizes0        = res0[K0_FINAL]["counts"]         # counts per sub-persona
sat_rate0     = res0[K0_FINAL]["sat_rate"]       # CSAT per sub-persona


# Compact summary (n, sat_rate, then mean ratings) - lowest CSAT first (problem-first view)
summary0 = (
    centers_orig0
    .join([sat_rate0, sizes0.rename("n")])
    .loc[:, ["n", "sat_rate"] + RATING_COLS]
    .sort_values("sat_rate", ascending=True)
    .rename_axis("P0_sub")  # make index label explicit
)

# Styling
fmt0 = {**{c: "{:.1f}" for c in RATING_COLS}, "sat_rate": "{:.1%}", "n": "{:,.0f}"}
styled0 = (
    summary0.style
    .format(fmt0)
    .background_gradient(subset=RATING_COLS, cmap="Blues", vmin=1, vmax=5)
    .background_gradient(subset=["sat_rate"], cmap="Blues", vmin=0, vmax=0.5)
)

display(Markdown("#### Dissatisfied cohort — sub-persona profiles: mean ratings (1–5), size (n), CSAT"))
display(styled0)

Dissatisfied cohort — sub-persona profiles: mean ratings (1–5), size (n), CSAT

	n	sat_rate	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
P0_sub
0	27,128	21.0%	2.4	3.0	2.6	2.8	1.8	2.4	1.9	2.0	3.2	3.3	3.7	3.0	3.7	1.8
1	24,129	25.2%	2.6	3.1	3.1	3.1	3.1	3.2	3.3	2.5	2.2	2.5	2.4	2.7	2.4	3.0

Interpreting results:

Scope: Everything here is within the dissatisfied cohort - drivers0 are deltas vs the cohort mean.
Prioritize: Start with lowest sat_rate role and biggest negative gaps. Use n to weigh impact.
Consistency check: Patterns should echo K choice for dissatisfied cohort in sub-persona cards below.

Sub-persona cards

For each sub-persona, surface top pains (largest negative Δ vs cohort mean) and top strengths (largest positive Δ) to propose targeted fixes.

Code

# Persona cards (dissatisfied cohort)
persona_cards(
    centers_orig0,   # means in 1–5
    drivers0,        # Δ vs cohort mean (negatives = pains)
    sizes0,          # n per sub-persona
    sat_rate0,       # CSAT per sub-persona
    top=4,
    delta_sig=0.30   # highlight |Δ| ≥ 0.30 as “meaningful” (tweak as you like)
)

Persona 0: n=27,128, sat_rate=21.0%

Lowest vs avg (pain points)

Factor	mean rating	Δ vs avg
Online support	1.9	-0.62
Inflight wifi service	1.8	-0.59
Online boarding	1.8	-0.56
Inflight entertainment	2.4	-0.37

Highest vs avg (strengths)

Factor	mean rating	Δ vs avg
Cleanliness	3.7	+0.57
Baggage handling	3.7	+0.57
On-board service	3.2	+0.47
Leg room service	3.3	+0.36

Persona 1: n=24,129, sat_rate=25.2%

Lowest vs avg (pain points)

Factor	mean rating	Δ vs avg
Cleanliness	2.4	-0.68
Baggage handling	2.4	-0.67
On-board service	2.2	-0.56
Leg room service	2.5	-0.44

Highest vs avg (strengths)

Factor	mean rating	Δ vs avg
Online support	3.3	+0.76
Inflight wifi service	3.1	+0.70
Online boarding	3.0	+0.68
Inflight entertainment	3.2	+0.44

Interpreting results:

Reading cards: Each card ranks top drivers by Δ vs the cohort mean (not global). Bold deltas (≥ ~0.30) flag meaningful gaps to take action.
Sub-persona 0 — “Digital Friction”: Consistent lows on Online booking/boarding, Online support, Inflight Wi-Fi; prioritize pre-/at-journey self-service fixes (simplify flows, clearer error help, off-wifi fallback).
Sub-persona 1 — “Cabin & Service Critics”: Lows on On-board service, Baggage handling, Cleanliness; target in-cabin/service playbooks (turn-time checklists, crew coaching, baggage SLAs, cleaning QA).
Same CSAT band, different routes: The small CSAT gap means “who first” is less important than tailoring fixes to each sub-persona’s drivers.

Radar: lowest-driver comparison

Overlay a simple radar chart on the lowest-rated drivers for each sub-persona (Δ vs cohort mean), to see at a glance where each group struggles most.

Code

# Radar comparison of lowest factors (dissatisfied cohort)

def plot_stage2_radar_lows(centers_orig0: pd.DataFrame,
                           drivers0: pd.DataFrame,
                           k_per_persona: int = 4,
                           include_cohort_mean: bool = True,
                           wrap: int = 18,
                           title: str = "Radar of lowest factors by sub-persona"):
    """
    centers_orig0 : mean ratings (1–5) per sub-persona (rows) × factors (cols)
    drivers0      : Δ vs *cohort* mean per sub-persona × factors (negatives = pains)
    k_per_persona : take this many worst (most negative Δ) factors per sub-persona
    include_cohort_mean : overlay the cohort mean as dashed reference
    wrap          : wrap width for radial axis labels
    """

    # Select axes: union of each sub-persona's k worst (most negative Δ) factors
    persona_ids = list(centers_orig0.index)
    lows = []
    for pid in persona_ids:
        worst = drivers0.loc[pid].sort_values().head(k_per_persona).index.tolist()
        lows.extend(worst)
    cols = pd.Index(lows).unique().tolist()  # axes = union, no duplicates

    if len(cols) < 3:
        print("Not enough distinct low factors to draw a radar (need ≥3).")
        return

    # Radar geometry (close the polygon by repeating the first point)
    angles = np.linspace(0, 2*np.pi, len(cols), endpoint=False)
    angles = np.concatenate([angles, angles[:1]])

    # Figure / Axes (polar)
    fig = plt.figure(figsize=(7.6, 7.6))
    ax  = plt.subplot(111, polar=True)
    ax.set_title(title, pad=20)

    # Plot each sub-persona on the same axes (ratings in 1–5)
    for pid in persona_ids:
        vals = centers_orig0.loc[pid, cols].values.astype(float)
        vals = np.concatenate([vals, vals[:1]])  # close polygon
        (line,) = ax.plot(angles, vals, linewidth=2, label=f"Sub-persona {pid}")
        ax.fill(angles, vals, alpha=0.10)  # same color, light fill

    # Optional cohort mean as dashed reference 
    if include_cohort_mean:
        cohort_mean = centers_orig0[cols].mean(axis=0).values.astype(float)
        cohort_mean = np.concatenate([cohort_mean, cohort_mean[:1]])
        ax.plot(angles, cohort_mean, linestyle="--", linewidth=1.5, label="Cohort mean")

    # Tidy axes: keep true 1–5 scale so colors/lengths are interpretable
    ax.set_ylim(1, 5)
    ax.set_yticks([1, 2, 3, 4, 5])
    ax.set_yticklabels(["1", "2", "3", "4", "5"])
    ax.set_rlabel_position(22)

    # Wrap long labels for readability
    labels_wrapped = ['\n'.join(textwrap.wrap(c, width=wrap)) for c in cols]
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(labels_wrapped)

    # Legend outside and layout
    ax.legend(bbox_to_anchor=(1.15, 1.0), loc="upper left", frameon=False)
    plt.tight_layout()
    plt.show()

# Usage: overlay the sub-personas on the union of their worst k factors
plot_stage2_radar_lows(centers_orig0, drivers0, k_per_persona=4, include_cohort_mean=True)

Interpreting results:

Axes = the union of each sub-persona’s worst drivers (most negative Δ vs the dissatisfied cohort’s mean).
Smaller radius on an axis = lower average rating (worse experience) for that sub-persona.
Reference line: The dashed cohort mean shows how each group sits relative to a “typical” dissatisfied traveler.
Divergence tells the story: Where the polygons separate, the sub-personas need different fixes
- Sub-persona 0 — “Digital Friction”: Tightest around center for Online boarding, Inflight wifi service, Online support, Inflight entertainment.
- Sub-persona 1 - “Cabin & Service Critics”: Slightly looser around center for On-board service, Cleanliness, Baggage handling, Leg room service.

05 - Uplift & What-ifs

Estimate the change in satisfaction probability if top pain-point ratings for the dissatisfied cohort are improved.

Scenario Setup

Aim the what-if at the right cohort and choose the specific ratings (“levers”) raised in the simulation.

Target group & levers (for overall dissatisfied cohort)

Define the dissatisfied cohort and select the 2–4 ratings to improve (your actionable levers).

Code

# Inputs expected from earlier cells
# dfc         : your working DataFrame (features + TARGET + persona labels)
# TARGET      : e.g., 'satisfaction'
# RATING_COLS : list of 1..5 ordinal rating columns
# (optional) mask0 : boolean mask for the dissatisfied cohort (if you already built it)

# --- Cohort target (mask0) and levers ---

def get_dissatisfied_mask(dfc: pd.DataFrame, target: str, persona_col: str = "persona"):
    if persona_col in dfc.columns:
        sat_rate_by_p = dfc.groupby(persona_col)[target].apply(lambda s: (s=="satisfied").mean())
        diss_id = int(sat_rate_by_p.idxmin())
        return (dfc[persona_col] == diss_id), diss_id
    return (dfc[target] != "satisfied"), None

if "mask0" not in globals():
    mask0, dissatisfied_id = get_dissatisfied_mask(dfc, TARGET)
else:
    dissatisfied_id = None

LEVERS = ["Online support", "Ease of Online booking", "Inflight wifi service"]
NEW_MIN, CAP = 3.5, 5  # threshold rule defaults (can change later)

display(Markdown(f"**Rows in dissatisfied cohort:** `{mask0.sum():,}` of `{len(dfc):,}` total surveyed"))
if dissatisfied_id is not None:
    display(Markdown(f"**Dissatisfied persona id:** `{dissatisfied_id}`"))

display(Markdown(f"**Levers (to improve):** " + ", ".join(f"`{c}`" for c in LEVERS)))
display(Markdown(f"**Improvement rule:** Raise each lever to at least `{NEW_MIN}` (cap `{CAP}`)"))

Rows in dissatisfied cohort: 51,257 of 129,880 total surveyed

Levers (to improve): Online support, Ease of Online booking, Inflight wifi service

Improvement rule: Raise each lever to at least 3.5 (cap 5)

Counterfactual build (“improved ratings”) & uplift scoring

Clone the dissatisfied cohort, raise selected rating levers to at least NEW_MIN, capped at 5, transform with the same preprocessor, and score the calibrated model to compare before vs after probabilities and compute uplift.

Code

# --- scoring using your fitted objects (reuse your earlier function if defined) ---
def score_proba(dfX: pd.DataFrame):
    if "pipeline" in globals():
        check_is_fitted(pipeline)
        return pipeline.predict_proba(dfX)[:, 1]
    if "preprocessor" in globals() and "rf_cal" in globals():
        check_is_fitted(preprocessor)
        X_ = preprocessor.transform(dfX)
        return rf_cal.predict_proba(X_)[:, 1]
    raise ValueError("Need `pipeline` or (`preprocessor` + `rf_cal`).")

# --- ensure features set ---
FEATURES = []
for name in ["NUM_COLS", "CAT_COLS", "RATING_COLS"]:
    if name in globals():
        FEATURES.extend(globals()[name])
if not FEATURES:
    raise ValueError("Define NUM_COLS, CAT_COLS, RATING_COLS earlier.")
missing = set(FEATURES) - set(dfc.columns)
if missing:
    raise KeyError(f"Missing from dfc: {sorted(missing)}")

# --- core helper: sweep one lever over several thresholds ---
def sweep_single_lever_thresholds(mask, lever: str, thresholds=(3.0, 3.5, 4.0, 4.5), cap=5):
    """
    For rows in `mask`, raise `lever` to >= each threshold (cap at 5), score before/after,
    and return a summary table for that lever.
    Other columns are untouched; NaNs are preserved.
    """
    assert lever in RATING_COLS, f"{lever} must be in RATING_COLS"

    base_df = dfc.loc[mask, FEATURES].copy()
    p_base  = score_proba(base_df)

    out = []
    v0 = pd.to_numeric(base_df[lever], errors="coerce")

    for thr in thresholds:
        df_cf = base_df.copy()
        v = pd.to_numeric(df_cf[lever], errors="coerce")
        bump_mask = v.notna() & (v < thr)
        v.loc[bump_mask] = thr
        df_cf[lever] = v.clip(upper=cap)

        p_cf = score_proba(df_cf)
        upl  = p_cf - p_base

        out.append({
            "lever": lever,
            "threshold": float(thr),
            "n": int(mask.sum()),
            "rows_changed_%": float((v0.notna() & (v0 < thr)).mean()),
            "p_base_mean": float(np.mean(p_base)),
            "p_cf_mean":   float(np.mean(p_cf)),
            "uplift_mean": float(np.mean(upl)),
            "uplift_p25":  float(np.percentile(upl, 25)),
            "uplift_p50":  float(np.percentile(upl, 50)),
            "uplift_p75":  float(np.percentile(upl, 75)),
            "share_positive": float((upl > 0).mean()),
        })

    df = pd.DataFrame(out).sort_values("threshold").reset_index(drop=True)
    return df

#######

LEVER_LIST = [
    "Online support",
    "Ease of Online booking",
    "Inflight wifi service",
]

for lever in LEVER_LIST:
    res = sweep_single_lever_thresholds(mask0, lever, thresholds=(3.0, 3.5, 4.0, 4.5), cap=5)
    display(Markdown(f"#### What-if: raise **{lever}** to ≥ threshold (cap 5)"))
    display(
        res.style.hide(axis="index").format({
            "threshold": "{:.1f}",
            "n": "{:,}",
            "rows_changed_%": "{:.1%}",
            "p_base_mean": "{:.2%}",
            "p_cf_mean":   "{:.2%}",
            "uplift_mean": "{:+.2%}",
            "uplift_p25":  "{:+.2%}",
            "uplift_p50":  "{:+.2%}",
            "uplift_p75":  "{:+.2%}",
            "share_positive": "{:.2%}",
        })
    )

What-if: raise Online support to ≥ threshold (cap 5)

lever	threshold	n	rows_changed_%	p_base_mean	p_cf_mean	uplift_mean	uplift_p25	uplift_p50	uplift_p75	share_positive
Online support	3.0	51,257	51.7%	22.84%	22.75%	-0.09%	+0.00%	+0.00%	+0.07%	32.65%
Online support	3.5	51,257	79.5%	22.84%	23.83%	+0.99%	+0.00%	+0.00%	+0.23%	38.58%
Online support	4.0	51,257	79.5%	22.84%	23.14%	+0.30%	+0.00%	+0.18%	+1.43%	61.00%
Online support	4.5	51,257	95.2%	22.84%	24.09%	+1.25%	+0.00%	+0.00%	+0.69%	48.52%

What-if: raise Ease of Online booking to ≥ threshold (cap 5)

lever	threshold	n	rows_changed_%	p_base_mean	p_cf_mean	uplift_mean	uplift_p25	uplift_p50	uplift_p75	share_positive
Ease of Online booking	3.0	51,257	62.6%	22.84%	23.63%	+0.79%	+0.00%	+0.00%	+1.29%	48.70%
Ease of Online booking	3.5	51,257	89.7%	22.84%	22.38%	-0.46%	+0.00%	+0.00%	+0.53%	45.21%
Ease of Online booking	4.0	51,257	89.7%	22.84%	27.68%	+4.84%	+0.00%	+2.88%	+9.61%	70.82%
Ease of Online booking	4.5	51,257	98.9%	22.84%	21.89%	-0.95%	+0.00%	+0.02%	+0.75%	50.51%

What-if: raise Inflight wifi service to ≥ threshold (cap 5)

lever	threshold	n	rows_changed_%	p_base_mean	p_cf_mean	uplift_mean	uplift_p25	uplift_p50	uplift_p75	share_positive
Inflight wifi service	3.0	51,257	58.1%	22.84%	23.16%	+0.32%	+0.00%	+0.00%	+0.63%	43.93%
Inflight wifi service	3.5	51,257	83.8%	22.84%	22.47%	-0.37%	+0.00%	+0.00%	+0.38%	41.59%
Inflight wifi service	4.0	51,257	83.8%	22.84%	28.04%	+5.20%	+0.00%	+3.04%	+9.03%	65.76%
Inflight wifi service	4.5	51,257	94.6%	22.84%	22.14%	-0.70%	+0.00%	+0.00%	+0.60%	48.19%

Interpreting results:

Each table isolates uplift scoring scenarios for each top driver, defined below:

uplift_mean = average change in predicted p(satisfied) (percentage points).
rows_changed_% = share of the cohort that actually gets edited by the rule.
share_positive = % of rows with uplift > 0 (tells you how broad the benefit is).
p50 / p75 = typical and upper-quartile gains (when the mean is pulled by tails).

Online support

Best mean uplift: ≥ 4.5 → +1.25 pp, rows_changed = 95.2%, share_positive = 48.5%, p75 = +0.69 pp.
Runner-up: ≥ 3.5 → +0.99 pp, rows_changed = 79.5%, share_positive = 38.6%, p75 = +0.23 pp.
Note: ≥ 4.0 has more people benefiting (share_positive = 61.0%) but a smaller mean (+0.30 pp). That says benefits are wider but shallower at 4.0; at 4.5, gains are deeper (bigger per-row jumps), lifting the mean.
Pick: If you care about headline impact, go ≥ 4.5. If you prefer breadth, ≥ 4.0 has the widest “helped” share, but with a modest mean.

Ease of Online booking

Stand-out: ≥ 4.0 → +4.84 pp, rows_changed = 89.7%, share_positive = 70.8%, p50 = +2.88 pp, p75 = +9.61 pp.
Lower pushes: ≥ 3.0 → +0.79 pp (nice but small).
Caveats: ≥ 3.5 and ≥ 4.5 show negative means, so “half-measures” (3.5) and “over-push” (4.5) backfire on average.
Pick: ≥ 4.0 is the clear sweet spot—big mean, broad benefit, and strong upper-quartile gains.

Inflight wifi service

Stand-out: ≥ 4.0 → +5.20 pp, rows_changed = 83.8%, share_positive = 65.8%, p50 = +3.04 pp, p75 = +9.03 pp.
Lower push: ≥ 3.0 → +0.32 pp (minor).
Caveats: ≥ 3.5 and ≥ 4.5 are negative on average.
Pick: ≥ 4.0 is again the sweet spot.

Recommendation (single-lever story)

Online support: aim for ≥ 4.5 (deep lift) or ≥ 4.0 (broader but shallower).
Online booking: ≥ 4.0.
Inflight wifi: ≥ 4.0.

Simulate combined scenario (optimized uplift for all 3 drivers)

Apply the optimized lever thresholds (≥ 4.0) to the dissatisfied cohort, score before vs after with the same pipeline, and summarize uplift.

Code

# --- CONFIG: combined thresholds (cap at 5) ---
THRESHOLDS = {
    "Online support":        4.0,
    "Ease of Online booking": 4.0,
    "Inflight wifi service":  4.0,
}
CAP = 5

# --- Safety: ensure columns exist ---
missing = set(THRESHOLDS.keys()) - set(dfc.columns)
assert not missing, f"Missing columns for thresholds: {missing}"

# --- Helper: apply per-column thresholds, preserving NaNs ---
def improve_multi_threshold(df: pd.DataFrame, thresholds: dict, cap=5):
    df2 = df.copy()
    for col, thr in thresholds.items():
        v = pd.to_numeric(df2[col], errors="coerce")
        bump = v.notna() & (v < thr)
        v.loc[bump] = thr
        df2[col] = v.clip(upper=cap)
    return df2

# --- Cohort slice, score before/after ---
base_df = dfc.loc[mask0, FEATURES].copy()
p_base  = score_proba(base_df)

cf_df   = improve_multi_threshold(base_df, THRESHOLDS, cap=CAP)
p_cf    = score_proba(cf_df)
uplift  = p_cf - p_base

# --- How many rows got changed per lever (as % of the cohort) ---
rows_changed = {}
for col, thr in THRESHOLDS.items():
    v0 = pd.to_numeric(base_df[col], errors="coerce")
    rows_changed[col] = float((v0.notna() & (v0 < thr)).mean())

# --- Summary table ---
summary_combined = pd.DataFrame([{
    "n": int(mask0.sum()),
    "p_base_mean": float(p_base.mean()),
    "p_cf_mean":   float(p_cf.mean()),
    "uplift_mean": float(uplift.mean()),
    "uplift_p25":  float(np.percentile(uplift, 25)),
    "uplift_p50":  float(np.percentile(uplift, 50)),
    "uplift_p75":  float(np.percentile(uplift, 75)),
    "share_positive": float((uplift > 0).mean()),
    **{f"rows_changed_{k}_%": rows_changed[k] for k in THRESHOLDS}
}])

display(Markdown("#### Combined scenario — summary"))
display(
    summary_combined
      .style.hide(axis="index")
      .format({
          "n": "{:,}",
          "p_base_mean": "{:.2%}",
          "p_cf_mean":   "{:.2%}",
          "uplift_mean": "{:+.2%}",
          "uplift_p25":  "{:+.2%}",
          "uplift_p50":  "{:+.2%}",
          "uplift_p75":  "{:+.2%}",
          "share_positive": "{:.2%}",
          **{f"rows_changed_{k}_%": "{:.1%}" for k in THRESHOLDS}
      })
)

# --- Optional visuals: uplift histogram + before/after violin ---
# Uplift histogram
plt.figure()
plt.hist(uplift, bins=30, density=True, histtype="step")
plt.xlabel("uplift = p_cf - p_base")
plt.ylabel("density")
plt.title("Uplift distribution (combined scenario)")
plt.show()

# Before/after violin (quick shape check)
plt.figure()
plt.violinplot([p_base, p_cf], showmeans=True, showextrema=False)
plt.xticks([1, 2], ["before", "after"])
plt.ylabel("predicted p(satisfied)")
plt.title("Before vs After (combined scenario)")
plt.show()

Combined scenario — summary

n	p_base_mean	p_cf_mean	uplift_mean	uplift_p25	uplift_p50	uplift_p75	share_positive	rows_changed_Online support_%	rows_changed_Ease of Online booking_%	rows_changed_Inflight wifi service_%
51,257	22.84%	32.45%	+9.61%	+0.03%	+7.36%	+18.53%	75.25%	79.5%	89.7%	83.8%

Interpreting results:

Raising Online support, Online booking, and Inflight wifi to ≥ 4.0 for the dissatisfied cohort lifts predicted CSAT by +9.61 percentage points on average (22.84% → 32.45%).

Bringing these three pain points up to “good” (4.0) yields a broad, material improvement in predicted satisfaction across the dissatisfied cohort.

uplift_mean (+9.61 pp): Average increase in predicted probability of “satisfied” after the 3-lever change. This is a large cohort-level effect.
share_positive (75.25%): About 3 in 4 customers in this cohort see a positive change. That’s broad coverage, not just a niche win.
Percentiles (distribution of individual uplift):
- p25 = +0.03 pp: At least 25% of rows see ~no change (some cases aren’t sensitive to these levers).
- p50 = +7.36 pp: The typical row gains ~+7.4 pp—strong median effect.
- p75 = +18.53 pp: A quarter of the cohort gains ~+18.5 pp or more—big upside for a meaningful slice.
  
  The median and 75th percentile gains show this isn’t just averaging a few big winners—typical customers benefit meaningfully, and a sizable group benefits a lot.
rows_changed_% (per lever):
- Online support 79.5%, Online booking 89.7%, Wifi 83.8% of the cohort had ratings below 4.0 and non-missing, so they were actually bumped by the scenario. (Everyone else already had ≥ 4.0 or NaN and was left as-is.)

Histogram: The uplift distribution is right-skewed with a small left tail → many small/medium gains and a healthy set of large positives, with few negatives pulling the mean down. That shape is exactly as desired.

Violin:

Median/mean up: The horizontal line (median) and the dot (mean, since showmeans=True) should sit higher in the after violin → typical predicted CSAT increased.
Density shift: The widest part of the after violin should be higher on the y-axis, showing more rows concentrated at higher probabilities.
Overlap check: Some overlap near the lower range is normal; less overlap and a clear upward shift indicate a broad lift, not just a few outliers.
Spread/tails: A longer upper tail (after) signals more big winners.

06 - Summary & Recommendations

Summary of Analysis

Reliable prediction: A calibrated Random Forest (benchmarked against Logistic Regression) delivered well-calibrated probabilities of satisfaction on a held-out validation split, enabling thresholding and what-if scoring rather than binary guesses.
Clean feature engineering: A consistent preprocessing pipeline (impute → encode/scale) ensured stable, reproducible features across numeric, categorical, and 1–5 ordinal ratings; health checks (NaNs/inf, zero-variance, encoding ranges) passed.
Drivers of (dis)satisfaction: Ratings around Online support, Ease of Online booking, and Inflight wifi service emerged as the most actionable pain points for the dissatisfied cohort (low means and clear model sensitivity).
Segment structure: Persona clustering showed meaningful heterogeneity (different “worst drivers”), explaining why uniform levers sometimes helped some travelers while leaving others unchanged.
Counterfactual uplift (cohort level): Single-lever sweeps revealed non-linear thresholds:
- Online support: broader benefit at ≥4.0; deeper per-row gains at ≥4.5.
- Online booking: sweet spot at ≥4.0 (3.5 and 4.5 underperformed).
- Inflight wifi: sweet spot at ≥4.0 (3.5 and 4.5 underperformed).
Combined scenario (dissatisfied cohort): Setting all three levers to ≥4.0 yielded a +9.61 pp average lift (22.84% → 32.45%), with 75.25% of rows improving and strong distributional gains (median +7.36 pp, p75 +18.53 pp).
Interpretability & communication: Violin and histogram views confirmed a broad right-shift (higher typical probabilities) with a positive tail (large wins) and minimal negatives.

Recommendations

Adopt calibrated predictions operationally: Use the calibrated RF probabilities to both monitor baseline CSAT risk and quantify expected lift from proposed service changes before rollout.
Set service targets at the effective thresholds:
- Online support: target ≥4.0 across the cohort; consider ≥4.5 where feasible for deeper gains (e.g., priority queues or concierge chat for high-value segments).
- Ease of Online booking: target ≥4.0 via UX clarity, fewer steps, resilient payment/seat maps.
- Inflight wifi: target ≥4.0 via bandwidth QoS, captive-portal speed, reliability KPIs.
Sequence actions for ROI: Start where rows_changed_% is high and uplift_mean is strong (Booking ≥4.0, Wifi ≥4.0), then layer Online support improvements (≥4.0 baseline, pilot ≥4.5 in select routes/segments).
Pilot, then scale: Run focused A/B pilots on the dissatisfied cohort with these thresholds, tracking realized CSAT deltas vs model-predicted uplift; expand to adjacent segments that behaved similarly in personas/composition charts.
Guardrails & QA: Track share_positive, median uplift, and cap-hits (%) per lever during rollouts; if a left tail emerges, adjust thresholds or narrow targeting.
Institutionalize feedback: Bake these three levers into an operational scorecard; re-train quarterly to capture seasonal and operational changes (new aircraft, route mix, staffing).

Conclusion

This analysis translated raw survey data into actionable thresholds and quantified impact, moving beyond “what correlates” to “what should we raise, and by how much, to move CSAT?” By combining calibrated prediction, driver sensitivity, and counterfactual uplift, the airline gets a clear, testable playbook: raise Online support, Online booking, and Inflight wifi to the 4.0 quality band, expect a broad, material lift in satisfaction, and tune further by segment to maximize ROI. The result is a pragmatic blueprint for prioritized service improvements that are measurable, communicable, and ready for pilot-to-scale execution.

	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	NaN	NaN	NaN	2.0	2.0	4.0	2.0	3.0	3.0	NaN	3	5.0	3.0	2.0
1	NaN	NaN	NaN	3.0	NaN	2.0	2.0	3.0	4.0	4.0	4	2.0	3.0	2.0
2	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	4.0	4.0	2.0
3	NaN	NaN	NaN	3.0	3.0	4.0	3.0	1.0	1.0	NaN	1	4.0	1.0	3.0
4	NaN	NaN	NaN	3.0	4.0	3.0	4.0	2.0	2.0	NaN	2	4.0	2.0	5.0

	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	NaN	NaN	NaN	2.0	2.0	4.0	2.0	3.0	3.0	NaN	3	5.0	3.0	2.0
1	NaN	NaN	NaN	3.0	NaN	2.0	2.0	3.0	4.0	4.0	4	2.0	3.0	2.0
2	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	4.0	4.0	2.0
3	NaN	NaN	NaN	3.0	3.0	4.0	3.0	1.0	1.0	NaN	1	4.0	1.0	3.0
4	NaN	NaN	NaN	3.0	4.0	3.0	4.0	2.0	2.0	NaN	2	4.0	2.0	5.0

01 - Introduction

Project Overview

Approach (end-to-end)

Business Objective

What success looks like

Constraints & scope

02 - EDA & Data Hygiene

Setup & Imports

Load Data

Examine Raw Data

Columns, non-null counts, datatypes, missingness

Ratings: min/max; “0” values identified

Persist cleaned ratings (0 → NaN)

Raw DataFrame (first 10 rows)

Cleaned DataFrame (first 10 rows)

Examine Cleaned Data

Ratings ranges, columns, non-null counts, datatypes (cleaned)

Numeric ranges (cleaned)

Categoricals (cleaned)

Cardinality by column

Customer Type

Type of Travel

Class

03 - Modeling

Target & Feature Groups

Modeling inputs — quick reference

Encode target

Shapes & random sample

Target distribution and balance

Feature-group samples

Ratings (first 5 rows):

Categoricals (first 5 rows):

Numerics (first 5 rows):

Train/Valid Split (Holdout)

Missingness by split

Train missing % (numeric)

Valid missing % (numeric)

Train missing % (ratings) — top 5

Valid missing % (ratings) — top 5

Preprocessing Pipeline

Sample outputs from cloned preprocessor

Fit on TRAIN, transform TRAIN/VALID

Transformed shapes & dtypes

NaNs/inf check

Zero-variance features (TRAIN)

Numeric scaling (z-scores, TRAIN)

Unseen categories in VALID

Feature matrix density

Ratings encoding range

Model Comparison

Pipeline diagrams

Pipeline: logreg

Pipeline: rf

Evaluation & Thresholding

Fit on TRAIN, evaluate on VALID

Interpreting results:

Calibration curves

Interpreting results:

Predicted probability distribution

Interpreting results:

Evaluation metrics table

Interpreting results:

Random Forest probability calibration (5-fold CV)

Interpreting results:

Decision threshold sweep

Interpreting results:

Confusion matrix

Interpreting results:

04 - Personas (Clustering) & Diagnostics

Feature Prep & Standardization

Data shapes:

Interpreting results:

K selection overview

Elbow & silhouette scans

Interpreting results:

Compare K candidates (metrics)

Interpreting K metrics summary:

What this suggests:

Final choice:

Select final K

Pipeline: `logreg`

Pipeline: `rf`

	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding
0	NaN	NaN	NaN	2.0	2.0	4.0	2.0	3.0	3.0	NaN	3	5.0	3.0	2.0
1	NaN	NaN	NaN	3.0	NaN	2.0	2.0	3.0	4.0	4.0	4	2.0	3.0	2.0
2	NaN	NaN	NaN	3.0	2.0	NaN	2.0	2.0	3.0	3.0	4	4.0	4.0	2.0
3	NaN	NaN	NaN	3.0	3.0	4.0	3.0	1.0	1.0	NaN	1	4.0	1.0	3.0
4	NaN	NaN	NaN	3.0	4.0	3.0	4.0	2.0	2.0	NaN	2	4.0	2.0	5.0