Embedding
Visual embedding workflow
1. Balance map#
People turn to semantic balance analysis using embeddings when they need to understand whether their dataset is fair, representative, and structurally complete: not just in terms of raw counts, but in terms of meaning. Traditional distributions can show how many samples fall into each category, but only embeddings reveal deeper patterns: whether certain concepts dominate, whether clusters are missing or underrepresented, whether two groups that "look balanced" numerically are actually very different semantically. This is crucial in ML, robotics, recommendation systems, audio/vision datasets, and any scenario where meaningful coverage matters more than labels alone.
BalanceMap in SmooSense makes this effortless by visualizing embedding space as bubble plots, computes relative ratio, and colorize by the level of imbalance.
1.1 Ratio-based color encoding for balance#
Color isn't determined by raw counts, but by relative balance across breakdowns (e.g., training/validation/test splits). This is because groups of the breakdown inherently have different size. Image below shows the distribution of fold. If we colorize by counts, then you will only see information from training fold.
For each bubble, we compute the ratio of samples of that bubble within its breakdown group:
We then compare these ratios across groups:


1.2 Try yourself#
Zoom in and drag around, you can easily find a blue cluster where all the data is in train fold, no testing or validation at all.
2. More#
Full embedding features (search, clustering etc) are coming.