cbrkit.sim
Similarity measures for different data types with aggregation and utility functions.
CBRkit provides built-in similarity measures for standard data types as well as
utilities for combining, caching, and transforming them.
All similarity functions follow the signature sim = f(x, y) or the batch
variant sims = f([(x1, y1), ...]).
Built-in measures are provided through generator functions that return a
configured similarity function.
Submodules:
cbrkit.sim.numbers: Numeric similarity (linear, exponential, threshold, sigmoid, step).cbrkit.sim.strings: String similarity (Levenshtein, Jaro, Jaro-Winkler, spaCy, NLTK).cbrkit.sim.collections: Collection and sequence similarity (Jaccard, Dice, etc.).cbrkit.sim.generic: Generic similarity (equality, static, tables).cbrkit.sim.embed: Embedding-based similarity with caching (Sentence Transformers, OpenAI).cbrkit.sim.graphs: Graph similarity algorithms (A*, VF2, greedy, LAP, etc.).cbrkit.sim.taxonomy: Taxonomy-based similarity (Wu-Palmer and others).cbrkit.sim.pooling: Pooling functions for aggregating multiple values.
Top-Level Functions:
attribute_value: Computes similarity for attribute-value data by applying per-attribute measures and aggregating them into a global score.aggregator: Creates an aggregation function that combines multiple local similarity scores into a single global score using a pooling strategy.combine: Combines multiple similarity functions and aggregates results.cache: Wraps a similarity function with result caching.transpose/transpose_value: Transforms inputs before passing them to a similarity function.table/dynamic_table/type_table/attribute_table: Lookup-based similarity dispatching.
Example:
>>> sim_func = attribute_value( ... attributes={ ... "price": numbers.linear(max=100000), ... "color": generic.equality(), ... }, ... aggregator=aggregator(pooling="mean"), ... )
1"""Similarity measures for different data types with aggregation and utility functions. 2 3CBRkit provides built-in similarity measures for standard data types as well as 4utilities for combining, caching, and transforming them. 5All similarity functions follow the signature `sim = f(x, y)` or the batch 6variant `sims = f([(x1, y1), ...])`. 7Built-in measures are provided through generator functions that return a 8configured similarity function. 9 10Submodules: 11- `cbrkit.sim.numbers`: Numeric similarity (linear, exponential, threshold, sigmoid, step). 12- `cbrkit.sim.strings`: String similarity (Levenshtein, Jaro, Jaro-Winkler, spaCy, NLTK). 13- `cbrkit.sim.collections`: Collection and sequence similarity (Jaccard, Dice, etc.). 14- `cbrkit.sim.generic`: Generic similarity (equality, static, tables). 15- `cbrkit.sim.embed`: Embedding-based similarity with caching (Sentence Transformers, OpenAI). 16- `cbrkit.sim.graphs`: Graph similarity algorithms (A*, VF2, greedy, LAP, etc.). 17- `cbrkit.sim.taxonomy`: Taxonomy-based similarity (Wu-Palmer and others). 18- `cbrkit.sim.pooling`: Pooling functions for aggregating multiple values. 19 20Top-Level Functions: 21- `attribute_value`: Computes similarity for attribute-value data by applying 22 per-attribute measures and aggregating them into a global score. 23- `aggregator`: Creates an aggregation function that combines multiple local 24 similarity scores into a single global score using a pooling strategy. 25- `combine`: Combines multiple similarity functions and aggregates results. 26- `cache`: Wraps a similarity function with result caching. 27- `transpose` / `transpose_value`: Transforms inputs before passing them 28 to a similarity function. 29- `table` / `dynamic_table` / `type_table` / `attribute_table`: Lookup-based 30 similarity dispatching. 31 32Example: 33 >>> sim_func = attribute_value( 34 ... attributes={ 35 ... "price": numbers.linear(max=100000), 36 ... "color": generic.equality(), 37 ... }, 38 ... aggregator=aggregator(pooling="mean"), 39 ... ) 40""" 41 42from . import collections, embed, generic, graphs, numbers, pooling, strings, taxonomy 43from .aggregator import aggregator 44from .pooling import PoolingName 45from .attribute_value import AttributeValueSim, attribute_value 46from .wrappers import ( 47 attribute_table, 48 cache, 49 combine, 50 dynamic_table, 51 table, 52 transpose, 53 transpose_value, 54 type_table, 55) 56 57__all__ = [ 58 "transpose", 59 "transpose_value", 60 "cache", 61 "combine", 62 "table", 63 "dynamic_table", 64 "type_table", 65 "attribute_table", 66 "collections", 67 "generic", 68 "numbers", 69 "strings", 70 "attribute_value", 71 "graphs", 72 "embed", 73 "taxonomy", 74 "pooling", 75 "aggregator", 76 "PoolingName", 77 "AttributeValueSim", 78]
23@dataclass(slots=True) 24class transpose[V1, V2, S: Float](BatchSimFunc[V1, S]): 25 """Transforms a similarity function from one type to another. 26 27 Args: 28 similarity_func: The similarity function to be used on the converted values. 29 conversion_func: A function that converts the input values from one type to another. 30 31 Examples: 32 >>> from cbrkit.sim.generic import equality 33 >>> sim = transpose( 34 ... similarity_func=equality(), 35 ... conversion_func=lambda x: x.lower(), 36 ... ) 37 >>> sim([("A", "a"), ("b", "B")]) 38 [1.0, 1.0] 39 """ 40 41 similarity_func: BatchSimFunc[V2, S] 42 conversion_func: ConversionFunc[V1, V2] 43 44 def __init__( 45 self, 46 similarity_func: AnySimFunc[V2, S], 47 conversion_func: ConversionFunc[V1, V2], 48 ): 49 self.similarity_func = batchify_sim(similarity_func) 50 self.conversion_func = conversion_func 51 52 @override 53 def __call__(self, batches: Sequence[tuple[V1, V1]]) -> Sequence[S]: 54 return self.similarity_func( 55 [(self.conversion_func(x), self.conversion_func(y)) for x, y in batches] 56 )
Transforms a similarity function from one type to another.
Arguments:
- similarity_func: The similarity function to be used on the converted values.
- conversion_func: A function that converts the input values from one type to another.
Examples:
>>> from cbrkit.sim.generic import equality >>> sim = transpose( ... similarity_func=equality(), ... conversion_func=lambda x: x.lower(), ... ) >>> sim([("A", "a"), ("b", "B")]) [1.0, 1.0]
59def transpose_value[V, S: Float]( 60 func: AnySimFunc[V, S], 61) -> BatchSimFunc[StructuredValue[V], S]: 62 """Create a transposed similarity function that extracts values before comparing.""" 63 return transpose(func, get_value)
Create a transposed similarity function that extracts values before comparing.
134@dataclass(slots=True) 135class cache[V, U, S: Float](BatchSimFunc[V, S]): 136 """Caches similarity results to avoid redundant computations. 137 138 Args: 139 similarity_func: The similarity function to cache. 140 conversion_func: Optional function to convert values to cache keys. 141 142 Examples: 143 >>> from cbrkit.sim.generic import equality 144 >>> sim = cache(equality()) 145 >>> len(sim.store) 146 0 147 >>> sim([("a", "a"), ("b", "b")]) 148 [1.0, 1.0] 149 >>> len(sim.store) 150 2 151 >>> sim([("a", "a")]) 152 [1.0] 153 >>> len(sim.store) 154 2 155 """ 156 157 similarity_func: BatchSimFunc[V, S] 158 conversion_func: ConversionFunc[V, U] | None 159 store: MutableMapping[tuple[U, U], S] = field(repr=False) 160 161 def __init__( 162 self, 163 similarity_func: AnySimFunc[V, S], 164 conversion_func: ConversionFunc[V, U] | None = None, 165 ): 166 self.similarity_func = batchify_sim(similarity_func) 167 self.conversion_func = conversion_func 168 self.store = {} 169 170 @override 171 def __call__(self, batches: Sequence[tuple[V, V]]) -> SimSeq[S]: 172 transformed_batches = ( 173 [(self.conversion_func(x), self.conversion_func(y)) for x, y in batches] 174 if self.conversion_func is not None 175 else cast(list[tuple[U, U]], batches) 176 ) 177 uncached_indexes = [ 178 idx 179 for idx, pair in enumerate(transformed_batches) 180 if pair not in self.store 181 ] 182 183 uncached_sims = self.similarity_func([batches[idx] for idx in uncached_indexes]) 184 self.store.update( 185 { 186 transformed_batches[idx]: sim 187 for idx, sim in zip(uncached_indexes, uncached_sims, strict=True) 188 } 189 ) 190 191 return [self.store[pair] for pair in transformed_batches]
Caches similarity results to avoid redundant computations.
Arguments:
- similarity_func: The similarity function to cache.
- conversion_func: Optional function to convert values to cache keys.
Examples:
>>> from cbrkit.sim.generic import equality >>> sim = cache(equality()) >>> len(sim.store) 0 >>> sim([("a", "a"), ("b", "b")]) [1.0, 1.0] >>> len(sim.store) 2 >>> sim([("a", "a")]) [1.0] >>> len(sim.store) 2
66@dataclass(slots=True) 67class combine[V, S: Float](BatchSimFunc[V, float]): 68 """Combines multiple similarity functions into one. 69 70 Args: 71 sim_funcs: A list of similarity functions to be combined. 72 aggregator: A function to aggregate the results from the similarity functions. 73 74 Returns: 75 A similarity function that combines the results from multiple similarity functions. 76 77 Examples: 78 >>> from cbrkit.sim.generic import equality, static 79 >>> sim = combine([equality(), static(0.5)]) 80 >>> sim([("a", "a"), ("a", "b")]) 81 [0.75, 0.25] 82 """ 83 84 sim_funcs: InitVar[Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]]] 85 aggregator: AggregatorFunc[str, S] = default_aggregator 86 batch_sim_funcs: Sequence[BatchSimFunc[V, S]] | Mapping[str, BatchSimFunc[V, S]] = ( 87 field(init=False, repr=False) 88 ) 89 90 def __post_init__( 91 self, sim_funcs: Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]] 92 ): 93 if isinstance(sim_funcs, Mapping): 94 funcs_map = cast(Mapping[str, AnySimFunc[V, S]], sim_funcs) 95 self.batch_sim_funcs = { 96 key: batchify_sim(func) for key, func in funcs_map.items() 97 } 98 elif isinstance(sim_funcs, Sequence): 99 self.batch_sim_funcs = [batchify_sim(func) for func in sim_funcs] 100 else: 101 raise ValueError(f"Invalid sim_funcs type: {type(sim_funcs)}") 102 103 @override 104 def __call__(self, batches: Sequence[tuple[V, V]]) -> Sequence[float]: 105 if isinstance(self.batch_sim_funcs, Mapping): 106 funcs_map = cast(Mapping[str, BatchSimFunc[V, S]], self.batch_sim_funcs) 107 func_results = { 108 func_key: func(batches) for func_key, func in funcs_map.items() 109 } 110 111 return [ 112 self.aggregator( 113 { 114 func_key: batch_results[batch_idx] 115 for func_key, batch_results in func_results.items() 116 } 117 ) 118 for batch_idx in range(len(batches)) 119 ] 120 121 elif isinstance(self.batch_sim_funcs, Sequence): 122 func_results = [func(batches) for func in self.batch_sim_funcs] 123 124 return [ 125 self.aggregator( 126 [batch_results[batch_idx] for batch_results in func_results] 127 ) 128 for batch_idx in range(len(batches)) 129 ] 130 131 raise ValueError(f"Invalid batch_sim_funcs type: {type(self.batch_sim_funcs)}")
Combines multiple similarity functions into one.
Arguments:
- sim_funcs: A list of similarity functions to be combined.
- aggregator: A function to aggregate the results from the similarity functions.
Returns:
A similarity function that combines the results from multiple similarity functions.
Examples:
>>> from cbrkit.sim.generic import equality, static >>> sim = combine([equality(), static(0.5)]) >>> sim([("a", "a"), ("a", "b")]) [0.75, 0.25]
194@dataclass(slots=True) 195class dynamic_table[K, U, V, S: Float](BatchSimFunc[U | V, S], HasMetadata): 196 """Allows to import a similarity values from a table. 197 198 Args: 199 entries: Sequence[tuple[a, b, sim(a, b)] 200 symmetric: If True, the table is assumed to be symmetric, i.e. sim(a, b) = sim(b, a) 201 default: Default similarity value for pairs not in the table 202 key_getter: A function that extracts the the key for lookup from the input values 203 204 Examples: 205 >>> from cbrkit.helpers import identity 206 >>> from cbrkit.sim.generic import static 207 >>> sim = dynamic_table( 208 ... { 209 ... ("a", "b"): static(0.5), 210 ... ("b", "c"): static(0.7) 211 ... }, 212 ... symmetric=True, 213 ... default=static(0.0), 214 ... key_getter=identity, 215 ... ) 216 >>> sim([("b", "a"), ("a", "c")]) 217 [0.5, 0.0] 218 """ 219 220 symmetric: bool 221 default: BatchSimFunc[U, S] | None 222 key_getter: Callable[[Any], K] 223 table: dict[tuple[K, K], BatchSimFunc[V, S]] 224 225 @property 226 @override 227 def metadata(self) -> JsonDict: 228 """Return metadata describing the dynamic table configuration.""" 229 return { 230 "symmetric": self.symmetric, 231 "default": get_metadata(self.default), 232 "key_getter": get_metadata(self.key_getter), 233 "table": [ 234 { 235 "x": str(k[0]), 236 "y": str(k[1]), 237 "value": get_metadata(v), 238 } 239 for k, v in self.table.items() 240 ], 241 } 242 243 def __init__( 244 self, 245 entries: Mapping[tuple[K, K], AnySimFunc[..., S]] 246 | Mapping[K, AnySimFunc[..., S]], 247 key_getter: Callable[[Any], K], 248 default: AnySimFunc[U, S] | S | None = None, 249 symmetric: bool = True, 250 ): 251 self.symmetric = symmetric 252 self.key_getter = key_getter 253 self.table = {} 254 255 if default is None: 256 self.default = None 257 elif isinstance(default, Callable): 258 self.default = batchify_sim(cast(AnySimFunc[U, S], default)) 259 else: 260 self.default = batchify_sim(static(cast(S, default))) 261 262 for key, val in entries.items(): 263 func = batchify_sim(val) 264 265 if isinstance(key, tuple): 266 x, y = cast(tuple[K, K], key) 267 else: 268 x = y = cast(K, key) 269 270 self.table[(x, y)] = func 271 272 if self.symmetric and x != y: 273 self.table[(y, x)] = func 274 275 @override 276 def __call__(self, batches: Sequence[tuple[U | V, U | V]]) -> SimSeq[S]: 277 # then we group the batches by key to avoid redundant computations 278 idx_map: defaultdict[tuple[K, K] | None, list[int]] = defaultdict(list) 279 280 for idx, (x, y) in enumerate(batches): 281 key = (self.key_getter(x), self.key_getter(y)) 282 283 if key in self.table: 284 idx_map[key].append(idx) 285 else: 286 idx_map[None].append(idx) 287 288 # now we compute the similarities 289 results: dict[int, S] = {} 290 291 for key, idxs in idx_map.items(): 292 sim_func = cast( 293 BatchSimFunc[U | V, S] | None, 294 self.table.get(key) if key is not None else self.default, 295 ) 296 297 if sim_func is None: 298 missing_entries = [batches[idx] for idx in idxs] 299 missing_keys = { 300 (self.key_getter(x), self.key_getter(y)) for x, y in missing_entries 301 } 302 303 raise ValueError(f"Pairs {missing_keys} not in the table") 304 305 sims = sim_func([batches[idx] for idx in idxs]) 306 307 for idx, sim in zip(idxs, sims, strict=True): 308 results[idx] = sim 309 310 return [results[idx] for idx in range(len(batches))]
Allows to import a similarity values from a table.
Arguments:
- entries: Sequence[tuple[a, b, sim(a, b)]
- symmetric: If True, the table is assumed to be symmetric, i.e. sim(a, b) = sim(b, a)
- default: Default similarity value for pairs not in the table
- key_getter: A function that extracts the the key for lookup from the input values
Examples:
>>> from cbrkit.helpers import identity >>> from cbrkit.sim.generic import static >>> sim = dynamic_table( ... { ... ("a", "b"): static(0.5), ... ("b", "c"): static(0.7) ... }, ... symmetric=True, ... default=static(0.0), ... key_getter=identity, ... ) >>> sim([("b", "a"), ("a", "c")]) [0.5, 0.0]
243 def __init__( 244 self, 245 entries: Mapping[tuple[K, K], AnySimFunc[..., S]] 246 | Mapping[K, AnySimFunc[..., S]], 247 key_getter: Callable[[Any], K], 248 default: AnySimFunc[U, S] | S | None = None, 249 symmetric: bool = True, 250 ): 251 self.symmetric = symmetric 252 self.key_getter = key_getter 253 self.table = {} 254 255 if default is None: 256 self.default = None 257 elif isinstance(default, Callable): 258 self.default = batchify_sim(cast(AnySimFunc[U, S], default)) 259 else: 260 self.default = batchify_sim(static(cast(S, default))) 261 262 for key, val in entries.items(): 263 func = batchify_sim(val) 264 265 if isinstance(key, tuple): 266 x, y = cast(tuple[K, K], key) 267 else: 268 x = y = cast(K, key) 269 270 self.table[(x, y)] = func 271 272 if self.symmetric and x != y: 273 self.table[(y, x)] = func
Inherited Members
316def type_table[U, V, S: Float]( 317 entries: Mapping[type[V], AnySimFunc[..., S]], 318 default: AnySimFunc[U, S] | S | None = None, 319) -> BatchSimFunc[U | V, S]: 320 """Create a dynamic table that dispatches similarity functions by value type.""" 321 return dynamic_table( 322 entries=cast(Mapping[type, AnySimFunc[..., S]], entries), 323 key_getter=type, 324 default=default, 325 symmetric=False, 326 )
Create a dynamic table that dispatches similarity functions by value type.
340def attribute_table[K, U, S: Float]( 341 entries: Mapping[K, AnySimFunc[..., S]], 342 attribute: str, 343 default: AnySimFunc[U, S] | S | None = None, 344 value_getter: Callable[[Any, str], K] = getitem_or_getattr, 345) -> BatchSimFunc[Any, S]: 346 """Create a dynamic table that dispatches similarity functions by attribute value.""" 347 key_getter = attribute_table_key_getter(value_getter, attribute) 348 349 return dynamic_table( 350 entries=entries, 351 key_getter=key_getter, 352 default=default, 353 symmetric=False, 354 )
Create a dynamic table that dispatches similarity functions by attribute value.
29@dataclass(slots=True, frozen=True) 30class attribute_value[V, S: Float](BatchSimFunc[V, AttributeValueSim[S]]): 31 """Similarity function that computes the attribute value similarity between two cases. 32 33 Args: 34 attributes: A mapping of attribute names to the similarity functions to be used for those attributes. 35 aggregator: A function that aggregates the local similarity scores for each attribute into a single global similarity. 36 value_getter: A function that retrieves the value of an attribute from a case. 37 default: The default similarity score to use when an error occurs during the computation of a similarity score. 38 For example, if a case does not have an attribute that is required for the similarity computation. 39 40 Examples: 41 >>> equality = lambda x, y: 1.0 if x == y else 0.0 42 >>> sim = attribute_value({ 43 ... "name": equality, 44 ... "age": equality, 45 ... }) 46 >>> scores = sim([ 47 ... ({"name": "John", "age": 25}, {"name": "John", "age": 30}), 48 ... ({"name": "Jane", "age": 30}, {"name": "John", "age": 30}), 49 ... ]) 50 >>> scores[0] 51 AttributeValueSim(value=0.5, attributes={'name': 1.0, 'age': 0.0}) 52 >>> scores[1] 53 AttributeValueSim(value=0.5, attributes={'name': 0.0, 'age': 1.0}) 54 """ 55 56 attributes: Mapping[str, AnySimFunc[Any, S]] 57 aggregator: AggregatorFunc[str, S] = default_aggregator 58 value_getter: Callable[[Any, str], Any] = getitem_or_getattr 59 default: S | None = None 60 61 @override 62 def __call__(self, batches: Sequence[tuple[V, V]]) -> SimSeq[AttributeValueSim[S]]: 63 if len(batches) == 0: 64 return [] 65 66 local_sims: list[dict[str, S]] = [dict() for _ in range(len(batches))] 67 68 for attr_name in self.attributes: 69 logger.debug(f"Processing attribute {attr_name}") 70 71 try: 72 nonempty_pairs: dict[int, tuple[Any, Any]] = {} 73 74 for idx, (x, y) in enumerate(batches): 75 x_val = self.value_getter(x, attr_name) 76 y_val = self.value_getter(y, attr_name) 77 78 if x_val is None or y_val is None: 79 if self.default is None: 80 raise ValueError( 81 f"Attribute '{attr_name}' has None value at index {idx}" 82 ) 83 local_sims[idx][attr_name] = self.default 84 else: 85 nonempty_pairs[idx] = (x_val, y_val) 86 87 if nonempty_pairs: 88 sim_func = batchify_sim(self.attributes[attr_name]) 89 90 for i, sim in zip( 91 nonempty_pairs, sim_func(list(nonempty_pairs.values())) 92 ): 93 local_sims[i][attr_name] = sim 94 95 except Exception as e: 96 if self.default is not None: 97 for idx in range(len(batches)): 98 local_sims[idx][attr_name] = self.default 99 else: 100 raise e 101 102 return [AttributeValueSim(self.aggregator(sims), sims) for sims in local_sims]
Similarity function that computes the attribute value similarity between two cases.
Arguments:
- attributes: A mapping of attribute names to the similarity functions to be used for those attributes.
- aggregator: A function that aggregates the local similarity scores for each attribute into a single global similarity.
- value_getter: A function that retrieves the value of an attribute from a case.
- default: The default similarity score to use when an error occurs during the computation of a similarity score. For example, if a case does not have an attribute that is required for the similarity computation.
Examples:
>>> equality = lambda x, y: 1.0 if x == y else 0.0 >>> sim = attribute_value({ ... "name": equality, ... "age": equality, ... }) >>> scores = sim([ ... ({"name": "John", "age": 25}, {"name": "John", "age": 30}), ... ({"name": "Jane", "age": 30}, {"name": "John", "age": 30}), ... ]) >>> scores[0] AttributeValueSim(value=0.5, attributes={'name': 1.0, 'age': 0.0}) >>> scores[1] AttributeValueSim(value=0.5, attributes={'name': 0.0, 'age': 1.0})
22@dataclass(slots=True, frozen=True) 23class aggregator[K](AggregatorFunc[K, Float]): 24 """ 25 Aggregates local similarities to a global similarity using the specified pooling function. 26 27 Args: 28 pooling: The pooling function to use. It can be either a string representing the name of the pooling function or a custom pooling function (see `cbrkit.typing.PoolingFunc`). 29 pooling_weights: The weights to apply to the similarities during pooling. It can be a sequence or a mapping. If None, every local similarity is weighted equally. 30 default_pooling_weight: The default weight to use if a similarity key is not found in the pooling_weights mapping. 31 32 Examples: 33 >>> agg = aggregator("mean") 34 >>> agg([0.5, 0.75, 1.0]) 35 0.75 36 >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 0}) 37 >>> agg({1: 1, 2: 1, 3: 1}) 38 1.0 39 >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 2}) 40 >>> agg({1: 1, 2: 1, 3: 1}) 41 1.0 42 """ 43 44 pooling: PoolingName | PoolingFunc[float] = "mean" 45 pooling_weights: SimMap[K, float] | SimSeq[float] | None = None 46 default_pooling_weight: float = 1.0 47 48 @override 49 def __call__(self, similarities: SimMap[K, Float] | SimSeq[Float]) -> float: 50 pooling_func = ( 51 pooling_funcs[cast(PoolingName, self.pooling)] 52 if isinstance(self.pooling, str) 53 else self.pooling 54 ) 55 assert (self.pooling_weights is None) or ( 56 type(similarities) is type(self.pooling_weights) # noqa: E721 57 ) 58 59 pooling_factor = 1.0 60 sims: Sequence[float] # noqa: F821 61 62 if isinstance(similarities, Mapping) and isinstance( 63 self.pooling_weights, Mapping 64 ): 65 sim_map = cast(SimMap[K, Float], similarities) 66 weight_map = cast(SimMap[K, float], self.pooling_weights) 67 sims = [ 68 unpack_float(sim) * weight_map.get(key, self.default_pooling_weight) 69 for key, sim in sim_map.items() 70 ] 71 pooling_factor = len(sim_map) / sum( 72 weight_map.get(key, self.default_pooling_weight) 73 for key in sim_map.keys() 74 ) 75 elif isinstance(similarities, Sequence) and isinstance( 76 self.pooling_weights, Sequence 77 ): 78 sims = [ 79 unpack_float(s) * w 80 for s, w in zip(similarities, self.pooling_weights, strict=True) 81 ] 82 pooling_factor = len(similarities) / sum(self.pooling_weights) 83 elif isinstance(similarities, Sequence) and self.pooling_weights is None: 84 sim_seq = cast(SimSeq[Float], similarities) 85 sims = [unpack_float(s) for s in sim_seq] 86 elif isinstance(similarities, Mapping) and self.pooling_weights is None: 87 sim_map = cast(SimMap[K, Float], similarities) 88 sims = [unpack_float(s) for s in sim_map.values()] 89 else: 90 raise NotImplementedError() 91 92 return pooling_func(sims) * pooling_factor
Aggregates local similarities to a global similarity using the specified pooling function.
Arguments:
- pooling: The pooling function to use. It can be either a string representing the name of the pooling function or a custom pooling function (see
cbrkit.typing.PoolingFunc). - pooling_weights: The weights to apply to the similarities during pooling. It can be a sequence or a mapping. If None, every local similarity is weighted equally.
- default_pooling_weight: The default weight to use if a similarity key is not found in the pooling_weights mapping.
Examples:
>>> agg = aggregator("mean") >>> agg([0.5, 0.75, 1.0]) 0.75 >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 0}) >>> agg({1: 1, 2: 1, 3: 1}) 1.0 >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 2}) >>> agg({1: 1, 2: 1, 3: 1}) 1.0
22@dataclass(slots=True, frozen=True) 23class AttributeValueSim[S: Float](StructuredValue[float]): 24 """Result of an attribute-value similarity computation with per-attribute scores.""" 25 26 attributes: Mapping[str, S]
Result of an attribute-value similarity computation with per-attribute scores.