cbrkit.sim

Similarity measures for different data types with aggregation and utility functions.

CBRkit provides built-in similarity measures for standard data types as well as utilities for combining, caching, and transforming them. All similarity functions follow the signature sim = f(x, y) or the batch variant sims = f([(x1, y1), ...]). Built-in measures are provided through generator functions that return a configured similarity function.

Submodules:

Top-Level Functions:

  • attribute_value: Computes similarity for attribute-value data by applying per-attribute measures and aggregating them into a global score.
  • aggregator: Creates an aggregation function that combines multiple local similarity scores into a single global score using a pooling strategy.
  • combine: Combines multiple similarity functions and aggregates results.
  • cache: Wraps a similarity function with result caching.
  • transpose / transpose_value: Transforms inputs before passing them to a similarity function.
  • table / dynamic_table / type_table / attribute_table: Lookup-based similarity dispatching.
Example:
>>> sim_func = attribute_value(
...     attributes={
...         "price": numbers.linear(max=100000),
...         "color": generic.equality(),
...     },
...     aggregator=aggregator(pooling="mean"),
... )
 1"""Similarity measures for different data types with aggregation and utility functions.
 2
 3CBRkit provides built-in similarity measures for standard data types as well as
 4utilities for combining, caching, and transforming them.
 5All similarity functions follow the signature `sim = f(x, y)` or the batch
 6variant `sims = f([(x1, y1), ...])`.
 7Built-in measures are provided through generator functions that return a
 8configured similarity function.
 9
10Submodules:
11- `cbrkit.sim.numbers`: Numeric similarity (linear, exponential, threshold, sigmoid, step).
12- `cbrkit.sim.strings`: String similarity (Levenshtein, Jaro, Jaro-Winkler, spaCy, NLTK).
13- `cbrkit.sim.collections`: Collection and sequence similarity (Jaccard, Dice, etc.).
14- `cbrkit.sim.generic`: Generic similarity (equality, static, tables).
15- `cbrkit.sim.embed`: Embedding-based similarity with caching (Sentence Transformers, OpenAI).
16- `cbrkit.sim.graphs`: Graph similarity algorithms (A*, VF2, greedy, LAP, etc.).
17- `cbrkit.sim.taxonomy`: Taxonomy-based similarity (Wu-Palmer and others).
18- `cbrkit.sim.pooling`: Pooling functions for aggregating multiple values.
19
20Top-Level Functions:
21- `attribute_value`: Computes similarity for attribute-value data by applying
22  per-attribute measures and aggregating them into a global score.
23- `aggregator`: Creates an aggregation function that combines multiple local
24  similarity scores into a single global score using a pooling strategy.
25- `combine`: Combines multiple similarity functions and aggregates results.
26- `cache`: Wraps a similarity function with result caching.
27- `transpose` / `transpose_value`: Transforms inputs before passing them
28  to a similarity function.
29- `table` / `dynamic_table` / `type_table` / `attribute_table`: Lookup-based
30  similarity dispatching.
31
32Example:
33    >>> sim_func = attribute_value(
34    ...     attributes={
35    ...         "price": numbers.linear(max=100000),
36    ...         "color": generic.equality(),
37    ...     },
38    ...     aggregator=aggregator(pooling="mean"),
39    ... )
40"""
41
42from . import collections, embed, generic, graphs, numbers, pooling, strings, taxonomy
43from .aggregator import aggregator
44from .pooling import PoolingName
45from .attribute_value import AttributeValueSim, attribute_value
46from .wrappers import (
47    attribute_table,
48    cache,
49    combine,
50    dynamic_table,
51    table,
52    transpose,
53    transpose_value,
54    type_table,
55)
56
57__all__ = [
58    "transpose",
59    "transpose_value",
60    "cache",
61    "combine",
62    "table",
63    "dynamic_table",
64    "type_table",
65    "attribute_table",
66    "collections",
67    "generic",
68    "numbers",
69    "strings",
70    "attribute_value",
71    "graphs",
72    "embed",
73    "taxonomy",
74    "pooling",
75    "aggregator",
76    "PoolingName",
77    "AttributeValueSim",
78]
@dataclass(slots=True)
class transpose(cbrkit.typing.BatchSimFunc[V1, S], typing.Generic[V1, V2, S]):
23@dataclass(slots=True)
24class transpose[V1, V2, S: Float](BatchSimFunc[V1, S]):
25    """Transforms a similarity function from one type to another.
26
27    Args:
28        similarity_func: The similarity function to be used on the converted values.
29        conversion_func: A function that converts the input values from one type to another.
30
31    Examples:
32        >>> from cbrkit.sim.generic import equality
33        >>> sim = transpose(
34        ...     similarity_func=equality(),
35        ...     conversion_func=lambda x: x.lower(),
36        ... )
37        >>> sim([("A", "a"), ("b", "B")])
38        [1.0, 1.0]
39    """
40
41    similarity_func: BatchSimFunc[V2, S]
42    conversion_func: ConversionFunc[V1, V2]
43
44    def __init__(
45        self,
46        similarity_func: AnySimFunc[V2, S],
47        conversion_func: ConversionFunc[V1, V2],
48    ):
49        self.similarity_func = batchify_sim(similarity_func)
50        self.conversion_func = conversion_func
51
52    @override
53    def __call__(self, batches: Sequence[tuple[V1, V1]]) -> Sequence[S]:
54        return self.similarity_func(
55            [(self.conversion_func(x), self.conversion_func(y)) for x, y in batches]
56        )

Transforms a similarity function from one type to another.

Arguments:
  • similarity_func: The similarity function to be used on the converted values.
  • conversion_func: A function that converts the input values from one type to another.
Examples:
>>> from cbrkit.sim.generic import equality
>>> sim = transpose(
...     similarity_func=equality(),
...     conversion_func=lambda x: x.lower(),
... )
>>> sim([("A", "a"), ("b", "B")])
[1.0, 1.0]
transpose( similarity_func: AnySimFunc[V2, S], conversion_func: cbrkit.typing.ConversionFunc[V1, V2])
44    def __init__(
45        self,
46        similarity_func: AnySimFunc[V2, S],
47        conversion_func: ConversionFunc[V1, V2],
48    ):
49        self.similarity_func = batchify_sim(similarity_func)
50        self.conversion_func = conversion_func
similarity_func: cbrkit.typing.BatchSimFunc[V2, S]
conversion_func: cbrkit.typing.ConversionFunc[V1, V2]
def transpose_value( func: AnySimFunc[V, S]) -> cbrkit.typing.BatchSimFunc[cbrkit.typing.StructuredValue[V], S]:
59def transpose_value[V, S: Float](
60    func: AnySimFunc[V, S],
61) -> BatchSimFunc[StructuredValue[V], S]:
62    """Create a transposed similarity function that extracts values before comparing."""
63    return transpose(func, get_value)

Create a transposed similarity function that extracts values before comparing.

@dataclass(slots=True)
class cache(cbrkit.typing.BatchSimFunc[V, S], typing.Generic[V, U, S]):
134@dataclass(slots=True)
135class cache[V, U, S: Float](BatchSimFunc[V, S]):
136    """Caches similarity results to avoid redundant computations.
137
138    Args:
139        similarity_func: The similarity function to cache.
140        conversion_func: Optional function to convert values to cache keys.
141
142    Examples:
143        >>> from cbrkit.sim.generic import equality
144        >>> sim = cache(equality())
145        >>> len(sim.store)
146        0
147        >>> sim([("a", "a"), ("b", "b")])
148        [1.0, 1.0]
149        >>> len(sim.store)
150        2
151        >>> sim([("a", "a")])
152        [1.0]
153        >>> len(sim.store)
154        2
155    """
156
157    similarity_func: BatchSimFunc[V, S]
158    conversion_func: ConversionFunc[V, U] | None
159    store: MutableMapping[tuple[U, U], S] = field(repr=False)
160
161    def __init__(
162        self,
163        similarity_func: AnySimFunc[V, S],
164        conversion_func: ConversionFunc[V, U] | None = None,
165    ):
166        self.similarity_func = batchify_sim(similarity_func)
167        self.conversion_func = conversion_func
168        self.store = {}
169
170    @override
171    def __call__(self, batches: Sequence[tuple[V, V]]) -> SimSeq[S]:
172        transformed_batches = (
173            [(self.conversion_func(x), self.conversion_func(y)) for x, y in batches]
174            if self.conversion_func is not None
175            else cast(list[tuple[U, U]], batches)
176        )
177        uncached_indexes = [
178            idx
179            for idx, pair in enumerate(transformed_batches)
180            if pair not in self.store
181        ]
182
183        uncached_sims = self.similarity_func([batches[idx] for idx in uncached_indexes])
184        self.store.update(
185            {
186                transformed_batches[idx]: sim
187                for idx, sim in zip(uncached_indexes, uncached_sims, strict=True)
188            }
189        )
190
191        return [self.store[pair] for pair in transformed_batches]

Caches similarity results to avoid redundant computations.

Arguments:
  • similarity_func: The similarity function to cache.
  • conversion_func: Optional function to convert values to cache keys.
Examples:
>>> from cbrkit.sim.generic import equality
>>> sim = cache(equality())
>>> len(sim.store)
0
>>> sim([("a", "a"), ("b", "b")])
[1.0, 1.0]
>>> len(sim.store)
2
>>> sim([("a", "a")])
[1.0]
>>> len(sim.store)
2
cache( similarity_func: AnySimFunc[V, S], conversion_func: Optional[cbrkit.typing.ConversionFunc[V, U]] = None)
161    def __init__(
162        self,
163        similarity_func: AnySimFunc[V, S],
164        conversion_func: ConversionFunc[V, U] | None = None,
165    ):
166        self.similarity_func = batchify_sim(similarity_func)
167        self.conversion_func = conversion_func
168        self.store = {}
similarity_func: cbrkit.typing.BatchSimFunc[V, S]
conversion_func: Optional[cbrkit.typing.ConversionFunc[V, U]]
store: MutableMapping[tuple[U, U], S]
@dataclass(slots=True)
class combine(cbrkit.typing.BatchSimFunc[V, float], typing.Generic[V, S]):
 66@dataclass(slots=True)
 67class combine[V, S: Float](BatchSimFunc[V, float]):
 68    """Combines multiple similarity functions into one.
 69
 70    Args:
 71        sim_funcs: A list of similarity functions to be combined.
 72        aggregator: A function to aggregate the results from the similarity functions.
 73
 74    Returns:
 75        A similarity function that combines the results from multiple similarity functions.
 76
 77    Examples:
 78        >>> from cbrkit.sim.generic import equality, static
 79        >>> sim = combine([equality(), static(0.5)])
 80        >>> sim([("a", "a"), ("a", "b")])
 81        [0.75, 0.25]
 82    """
 83
 84    sim_funcs: InitVar[Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]]]
 85    aggregator: AggregatorFunc[str, S] = default_aggregator
 86    batch_sim_funcs: Sequence[BatchSimFunc[V, S]] | Mapping[str, BatchSimFunc[V, S]] = (
 87        field(init=False, repr=False)
 88    )
 89
 90    def __post_init__(
 91        self, sim_funcs: Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]]
 92    ):
 93        if isinstance(sim_funcs, Mapping):
 94            funcs_map = cast(Mapping[str, AnySimFunc[V, S]], sim_funcs)
 95            self.batch_sim_funcs = {
 96                key: batchify_sim(func) for key, func in funcs_map.items()
 97            }
 98        elif isinstance(sim_funcs, Sequence):
 99            self.batch_sim_funcs = [batchify_sim(func) for func in sim_funcs]
100        else:
101            raise ValueError(f"Invalid sim_funcs type: {type(sim_funcs)}")
102
103    @override
104    def __call__(self, batches: Sequence[tuple[V, V]]) -> Sequence[float]:
105        if isinstance(self.batch_sim_funcs, Mapping):
106            funcs_map = cast(Mapping[str, BatchSimFunc[V, S]], self.batch_sim_funcs)
107            func_results = {
108                func_key: func(batches) for func_key, func in funcs_map.items()
109            }
110
111            return [
112                self.aggregator(
113                    {
114                        func_key: batch_results[batch_idx]
115                        for func_key, batch_results in func_results.items()
116                    }
117                )
118                for batch_idx in range(len(batches))
119            ]
120
121        elif isinstance(self.batch_sim_funcs, Sequence):
122            func_results = [func(batches) for func in self.batch_sim_funcs]
123
124            return [
125                self.aggregator(
126                    [batch_results[batch_idx] for batch_results in func_results]
127                )
128                for batch_idx in range(len(batches))
129            ]
130
131        raise ValueError(f"Invalid batch_sim_funcs type: {type(self.batch_sim_funcs)}")

Combines multiple similarity functions into one.

Arguments:
  • sim_funcs: A list of similarity functions to be combined.
  • aggregator: A function to aggregate the results from the similarity functions.
Returns:

A similarity function that combines the results from multiple similarity functions.

Examples:
>>> from cbrkit.sim.generic import equality, static
>>> sim = combine([equality(), static(0.5)])
>>> sim([("a", "a"), ("a", "b")])
[0.75, 0.25]
combine( sim_funcs: dataclasses.InitVar[Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]]], aggregator: cbrkit.typing.AggregatorFunc[str, S] = aggregator(pooling='mean', pooling_weights=None, default_pooling_weight=1.0))
sim_funcs: dataclasses.InitVar[Sequence[AnySimFunc[V, S]] | Mapping[str, AnySimFunc[V, S]]]
aggregator: cbrkit.typing.AggregatorFunc[str, S]
batch_sim_funcs: Sequence[cbrkit.typing.BatchSimFunc[V, S]] | Mapping[str, cbrkit.typing.BatchSimFunc[V, S]]
table = <class 'dynamic_table'>
@dataclass(slots=True)
class dynamic_table(cbrkit.typing.BatchSimFunc[typing.Union[U, V], S], cbrkit.typing.HasMetadata, typing.Generic[K, U, V, S]):
194@dataclass(slots=True)
195class dynamic_table[K, U, V, S: Float](BatchSimFunc[U | V, S], HasMetadata):
196    """Allows to import a similarity values from a table.
197
198    Args:
199        entries: Sequence[tuple[a, b, sim(a, b)]
200        symmetric: If True, the table is assumed to be symmetric, i.e. sim(a, b) = sim(b, a)
201        default: Default similarity value for pairs not in the table
202        key_getter: A function that extracts the the key for lookup from the input values
203
204    Examples:
205        >>> from cbrkit.helpers import identity
206        >>> from cbrkit.sim.generic import static
207        >>> sim = dynamic_table(
208        ...     {
209        ...         ("a", "b"): static(0.5),
210        ...         ("b", "c"): static(0.7)
211        ...     },
212        ...     symmetric=True,
213        ...     default=static(0.0),
214        ...     key_getter=identity,
215        ... )
216        >>> sim([("b", "a"), ("a", "c")])
217        [0.5, 0.0]
218    """
219
220    symmetric: bool
221    default: BatchSimFunc[U, S] | None
222    key_getter: Callable[[Any], K]
223    table: dict[tuple[K, K], BatchSimFunc[V, S]]
224
225    @property
226    @override
227    def metadata(self) -> JsonDict:
228        """Return metadata describing the dynamic table configuration."""
229        return {
230            "symmetric": self.symmetric,
231            "default": get_metadata(self.default),
232            "key_getter": get_metadata(self.key_getter),
233            "table": [
234                {
235                    "x": str(k[0]),
236                    "y": str(k[1]),
237                    "value": get_metadata(v),
238                }
239                for k, v in self.table.items()
240            ],
241        }
242
243    def __init__(
244        self,
245        entries: Mapping[tuple[K, K], AnySimFunc[..., S]]
246        | Mapping[K, AnySimFunc[..., S]],
247        key_getter: Callable[[Any], K],
248        default: AnySimFunc[U, S] | S | None = None,
249        symmetric: bool = True,
250    ):
251        self.symmetric = symmetric
252        self.key_getter = key_getter
253        self.table = {}
254
255        if default is None:
256            self.default = None
257        elif isinstance(default, Callable):
258            self.default = batchify_sim(cast(AnySimFunc[U, S], default))
259        else:
260            self.default = batchify_sim(static(cast(S, default)))
261
262        for key, val in entries.items():
263            func = batchify_sim(val)
264
265            if isinstance(key, tuple):
266                x, y = cast(tuple[K, K], key)
267            else:
268                x = y = cast(K, key)
269
270            self.table[(x, y)] = func
271
272            if self.symmetric and x != y:
273                self.table[(y, x)] = func
274
275    @override
276    def __call__(self, batches: Sequence[tuple[U | V, U | V]]) -> SimSeq[S]:
277        # then we group the batches by key to avoid redundant computations
278        idx_map: defaultdict[tuple[K, K] | None, list[int]] = defaultdict(list)
279
280        for idx, (x, y) in enumerate(batches):
281            key = (self.key_getter(x), self.key_getter(y))
282
283            if key in self.table:
284                idx_map[key].append(idx)
285            else:
286                idx_map[None].append(idx)
287
288        # now we compute the similarities
289        results: dict[int, S] = {}
290
291        for key, idxs in idx_map.items():
292            sim_func = cast(
293                BatchSimFunc[U | V, S] | None,
294                self.table.get(key) if key is not None else self.default,
295            )
296
297            if sim_func is None:
298                missing_entries = [batches[idx] for idx in idxs]
299                missing_keys = {
300                    (self.key_getter(x), self.key_getter(y)) for x, y in missing_entries
301                }
302
303                raise ValueError(f"Pairs {missing_keys} not in the table")
304
305            sims = sim_func([batches[idx] for idx in idxs])
306
307            for idx, sim in zip(idxs, sims, strict=True):
308                results[idx] = sim
309
310        return [results[idx] for idx in range(len(batches))]

Allows to import a similarity values from a table.

Arguments:
  • entries: Sequence[tuple[a, b, sim(a, b)]
  • symmetric: If True, the table is assumed to be symmetric, i.e. sim(a, b) = sim(b, a)
  • default: Default similarity value for pairs not in the table
  • key_getter: A function that extracts the the key for lookup from the input values
Examples:
>>> from cbrkit.helpers import identity
>>> from cbrkit.sim.generic import static
>>> sim = dynamic_table(
...     {
...         ("a", "b"): static(0.5),
...         ("b", "c"): static(0.7)
...     },
...     symmetric=True,
...     default=static(0.0),
...     key_getter=identity,
... )
>>> sim([("b", "a"), ("a", "c")])
[0.5, 0.0]
dynamic_table( entries: Mapping[tuple[K, K], AnySimFunc[..., S]] | Mapping[K, AnySimFunc[..., S]], key_getter: Callable[[typing.Any], K], default: Union[AnySimFunc[U, S], S, NoneType] = None, symmetric: bool = True)
243    def __init__(
244        self,
245        entries: Mapping[tuple[K, K], AnySimFunc[..., S]]
246        | Mapping[K, AnySimFunc[..., S]],
247        key_getter: Callable[[Any], K],
248        default: AnySimFunc[U, S] | S | None = None,
249        symmetric: bool = True,
250    ):
251        self.symmetric = symmetric
252        self.key_getter = key_getter
253        self.table = {}
254
255        if default is None:
256            self.default = None
257        elif isinstance(default, Callable):
258            self.default = batchify_sim(cast(AnySimFunc[U, S], default))
259        else:
260            self.default = batchify_sim(static(cast(S, default)))
261
262        for key, val in entries.items():
263            func = batchify_sim(val)
264
265            if isinstance(key, tuple):
266                x, y = cast(tuple[K, K], key)
267            else:
268                x = y = cast(K, key)
269
270            self.table[(x, y)] = func
271
272            if self.symmetric and x != y:
273                self.table[(y, x)] = func
symmetric: bool
default: Optional[cbrkit.typing.BatchSimFunc[U, S]]
key_getter: Callable[[typing.Any], K]
table: dict[tuple[K, K], cbrkit.typing.BatchSimFunc[V, S]]
def type_table( entries: Mapping[type[V], AnySimFunc[..., S]], default: Union[AnySimFunc[U, S], S, NoneType] = None) -> cbrkit.typing.BatchSimFunc[typing.Union[U, V], S]:
316def type_table[U, V, S: Float](
317    entries: Mapping[type[V], AnySimFunc[..., S]],
318    default: AnySimFunc[U, S] | S | None = None,
319) -> BatchSimFunc[U | V, S]:
320    """Create a dynamic table that dispatches similarity functions by value type."""
321    return dynamic_table(
322        entries=cast(Mapping[type, AnySimFunc[..., S]], entries),
323        key_getter=type,
324        default=default,
325        symmetric=False,
326    )

Create a dynamic table that dispatches similarity functions by value type.

def attribute_table( entries: Mapping[K, AnySimFunc[..., S]], attribute: str, default: Union[AnySimFunc[U, S], S, NoneType] = None, value_getter: Callable[[typing.Any, str], K] = <function getitem_or_getattr>) -> cbrkit.typing.BatchSimFunc[typing.Any, S]:
340def attribute_table[K, U, S: Float](
341    entries: Mapping[K, AnySimFunc[..., S]],
342    attribute: str,
343    default: AnySimFunc[U, S] | S | None = None,
344    value_getter: Callable[[Any, str], K] = getitem_or_getattr,
345) -> BatchSimFunc[Any, S]:
346    """Create a dynamic table that dispatches similarity functions by attribute value."""
347    key_getter = attribute_table_key_getter(value_getter, attribute)
348
349    return dynamic_table(
350        entries=entries,
351        key_getter=key_getter,
352        default=default,
353        symmetric=False,
354    )

Create a dynamic table that dispatches similarity functions by attribute value.

@dataclass(slots=True, frozen=True)
class attribute_value(cbrkit.typing.BatchSimFunc[V, cbrkit.sim.attribute_value.AttributeValueSim[S]], typing.Generic[V, S]):
 29@dataclass(slots=True, frozen=True)
 30class attribute_value[V, S: Float](BatchSimFunc[V, AttributeValueSim[S]]):
 31    """Similarity function that computes the attribute value similarity between two cases.
 32
 33    Args:
 34        attributes: A mapping of attribute names to the similarity functions to be used for those attributes.
 35        aggregator: A function that aggregates the local similarity scores for each attribute into a single global similarity.
 36        value_getter: A function that retrieves the value of an attribute from a case.
 37        default: The default similarity score to use when an error occurs during the computation of a similarity score.
 38            For example, if a case does not have an attribute that is required for the similarity computation.
 39
 40    Examples:
 41        >>> equality = lambda x, y: 1.0 if x == y else 0.0
 42        >>> sim = attribute_value({
 43        ...     "name": equality,
 44        ...     "age": equality,
 45        ... })
 46        >>> scores = sim([
 47        ...     ({"name": "John", "age": 25}, {"name": "John", "age": 30}),
 48        ...     ({"name": "Jane", "age": 30}, {"name": "John", "age": 30}),
 49        ... ])
 50        >>> scores[0]
 51        AttributeValueSim(value=0.5, attributes={'name': 1.0, 'age': 0.0})
 52        >>> scores[1]
 53        AttributeValueSim(value=0.5, attributes={'name': 0.0, 'age': 1.0})
 54    """
 55
 56    attributes: Mapping[str, AnySimFunc[Any, S]]
 57    aggregator: AggregatorFunc[str, S] = default_aggregator
 58    value_getter: Callable[[Any, str], Any] = getitem_or_getattr
 59    default: S | None = None
 60
 61    @override
 62    def __call__(self, batches: Sequence[tuple[V, V]]) -> SimSeq[AttributeValueSim[S]]:
 63        if len(batches) == 0:
 64            return []
 65
 66        local_sims: list[dict[str, S]] = [dict() for _ in range(len(batches))]
 67
 68        for attr_name in self.attributes:
 69            logger.debug(f"Processing attribute {attr_name}")
 70
 71            try:
 72                nonempty_pairs: dict[int, tuple[Any, Any]] = {}
 73
 74                for idx, (x, y) in enumerate(batches):
 75                    x_val = self.value_getter(x, attr_name)
 76                    y_val = self.value_getter(y, attr_name)
 77
 78                    if x_val is None or y_val is None:
 79                        if self.default is None:
 80                            raise ValueError(
 81                                f"Attribute '{attr_name}' has None value at index {idx}"
 82                            )
 83                        local_sims[idx][attr_name] = self.default
 84                    else:
 85                        nonempty_pairs[idx] = (x_val, y_val)
 86
 87                if nonempty_pairs:
 88                    sim_func = batchify_sim(self.attributes[attr_name])
 89
 90                    for i, sim in zip(
 91                        nonempty_pairs, sim_func(list(nonempty_pairs.values()))
 92                    ):
 93                        local_sims[i][attr_name] = sim
 94
 95            except Exception as e:
 96                if self.default is not None:
 97                    for idx in range(len(batches)):
 98                        local_sims[idx][attr_name] = self.default
 99                else:
100                    raise e
101
102        return [AttributeValueSim(self.aggregator(sims), sims) for sims in local_sims]

Similarity function that computes the attribute value similarity between two cases.

Arguments:
  • attributes: A mapping of attribute names to the similarity functions to be used for those attributes.
  • aggregator: A function that aggregates the local similarity scores for each attribute into a single global similarity.
  • value_getter: A function that retrieves the value of an attribute from a case.
  • default: The default similarity score to use when an error occurs during the computation of a similarity score. For example, if a case does not have an attribute that is required for the similarity computation.
Examples:
>>> equality = lambda x, y: 1.0 if x == y else 0.0
>>> sim = attribute_value({
...     "name": equality,
...     "age": equality,
... })
>>> scores = sim([
...     ({"name": "John", "age": 25}, {"name": "John", "age": 30}),
...     ({"name": "Jane", "age": 30}, {"name": "John", "age": 30}),
... ])
>>> scores[0]
AttributeValueSim(value=0.5, attributes={'name': 1.0, 'age': 0.0})
>>> scores[1]
AttributeValueSim(value=0.5, attributes={'name': 0.0, 'age': 1.0})
attribute_value( attributes: Mapping[str, AnySimFunc[typing.Any, S]], aggregator: cbrkit.typing.AggregatorFunc[str, S] = aggregator(pooling='mean', pooling_weights=None, default_pooling_weight=1.0), value_getter: Callable[[typing.Any, str], typing.Any] = <function getitem_or_getattr>, default: Optional[S] = None)
attributes: Mapping[str, AnySimFunc[typing.Any, S]]
aggregator: cbrkit.typing.AggregatorFunc[str, S]
value_getter: Callable[[typing.Any, str], typing.Any]
default: Optional[S]
@dataclass(slots=True, frozen=True)
class aggregator(cbrkit.typing.AggregatorFunc[K, Float], typing.Generic[K]):
22@dataclass(slots=True, frozen=True)
23class aggregator[K](AggregatorFunc[K, Float]):
24    """
25    Aggregates local similarities to a global similarity using the specified pooling function.
26
27    Args:
28        pooling: The pooling function to use. It can be either a string representing the name of the pooling function or a custom pooling function (see `cbrkit.typing.PoolingFunc`).
29        pooling_weights: The weights to apply to the similarities during pooling. It can be a sequence or a mapping. If None, every local similarity is weighted equally.
30        default_pooling_weight: The default weight to use if a similarity key is not found in the pooling_weights mapping.
31
32    Examples:
33        >>> agg = aggregator("mean")
34        >>> agg([0.5, 0.75, 1.0])
35        0.75
36        >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 0})
37        >>> agg({1: 1, 2: 1, 3: 1})
38        1.0
39        >>> agg = aggregator("mean", {1: 1, 2: 1, 3: 2})
40        >>> agg({1: 1, 2: 1, 3: 1})
41        1.0
42    """
43
44    pooling: PoolingName | PoolingFunc[float] = "mean"
45    pooling_weights: SimMap[K, float] | SimSeq[float] | None = None
46    default_pooling_weight: float = 1.0
47
48    @override
49    def __call__(self, similarities: SimMap[K, Float] | SimSeq[Float]) -> float:
50        pooling_func = (
51            pooling_funcs[cast(PoolingName, self.pooling)]
52            if isinstance(self.pooling, str)
53            else self.pooling
54        )
55        assert (self.pooling_weights is None) or (
56            type(similarities) is type(self.pooling_weights)  # noqa: E721
57        )
58
59        pooling_factor = 1.0
60        sims: Sequence[float]  # noqa: F821
61
62        if isinstance(similarities, Mapping) and isinstance(
63            self.pooling_weights, Mapping
64        ):
65            sim_map = cast(SimMap[K, Float], similarities)
66            weight_map = cast(SimMap[K, float], self.pooling_weights)
67            sims = [
68                unpack_float(sim) * weight_map.get(key, self.default_pooling_weight)
69                for key, sim in sim_map.items()
70            ]
71            pooling_factor = len(sim_map) / sum(
72                weight_map.get(key, self.default_pooling_weight)
73                for key in sim_map.keys()
74            )
75        elif isinstance(similarities, Sequence) and isinstance(
76            self.pooling_weights, Sequence
77        ):
78            sims = [
79                unpack_float(s) * w
80                for s, w in zip(similarities, self.pooling_weights, strict=True)
81            ]
82            pooling_factor = len(similarities) / sum(self.pooling_weights)
83        elif isinstance(similarities, Sequence) and self.pooling_weights is None:
84            sim_seq = cast(SimSeq[Float], similarities)
85            sims = [unpack_float(s) for s in sim_seq]
86        elif isinstance(similarities, Mapping) and self.pooling_weights is None:
87            sim_map = cast(SimMap[K, Float], similarities)
88            sims = [unpack_float(s) for s in sim_map.values()]
89        else:
90            raise NotImplementedError()
91
92        return pooling_func(sims) * pooling_factor

Aggregates local similarities to a global similarity using the specified pooling function.

Arguments:
  • pooling: The pooling function to use. It can be either a string representing the name of the pooling function or a custom pooling function (see cbrkit.typing.PoolingFunc).
  • pooling_weights: The weights to apply to the similarities during pooling. It can be a sequence or a mapping. If None, every local similarity is weighted equally.
  • default_pooling_weight: The default weight to use if a similarity key is not found in the pooling_weights mapping.
Examples:
>>> agg = aggregator("mean")
>>> agg([0.5, 0.75, 1.0])
0.75
>>> agg = aggregator("mean", {1: 1, 2: 1, 3: 0})
>>> agg({1: 1, 2: 1, 3: 1})
1.0
>>> agg = aggregator("mean", {1: 1, 2: 1, 3: 2})
>>> agg({1: 1, 2: 1, 3: 1})
1.0
aggregator( pooling: Union[PoolingName, cbrkit.typing.PoolingFunc[float]] = 'mean', pooling_weights: SimMap[K, float] | SimSeq[float] | None = None, default_pooling_weight: float = 1.0)
pooling: Union[PoolingName, cbrkit.typing.PoolingFunc[float]]
pooling_weights: SimMap[K, float] | SimSeq[float] | None
default_pooling_weight: float
type PoolingName = Literal['mean', 'fmean', 'geometric_mean', 'harmonic_mean', 'median', 'median_low', 'median_high', 'mode', 'min', 'max', 'sum']
@dataclass(slots=True, frozen=True)
class AttributeValueSim(cbrkit.typing.StructuredValue[float], typing.Generic[S]):
22@dataclass(slots=True, frozen=True)
23class AttributeValueSim[S: Float](StructuredValue[float]):
24    """Result of an attribute-value similarity computation with per-attribute scores."""
25
26    attributes: Mapping[str, S]

Result of an attribute-value similarity computation with per-attribute scores.

AttributeValueSim(value: T, attributes: Mapping[str, S])
attributes: Mapping[str, S]