Multimodal Modules

We include strong baselines for multimodal fusion and state-of-the-art paper implementations.

Fusers

This module contains the implementation of basic fusion algorithms and fusion pipelines.

The fusers are implemented for arbitrary number of input modalities, unless otherwise stated and are geared towards sequential inputs.

A fusion pipeline consists generally of three stages

Pre-fuse processing: Perform some common operations to all input modalities (e.g. project to a common dimension.)
Fuser: Fuse all modality representations into a single vector (e.g. concatenate all modality features using CatFuser).
Timesteps Pooling: Aggregate fused features for all timesteps into a single vector (e.g. add all timesteps with SumPooler)

`SUPPORTED_FUSERS: Mapping[str, Type[slp.modules.fuse.BaseFuser]]`

Currently implemented fusers

`SUPPORTED_POOLERS: Mapping[str, Type[slp.modules.fuse.BaseTimestepsPooler]]`

Supported poolers

`AttentionFuser`

`init(self, feature_size, n_modalities, use_all_trimodal=False, residual=True, dropout=0.1, **kwargs)` `special`

Fuse all combinations of three modalities using a base module using bilinear fusion

If input modalities are a, t, v, then the output is

Where f is TwowayAttention and g is Attention modules and values with [] are optional o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Number of feature dimensions	required
`n_modalities`	`int`	Number of input modalities (should be 3)	required
`use_all_trimodal`	`bool`	Use all optional trimodal combinations	`False`
`residual`	`bool`	Use residual connection in TwowayAttention. Defaults to True	`True`
`dropout`	`float`	Dropout probability	`0.1`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    residual: bool = True,
    dropout: float = 0.1,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module using bilinear fusion

    If input modalities are a, t, v, then the output is

    Where f is TwowayAttention and g is Attention modules and values with [] are optional
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
        residual (bool): Use residual connection in TwowayAttention. Defaults to True
        dropout (float): Dropout probability
    """
    kwargs["dropout"] = dropout
    kwargs["residual"] = residual
    super(AttentionFuser, self).__init__(
        feature_size,
        n_modalities,
        use_all_trimodal=use_all_trimodal,
        **kwargs,
    )

`fuse(self, mods, , lengths=None)`

Perform attention fusion on input modalities

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Unpadded tensors lengths	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D]

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform attention fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]

    """
    txt, au, vi = mods
    ta, at = self.ta(txt, au)
    va, av = self.va(vi, au)
    tv, vt = self.tv(txt, vi)

    va = va + av
    tv = vt + tv
    ta = ta + at

    tav, _ = self.tav(txt, queries=va)

    out_list = [txt, au, vi, ta, tv, va, tav]

    if self.use_all_trimodal:
        vat, _ = self.vat(vi, queries=ta)
        atv, _ = self.atv(au, queries=tv)

        out_list = out_list + [vat, atv]

    # B x L x 7*D or B x L x 9*D
    fused = torch.cat(out_list, dim=-1)

    return fused

`BaseFuser`

`out_size: int` `property` `readonly`

Output feature size.

Each fuser specifies its output feature size

`init(self, feature_size, n_modalities, **extra_kwargs)` `special`

Base fuser class.

Our fusion methods are separated in direct and combinatorial. An example for direct fusion is concatenation, where feature vectors of N modalities are concatenated into a fused vector. When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio, text -> visual, audio -> visaul etc.) In the current implementation, combinatorial fusion is implemented for 3 input modalities

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Assume all modality representations have the same feature_size	required
`n_modalities`	`int`	Number of input modalities	required
`**extra_kwargs`	`dict`	Extra keyword arguments to maintain interoperability of children classes	`{}`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    **extra_kwargs,
):
    """Base fuser class.

    Our fusion methods are separated in direct and combinatorial.
    An example for direct fusion is concatenation, where feature vectors of N modalities
    are concatenated into a fused vector.
    When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio,
    text -> visual, audio -> visaul etc.)
    In the current implementation, combinatorial fusion is implemented for 3 input modalities

    Args:
        feature_size (int): Assume all modality representations have the same feature_size
        n_modalities (int): Number of input modalities
        **extra_kwargs (dict): Extra keyword arguments to maintain interoperability of children
            classes
    """
    super(BaseFuser, self).__init__()
    self.feature_size = feature_size
    self.n_modalities = n_modalities

`forward(self, mods, , lengths=None)`

Fuse the modality representations

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	List of modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Lengths of each modality	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Fused tensor [B, L, self.out_size]

Source code in slp/modules/fuse.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, L, self.out_size]
    """
    fused = self.fuse(*mods, lengths=lengths)

    return fused

`fuse(self, mods, , lengths=None)`

Abstract method to fuse the modality representations

Children classes should implement this method

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	List of modality tensors	`()`
`lengths`	`Optional[torch.Tensor]`	Lengths of each modality	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Fused tensor

Source code in slp/modules/fuse.py

@abstractmethod
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Abstract method to fuse the modality representations

    Children classes should implement this method

    Args:
        *mods: List of modality tensors
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor
    """
    pass

`BaseFusionPipeline`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, *args, **kwargs)` `special`

Base class for a fusion pipeline

Inherit this class to implement a fusion pipeline

Source code in slp/modules/fuse.py

def __init__(self, *args, **kwargs):
    """Base class for a fusion pipeline

    Inherit this class to implement a fusion pipeline

    """
    super(BaseFusionPipeline, self).__init__()

`BaseTimestepsPooler`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, feature_size, batch_first=True, **kwargs)` `special`

Abstract base class for Timesteps Poolers

Timesteps Poolers aggregate the features for different timesteps

Given a tensor with dimensions [BatchSize, Length, Dim] they return an aggregated tensor with dimensions [BatchSize, Dim]

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Feature dimension	required
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`**kwargs`		Variable keyword arguments for subclasses	`{}`

Source code in slp/modules/fuse.py

def __init__(self, feature_size: int, batch_first: bool = True, **kwargs):
    """Abstract base class for Timesteps Poolers

    Timesteps Poolers aggregate the features for different timesteps

    Given a tensor with dimensions [BatchSize, Length, Dim]
    they return an aggregated tensor with dimensions [BatchSize, Dim]


    Args:
        feature_size (int): Feature dimension
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **kwargs: Variable keyword arguments for subclasses
    """
    super(BaseTimestepsPooler, self).__init__()
    self.pooling_dim = 0 if not batch_first else 1
    self.feature_size = feature_size

`forward(self, x, lengths=None)`

Pool features of input tensor across timesteps

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	[B, L, D] Input sequence	required
`lengths`	`Optional[torch.Tensor]`	Optional unpadded sequence lengths for input tensor	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: [B, D] Output aggregated features across timesteps

Source code in slp/modules/fuse.py

def forward(
    self, x: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Pool features of input tensor across timesteps

    Args:
        x (torch.Tensor): [B, L, D] Input sequence
        lengths (Optional[torch.Tensor]): Optional unpadded sequence lengths for input tensor

    Returns:
        torch.Tensor: [B, D] Output aggregated features across timesteps
    """

    if x.ndim == 2:
        return x

    if x.ndim != 3:
        raise ValueError("Expected 3 dimensional tensor [B, L, D] or [L, B, D]")

    return self._pool(x, lengths=lengths)

`BilinearFuser`

`init(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)` `special`

Fuse all combinations of three modalities using a base module using bilinear fusion

If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Where f and g are the nn.Bilinear function and values with [] are optional

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Number of feature dimensions	required
`n_modalities`	`int`	Number of input modalities (should be 3)	required
`use_all_trimodal`	`bool`	Use all optional trimodal combinations	`False`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module using bilinear fusion

    If input modalities are a, t, v, then the output is
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Where f and g are the nn.Bilinear function and values with [] are optional

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
    """
    super(BilinearFuser, self).__init__(
        feature_size,
        n_modalities,
        use_all_trimodal=use_all_trimodal,
        **kwargs,
    )

`fuse(self, mods, , lengths=None)`

Perform bilinear fusion on input modalities

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Unpadded tensors lengths	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D]

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform bilinear fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]

    """
    txt, au, vi = mods
    ta = self.ta(txt, au)
    va = self.va(vi, au)
    tv = self.tv(txt, vi)

    tav = self.tav(txt, va)

    out_list = [txt, au, vi, ta, tv, va, tav]

    if self.use_all_trimodal:
        vat = self.vat(vi, ta)
        atv = self.atv(au, tv)

        out_list = out_list + [vat, atv]

    # B x L x 7*D or B x L x 9*D
    fused = torch.cat(out_list, dim=-1)

    return fused

`BimodalAttentionFuser`

`fuse(self, mods, , lengths=None)`

Perform attention fusion on input modalities

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Unpadded tensors lengths	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: fused output vector [B, L, 3*D]

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform attention fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 3*D]

    """
    x, y = mods
    xy, yx = self.xy(x, y)
    xy = xy + yx
    # B x L x 3*D
    fused = torch.cat([x, y, xy], dim=-1)

    return fused

`BimodalBilinearFuser`

`fuse(self, mods, , lengths=None)`

Perform bilinear fusion on input modalities

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Unpadded tensors lengths	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: fused output vector [B, L, 3*D]

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform bilinear fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 3*D]

    """
    x, y = mods
    xy = self.xy(x, y)

    # B x L x 3*D
    fused = torch.cat([x, y, xy], dim=-1)

    return fused

`BimodalCombinatorialFuser`

`out_size: int` `property` `readonly`

Fused vector feature dimension

Returns:

Type	Description
`int`	int: 3 * feature_size

`init(self, feature_size, n_modalities, **kwargs)` `special`

Fuse all combinations of three modalities using a base module

If input modalities are x, y, then the output is o = x || y || f(x, y)

Where f is a network module (e.g. attention)

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Number of feature dimensions	required
`n_modalities`	`int`	Number of input modalities (should be 3)	required

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module

    If input modalities are x, y, then the output is
    o = x || y || f(x, y)

    Where f is a network module (e.g. attention)

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
    """
    super(BimodalCombinatorialFuser, self).__init__(
        feature_size, n_modalities, **kwargs
    )
    self._check_n_modalities(n=2)
    self.xy = self._bimodal_fusion_module(feature_size, **kwargs)

`CatFuser`

Fuse by concatenating modality representations

o = m1 || m2 || m3 ...

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Assume all modality representations have the same feature_size	required
`n_modalities`	`int`	Number of input modalities	required
`**extra_kwargs`	`dict`	Extra keyword arguments to maintain interoperability of children classes	required

`out_size: int` `property` `readonly`

d_out = n_modalities * d_in

Returns:

Type	Description
`int`	int: output feature size

`fuse(self, mods, , lengths=None)`

Concatenate input tensors into a single tensor

Examples:

fuser = CatFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, 2 * D)

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable number of input tensors	`()`

Returns:

Type	Description
`Tensor`	torch.Tensor: Concatenated input tensors

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Concatenate input tensors into a single tensor

    Example:
        fuser = CatFuser(5, 2)
        x = torch.rand(16, 6, 5)  # (B, L, D)
        y = torch.rand(16, 6, 5)  # (B, L, D)
        out = fuser(x, y)  # (B, L, 2 * D)

    Args:
        *mods: Variable number of input tensors

    Returns:
        torch.Tensor: Concatenated input tensors

    """

    return torch.cat(mods, dim=-1)

`Conv1dProjection`

`init(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=False)` `special`

Project features for N modalities using 1D convolutions

Parameters:

Name	Type	Description	Default
`modality_sizes`	`List[int]`	List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]	required
`projection_size`	`int`	Output number of features for each modality	required
`kernel_size`	`int`	Convolution kernel size	`1`
`padding`	`int`	Convlution amount of padding	`0`
`bias`	`bool`	Use bias in convolutional layers	`False`

Source code in slp/modules/fuse.py

def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    kernel_size: int = 1,
    padding: int = 0,
    bias: bool = False,
):
    """Project features for N modalities using 1D convolutions

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        projection_size (int): Output number of features for each modality
        kernel_size (int): Convolution kernel size
        padding (int): Convlution amount of padding
        bias (bool): Use bias in convolutional layers
    """
    super(Conv1dProjection, self).__init__()
    self.p = nn.ModuleList(
        [
            nn.Conv1d(
                sz,
                projection_size,
                kernel_size=kernel_size,
                padding=padding,
                bias=bias,
            )
            for sz in modality_sizes
        ]
    )

`forward(self, *mods)`

Project modality representations to a given number of features using Conv1d layers

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

c_proj = Conv1dProjection([300, 74, 35], 100) text_p, audio_p, visual_p = c_proj(text, audio, visual)

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable length tensors list	`()`

Returns:

Type	Description
`List[torch.Tensor]`	List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py

def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features using Conv1d layers
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        c_proj = Conv1dProjection([300, 74, 35], 100)
        text_p, audio_p, visual_p = c_proj(text, audio, visual)

    Args:
        *mods: Variable length tensors list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """
    mods_o: List[torch.Tensor] = [
        self.p[i](m.transpose(1, 2)).transpose(1, 2) for i, m in enumerate(mods)
    ]

    return mods_o

`FuseAggregateTimesteps`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, **fuser_kwargs)` `special`

Fuse input feature sequences and aggregate across timesteps

Fuser -> TimestepsPooler

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	The input modality representations dimension	required
`n_modalities`	`int`	Number of input modalities	required
`output_size`	`Optional[int]`	Required output size. If not provided, output_size = fuser.out_size	`None`
`fusion_method`	`str`	Select which fuser to use [cat\|sum\|attention\|bilinear]	`'cat'`
`timesteps_pooling_method`	`str`	TimestepsPooler method [cat\|sum\|rnn]	`'sum'`
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`**fuser_kwargs`	`dict`	Extra keyword arguments to instantiate fuser	`{}`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    output_size: Optional[int] = None,
    fusion_method: str = "cat",
    timesteps_pooling_method: str = "sum",
    batch_first: bool = True,
    **fuser_kwargs,
):
    """Fuse input feature sequences and aggregate across timesteps

    Fuser -> TimestepsPooler

    Args:
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        output_size (Optional[int]): Required output size. If not provided,
            output_size = fuser.out_size
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """

    super(FuseAggregateTimesteps, self).__init__(
        feature_size, n_modalities, fusion_method=fusion_method
    )
    self.fuser = make_fuser(
        fusion_method, feature_size, n_modalities, **fuser_kwargs
    )
    output_size = (  # bidirectional rnn. fused_size / 2 results to fused_size outputs
        output_size if output_size is not None else self.fuser.out_size // 2
    )
    self.timesteps_pooler = TimestepsPooler(
        self.fuser.out_size,
        hidden_size=output_size,
        mode=timesteps_pooling_method,
        batch_first=batch_first,
    )

`forward(self, mods, , lengths=None)`

Fuse the modality representations and aggregate across timesteps

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	List of modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Lengths of each modality	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/fuse.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations and aggregate across timesteps

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """
    fused = self.fuser(*mods, lengths=lengths)
    out: torch.Tensor = self.timesteps_pooler(fused, lengths=lengths)

    return out

`LinearProjection`

`init(self, modality_sizes, projection_size, bias=True)` `special`

Project features for N modalities using feedforward layers

Parameters:

Name	Type	Description	Default
`modality_sizes`	`List[int]`	List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]	required
`bias`	`bool`	Use bias in feedforward layers	`True`

Source code in slp/modules/fuse.py

def __init__(
    self, modality_sizes: List[int], projection_size: int, bias: bool = True
):
    """Project features for N modalities using feedforward layers

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        bias (bool): Use bias in feedforward layers
    """
    super(LinearProjection, self).__init__()
    self.p = nn.ModuleList(
        [nn.Linear(sz, projection_size, bias=bias) for sz in modality_sizes]
    )

`forward(self, *mods)`

Project modality representations to a given number of features using Linear layers

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

l_proj = LinearProjection([300, 74, 35], 100) text_p, audio_p, visual_p = l_proj(text, audio, visual)

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable length tensor list	`()`

Returns:

Type	Description
`List[torch.Tensor]`	List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py

def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features using Linear layers
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        l_proj = LinearProjection([300, 74, 35], 100)
        text_p, audio_p, visual_p = l_proj(text, audio, visual)

    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """
    mods_o: List[torch.Tensor] = [self.p[i](m) for i, m in enumerate(mods)]

    return mods_o

`ModalityProjection`

`init(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=True, mode=None)` `special`

Adapter module to project features for N modalities using 1D convolutions or feedforward

Parameters:

Name	Type	Description	Default
`modality_sizes`	`List[int]`	List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]	required
`projection_size`	`int`	Output number of features for each modality	required
`kernel_size`	`int`	Convolution kernel size. Used when mode=="conv"	`1`
`padding`	`int`	Convlution amount of padding. Used when mode=="conv"	`0`
`bias`	`bool`	Use bias	`True`
`mode`	`Optional[str]`	Projection method. linear -> LinearProjection conv\|conv1d\|convolutional -> Conv1dProjection	`None`

Source code in slp/modules/fuse.py

def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    kernel_size: int = 1,
    padding: int = 0,
    bias: bool = True,
    mode: Optional[str] = None,
):
    """Adapter module to project features for N modalities using 1D convolutions or feedforward

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        projection_size (int): Output number of features for each modality
        kernel_size (int): Convolution kernel size. Used when mode=="conv"
        padding (int): Convlution amount of padding. Used when mode=="conv"
        bias (bool): Use bias
        mode (Optional[str]): Projection method.
            linear -> LinearProjection
            conv|conv1d|convolutional -> Conv1dProjection
    """
    super(ModalityProjection, self).__init__()

    if mode is None:
        self.p: Optional[Union[LinearProjection, Conv1dProjection]] = None
    elif mode == "linear":
        self.p = LinearProjection(modality_sizes, projection_size, bias=bias)
    elif mode == "conv" or mode == "conv1d" or mode == "convolutional":
        self.p = Conv1dProjection(
            modality_sizes,
            projection_size,
            kernel_size=kernel_size,
            padding=padding,
            bias=bias,
        )
    else:
        raise ValueError(
            "Supported mode=[linear|conv|conv1d|convolutional]."
            "conv, conv1d and convolutional are equivalent."
        )

`forward(self, *mods)`

Project modality representations to a given number of features

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

l_proj = ModalityProjection([300, 74, 35], 100, mode="linear") text_p, audio_p, visual_p = l_proj(text, audio, visual)

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 300)

audio_p: (B, L, 74)

visual_p: (B, L, 35)

l_proj = ModalityProjection([300, 74, 35], 100, mode=None) text_p, audio_p, visual_p = l_proj(text, audio, visual)

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable length tensor list	`()`

Returns:

Type	Description
`List[torch.Tensor]`	List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py

def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        l_proj = ModalityProjection([300, 74, 35], 100, mode="linear")
        text_p, audio_p, visual_p = l_proj(text, audio, visual)

    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 300)
        #    audio_p: (B, L, 74)
        #    visual_p: (B, L, 35)
        l_proj = ModalityProjection([300, 74, 35], 100, mode=None)
        text_p, audio_p, visual_p = l_proj(text, audio, visual)


    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """

    if self.p is None:
        return list(mods)
    mods_o: List[torch.Tensor] = self.p(*mods)

    return mods_o

`ModalityWeights`

`init(self, feature_size)` `special`

Multiply each modality features with a learnable weight

i: modality index learnable_weight[i] = softmax(Linear(modality_features[i])) output_modality[i] = learnable_weight * modality_features[i]

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	All modalities are assumed to be projected into a space with the same number of features.	required

Source code in slp/modules/fuse.py

def __init__(self, feature_size: int):
    """Multiply each modality features with a learnable weight

    i: modality index
    learnable_weight[i] = softmax(Linear(modality_features[i]))
    output_modality[i] = learnable_weight * modality_features[i]

    Args:
        feature_size (int): All modalities are assumed to be projected into a space with the same
            number of features.

    """
    super(ModalityWeights, self).__init__()

    self.mod_w = nn.Linear(feature_size, 1)

`forward(self, *mods)`

Use learnable weights to multiply modality features

Examples:

Inputs:

text: (B, L, 100)

audio: (B, L, 100)

visual: (B, L, 100)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

mw = ModalityWeights(100) text_w, audio_w, visual_w = mw(text, audio, visual)

The operation is summarized as:

w_x = softmax(W * x + b) w_y = softmax(W * y + b) x_out = w_x * x y_out = w_y * y

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable length tensor list	`()`

Returns:

Type	Description
`List[torch.Tensor]`	List[torch.Tensor]: Variable length reweighted tensors list

Source code in slp/modules/fuse.py

def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Use learnable weights to multiply modality features

    Example:
        # Inputs:
        #    text: (B, L, 100)
        #    audio: (B, L, 100)
        #    visual: (B, L, 100)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        mw = ModalityWeights(100)
        text_w, audio_w, visual_w = mw(text, audio, visual)

    The operation is summarized as:

    w_x = softmax(W * x + b)
    w_y = softmax(W * y + b)
    x_out = w_x * x
    y_out = w_y * y

    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length reweighted tensors list
    """
    weight = self.mod_w(torch.cat([x.unsqueeze(1) for x in mods], dim=1))
    weight = F.softmax(weight, dim=1)
    mods_o: List[torch.Tensor] = [m * weight[:, i, ...] for i, m in enumerate(mods)]

    return mods_o

`ProjectFuseAggregate`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, modality_sizes, projection_size, projection_type=None, fusion_method='cat', timesteps_pooling_method='sum', modality_weights=False, batch_first=True, **fuser_kwargs)` `special`

Project input feature sequences, fuse and aggregate across timesteps

ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler

Parameters:

Name	Type	Description	Default
`modality_sizes`	`List[int]`	List of input modality representations dimensions	required
`projection_size`	`int`	Project all modalities to have this feature size	required
`projection_type`	`Optional[str]`	Optional projection method [linear\|conv]	`None`
`fusion_method`	`str`	Select which fuser to use [cat\|sum\|attention\|bilinear]	`'cat'`
`timesteps_pooling_method`	`str`	TimestepsPooler method [cat\|sum\|rnn]	`'sum'`
`modality_weights`	`bool`	Multiply projected modality representations with learnable weights. Default value is False.	`False`
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`**fuser_kwargs`	`dict`	Extra keyword arguments to instantiate fuser	`{}`

Source code in slp/modules/fuse.py

def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    projection_type: Optional[str] = None,
    fusion_method="cat",
    timesteps_pooling_method="sum",
    modality_weights: bool = False,
    batch_first: bool = True,
    **fuser_kwargs,
):
    """Project input feature sequences, fuse and aggregate across timesteps

    ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler

    Args:
        modality_sizes (List[int]): List of input modality representations dimensions
        projection_size (int): Project all modalities to have this feature size
        projection_type (Optional[str]): Optional projection method [linear|conv]
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        modality_weights (bool): Multiply projected modality representations with learnable
            weights. Default value is False.
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """
    super(ProjectFuseAggregate, self).__init__()
    n_modalities = len(modality_sizes)

    self.projection = None
    self.modality_weights = None

    if projection_type is not None:
        self.projection = ModalityProjection(
            modality_sizes, projection_size, mode=projection_type
        )

        if modality_weights:
            self.modality_weights = ModalityWeights(projection_size)

    fuser_kwargs["output_size"] = projection_size
    fuser_kwargs["fusion_method"] = fusion_method
    fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
    fuser_kwargs["batch_first"] = batch_first

    if "n_modalities" in fuser_kwargs:
        del fuser_kwargs["n_modalities"]

    if "projection_size" in fuser_kwargs:
        del fuser_kwargs["projection_size"]

    self.fuse_aggregate = FuseAggregateTimesteps(
        projection_size,
        n_modalities,
        **fuser_kwargs,
    )

`forward(self, mods, , lengths=None)`

Project modality representations to a common dimension, fuse and aggregate across timesteps

Optionally use modality weights

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	List of modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Lengths of each modality	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/fuse.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Project modality representations to a common dimension, fuse and aggregate across timesteps

    Optionally use modality weights

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """

    if self.projection is not None:
        mods = self.projection(*mods)

    if self.modality_weights is not None:
        mods = self.modality_weights(*mods)
    fused: torch.Tensor = self.fuse_aggregate(*mods, lengths=lengths)

    return fused

`RnnPooler`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, feature_size, hidden_size=None, batch_first=True, bidirectional=True, merge_bi='cat', attention=True, **kwargs)` `special`

Aggregate features of the input tensor using an AttentiveRNN

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Feature dimension	required
`hidden_size`	`Optional[int]`	Hidden dimension	`None`
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`bidirectional`	`bool`	Use bidirectional RNN. Defaults to True	`True`
`merge_bi`	`str`	How bidirectional states are merged. Defaults to "cat"	`'cat'`
`attention`	`bool`	Use attention for the RNN output states	`True`
`**kwargs`		Variable keyword arguments	`{}`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    hidden_size: Optional[int] = None,
    batch_first: bool = True,
    bidirectional: bool = True,
    merge_bi: str = "cat",
    attention: bool = True,
    **kwargs,
):
    """Aggregate features of the input tensor using an AttentiveRNN

    Args:
        feature_size (int): Feature dimension
        hidden_size (int): Hidden dimension
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        bidirectional (bool): Use bidirectional RNN. Defaults to True
        merge_bi (str): How bidirectional states are merged. Defaults to "cat"
        attention (bool): Use attention for the RNN output states
        **kwargs: Variable keyword arguments
    """
    super(RnnPooler, self).__init__(feature_size, batch_first=batch_first, **kwargs)
    self.hidden_size = hidden_size if hidden_size is not None else feature_size
    self.rnn = AttentiveRNN(
        feature_size,
        hidden_size=self.hidden_size,
        batch_first=batch_first,
        bidirectional=bidirectional,
        merge_bi=merge_bi,
        attention=attention,
        return_hidden=False,  # We want to aggregate all hidden states.
    )

`SumFuser`

Fuse by adding modality representations

o = m1 + m2 + m3 ...

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Assume all modality representations have the same feature_size	required
`n_modalities`	`int`	Number of input modalities	required
`**extra_kwargs`	`dict`	Extra keyword arguments to maintain interoperability of children classes	required

`out_size: int` `property` `readonly`

d_out = d_in

Returns:

Type	Description
`int`	int: output feature size

`fuse(self, mods, , lengths=None)`

Sum input tensors into a single tensor

Examples:

fuser = SumFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, D)

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable number of input tensors	`()`

Returns:

Type	Description
`Tensor`	torch.Tensor: Summed input tensors

Source code in slp/modules/fuse.py

def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Sum input tensors into a single tensor

    Example:
        fuser = SumFuser(5, 2)
        x = torch.rand(16, 6, 5)  # (B, L, D)
        y = torch.rand(16, 6, 5)  # (B, L, D)
        out = fuser(x, y)  # (B, L, D)

    Args:
        *mods: Variable number of input tensors

    Returns:
        torch.Tensor: Summed input tensors

    """

    return torch.cat([m.unsqueeze(-1) for m in mods], dim=-1).sum(-1)

`TimestepsPooler`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, feature_size, mode='sum', batch_first=True, **kwargs)` `special`

Aggregate features from all timesteps into a single representation.

Three methods supported: sum: Sum features from all timesteps mean: Average features from all timesteps max: Max pool features from all timesteps rnn: Use the output from an attentive RNN

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	The number of features for the input fused representations	required
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`mode`	`str`	The timestep pooling method sum: Sum hidden states mean: Average hidden states max: Max pool features from all hidden states rnn: Use the output of an Attentive RNN	`'sum'`

Source code in slp/modules/fuse.py

def __init__(
    self, feature_size: int, mode: str = "sum", batch_first=True, **kwargs
):
    """Aggregate features from all timesteps into a single representation.

    Three methods supported:
        sum: Sum features from all timesteps
        mean: Average features from all timesteps
        max: Max pool features from all timesteps
        rnn: Use the output from an attentive RNN

    Args:
        feature_size (int): The number of features for the input fused representations
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        mode (str): The timestep pooling method
            sum: Sum hidden states
            mean: Average hidden states
            max: Max pool features from all hidden states
            rnn: Use the output of an Attentive RNN
    """
    super(TimestepsPooler, self).__init__(
        feature_size, batch_first=batch_first, **kwargs
    )
    assert (
        mode is None or mode in SUPPORTED_POOLERS
    ), "Unsupported timestep pooling method.  Available methods: {SUPPORTED_POOLERS.keys()}"

    self.pooler = None

    if mode is not None:
        self.pooler = SUPPORTED_POOLERS[mode](
            feature_size, batch_first=batch_first, **kwargs
        )

`TrimodalCombinatorialFuser`

`out_size: int` `property` `readonly`

Fused vector feature dimension

Returns:

Type	Description
`int`	int: 7 * feature_size if use_all_trimodal==False else 9*feature_size

`init(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)` `special`

Fuse all combinations of three modalities using a base module

If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Where f and g network modules (e.g. attention) and values with [] are optional

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	Number of feature dimensions	required
`n_modalities`	`int`	Number of input modalities (should be 3)	required
`use_all_trimodal`	`bool`	Use all optional trimodal combinations	`False`

Source code in slp/modules/fuse.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module

    If input modalities are a, t, v, then the output is
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Where f and g network modules (e.g. attention) and values with [] are optional

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
    """
    super(TrimodalCombinatorialFuser, self).__init__(
        feature_size, n_modalities, **kwargs
    )
    self._check_n_modalities(n=3)
    self.use_all_trimodal = use_all_trimodal

    self.ta = self._bimodal_fusion_module(feature_size, **kwargs)
    self.va = self._bimodal_fusion_module(feature_size, **kwargs)
    self.tv = self._bimodal_fusion_module(feature_size, **kwargs)

    self.tav = self._trimodal_fusion_module(feature_size, **kwargs)

    if use_all_trimodal:
        self.vat = self._trimodal_fusion_module(feature_size, **kwargs)
        self.atv = self._trimodal_fusion_module(feature_size, **kwargs)

`make_fuser(fusion_method, feature_size, n_modalities, **kwargs)`

Helper function to instantiate a fuser given a string fusion_method parameter

Parameters:

Name	Type	Description	Default
`fusion_method`	`str`	One of the supported fusion methods [cat\|add\|bilinear\|attention]	required
`feature_size`	`int`	The input modality representations dimension	required
`n_modalities`	`int`	Number of input modalities	required
`**kwargs`		Variable keyword arguments to pass to the instantiated fuser	`{}`

Source code in slp/modules/fuse.py

def make_fuser(fusion_method: str, feature_size: int, n_modalities: int, **kwargs):
    """Helper function to instantiate a fuser given a string fusion_method parameter

    Args:
        fusion_method (str): One of the supported fusion methods [cat|add|bilinear|attention]
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        **kwargs: Variable keyword arguments to pass to the instantiated fuser
    """

    if fusion_method not in SUPPORTED_FUSERS.keys():
        raise NotImplementedError(
            f"The supported fusers are {SUPPORTED_FUSERS.keys()}. You provided {fusion_method}"
        )

    if fusion_method == "bilinear":
        if n_modalities == 2:
            return BimodalBilinearFuser(feature_size, n_modalities, **kwargs)
        elif n_modalities == 3:
            return BilinearFuser(feature_size, n_modalities, **kwargs)
        else:
            raise ValueError("bilinear implemented for 2 or 3 modalities")

    if fusion_method == "attention":
        if n_modalities == 2:
            return BimodalAttentionFuser(feature_size, n_modalities, **kwargs)
        elif n_modalities == 3:
            return AttentionFuser(feature_size, n_modalities, **kwargs)
        else:
            raise ValueError("attention implemented for 2 or 3 modalities")

    return SUPPORTED_FUSERS[fusion_method](feature_size, n_modalities, **kwargs)

Multimodal encoders

These modules implement mid and late fusion. The general structure of a multimodal encoder contains:

N Unimodal encoders (e.g. RNNs), where N is the number of input modalities
A fusion pipeline

We furthermore implement Multimodal classifiers, which consist of a multimodal encoder followed by an nn.Linear layer.

A special mention should be added for our MultimodalBaseline. This baseline consists of RNN encoders followed by an attention fuser and an RNN timesteps poolwer in multimodal tasks and is tuned on CMU-MOSEI. The default configuration is provided through static methods and achieve strong performance.

`AudioEncoder`

Alias for Unimodal Encoder

`AudioTextClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`AudioVisualClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["visual"], mod_dict["audio"]]
    fused = self.enc(*mods, lengths=lengths["visual"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`BaseEncoder`

`out_size: int` `property` `readonly`

An encoder returns its output size

Returns:

Type	Description
`int`	int: The output feature size of the encoder

`init(self, *args, **kwargs)` `special`

Base class implementing a multimodal encoder

A BaseEncoder child encodes and fuses the modality features and returns representations ready to be provided to a classification layer

Source code in slp/modules/multimodal.py

def __init__(self, *args, **kwargs):
    """Base class implementing a multimodal encoder

    A BaseEncoder child encodes and fuses the modality  features
    and returns representations ready to be provided to a classification layer
    """
    super(BaseEncoder, self).__init__()
    self.args = args
    self.kwargs = kwargs
    self.clf = None

`forward(self, mods, , lengths=None)`

Encode + fuse

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable input modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	The unpadded tensor lengths. Defaults to None.	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: The fused tensor [B, D]

Source code in slp/modules/multimodal.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Encode + fuse

    Args:
        *mods (torch.Tensor): Variable input modality tensors [B, L, D]
        lengths (Optional[torch.Tensor], optional): The unpadded tensor lengths. Defaults to None.

    Returns:
        torch.Tensor: The fused tensor [B, D]
    """
    encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)
    fused = self._fuse(*encoded, lengths=lengths)

    return fused

`BimodalEncoder`

`out_size: int` `property` `readonly`

Output feature size

Returns:

Type	Description
`int`	int: Output feature size

`init(self, encoder1_args, encoder2_args, fuser_args, **kwargs)` `special`

Two modality encoder

Encode + Fuse two input modalities

Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }

Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }

Parameters:

Name	Type	Description	Default
`encoder1_args`	`Dict[str, Any]`	Configuration for first encoder	required
`encoder2_args`	`Dict[str, Any]`	Configuration for second encoder	required
`fuser_args`	`Dict[str, Any]`	Configuration for fuser	required

Source code in slp/modules/multimodal.py

def __init__(
    self,
    encoder1_args: Dict[str, Any],
    encoder2_args: Dict[str, Any],
    fuser_args: Dict[str, Any],
    **kwargs,
):
    """Two modality encoder

    Encode + Fuse two input modalities

    Example encoder_args:
        {
            "input_size": 35,
            "hidden_size": 100,
            "layers": 1,
            "bidirectional": True,
            "dropout": 0.2,
            "rnn_type": "lstm",
            "attention": True,
        }

    Example fuser_args:
        {
            "n_modalities": 3,
            "dropout": 0.2,
            "output_size": 100,
            "hidden_size": 100,
            "fusion_method": "cat",
            "timesteps_pooling_method": "rnn",
        }

    Args:
        encoder1_args (Dict[str, Any]): Configuration for first encoder
        encoder2_args (Dict[str, Any]): Configuration for second encoder
        fuser_args (Dict[str, Any]): Configuration for fuser
    """
    super(BimodalEncoder, self).__init__(
        encoder1_args,
        encoder2_args,
        fuser_args,
        **kwargs,
    )
    self.input_projection = None

    if "input_projection" in fuser_args and fuser_args["input_projection"]:
        self.input_projection = ModalityProjection(
            [encoder1_args["input_size"], encoder2_args["input_size"]],
            fuser_args["hidden_size"],
            mode=fuser_args["input_projection"],
        )

    encoder1_args["return_hidden"] = True
    encoder2_args["return_hidden"] = True

    self.encoder1 = UnimodalEncoder(**encoder1_args)

    self.encoder2 = UnimodalEncoder(**encoder2_args)

    self.fuse = self._make_fusion_pipeline(
        [self.encoder1.out_size, self.encoder2.out_size], **fuser_args
    )

`GloveEncoder`

Alias for Unimodal Encoder

`MOSEIClassifier`

`init(self, encoder, num_classes, dropout=0.2)` `special`

Encode and classify multimodal inputs

Parameters:

Name	Type	Description	Default
`encoder`	`BaseEncoder`	The encoder module	required
`num_classes`	`int`	The number of target classes	required
`dropout`	`float`	Dropout probability	`0.2`

Source code in slp/modules/multimodal.py

def __init__(self, encoder: BaseEncoder, num_classes: int, dropout: float = 0.2):
    """Encode and classify multimodal inputs

    Args:
        encoder (BaseEncoder): The encoder module
        num_classes (int): The number of target classes
        dropout (float): Dropout probability

    """
    super(MOSEIClassifier, self).__init__()
    self.enc = encoder
    self.drop = nn.Dropout(p=dropout)
    self.clf = nn.Linear(self.enc.out_size, num_classes)

`MultimodalBaseline`

`init(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False)` `special`

Multimodal baseline architecture

This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler. The default configuration is tuned for good performance on MOSEI.

Parameters:

Name	Type	Description	Default
`text_size`	`int`	Text input size. Defaults to 300.	`300`
`audio_size`	`int`	Audio input size. Defaults to 74.	`74`
`visual_size`	`int`	Visual input size. Defaults to 35.	`35`
`hidden_size`	`int`	Hidden dimension. Defaults to 100.	`100`
`dropout`	`float`	Dropout rate. Defaults to 0.2.	`0.2`
`encoder_layers`	`float`	Number of encoder layers. Defaults to 1.	`1`
`bidirectional`	`bool`	Use bidirectional RNNs. Defaults to True.	`True`
`merge_bi`	`str`	Bidirectional merging method in the encoders. Defaults to "sum".	`'sum'`
`rnn_type`	`str`	RNN type [lstm\|gru]. Defaults to "lstm".	`'lstm'`
`encoder_attention`	`bool`	Use attention in the encoder RNNs. Defaults to True.	`True`
`fuser_residual`	`bool`	Use vilbert like residual in the attention fuser. Defaults to True.	`True`
`use_all_trimodal`	`bool`	Use all trimodal interactions for the Attention fuser. Defaults to False.	`False`

Source code in slp/modules/multimodal.py

def __init__(
    self,
    text_size: int = 300,
    audio_size: int = 74,
    visual_size: int = 35,
    hidden_size: int = 100,
    dropout: float = 0.2,
    encoder_layers: float = 1,
    bidirectional: bool = True,
    merge_bi: str = "sum",
    rnn_type: str = "lstm",
    encoder_attention: bool = True,
    fuser_residual: bool = True,
    use_all_trimodal: bool = False,
):
    """Multimodal baseline architecture

    This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler.
    The default configuration is tuned for good performance on MOSEI.

    Args:
        text_size (int, optional): Text input size. Defaults to 300.
        audio_size (int, optional): Audio input size. Defaults to 74.
        visual_size (int, optional): Visual input size. Defaults to 35.
        hidden_size (int, optional): Hidden dimension. Defaults to 100.
        dropout (float, optional): Dropout rate. Defaults to 0.2.
        encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
        merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
        rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
        encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
        fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
        use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
    """
    cfg = {
        "hidden_size": hidden_size,
        "dropout": dropout,
        "layers": encoder_layers,
        "attention": encoder_attention,
        "bidirectional": bidirectional,
        "rnn_type": rnn_type,
        "merge_bi": merge_bi,
    }

    text_cfg = MultimodalBaseline.encoder_cfg(text_size, **cfg)
    audio_cfg = MultimodalBaseline.encoder_cfg(audio_size, **cfg)
    visual_cfg = MultimodalBaseline.encoder_cfg(visual_size, **cfg)
    fuser_cfg = MultimodalBaseline.fuser_cfg(
        hidden_size=hidden_size,
        dropout=dropout,
        residual=fuser_residual,
        use_all_trimodal=use_all_trimodal,
    )

    super(MultimodalBaseline, self).__init__(
        text_cfg, audio_cfg, visual_cfg, fuser_cfg
    )

`encoder_cfg(input_size, **cfg)` `staticmethod`

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`input_size`	`int`	Input modality size	required
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The encoder configuration

Source code in slp/modules/multimodal.py

@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
        "merge_bi": cfg.get("merge_bi", "sum"),
    }

`fuser_cfg(**cfg)` `staticmethod`

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The fuser configuration

Source code in slp/modules/multimodal.py

@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
    }

`MultimodalBaselineClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`TrimodalClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`TrimodalEncoder`

`out_size: int` `property` `readonly`

Output feature size

Returns:

Type	Description
`int`	int: Output feature size

`init(self, encoder1_args, encoder2_args, encoder3_args, fuser_args, **kwargs)` `special`

Two modality encoder

Encode + Fuse three input modalities

Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }

Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }

Parameters:

Name	Type	Description	Default
`encoder1_args`	`Dict[str, Any]`	Configuration for first encoder	required
`encoder2_args`	`Dict[str, Any]`	Configuration for second encoder	required
`encoder3_args`	`Dict[str, Any]`	Configuration for third encoder	required
`fuser_args`	`Dict[str, Any]`	Configuration for fuser	required

Source code in slp/modules/multimodal.py

def __init__(
    self,
    encoder1_args: Dict[str, Any],
    encoder2_args: Dict[str, Any],
    encoder3_args: Dict[str, Any],
    fuser_args: Dict[str, Any],
    **kwargs,
):
    """Two modality encoder

    Encode + Fuse three input modalities

    Example encoder_args:
        {
            "input_size": 35,
            "hidden_size": 100,
            "layers": 1,
            "bidirectional": True,
            "dropout": 0.2,
            "rnn_type": "lstm",
            "attention": True,
        }

    Example fuser_args:
        {
            "n_modalities": 3,
            "dropout": 0.2,
            "output_size": 100,
            "hidden_size": 100,
            "fusion_method": "cat",
            "timesteps_pooling_method": "rnn",
        }

    Args:
        encoder1_args (Dict[str, Any]): Configuration for first encoder
        encoder2_args (Dict[str, Any]): Configuration for second encoder
        encoder3_args (Dict[str, Any]): Configuration for third encoder
        fuser_args (Dict[str, Any]): Configuration for fuser
    """
    super(TrimodalEncoder, self).__init__(
        encoder1_args,
        encoder2_args,
        encoder3_args,
        fuser_args,
        **kwargs,
    )
    self.input_projection = None

    if "input_projection" in fuser_args and fuser_args["input_projection"]:
        self.input_projection = ModalityProjection(
            [encoder1_args["input_size"], encoder2_args["input_size"]],
            fuser_args["hidden_size"],
            mode=fuser_args["input_projection"],
        )

    self.encoder1 = UnimodalEncoder(**encoder1_args)

    self.encoder2 = UnimodalEncoder(**encoder2_args)

    self.encoder3 = UnimodalEncoder(**encoder3_args)
    # encoder3_args["input_size"], encoder3_args["hidden_size"], **encoder3_args

    self.fuse = self._make_fusion_pipeline(
        [self.encoder1.out_size, self.encoder2.out_size, self.encoder3.out_size],
        **fuser_args,
    )

`UnimodalClassifier`

`init(self, input_size, hidden_size, num_classes, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, **kwargs)` `special`

Encode and classify unimodal inputs

Parameters:

Name	Type	Description	Default
`input_size`	`int`	The input modality feature size	required
`hidden_size`	`int`	Hidden size for RNN	required
`num_classes`	`int`	The number of target classes	required
`layers`	`int`	Number of RNN layers	`1`
`bidirectional`	`bool`	Use biRNN	`True`
`dropout`	`float`	Dropout probability	`0.2`
`rnn_type`	`str`	[lstm\|gru]	`'lstm'`
`attention`	`bool`	Use attention on hidden states	`True`

Source code in slp/modules/multimodal.py

def __init__(
    self,
    input_size: int,
    hidden_size: int,
    num_classes: int,
    layers: int = 1,
    bidirectional: bool = True,
    dropout: float = 0.2,
    rnn_type: str = "lstm",
    attention: bool = True,
    **kwargs,
):
    """Encode and classify unimodal inputs

    Args:
        input_size (int): The input modality feature size
        hidden_size (int): Hidden size for RNN
        num_classes (int): The number of target classes
        layers (int): Number of RNN layers
        bidirectional (bool): Use biRNN
        dropout (float): Dropout probability
        rnn_type (str): [lstm|gru]
        attention (bool): Use attention on hidden states

    """
    enc = UnimodalEncoder(
        input_size,
        hidden_size,
        layers=layers,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        attention=attention,
        aggregate_encoded=True,
    )
    super(UnimodalClassifier, self).__init__(enc, num_classes)

`forward(self, x, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, x: torch.Tensor, lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    fused = self.enc(x, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`UnimodalEncoder`

`out_size: int` `property` `readonly`

Output feature size

Returns:

Type	Description
`int`	int: Output feature size

`init(self, input_size, hidden_size, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, merge_bi='sum', aggregate_encoded=False, **kwargs)` `special`

Single modality encoder

Encode a single modality using an Attentive RNN

Parameters:

Name	Type	Description	Default
`input_size`	`int`	Input feature size	required
`hidden_size`	`int`	RNN hidden size	required
`layers`	`int`	Number of RNN layers. Defaults to 1.	`1`
`bidirectional`	`bool`	Use bidirectional RNN. Defaults to True.	`True`
`dropout`	`float`	Dropout probability. Defaults to 0.2.	`0.2`
`rnn_type`	`str`	lstm or gru. Defaults to "lstm".	`'lstm'`
`attention`	`bool`	Use attention over hidden states. Defaults to True.	`True`
`merge_bi`	`str`	How to merge hidden states [sum\|cat]. Defaults to sum.	`'sum'`
`aggregate_encoded`	`bool`	Aggregate hidden states. Defaults to False.	`False`

Source code in slp/modules/multimodal.py

def __init__(
    self,
    input_size: int,
    hidden_size: int,
    layers: int = 1,
    bidirectional: bool = True,
    dropout: float = 0.2,
    rnn_type: str = "lstm",
    attention: bool = True,
    merge_bi: str = "sum",
    aggregate_encoded: bool = False,
    **kwargs,
):
    """Single modality encoder

    Encode a single modality using an Attentive RNN

    Args:
        input_size (int): Input feature size
        hidden_size (int): RNN hidden size
        layers (int, optional): Number of RNN layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNN. Defaults to True.
        dropout (float, optional): Dropout probability. Defaults to 0.2.
        rnn_type (str, optional): lstm or gru. Defaults to "lstm".
        attention (bool, optional): Use attention over hidden states. Defaults to True.
        merge_bi (str, optional): How to merge hidden states [sum|cat]. Defaults to sum.
        aggregate_encoded (bool, optional): Aggregate hidden states. Defaults to False.
    """
    super(UnimodalEncoder, self).__init__(
        input_size,
        hidden_size,
        layers=layers,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        attention=attention,
        **kwargs,
    )
    self.aggregate_encoded = aggregate_encoded
    self.encoder = AttentiveRNN(
        input_size,
        hidden_size,
        batch_first=True,
        layers=layers,
        merge_bi=merge_bi,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        packed_sequence=True,
        attention=attention,
        return_hidden=True,
    )

`VisualEncoder`

Alias for Unimodal Encoder

`VisualTextClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

M3

`HardMultimodalDropout`

`init(self, p=0.5, n_modalities=3, p_mod=None)` `special`

MMDrop initial implementation

For each sample in a batch drop one of the modalities with probability p

Parameters:

Name	Type	Description	Default
`p`	`float`	drop probability	`0.5`
`n_modalities`	`int`	number of modalities	`3`
`p_mod`	`Optional[List[float]]`	Drop probabilities for each modality	`None`

Source code in slp/modules/mmdrop.py

def __init__(
    self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
    """MMDrop initial implementation

    For each sample in a batch drop one of the modalities with probability p

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
    """
    super(HardMultimodalDropout, self).__init__()
    self.p = p
    self.n_modalities = n_modalities

    self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]

    if p_mod is not None:
        self.p_mod = p_mod

`forward(self, *mods)`

Naive mmdrop forward

Iterate over batch and randomly choose modality to drop

Parameters:

Name	Type	Description	Default
`mods`	`varargs torch.Tensor`	[B, L, D_m] Modality representations	`()`

Returns:

Type	Description
`(List[torch.Tensor])`	The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py

def forward(self, *mods):
    """Naive mmdrop forward

    Iterate over batch and randomly choose modality to drop

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped
    """
    mods = list(mods)

    # List of [B, L, D]

    if self.training:
        if random.random() < self.p:
            # Drop different modality for each sample in batch

            for batch in range(mods[0].size(0)):
                m = random.choices(
                    list(range(self.n_modalities)), weights=self.p_mod, k=1
                )[0]

                # m = random.randint(0, self.n_modalities - 1)
                mask = torch.ones_like(mods[m])
                mask[batch] = 0.0
                mods[m] = mods[m] * mask

        if self.p > 0:
            for m in range(len(mods)):
                keep_prob = 1 - (self.p / self.n_modalities)
                mods[m] = mods[m] * (1 / keep_prob)

    return mods

`MultimodalDropout`

`init(self, p=0.5, n_modalities=3, p_mod=None, mode='hard')` `special`

mmdrop wrapper class

Drop p * 100 % of features of a specific modality over batch

Parameters:

Name	Type	Description	Default
`p`	`float`	drop probability	`0.5`
`n_modalities`	`int`	number of modalities	`3`
`p_mod`	`Optional[List[float]]`	Drop probabilities for each modality	`None`
`mode`	`str`	Hard or soft mmdrop	`'hard'`

Source code in slp/modules/mmdrop.py

def __init__(
    self,
    p: float = 0.5,
    n_modalities: int = 3,
    p_mod: Optional[List[float]] = None,
    mode: str = "hard",
):
    """mmdrop wrapper class

    Drop p * 100 % of features of a specific modality over batch

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
        mode (str): Hard or soft mmdrop
    """
    super(MultimodalDropout, self).__init__()

    assert mode in [
        "hard",
        "soft",
    ], "Allowed mode for MultimodalDropout ['hard' | 'soft']"

    if mode == "hard":
        self.mmdrop = HardMultimodalDropout(
            p=p, n_modalities=n_modalities, p_mod=p_mod
        )
    else:
        self.mmdrop = SoftMultimodalDropout(  # type: ignore
            p=p, n_modalities=n_modalities, p_mod=p_mod
        )

`forward(self, *mods)`

mmdrop wrapper forward

Perform hard or soft mmdrop

Parameters:

Name	Type	Description	Default
`mods`	`varargs torch.Tensor`	[B, L, D_m] Modality representations	`()`

Returns:

Type	Description
`(List[torch.Tensor])`	The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py

def forward(self, *mods):
    """mmdrop wrapper forward

    Perform hard or soft mmdrop

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped

    """
    return self.mmdrop(*mods)

`SoftMultimodalDropout`

`init(self, p=0.5, n_modalities=3, p_mod=None)` `special`

Soft mmdrop implementation

Drop p * 100 % of features of a specific modality over batch

Parameters:

Name	Type	Description	Default
`p`	`float`	drop probability	`0.5`
`n_modalities`	`int`	number of modalities	`3`
`p_mod`	`Optional[List[float]]`	Drop probabilities for each modality	`None`

Source code in slp/modules/mmdrop.py

def __init__(
    self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
    """Soft mmdrop implementation

    Drop p * 100 % of features of a specific modality over batch

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
    """
    super(SoftMultimodalDropout, self).__init__()
    self.p = p  # p_drop
    self.n_modalities = n_modalities

    self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]

    if p_mod is not None:
        self.p_mod = p_mod

`forward(self, *mods)`

Soft mmdrop forward

Sample a binomial mask to mask a random modality in this batch

Parameters:

Name	Type	Description	Default
`mods`	`varargs torch.Tensor`	[B, L, D_m] Modality representations	`()`

Returns:

Type	Description
`(List[torch.Tensor])`	The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py

def forward(self, *mods):
    """Soft mmdrop forward

    Sample a binomial mask to mask a random modality in this batch

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped
    """
    mods = list(mods)

    if self.training:
        # m = random.randint(0, self.n_modalities - 1)
        m = random.choices(list(range(self.n_modalities)), weights=self.p_mod, k=1)[
            0
        ]

        binomial = torch.distributions.binomial.Binomial(probs=1 - self.p)
        mods[m] = mods[m] * binomial.sample(mods[m].size()).to(mods[m].device)

        for m in range(self.n_modalities):
            mods[m] = mods[m] * (1.0 / (1 - self.p / self.n_modalities))

    return mods

`M3`

`encoder_cfg(input_size, **cfg)` `staticmethod`

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`input_size`	`int`	Input modality size	required
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The encoder configuration

Source code in slp/modules/m3.py

@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
    }

`fuser_cfg(**cfg)` `staticmethod`

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The fuser configuration

Source code in slp/modules/m3.py

@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
        "mmdrop_prob": 0.2,
        "mmdrop_individual_mod_prob": None,
        "mmdrop_algorithm": "hard",
    }

`M3Classifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/m3.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

`M3FuseAggregate`

`out_size: int` `property` `readonly`

Define the feature size of the returned tensor

Returns:

Type	Description
`int`	int: The feature dimension of the output tensor

`init(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, mmdrop_prob=0.2, mmdrop_individual_mod_prob=None, mmdrop_algorithm='hard', **fuser_kwargs)` `special`

MultimodalDropout, Fuse input feature sequences and aggregate across timesteps

MultimodalDropout -> Fuser -> TimestepsPooler

Parameters:

Name	Type	Description	Default
`feature_size`	`int`	The input modality representations dimension	required
`n_modalities`	`int`	Number of input modalities	required
`output_size`	`Optional[int]`	Required output size. If not provided, output_size = fuser.out_size	`None`
`fusion_method`	`str`	Select which fuser to use [cat\|sum\|attention\|bilinear]	`'cat'`
`timesteps_pooling_method`	`str`	TimestepsPooler method [cat\|sum\|rnn]	`'sum'`
`batch_first`	`bool`	Input tensors are in batch first configuration. Leave this as true except if you know what you are doing	`True`
`mmdrop_prob`	`float`	The probability for multimodal dropout. Defaults to 0.2	`0.2`
`mmdrop_individual_mod_prob`	`Optional[List[float]]`	Drop probabilities for each modality for multimodal dropout. If None all modalities are dropped with equal probability	`None`
`mmdrop_algorithm`	`str`	Choose multimodal dropout algorithm [hard\|soft]. Defaults to hard	`'hard'`
`**fuser_kwargs`	`dict`	Extra keyword arguments to instantiate fuser	`{}`

Source code in slp/modules/m3.py

def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    output_size: Optional[int] = None,
    fusion_method: str = "cat",
    timesteps_pooling_method: str = "sum",
    batch_first: bool = True,
    mmdrop_prob: float = 0.2,
    mmdrop_individual_mod_prob: Optional[List[float]] = None,
    mmdrop_algorithm: str = "hard",
    **fuser_kwargs,
):
    """MultimodalDropout, Fuse input feature sequences and aggregate across timesteps

    MultimodalDropout -> Fuser -> TimestepsPooler

    Args:
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        output_size (Optional[int]): Required output size. If not provided,
            output_size = fuser.out_size
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        mmdrop_prob (float): The probability for multimodal dropout. Defaults to 0.2
        mmdrop_individual_mod_prob (Optional[List[float]]): Drop probabilities for each modality
            for multimodal dropout. If None all modalities are dropped with equal probability
        mmdrop_algorithm (str): Choose multimodal dropout algorithm [hard|soft]. Defaults to hard
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """
    super(M3FuseAggregate, self).__init__()

    self.m3 = MultimodalDropout(
        p=mmdrop_prob,
        n_modalities=n_modalities,
        p_mod=mmdrop_individual_mod_prob,
        mode=mmdrop_algorithm,
    )

    fuser_kwargs["output_size"] = output_size
    fuser_kwargs["fusion_method"] = fusion_method
    fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
    fuser_kwargs["batch_first"] = batch_first

    if "n_modalities" in fuser_kwargs:
        fuser_kwargs.pop("n_modalities")  # Avoid multiple arguments

    if "projection_size" in fuser_kwargs:
        fuser_kwargs.pop("projection_size")  # Avoid multiple arguments

    self.fuse_aggregate = FuseAggregateTimesteps(
        feature_size,
        n_modalities,
        **fuser_kwargs,
    )

`forward(self, mods, , lengths=None)`

Fuse the modality representations and aggregate across timesteps

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	List of modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	Lengths of each modality	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/m3.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations and aggregate across timesteps

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """
    mods_masked: List[torch.Tensor] = self.m3(*mods)
    fused: torch.Tensor = self.fuse_aggregate(*mods_masked, lengths=lengths)

    return fused

Multimodal Feedback

`BaseFeedbackUnit`

`init(self, top_size, target_size, n_top_modalities, **kwargs)` `special`

Base class for feedback unit

Feedback units are responsible for projecting top-level crossmodal representations to bottom-level features and applying the top-down masks

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`n_top_modalities`	`int`	Number of modalities to use for feedback	required

Source code in slp/modules/feedback.py

def __init__(
    self, top_size: int, target_size: int, n_top_modalities: int, **kwargs
):
    """Base class for feedback unit

    Feedback units are responsible for projecting top-level crossmodal
    representations to bottom-level features and applying the top-down masks

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        n_top_modalities (int): Number of modalities to use for feedback
    """
    super(BaseFeedbackUnit, self).__init__()
    self.n_ = n_top_modalities

    self.mask_layers = nn.ModuleList(
        [
            self.make_mask_layer(top_size, target_size, **kwargs)
            for _ in range(self.n_)
        ]
    )

`forward(self, x_bottom, mods_top, , lengths=None)`

Apply the top-down masks to the input feature vector

x = x * top_down_mask

Parameters:

Name	Type	Description	Default
`x_bottom`	`Tensor`	Bottom-level features [B, L, target_size]	required
`*mods_top`	`Tensor`	Top-level modality representations	`()`
`lengths`	`Optional[torch.Tensor]`	Original unpadded tensor lengths. Defaults to None.	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: Masked low level feature tensor [B, L, target_size]

Source code in slp/modules/feedback.py

def forward(
    self,
    x_bottom: torch.Tensor,
    *mods_top: torch.Tensor,
    lengths: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """Apply the top-down masks to the input feature vector

    x = x * top_down_mask

    Args:
        x_bottom (torch.Tensor): Bottom-level features [B, L, target_size]
        *mods_top (torch.Tensor): Top-level modality representations
        lengths (Optional[torch.Tensor], optional): Original unpadded tensor lengths. Defaults to None.

    Returns:
        torch.Tensor: Masked low level feature tensor [B, L, target_size]
    """
    mask = self._get_feedback_mask(*mods_top, lengths=lengths)
    x_bottom = x_bottom * mask

    return x_bottom

`make_mask_layer(self, top_size, target_size, **kwargs)`

Abstract method to instantiate the layer to use for top-down feedback

To be implemented by subclasses

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`**kwargs`		extra configuration for the feedback layer	`{}`

Returns:

Type	Description
`Module`	nn.Module: The instanstiated feedback layer

Source code in slp/modules/feedback.py

@abc.abstractmethod
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Abstract method to instantiate the layer to use for top-down feedback

    To be implemented by subclasses

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: The instanstiated feedback layer
    """
    pass

`BoomFeedbackUnit`

`make_mask_layer(self, top_size, target_size, **kwargs)`

Use an boom module for top-down projection

A boom module is a two-layer MLP where the inner projection size is much larger than the input and output size. (similar to Position feedforward in transformers)

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`**kwargs`		extra configuration for the feedback layer	`{}`

Returns:

Type	Description
`nn.Module`	slp.modules.feedforward.TwoLayer instance

Source code in slp/modules/feedback.py

def make_mask_layer(self, top_size, target_size, **kwargs):
    """Use an boom module for top-down projection

    A boom module is a two-layer MLP where the inner projection size is
    much larger than the input and output size. (similar to Position feedforward in transformers)

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.feedforward.TwoLayer instance
    """
    return TwoLayer(
        top_size,
        2 * top_size,
        target_size,
        activation=kwargs.get("activation", "gelu"),
        dropout=kwargs.get("dropout", 0.2),
    )

`DownUpFeedbackUnit`

`make_mask_layer(self, top_size, target_size, **kwargs)`

Use an down-up module for top-down projection

A down-up module is a two-layer MLP where the inner projection size is much smaller than the input and output size. (Similar to adapyers)

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`**kwargs`		extra configuration for the feedback layer	`{}`

Returns:

Type	Description
`nn.Module`	slp.modules.feedforward.TwoLayer instance

Source code in slp/modules/feedback.py

def make_mask_layer(self, top_size, target_size, **kwargs):
    """Use an down-up module for top-down projection

    A down-up module is a two-layer MLP where the inner projection size is
    much smaller than the input and output size. (Similar to adapyers)

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.feedforward.TwoLayer instance
    """
    return TwoLayer(
        top_size,
        top_size // 5,
        target_size,
        activation=kwargs.get("activation", "gelu"),
        dropout=kwargs.get("dropout", 0.2),
    )

`Feedback`

`init(self, top_size, bottom_modality_sizes, use_self=False, mask_type='rnn', **kwargs)` `special`

Feedback module

Given a list of low-level features and top-level representations for n modalities:

Create top-down masks for each modality
Apply top-down masks to the low level features
Return masked low-level features

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size for top-level representations (Common across modalities)	required
`bottom_modality_sizes`	`List[int]`	List of feature sizes for each low-level modality feature	required
`use_self`	`bool`	Include the self modality when creating the top-down mask. Defaults to False.	`False`
`mask_type`	`str`	Which feedback unit to use [rnn\|gated\|boom\|downup]. Defaults to "rnn".	`'rnn'`

Source code in slp/modules/feedback.py

def __init__(
    self,
    top_size: int,
    bottom_modality_sizes: List[int],
    use_self: bool = False,
    mask_type: str = "rnn",
    **kwargs,
):
    """Feedback module

    Given a list of low-level features and top-level representations for n modalities:

    * Create top-down masks for each modality
    * Apply top-down masks to the low level features
    * Return masked low-level features

    Args:
        top_size (int): Feature size for top-level representations (Common across modalities)
        bottom_modality_sizes (List[int]): List of feature sizes for each low-level modality feature
        use_self (bool, optional): Include the self modality when creating the top-down mask. Defaults to False.
        mask_type (str, optional): Which feedback unit to use [rnn|gated|boom|downup]. Defaults to "rnn".
    """
    super(Feedback, self).__init__()

    n_top_modalities = len(bottom_modality_sizes)
    self.use_self = use_self

    if not use_self:
        n_top_modalities = n_top_modalities - 1

    self.feedback_units = nn.ModuleList(
        [
            _make_feedback_unit(
                top_size,
                bottom_modality_sizes[i],
                n_top_modalities,
                mask_type=mask_type,
                **kwargs,
            )
            for i in range(len(bottom_modality_sizes))
        ]
    )

`forward(self, mods_bottom, mods_top, lengths=None)`

Create and apply the top-down masks to mods_bottom

Parameters:

Name	Type	Description	Default
`mods_bottom`	`List[torch.Tensor]`	Low-level features for each modality	required
`mods_top`	`List[torch.Tensor]`	High-level representations for each modality	required
`lengths`	`Optional[torch.Tensor]`	Original unpadded sequence lengths. Defaults to None.	`None`

Returns:

Type	Description
`List[torch.Tensor]`	List[torch.Tensor]: Masked low level features for each modality

Source code in slp/modules/feedback.py

def forward(
    self,
    mods_bottom: List[torch.Tensor],
    mods_top: List[torch.Tensor],
    lengths: Optional[torch.Tensor] = None,
) -> List[torch.Tensor]:
    """Create and apply the top-down masks to mods_bottom

    Args:
        mods_bottom (List[torch.Tensor]): Low-level features for each modality
        mods_top (List[torch.Tensor]): High-level representations for each modality
        lengths (Optional[torch.Tensor], optional): Original unpadded sequence lengths. Defaults to None.

    Returns:
        List[torch.Tensor]: Masked low level features for each modality
    """
    out = []

    for i, bm in enumerate(mods_bottom):
        top = mods_top if self.use_self else mods_top[:i] + mods_top[i + 1 :]
        masked = self.feedback_units[i](bm, *top, lengths=lengths)
        out.append(masked)

    return out

`GatedFeedbackUnit`

Apply feedback mask using simple gating mechanism

\[x_bottom = x_bottom * \frac{1}{2} [\sigma(W1 * y_top) + \sigma(W2 * z_top)]\]

`make_mask_layer(self, top_size, target_size, **kwargs)`

Use a simple nn.Linear layer for top-down projection

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`**kwargs`		extra configuration for the feedback layer	`{}`

Returns:

Type	Description
`Module`	nn.Module: nn.Linear instance with dropout

Source code in slp/modules/feedback.py

def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Use a simple nn.Linear layer for top-down projection


    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: nn.Linear instance with dropout
    """
    return nn.Sequential(
        nn.Linear(top_size, target_size),
        nn.Dropout(p=kwargs.get("dropout", 0.2)),
    )

`RNNFeedbackUnit`

Apply feedback mask using top-down RNN layers

\[x_bottom = x_bottom * \frac{1}{2} [\sigma(RNN(y_top)) + \sigma(RNN(z_top))]\]

`make_mask_layer(self, top_size, target_size, **kwargs)`

Use an RNN for top-down projection

Parameters:

Name	Type	Description	Default
`top_size`	`int`	Feature size of the top-level representations	required
`target_size`	`int`	Feature size of the bottom-level features	required
`**kwargs`		extra configuration for the feedback layer	`{}`

Returns:

Type	Description
`Module`	nn.Module: slp.modules.rnn.AttentiveRNN instance

Source code in slp/modules/feedback.py

def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Use an RNN for top-down projection


    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.rnn.AttentiveRNN instance
    """
    return AttentiveRNN(
        top_size,
        hidden_size=target_size,
        attention=kwargs.get("attention", False),
        dropout=kwargs.get("dropout", 0.2),
        return_hidden=True,
        bidirectional=kwargs.get("bidirectional", False),
        merge_bi="sum",
        rnn_type=kwargs.get("rnn_type", "lstm"),
    )

`MMLatch`

`init(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False, feedback=True, use_self_feedback=False, feedback_algorithm='rnn')` `special`

MMLatch implementation

Multimodal baseline + feedback

Parameters:

Name	Type	Description	Default
`text_size`	`int`	Text input size. Defaults to 300.	`300`
`audio_size`	`int`	Audio input size. Defaults to 74.	`74`
`visual_size`	`int`	Visual input size. Defaults to 35.	`35`
`hidden_size`	`int`	Hidden dimension. Defaults to 100.	`100`
`dropout`	`float`	Dropout rate. Defaults to 0.2.	`0.2`
`encoder_layers`	`float`	Number of encoder layers. Defaults to 1.	`1`
`bidirectional`	`bool`	Use bidirectional RNNs. Defaults to True.	`True`
`merge_bi`	`str`	Bidirectional merging method in the encoders. Defaults to "sum".	`'sum'`
`rnn_type`	`str`	RNN type [lstm\|gru]. Defaults to "lstm".	`'lstm'`
`encoder_attention`	`bool`	Use attention in the encoder RNNs. Defaults to True.	`True`
`fuser_residual`	`bool`	Use vilbert like residual in the attention fuser. Defaults to True.	`True`
`use_all_trimodal`	`bool`	Use all trimodal interactions for the Attention fuser. Defaults to False.	`False`
`feedback`	`bool`	Use top-down feedback. Defaults to True.	`True`
`use_self_feedback`	`bool`	If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False.	`False`
`feedback_algorithm`	`str`	Feedback module [rnn\|boom\|gated\|downup]. Defaults to "rnn".	`'rnn'`

Source code in slp/modules/mmlatch.py

def __init__(
    self,
    text_size: int = 300,
    audio_size: int = 74,
    visual_size: int = 35,
    hidden_size: int = 100,
    dropout: float = 0.2,
    encoder_layers: float = 1,
    bidirectional: bool = True,
    merge_bi: str = "sum",
    rnn_type: str = "lstm",
    encoder_attention: bool = True,
    fuser_residual: bool = True,
    use_all_trimodal: bool = False,
    feedback: bool = True,
    use_self_feedback: bool = False,
    feedback_algorithm: str = "rnn",
):
    """MMLatch implementation

    Multimodal baseline + feedback

    Args:
        text_size (int, optional): Text input size. Defaults to 300.
        audio_size (int, optional): Audio input size. Defaults to 74.
        visual_size (int, optional): Visual input size. Defaults to 35.
        hidden_size (int, optional): Hidden dimension. Defaults to 100.
        dropout (float, optional): Dropout rate. Defaults to 0.2.
        encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
        merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
        rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
        encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
        fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
        use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
        feedback (bool, optional): Use top-down feedback. Defaults to True.
        use_self_feedback (bool, optional): If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False.
        feedback_algorithm (str, optional): Feedback module [rnn|boom|gated|downup]. Defaults to "rnn".
    """
    super(MMLatch, self).__init__(
        text_size=text_size,
        audio_size=audio_size,
        visual_size=visual_size,
        hidden_size=hidden_size,
        dropout=dropout,
        encoder_layers=encoder_layers,
        bidirectional=bidirectional,
        merge_bi=merge_bi,
        rnn_type=rnn_type,
        encoder_attention=encoder_attention,
        fuser_residual=fuser_residual,
        use_all_trimodal=use_all_trimodal,
    )

    self.feedback = None

    if feedback:
        self.feedback = Feedback(
            hidden_size,
            [text_size, audio_size, visual_size],
            use_self=use_self_feedback,
            mask_type=feedback_algorithm,
        )

`encoder_cfg(input_size, **cfg)` `staticmethod`

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`input_size`	`int`	Input modality size	required
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The encoder configuration

Source code in slp/modules/mmlatch.py

@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
    }

`forward(self, mods, , lengths=None)`

Encode + fuse

Parameters:

Name	Type	Description	Default
`*mods`	`Tensor`	Variable input modality tensors [B, L, D]	`()`
`lengths`	`Optional[torch.Tensor]`	The unpadded tensor lengths. Defaults to None.	`None`

Returns:

Type	Description
`Tensor`	torch.Tensor: The fused tensor [B, D]

Source code in slp/modules/mmlatch.py

def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)

    if self.feedback is not None:
        mods_feedback: List[torch.Tensor] = self.feedback(
            mods, encoded, lengths=lengths
        )
        encoded = self._encode(*mods_feedback, lengths=lengths)

    fused = self._fuse(*encoded, lengths=lengths)

    return fused

`fuser_cfg(**cfg)` `staticmethod`

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name	Type	Description	Default
`**cfg`		Optional keyword arguments	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The fuser configuration

Source code in slp/modules/mmlatch.py

@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
    }

`MMLatchClassifier`

`forward(self, mod_dict, lengths)`

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/mmlatch.py

def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

Multimodal Modules

Fusers

SUPPORTED_FUSERS: Mapping[str, Type[slp.modules.fuse.BaseFuser]]

SUPPORTED_POOLERS: Mapping[str, Type[slp.modules.fuse.BaseTimestepsPooler]]

AttentionFuser

__init__(self, feature_size, n_modalities, use_all_trimodal=False, residual=True, dropout=0.1, **kwargs) special

fuse(self, *mods, *, lengths=None)

BaseFuser

out_size: int property readonly

__init__(self, feature_size, n_modalities, **extra_kwargs) special

forward(self, *mods, *, lengths=None)

fuse(self, *mods, *, lengths=None)

BaseFusionPipeline

out_size: int property readonly

__init__(self, *args, **kwargs) special

BaseTimestepsPooler

out_size: int property readonly

__init__(self, feature_size, batch_first=True, **kwargs) special

forward(self, x, lengths=None)

BilinearFuser

__init__(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs) special

fuse(self, *mods, *, lengths=None)

BimodalAttentionFuser

fuse(self, *mods, *, lengths=None)

BimodalBilinearFuser

fuse(self, *mods, *, lengths=None)

BimodalCombinatorialFuser

out_size: int property readonly

__init__(self, feature_size, n_modalities, **kwargs) special

CatFuser

out_size: int property readonly

fuse(self, *mods, *, lengths=None)

Conv1dProjection

__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=False) special

forward(self, *mods)

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

FuseAggregateTimesteps

out_size: int property readonly

__init__(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, **fuser_kwargs) special

forward(self, *mods, *, lengths=None)

LinearProjection

__init__(self, modality_sizes, projection_size, bias=True) special

forward(self, *mods)

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

ModalityProjection

__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=True, mode=None) special

forward(self, *mods)

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 300)

audio_p: (B, L, 74)

visual_p: (B, L, 35)

ModalityWeights

__init__(self, feature_size) special

forward(self, *mods)

`SUPPORTED_FUSERS: Mapping[str, Type[slp.modules.fuse.BaseFuser]]`

`SUPPORTED_POOLERS: Mapping[str, Type[slp.modules.fuse.BaseTimestepsPooler]]`

`AttentionFuser`

`init(self, feature_size, n_modalities, use_all_trimodal=False, residual=True, dropout=0.1, **kwargs)` `special`

`fuse(self, mods, , lengths=None)`

`BaseFuser`

`out_size: int` `property` `readonly`

`init(self, feature_size, n_modalities, **extra_kwargs)` `special`

`forward(self, mods, , lengths=None)`

`fuse(self, mods, , lengths=None)`

`BaseFusionPipeline`

`out_size: int` `property` `readonly`

`init(self, *args, **kwargs)` `special`

`BaseTimestepsPooler`

`out_size: int` `property` `readonly`

`init(self, feature_size, batch_first=True, **kwargs)` `special`

`forward(self, x, lengths=None)`

`BilinearFuser`

`init(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)` `special`

`fuse(self, mods, , lengths=None)`

`BimodalAttentionFuser`

`fuse(self, mods, , lengths=None)`

`BimodalBilinearFuser`

`fuse(self, mods, , lengths=None)`

`BimodalCombinatorialFuser`

`out_size: int` `property` `readonly`

`init(self, feature_size, n_modalities, **kwargs)` `special`

`CatFuser`

`out_size: int` `property` `readonly`

`fuse(self, mods, , lengths=None)`

`Conv1dProjection`

`init(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=False)` `special`

`forward(self, *mods)`

`FuseAggregateTimesteps`

`out_size: int` `property` `readonly`

`init(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, **fuser_kwargs)` `special`

`forward(self, mods, , lengths=None)`

`LinearProjection`

`init(self, modality_sizes, projection_size, bias=True)` `special`

`forward(self, *mods)`

`ModalityProjection`

`init(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=True, mode=None)` `special`

`forward(self, *mods)`

`ModalityWeights`

`init(self, feature_size)` `special`

`forward(self, *mods)`

`ProjectFuseAggregate`

`out_size: int` `property` `readonly`

`init(self, modality_sizes, projection_size, projection_type=None, fusion_method='cat', timesteps_pooling_method='sum', modality_weights=False, batch_first=True, **fuser_kwargs)` `special`

`forward(self, mods, , lengths=None)`

`RnnPooler`

`out_size: int` `property` `readonly`

`init(self, feature_size, hidden_size=None, batch_first=True, bidirectional=True, merge_bi='cat', attention=True, **kwargs)` `special`

`SumFuser`

`out_size: int` `property` `readonly`

`fuse(self, mods, , lengths=None)`

`TimestepsPooler`

`out_size: int` `property` `readonly`

`init(self, feature_size, mode='sum', batch_first=True, **kwargs)` `special`

`TrimodalCombinatorialFuser`

`out_size: int` `property` `readonly`

`init(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)` `special`

`make_fuser(fusion_method, feature_size, n_modalities, **kwargs)`

`AudioEncoder`

`AudioTextClassifier`

`forward(self, mod_dict, lengths)`

`AudioVisualClassifier`

`forward(self, mod_dict, lengths)`

`BaseEncoder`

`out_size: int` `property` `readonly`

`init(self, *args, **kwargs)` `special`

`forward(self, mods, , lengths=None)`

`BimodalEncoder`

`out_size: int` `property` `readonly`

`init(self, encoder1_args, encoder2_args, fuser_args, **kwargs)` `special`

`GloveEncoder`

`MOSEIClassifier`

`init(self, encoder, num_classes, dropout=0.2)` `special`

`MultimodalBaseline`

`init(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False)` `special`