Skip to content

Multimodal Modules

We include strong baselines for multimodal fusion and state-of-the-art paper implementations.

Fusers

This module contains the implementation of basic fusion algorithms and fusion pipelines.

The fusers are implemented for arbitrary number of input modalities, unless otherwise stated and are geared towards sequential inputs.

A fusion pipeline consists generally of three stages

  • Pre-fuse processing: Perform some common operations to all input modalities (e.g. project to a common dimension.)
  • Fuser: Fuse all modality representations into a single vector (e.g. concatenate all modality features using CatFuser).
  • Timesteps Pooling: Aggregate fused features for all timesteps into a single vector (e.g. add all timesteps with SumPooler)

SUPPORTED_FUSERS: Mapping[str, Type[slp.modules.fuse.BaseFuser]]

Currently implemented fusers

SUPPORTED_POOLERS: Mapping[str, Type[slp.modules.fuse.BaseTimestepsPooler]]

Supported poolers

AttentionFuser

__init__(self, feature_size, n_modalities, use_all_trimodal=False, residual=True, dropout=0.1, **kwargs) special

Fuse all combinations of three modalities using a base module using bilinear fusion

If input modalities are a, t, v, then the output is

Where f is TwowayAttention and g is Attention modules and values with [] are optional o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Parameters:

Name Type Description Default
feature_size int

Number of feature dimensions

required
n_modalities int

Number of input modalities (should be 3)

required
use_all_trimodal bool

Use all optional trimodal combinations

False
residual bool

Use residual connection in TwowayAttention. Defaults to True

True
dropout float

Dropout probability

0.1
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    residual: bool = True,
    dropout: float = 0.1,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module using bilinear fusion

    If input modalities are a, t, v, then the output is

    Where f is TwowayAttention and g is Attention modules and values with [] are optional
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
        residual (bool): Use residual connection in TwowayAttention. Defaults to True
        dropout (float): Dropout probability
    """
    kwargs["dropout"] = dropout
    kwargs["residual"] = residual
    super(AttentionFuser, self).__init__(
        feature_size,
        n_modalities,
        use_all_trimodal=use_all_trimodal,
        **kwargs,
    )

fuse(self, *mods, *, lengths=None)

Perform attention fusion on input modalities

Parameters:

Name Type Description Default
*mods Tensor

Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]

()
lengths Optional[torch.Tensor]

Unpadded tensors lengths

None

Returns:

Type Description
Tensor

torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D]

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform attention fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]

    """
    txt, au, vi = mods
    ta, at = self.ta(txt, au)
    va, av = self.va(vi, au)
    tv, vt = self.tv(txt, vi)

    va = va + av
    tv = vt + tv
    ta = ta + at

    tav, _ = self.tav(txt, queries=va)

    out_list = [txt, au, vi, ta, tv, va, tav]

    if self.use_all_trimodal:
        vat, _ = self.vat(vi, queries=ta)
        atv, _ = self.atv(au, queries=tv)

        out_list = out_list + [vat, atv]

    # B x L x 7*D or B x L x 9*D
    fused = torch.cat(out_list, dim=-1)

    return fused

BaseFuser

out_size: int property readonly

Output feature size.

Each fuser specifies its output feature size

__init__(self, feature_size, n_modalities, **extra_kwargs) special

Base fuser class.

Our fusion methods are separated in direct and combinatorial. An example for direct fusion is concatenation, where feature vectors of N modalities are concatenated into a fused vector. When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio, text -> visual, audio -> visaul etc.) In the current implementation, combinatorial fusion is implemented for 3 input modalities

Parameters:

Name Type Description Default
feature_size int

Assume all modality representations have the same feature_size

required
n_modalities int

Number of input modalities

required
**extra_kwargs dict

Extra keyword arguments to maintain interoperability of children classes

{}
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    **extra_kwargs,
):
    """Base fuser class.

    Our fusion methods are separated in direct and combinatorial.
    An example for direct fusion is concatenation, where feature vectors of N modalities
    are concatenated into a fused vector.
    When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio,
    text -> visual, audio -> visaul etc.)
    In the current implementation, combinatorial fusion is implemented for 3 input modalities

    Args:
        feature_size (int): Assume all modality representations have the same feature_size
        n_modalities (int): Number of input modalities
        **extra_kwargs (dict): Extra keyword arguments to maintain interoperability of children
            classes
    """
    super(BaseFuser, self).__init__()
    self.feature_size = feature_size
    self.n_modalities = n_modalities

forward(self, *mods, *, lengths=None)

Fuse the modality representations

Parameters:

Name Type Description Default
*mods Tensor

List of modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

Lengths of each modality

None

Returns:

Type Description
Tensor

torch.Tensor: Fused tensor [B, L, self.out_size]

Source code in slp/modules/fuse.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, L, self.out_size]
    """
    fused = self.fuse(*mods, lengths=lengths)

    return fused

fuse(self, *mods, *, lengths=None)

Abstract method to fuse the modality representations

Children classes should implement this method

Parameters:

Name Type Description Default
*mods Tensor

List of modality tensors

()
lengths Optional[torch.Tensor]

Lengths of each modality

None

Returns:

Type Description
Tensor

torch.Tensor: Fused tensor

Source code in slp/modules/fuse.py
@abstractmethod
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Abstract method to fuse the modality representations

    Children classes should implement this method

    Args:
        *mods: List of modality tensors
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor
    """
    pass

BaseFusionPipeline

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, *args, **kwargs) special

Base class for a fusion pipeline

Inherit this class to implement a fusion pipeline

Source code in slp/modules/fuse.py
def __init__(self, *args, **kwargs):
    """Base class for a fusion pipeline

    Inherit this class to implement a fusion pipeline

    """
    super(BaseFusionPipeline, self).__init__()

BaseTimestepsPooler

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, feature_size, batch_first=True, **kwargs) special

Abstract base class for Timesteps Poolers

Timesteps Poolers aggregate the features for different timesteps

Given a tensor with dimensions [BatchSize, Length, Dim] they return an aggregated tensor with dimensions [BatchSize, Dim]

Parameters:

Name Type Description Default
feature_size int

Feature dimension

required
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
**kwargs

Variable keyword arguments for subclasses

{}
Source code in slp/modules/fuse.py
def __init__(self, feature_size: int, batch_first: bool = True, **kwargs):
    """Abstract base class for Timesteps Poolers

    Timesteps Poolers aggregate the features for different timesteps

    Given a tensor with dimensions [BatchSize, Length, Dim]
    they return an aggregated tensor with dimensions [BatchSize, Dim]


    Args:
        feature_size (int): Feature dimension
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **kwargs: Variable keyword arguments for subclasses
    """
    super(BaseTimestepsPooler, self).__init__()
    self.pooling_dim = 0 if not batch_first else 1
    self.feature_size = feature_size

forward(self, x, lengths=None)

Pool features of input tensor across timesteps

Parameters:

Name Type Description Default
x Tensor

[B, L, D] Input sequence

required
lengths Optional[torch.Tensor]

Optional unpadded sequence lengths for input tensor

None

Returns:

Type Description
Tensor

torch.Tensor: [B, D] Output aggregated features across timesteps

Source code in slp/modules/fuse.py
def forward(
    self, x: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Pool features of input tensor across timesteps

    Args:
        x (torch.Tensor): [B, L, D] Input sequence
        lengths (Optional[torch.Tensor]): Optional unpadded sequence lengths for input tensor

    Returns:
        torch.Tensor: [B, D] Output aggregated features across timesteps
    """

    if x.ndim == 2:
        return x

    if x.ndim != 3:
        raise ValueError("Expected 3 dimensional tensor [B, L, D] or [L, B, D]")

    return self._pool(x, lengths=lengths)

BilinearFuser

__init__(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs) special

Fuse all combinations of three modalities using a base module using bilinear fusion

If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Where f and g are the nn.Bilinear function and values with [] are optional

Parameters:

Name Type Description Default
feature_size int

Number of feature dimensions

required
n_modalities int

Number of input modalities (should be 3)

required
use_all_trimodal bool

Use all optional trimodal combinations

False
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module using bilinear fusion

    If input modalities are a, t, v, then the output is
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Where f and g are the nn.Bilinear function and values with [] are optional

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
    """
    super(BilinearFuser, self).__init__(
        feature_size,
        n_modalities,
        use_all_trimodal=use_all_trimodal,
        **kwargs,
    )

fuse(self, *mods, *, lengths=None)

Perform bilinear fusion on input modalities

Parameters:

Name Type Description Default
*mods Tensor

Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]

()
lengths Optional[torch.Tensor]

Unpadded tensors lengths

None

Returns:

Type Description
Tensor

torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D]

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform bilinear fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]

    """
    txt, au, vi = mods
    ta = self.ta(txt, au)
    va = self.va(vi, au)
    tv = self.tv(txt, vi)

    tav = self.tav(txt, va)

    out_list = [txt, au, vi, ta, tv, va, tav]

    if self.use_all_trimodal:
        vat = self.vat(vi, ta)
        atv = self.atv(au, tv)

        out_list = out_list + [vat, atv]

    # B x L x 7*D or B x L x 9*D
    fused = torch.cat(out_list, dim=-1)

    return fused

BimodalAttentionFuser

fuse(self, *mods, *, lengths=None)

Perform attention fusion on input modalities

Parameters:

Name Type Description Default
*mods Tensor

Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]

()
lengths Optional[torch.Tensor]

Unpadded tensors lengths

None

Returns:

Type Description
Tensor

torch.Tensor: fused output vector [B, L, 3*D]

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform attention fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 3*D]

    """
    x, y = mods
    xy, yx = self.xy(x, y)
    xy = xy + yx
    # B x L x 3*D
    fused = torch.cat([x, y, xy], dim=-1)

    return fused

BimodalBilinearFuser

fuse(self, *mods, *, lengths=None)

Perform bilinear fusion on input modalities

Parameters:

Name Type Description Default
*mods Tensor

Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]

()
lengths Optional[torch.Tensor]

Unpadded tensors lengths

None

Returns:

Type Description
Tensor

torch.Tensor: fused output vector [B, L, 3*D]

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Perform bilinear fusion on input modalities

    Args:
        *mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
        lengths (Optional[torch.Tensor]): Unpadded tensors lengths

    Returns:
        torch.Tensor: fused output vector [B, L, 3*D]

    """
    x, y = mods
    xy = self.xy(x, y)

    # B x L x 3*D
    fused = torch.cat([x, y, xy], dim=-1)

    return fused

BimodalCombinatorialFuser

out_size: int property readonly

Fused vector feature dimension

Returns:

Type Description
int

int: 3 * feature_size

__init__(self, feature_size, n_modalities, **kwargs) special

Fuse all combinations of three modalities using a base module

If input modalities are x, y, then the output is o = x || y || f(x, y)

Where f is a network module (e.g. attention)

Parameters:

Name Type Description Default
feature_size int

Number of feature dimensions

required
n_modalities int

Number of input modalities (should be 3)

required
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module

    If input modalities are x, y, then the output is
    o = x || y || f(x, y)

    Where f is a network module (e.g. attention)

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
    """
    super(BimodalCombinatorialFuser, self).__init__(
        feature_size, n_modalities, **kwargs
    )
    self._check_n_modalities(n=2)
    self.xy = self._bimodal_fusion_module(feature_size, **kwargs)

CatFuser

Fuse by concatenating modality representations

o = m1 || m2 || m3 ...

Parameters:

Name Type Description Default
feature_size int

Assume all modality representations have the same feature_size

required
n_modalities int

Number of input modalities

required
**extra_kwargs dict

Extra keyword arguments to maintain interoperability of children classes

required

out_size: int property readonly

d_out = n_modalities * d_in

Returns:

Type Description
int

int: output feature size

fuse(self, *mods, *, lengths=None)

Concatenate input tensors into a single tensor

Examples:

fuser = CatFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, 2 * D)

Parameters:

Name Type Description Default
*mods Tensor

Variable number of input tensors

()

Returns:

Type Description
Tensor

torch.Tensor: Concatenated input tensors

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Concatenate input tensors into a single tensor

    Example:
        fuser = CatFuser(5, 2)
        x = torch.rand(16, 6, 5)  # (B, L, D)
        y = torch.rand(16, 6, 5)  # (B, L, D)
        out = fuser(x, y)  # (B, L, 2 * D)

    Args:
        *mods: Variable number of input tensors

    Returns:
        torch.Tensor: Concatenated input tensors

    """

    return torch.cat(mods, dim=-1)

Conv1dProjection

__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=False) special

Project features for N modalities using 1D convolutions

Parameters:

Name Type Description Default
modality_sizes List[int]

List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]

required
projection_size int

Output number of features for each modality

required
kernel_size int

Convolution kernel size

1
padding int

Convlution amount of padding

0
bias bool

Use bias in convolutional layers

False
Source code in slp/modules/fuse.py
def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    kernel_size: int = 1,
    padding: int = 0,
    bias: bool = False,
):
    """Project features for N modalities using 1D convolutions

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        projection_size (int): Output number of features for each modality
        kernel_size (int): Convolution kernel size
        padding (int): Convlution amount of padding
        bias (bool): Use bias in convolutional layers
    """
    super(Conv1dProjection, self).__init__()
    self.p = nn.ModuleList(
        [
            nn.Conv1d(
                sz,
                projection_size,
                kernel_size=kernel_size,
                padding=padding,
                bias=bias,
            )
            for sz in modality_sizes
        ]
    )

forward(self, *mods)

Project modality representations to a given number of features using Conv1d layers

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

c_proj = Conv1dProjection([300, 74, 35], 100) text_p, audio_p, visual_p = c_proj(text, audio, visual)

Parameters:

Name Type Description Default
*mods Tensor

Variable length tensors list

()

Returns:

Type Description
List[torch.Tensor]

List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features using Conv1d layers
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        c_proj = Conv1dProjection([300, 74, 35], 100)
        text_p, audio_p, visual_p = c_proj(text, audio, visual)

    Args:
        *mods: Variable length tensors list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """
    mods_o: List[torch.Tensor] = [
        self.p[i](m.transpose(1, 2)).transpose(1, 2) for i, m in enumerate(mods)
    ]

    return mods_o

FuseAggregateTimesteps

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, **fuser_kwargs) special

Fuse input feature sequences and aggregate across timesteps

Fuser -> TimestepsPooler

Parameters:

Name Type Description Default
feature_size int

The input modality representations dimension

required
n_modalities int

Number of input modalities

required
output_size Optional[int]

Required output size. If not provided, output_size = fuser.out_size

None
fusion_method str

Select which fuser to use [cat|sum|attention|bilinear]

'cat'
timesteps_pooling_method str

TimestepsPooler method [cat|sum|rnn]

'sum'
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
**fuser_kwargs dict

Extra keyword arguments to instantiate fuser

{}
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    output_size: Optional[int] = None,
    fusion_method: str = "cat",
    timesteps_pooling_method: str = "sum",
    batch_first: bool = True,
    **fuser_kwargs,
):
    """Fuse input feature sequences and aggregate across timesteps

    Fuser -> TimestepsPooler

    Args:
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        output_size (Optional[int]): Required output size. If not provided,
            output_size = fuser.out_size
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """

    super(FuseAggregateTimesteps, self).__init__(
        feature_size, n_modalities, fusion_method=fusion_method
    )
    self.fuser = make_fuser(
        fusion_method, feature_size, n_modalities, **fuser_kwargs
    )
    output_size = (  # bidirectional rnn. fused_size / 2 results to fused_size outputs
        output_size if output_size is not None else self.fuser.out_size // 2
    )
    self.timesteps_pooler = TimestepsPooler(
        self.fuser.out_size,
        hidden_size=output_size,
        mode=timesteps_pooling_method,
        batch_first=batch_first,
    )

forward(self, *mods, *, lengths=None)

Fuse the modality representations and aggregate across timesteps

Parameters:

Name Type Description Default
*mods Tensor

List of modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

Lengths of each modality

None

Returns:

Type Description
Tensor

torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/fuse.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations and aggregate across timesteps

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """
    fused = self.fuser(*mods, lengths=lengths)
    out: torch.Tensor = self.timesteps_pooler(fused, lengths=lengths)

    return out

LinearProjection

__init__(self, modality_sizes, projection_size, bias=True) special

Project features for N modalities using feedforward layers

Parameters:

Name Type Description Default
modality_sizes List[int]

List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]

required
bias bool

Use bias in feedforward layers

True
Source code in slp/modules/fuse.py
def __init__(
    self, modality_sizes: List[int], projection_size: int, bias: bool = True
):
    """Project features for N modalities using feedforward layers

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        bias (bool): Use bias in feedforward layers
    """
    super(LinearProjection, self).__init__()
    self.p = nn.ModuleList(
        [nn.Linear(sz, projection_size, bias=bias) for sz in modality_sizes]
    )

forward(self, *mods)

Project modality representations to a given number of features using Linear layers

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

l_proj = LinearProjection([300, 74, 35], 100) text_p, audio_p, visual_p = l_proj(text, audio, visual)

Parameters:

Name Type Description Default
*mods Tensor

Variable length tensor list

()

Returns:

Type Description
List[torch.Tensor]

List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features using Linear layers
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        l_proj = LinearProjection([300, 74, 35], 100)
        text_p, audio_p, visual_p = l_proj(text, audio, visual)

    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """
    mods_o: List[torch.Tensor] = [self.p[i](m) for i, m in enumerate(mods)]

    return mods_o

ModalityProjection

__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=True, mode=None) special

Adapter module to project features for N modalities using 1D convolutions or feedforward

Parameters:

Name Type Description Default
modality_sizes List[int]

List of number of features for each modality. E.g. for MOSEI: [300, 74, 35]

required
projection_size int

Output number of features for each modality

required
kernel_size int

Convolution kernel size. Used when mode=="conv"

1
padding int

Convlution amount of padding. Used when mode=="conv"

0
bias bool

Use bias

True
mode Optional[str]

Projection method. linear -> LinearProjection conv|conv1d|convolutional -> Conv1dProjection

None
Source code in slp/modules/fuse.py
def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    kernel_size: int = 1,
    padding: int = 0,
    bias: bool = True,
    mode: Optional[str] = None,
):
    """Adapter module to project features for N modalities using 1D convolutions or feedforward

    Args:
        modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
            [300, 74, 35]
        projection_size (int): Output number of features for each modality
        kernel_size (int): Convolution kernel size. Used when mode=="conv"
        padding (int): Convlution amount of padding. Used when mode=="conv"
        bias (bool): Use bias
        mode (Optional[str]): Projection method.
            linear -> LinearProjection
            conv|conv1d|convolutional -> Conv1dProjection
    """
    super(ModalityProjection, self).__init__()

    if mode is None:
        self.p: Optional[Union[LinearProjection, Conv1dProjection]] = None
    elif mode == "linear":
        self.p = LinearProjection(modality_sizes, projection_size, bias=bias)
    elif mode == "conv" or mode == "conv1d" or mode == "convolutional":
        self.p = Conv1dProjection(
            modality_sizes,
            projection_size,
            kernel_size=kernel_size,
            padding=padding,
            bias=bias,
        )
    else:
        raise ValueError(
            "Supported mode=[linear|conv|conv1d|convolutional]."
            "conv, conv1d and convolutional are equivalent."
        )

forward(self, *mods)

Project modality representations to a given number of features

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

l_proj = ModalityProjection([300, 74, 35], 100, mode="linear") text_p, audio_p, visual_p = l_proj(text, audio, visual)

Examples:

Inputs:

text: (B, L, 300)

audio: (B, L, 74)

visual: (B, L, 35)

Outputs:

text_p: (B, L, 300)

audio_p: (B, L, 74)

visual_p: (B, L, 35)

l_proj = ModalityProjection([300, 74, 35], 100, mode=None) text_p, audio_p, visual_p = l_proj(text, audio, visual)

Parameters:

Name Type Description Default
*mods Tensor

Variable length tensor list

()

Returns:

Type Description
List[torch.Tensor]

List[torch.Tensor]: Variable length projected tensors list

Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Project modality representations to a given number of features
    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        l_proj = ModalityProjection([300, 74, 35], 100, mode="linear")
        text_p, audio_p, visual_p = l_proj(text, audio, visual)

    Example:
        # Inputs:
        #    text: (B, L, 300)
        #    audio: (B, L, 74)
        #    visual: (B, L, 35)
        # Outputs:
        #    text_p: (B, L, 300)
        #    audio_p: (B, L, 74)
        #    visual_p: (B, L, 35)
        l_proj = ModalityProjection([300, 74, 35], 100, mode=None)
        text_p, audio_p, visual_p = l_proj(text, audio, visual)


    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length projected tensors list
    """

    if self.p is None:
        return list(mods)
    mods_o: List[torch.Tensor] = self.p(*mods)

    return mods_o

ModalityWeights

__init__(self, feature_size) special

Multiply each modality features with a learnable weight

i: modality index learnable_weight[i] = softmax(Linear(modality_features[i])) output_modality[i] = learnable_weight * modality_features[i]

Parameters:

Name Type Description Default
feature_size int

All modalities are assumed to be projected into a space with the same number of features.

required
Source code in slp/modules/fuse.py
def __init__(self, feature_size: int):
    """Multiply each modality features with a learnable weight

    i: modality index
    learnable_weight[i] = softmax(Linear(modality_features[i]))
    output_modality[i] = learnable_weight * modality_features[i]

    Args:
        feature_size (int): All modalities are assumed to be projected into a space with the same
            number of features.

    """
    super(ModalityWeights, self).__init__()

    self.mod_w = nn.Linear(feature_size, 1)

forward(self, *mods)

Use learnable weights to multiply modality features

Examples:

Inputs:

text: (B, L, 100)

audio: (B, L, 100)

visual: (B, L, 100)

Outputs:

text_p: (B, L, 100)

audio_p: (B, L, 100)

visual_p: (B, L, 100)

mw = ModalityWeights(100) text_w, audio_w, visual_w = mw(text, audio, visual)

The operation is summarized as:

w_x = softmax(W * x + b) w_y = softmax(W * y + b) x_out = w_x * x y_out = w_y * y

Parameters:

Name Type Description Default
*mods Tensor

Variable length tensor list

()

Returns:

Type Description
List[torch.Tensor]

List[torch.Tensor]: Variable length reweighted tensors list

Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
    """Use learnable weights to multiply modality features

    Example:
        # Inputs:
        #    text: (B, L, 100)
        #    audio: (B, L, 100)
        #    visual: (B, L, 100)
        # Outputs:
        #    text_p: (B, L, 100)
        #    audio_p: (B, L, 100)
        #    visual_p: (B, L, 100)
        mw = ModalityWeights(100)
        text_w, audio_w, visual_w = mw(text, audio, visual)

    The operation is summarized as:

    w_x = softmax(W * x + b)
    w_y = softmax(W * y + b)
    x_out = w_x * x
    y_out = w_y * y

    Args:
        *mods: Variable length tensor list

    Returns:
        List[torch.Tensor]: Variable length reweighted tensors list
    """
    weight = self.mod_w(torch.cat([x.unsqueeze(1) for x in mods], dim=1))
    weight = F.softmax(weight, dim=1)
    mods_o: List[torch.Tensor] = [m * weight[:, i, ...] for i, m in enumerate(mods)]

    return mods_o

ProjectFuseAggregate

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, modality_sizes, projection_size, projection_type=None, fusion_method='cat', timesteps_pooling_method='sum', modality_weights=False, batch_first=True, **fuser_kwargs) special

Project input feature sequences, fuse and aggregate across timesteps

ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler

Parameters:

Name Type Description Default
modality_sizes List[int]

List of input modality representations dimensions

required
projection_size int

Project all modalities to have this feature size

required
projection_type Optional[str]

Optional projection method [linear|conv]

None
fusion_method str

Select which fuser to use [cat|sum|attention|bilinear]

'cat'
timesteps_pooling_method str

TimestepsPooler method [cat|sum|rnn]

'sum'
modality_weights bool

Multiply projected modality representations with learnable weights. Default value is False.

False
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
**fuser_kwargs dict

Extra keyword arguments to instantiate fuser

{}
Source code in slp/modules/fuse.py
def __init__(
    self,
    modality_sizes: List[int],
    projection_size: int,
    projection_type: Optional[str] = None,
    fusion_method="cat",
    timesteps_pooling_method="sum",
    modality_weights: bool = False,
    batch_first: bool = True,
    **fuser_kwargs,
):
    """Project input feature sequences, fuse and aggregate across timesteps

    ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler

    Args:
        modality_sizes (List[int]): List of input modality representations dimensions
        projection_size (int): Project all modalities to have this feature size
        projection_type (Optional[str]): Optional projection method [linear|conv]
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        modality_weights (bool): Multiply projected modality representations with learnable
            weights. Default value is False.
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """
    super(ProjectFuseAggregate, self).__init__()
    n_modalities = len(modality_sizes)

    self.projection = None
    self.modality_weights = None

    if projection_type is not None:
        self.projection = ModalityProjection(
            modality_sizes, projection_size, mode=projection_type
        )

        if modality_weights:
            self.modality_weights = ModalityWeights(projection_size)

    fuser_kwargs["output_size"] = projection_size
    fuser_kwargs["fusion_method"] = fusion_method
    fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
    fuser_kwargs["batch_first"] = batch_first

    if "n_modalities" in fuser_kwargs:
        del fuser_kwargs["n_modalities"]

    if "projection_size" in fuser_kwargs:
        del fuser_kwargs["projection_size"]

    self.fuse_aggregate = FuseAggregateTimesteps(
        projection_size,
        n_modalities,
        **fuser_kwargs,
    )

forward(self, *mods, *, lengths=None)

Project modality representations to a common dimension, fuse and aggregate across timesteps

Optionally use modality weights

Parameters:

Name Type Description Default
*mods Tensor

List of modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

Lengths of each modality

None

Returns:

Type Description
Tensor

torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/fuse.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Project modality representations to a common dimension, fuse and aggregate across timesteps

    Optionally use modality weights

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """

    if self.projection is not None:
        mods = self.projection(*mods)

    if self.modality_weights is not None:
        mods = self.modality_weights(*mods)
    fused: torch.Tensor = self.fuse_aggregate(*mods, lengths=lengths)

    return fused

RnnPooler

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, feature_size, hidden_size=None, batch_first=True, bidirectional=True, merge_bi='cat', attention=True, **kwargs) special

Aggregate features of the input tensor using an AttentiveRNN

Parameters:

Name Type Description Default
feature_size int

Feature dimension

required
hidden_size Optional[int]

Hidden dimension

None
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
bidirectional bool

Use bidirectional RNN. Defaults to True

True
merge_bi str

How bidirectional states are merged. Defaults to "cat"

'cat'
attention bool

Use attention for the RNN output states

True
**kwargs

Variable keyword arguments

{}
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    hidden_size: Optional[int] = None,
    batch_first: bool = True,
    bidirectional: bool = True,
    merge_bi: str = "cat",
    attention: bool = True,
    **kwargs,
):
    """Aggregate features of the input tensor using an AttentiveRNN

    Args:
        feature_size (int): Feature dimension
        hidden_size (int): Hidden dimension
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        bidirectional (bool): Use bidirectional RNN. Defaults to True
        merge_bi (str): How bidirectional states are merged. Defaults to "cat"
        attention (bool): Use attention for the RNN output states
        **kwargs: Variable keyword arguments
    """
    super(RnnPooler, self).__init__(feature_size, batch_first=batch_first, **kwargs)
    self.hidden_size = hidden_size if hidden_size is not None else feature_size
    self.rnn = AttentiveRNN(
        feature_size,
        hidden_size=self.hidden_size,
        batch_first=batch_first,
        bidirectional=bidirectional,
        merge_bi=merge_bi,
        attention=attention,
        return_hidden=False,  # We want to aggregate all hidden states.
    )

SumFuser

Fuse by adding modality representations

o = m1 + m2 + m3 ...

Parameters:

Name Type Description Default
feature_size int

Assume all modality representations have the same feature_size

required
n_modalities int

Number of input modalities

required
**extra_kwargs dict

Extra keyword arguments to maintain interoperability of children classes

required

out_size: int property readonly

d_out = d_in

Returns:

Type Description
int

int: output feature size

fuse(self, *mods, *, lengths=None)

Sum input tensors into a single tensor

Examples:

fuser = SumFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, D)

Parameters:

Name Type Description Default
*mods Tensor

Variable number of input tensors

()

Returns:

Type Description
Tensor

torch.Tensor: Summed input tensors

Source code in slp/modules/fuse.py
def fuse(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Sum input tensors into a single tensor

    Example:
        fuser = SumFuser(5, 2)
        x = torch.rand(16, 6, 5)  # (B, L, D)
        y = torch.rand(16, 6, 5)  # (B, L, D)
        out = fuser(x, y)  # (B, L, D)

    Args:
        *mods: Variable number of input tensors

    Returns:
        torch.Tensor: Summed input tensors

    """

    return torch.cat([m.unsqueeze(-1) for m in mods], dim=-1).sum(-1)

TimestepsPooler

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, feature_size, mode='sum', batch_first=True, **kwargs) special

Aggregate features from all timesteps into a single representation.

Three methods supported: sum: Sum features from all timesteps mean: Average features from all timesteps max: Max pool features from all timesteps rnn: Use the output from an attentive RNN

Parameters:

Name Type Description Default
feature_size int

The number of features for the input fused representations

required
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
mode str

The timestep pooling method sum: Sum hidden states mean: Average hidden states max: Max pool features from all hidden states rnn: Use the output of an Attentive RNN

'sum'
Source code in slp/modules/fuse.py
def __init__(
    self, feature_size: int, mode: str = "sum", batch_first=True, **kwargs
):
    """Aggregate features from all timesteps into a single representation.

    Three methods supported:
        sum: Sum features from all timesteps
        mean: Average features from all timesteps
        max: Max pool features from all timesteps
        rnn: Use the output from an attentive RNN

    Args:
        feature_size (int): The number of features for the input fused representations
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        mode (str): The timestep pooling method
            sum: Sum hidden states
            mean: Average hidden states
            max: Max pool features from all hidden states
            rnn: Use the output of an Attentive RNN
    """
    super(TimestepsPooler, self).__init__(
        feature_size, batch_first=batch_first, **kwargs
    )
    assert (
        mode is None or mode in SUPPORTED_POOLERS
    ), "Unsupported timestep pooling method.  Available methods: {SUPPORTED_POOLERS.keys()}"

    self.pooler = None

    if mode is not None:
        self.pooler = SUPPORTED_POOLERS[mode](
            feature_size, batch_first=batch_first, **kwargs
        )

TrimodalCombinatorialFuser

out_size: int property readonly

Fused vector feature dimension

Returns:

Type Description
int

int: 7 * feature_size if use_all_trimodal==False else 9*feature_size

__init__(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs) special

Fuse all combinations of three modalities using a base module

If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

Where f and g network modules (e.g. attention) and values with [] are optional

Parameters:

Name Type Description Default
feature_size int

Number of feature dimensions

required
n_modalities int

Number of input modalities (should be 3)

required
use_all_trimodal bool

Use all optional trimodal combinations

False
Source code in slp/modules/fuse.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    use_all_trimodal: bool = False,
    **kwargs,
):
    """Fuse all combinations of three modalities using a base module

    If input modalities are a, t, v, then the output is
    o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]

    Where f and g network modules (e.g. attention) and values with [] are optional

    Args:
        feature_size (int): Number of feature dimensions
        n_modalities (int): Number of input modalities (should be 3)
        use_all_trimodal (bool): Use all optional trimodal combinations
    """
    super(TrimodalCombinatorialFuser, self).__init__(
        feature_size, n_modalities, **kwargs
    )
    self._check_n_modalities(n=3)
    self.use_all_trimodal = use_all_trimodal

    self.ta = self._bimodal_fusion_module(feature_size, **kwargs)
    self.va = self._bimodal_fusion_module(feature_size, **kwargs)
    self.tv = self._bimodal_fusion_module(feature_size, **kwargs)

    self.tav = self._trimodal_fusion_module(feature_size, **kwargs)

    if use_all_trimodal:
        self.vat = self._trimodal_fusion_module(feature_size, **kwargs)
        self.atv = self._trimodal_fusion_module(feature_size, **kwargs)

make_fuser(fusion_method, feature_size, n_modalities, **kwargs)

Helper function to instantiate a fuser given a string fusion_method parameter

Parameters:

Name Type Description Default
fusion_method str

One of the supported fusion methods [cat|add|bilinear|attention]

required
feature_size int

The input modality representations dimension

required
n_modalities int

Number of input modalities

required
**kwargs

Variable keyword arguments to pass to the instantiated fuser

{}
Source code in slp/modules/fuse.py
def make_fuser(fusion_method: str, feature_size: int, n_modalities: int, **kwargs):
    """Helper function to instantiate a fuser given a string fusion_method parameter

    Args:
        fusion_method (str): One of the supported fusion methods [cat|add|bilinear|attention]
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        **kwargs: Variable keyword arguments to pass to the instantiated fuser
    """

    if fusion_method not in SUPPORTED_FUSERS.keys():
        raise NotImplementedError(
            f"The supported fusers are {SUPPORTED_FUSERS.keys()}. You provided {fusion_method}"
        )

    if fusion_method == "bilinear":
        if n_modalities == 2:
            return BimodalBilinearFuser(feature_size, n_modalities, **kwargs)
        elif n_modalities == 3:
            return BilinearFuser(feature_size, n_modalities, **kwargs)
        else:
            raise ValueError("bilinear implemented for 2 or 3 modalities")

    if fusion_method == "attention":
        if n_modalities == 2:
            return BimodalAttentionFuser(feature_size, n_modalities, **kwargs)
        elif n_modalities == 3:
            return AttentionFuser(feature_size, n_modalities, **kwargs)
        else:
            raise ValueError("attention implemented for 2 or 3 modalities")

    return SUPPORTED_FUSERS[fusion_method](feature_size, n_modalities, **kwargs)

Multimodal encoders

These modules implement mid and late fusion. The general structure of a multimodal encoder contains:

  • N Unimodal encoders (e.g. RNNs), where N is the number of input modalities
  • A fusion pipeline

We furthermore implement Multimodal classifiers, which consist of a multimodal encoder followed by an nn.Linear layer.

A special mention should be added for our MultimodalBaseline. This baseline consists of RNN encoders followed by an attention fuser and an RNN timesteps poolwer in multimodal tasks and is tuned on CMU-MOSEI. The default configuration is provided through static methods and achieve strong performance.

AudioEncoder

Alias for Unimodal Encoder

AudioTextClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

AudioVisualClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["visual"], mod_dict["audio"]]
    fused = self.enc(*mods, lengths=lengths["visual"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

BaseEncoder

out_size: int property readonly

An encoder returns its output size

Returns:

Type Description
int

int: The output feature size of the encoder

__init__(self, *args, **kwargs) special

Base class implementing a multimodal encoder

A BaseEncoder child encodes and fuses the modality features and returns representations ready to be provided to a classification layer

Source code in slp/modules/multimodal.py
def __init__(self, *args, **kwargs):
    """Base class implementing a multimodal encoder

    A BaseEncoder child encodes and fuses the modality  features
    and returns representations ready to be provided to a classification layer
    """
    super(BaseEncoder, self).__init__()
    self.args = args
    self.kwargs = kwargs
    self.clf = None

forward(self, *mods, *, lengths=None)

Encode + fuse

Parameters:

Name Type Description Default
*mods Tensor

Variable input modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

The unpadded tensor lengths. Defaults to None.

None

Returns:

Type Description
Tensor

torch.Tensor: The fused tensor [B, D]

Source code in slp/modules/multimodal.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Encode + fuse

    Args:
        *mods (torch.Tensor): Variable input modality tensors [B, L, D]
        lengths (Optional[torch.Tensor], optional): The unpadded tensor lengths. Defaults to None.

    Returns:
        torch.Tensor: The fused tensor [B, D]
    """
    encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)
    fused = self._fuse(*encoded, lengths=lengths)

    return fused

BimodalEncoder

out_size: int property readonly

Output feature size

Returns:

Type Description
int

int: Output feature size

__init__(self, encoder1_args, encoder2_args, fuser_args, **kwargs) special

Two modality encoder

Encode + Fuse two input modalities

Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }

Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }

Parameters:

Name Type Description Default
encoder1_args Dict[str, Any]

Configuration for first encoder

required
encoder2_args Dict[str, Any]

Configuration for second encoder

required
fuser_args Dict[str, Any]

Configuration for fuser

required
Source code in slp/modules/multimodal.py
def __init__(
    self,
    encoder1_args: Dict[str, Any],
    encoder2_args: Dict[str, Any],
    fuser_args: Dict[str, Any],
    **kwargs,
):
    """Two modality encoder

    Encode + Fuse two input modalities

    Example encoder_args:
        {
            "input_size": 35,
            "hidden_size": 100,
            "layers": 1,
            "bidirectional": True,
            "dropout": 0.2,
            "rnn_type": "lstm",
            "attention": True,
        }

    Example fuser_args:
        {
            "n_modalities": 3,
            "dropout": 0.2,
            "output_size": 100,
            "hidden_size": 100,
            "fusion_method": "cat",
            "timesteps_pooling_method": "rnn",
        }

    Args:
        encoder1_args (Dict[str, Any]): Configuration for first encoder
        encoder2_args (Dict[str, Any]): Configuration for second encoder
        fuser_args (Dict[str, Any]): Configuration for fuser
    """
    super(BimodalEncoder, self).__init__(
        encoder1_args,
        encoder2_args,
        fuser_args,
        **kwargs,
    )
    self.input_projection = None

    if "input_projection" in fuser_args and fuser_args["input_projection"]:
        self.input_projection = ModalityProjection(
            [encoder1_args["input_size"], encoder2_args["input_size"]],
            fuser_args["hidden_size"],
            mode=fuser_args["input_projection"],
        )

    encoder1_args["return_hidden"] = True
    encoder2_args["return_hidden"] = True

    self.encoder1 = UnimodalEncoder(**encoder1_args)

    self.encoder2 = UnimodalEncoder(**encoder2_args)

    self.fuse = self._make_fusion_pipeline(
        [self.encoder1.out_size, self.encoder2.out_size], **fuser_args
    )

GloveEncoder

Alias for Unimodal Encoder

MOSEIClassifier

__init__(self, encoder, num_classes, dropout=0.2) special

Encode and classify multimodal inputs

Parameters:

Name Type Description Default
encoder BaseEncoder

The encoder module

required
num_classes int

The number of target classes

required
dropout float

Dropout probability

0.2
Source code in slp/modules/multimodal.py
def __init__(self, encoder: BaseEncoder, num_classes: int, dropout: float = 0.2):
    """Encode and classify multimodal inputs

    Args:
        encoder (BaseEncoder): The encoder module
        num_classes (int): The number of target classes
        dropout (float): Dropout probability

    """
    super(MOSEIClassifier, self).__init__()
    self.enc = encoder
    self.drop = nn.Dropout(p=dropout)
    self.clf = nn.Linear(self.enc.out_size, num_classes)

MultimodalBaseline

__init__(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False) special

Multimodal baseline architecture

This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler. The default configuration is tuned for good performance on MOSEI.

Parameters:

Name Type Description Default
text_size int

Text input size. Defaults to 300.

300
audio_size int

Audio input size. Defaults to 74.

74
visual_size int

Visual input size. Defaults to 35.

35
hidden_size int

Hidden dimension. Defaults to 100.

100
dropout float

Dropout rate. Defaults to 0.2.

0.2
encoder_layers float

Number of encoder layers. Defaults to 1.

1
bidirectional bool

Use bidirectional RNNs. Defaults to True.

True
merge_bi str

Bidirectional merging method in the encoders. Defaults to "sum".

'sum'
rnn_type str

RNN type [lstm|gru]. Defaults to "lstm".

'lstm'
encoder_attention bool

Use attention in the encoder RNNs. Defaults to True.

True
fuser_residual bool

Use vilbert like residual in the attention fuser. Defaults to True.

True
use_all_trimodal bool

Use all trimodal interactions for the Attention fuser. Defaults to False.

False
Source code in slp/modules/multimodal.py
def __init__(
    self,
    text_size: int = 300,
    audio_size: int = 74,
    visual_size: int = 35,
    hidden_size: int = 100,
    dropout: float = 0.2,
    encoder_layers: float = 1,
    bidirectional: bool = True,
    merge_bi: str = "sum",
    rnn_type: str = "lstm",
    encoder_attention: bool = True,
    fuser_residual: bool = True,
    use_all_trimodal: bool = False,
):
    """Multimodal baseline architecture

    This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler.
    The default configuration is tuned for good performance on MOSEI.

    Args:
        text_size (int, optional): Text input size. Defaults to 300.
        audio_size (int, optional): Audio input size. Defaults to 74.
        visual_size (int, optional): Visual input size. Defaults to 35.
        hidden_size (int, optional): Hidden dimension. Defaults to 100.
        dropout (float, optional): Dropout rate. Defaults to 0.2.
        encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
        merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
        rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
        encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
        fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
        use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
    """
    cfg = {
        "hidden_size": hidden_size,
        "dropout": dropout,
        "layers": encoder_layers,
        "attention": encoder_attention,
        "bidirectional": bidirectional,
        "rnn_type": rnn_type,
        "merge_bi": merge_bi,
    }

    text_cfg = MultimodalBaseline.encoder_cfg(text_size, **cfg)
    audio_cfg = MultimodalBaseline.encoder_cfg(audio_size, **cfg)
    visual_cfg = MultimodalBaseline.encoder_cfg(visual_size, **cfg)
    fuser_cfg = MultimodalBaseline.fuser_cfg(
        hidden_size=hidden_size,
        dropout=dropout,
        residual=fuser_residual,
        use_all_trimodal=use_all_trimodal,
    )

    super(MultimodalBaseline, self).__init__(
        text_cfg, audio_cfg, visual_cfg, fuser_cfg
    )

encoder_cfg(input_size, **cfg) staticmethod

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
input_size int

Input modality size

required
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The encoder configuration

Source code in slp/modules/multimodal.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
        "merge_bi": cfg.get("merge_bi", "sum"),
    }

fuser_cfg(**cfg) staticmethod

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The fuser configuration

Source code in slp/modules/multimodal.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
    }

MultimodalBaselineClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

TrimodalClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

TrimodalEncoder

out_size: int property readonly

Output feature size

Returns:

Type Description
int

int: Output feature size

__init__(self, encoder1_args, encoder2_args, encoder3_args, fuser_args, **kwargs) special

Two modality encoder

Encode + Fuse three input modalities

Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }

Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }

Parameters:

Name Type Description Default
encoder1_args Dict[str, Any]

Configuration for first encoder

required
encoder2_args Dict[str, Any]

Configuration for second encoder

required
encoder3_args Dict[str, Any]

Configuration for third encoder

required
fuser_args Dict[str, Any]

Configuration for fuser

required
Source code in slp/modules/multimodal.py
def __init__(
    self,
    encoder1_args: Dict[str, Any],
    encoder2_args: Dict[str, Any],
    encoder3_args: Dict[str, Any],
    fuser_args: Dict[str, Any],
    **kwargs,
):
    """Two modality encoder

    Encode + Fuse three input modalities

    Example encoder_args:
        {
            "input_size": 35,
            "hidden_size": 100,
            "layers": 1,
            "bidirectional": True,
            "dropout": 0.2,
            "rnn_type": "lstm",
            "attention": True,
        }

    Example fuser_args:
        {
            "n_modalities": 3,
            "dropout": 0.2,
            "output_size": 100,
            "hidden_size": 100,
            "fusion_method": "cat",
            "timesteps_pooling_method": "rnn",
        }

    Args:
        encoder1_args (Dict[str, Any]): Configuration for first encoder
        encoder2_args (Dict[str, Any]): Configuration for second encoder
        encoder3_args (Dict[str, Any]): Configuration for third encoder
        fuser_args (Dict[str, Any]): Configuration for fuser
    """
    super(TrimodalEncoder, self).__init__(
        encoder1_args,
        encoder2_args,
        encoder3_args,
        fuser_args,
        **kwargs,
    )
    self.input_projection = None

    if "input_projection" in fuser_args and fuser_args["input_projection"]:
        self.input_projection = ModalityProjection(
            [encoder1_args["input_size"], encoder2_args["input_size"]],
            fuser_args["hidden_size"],
            mode=fuser_args["input_projection"],
        )

    self.encoder1 = UnimodalEncoder(**encoder1_args)

    self.encoder2 = UnimodalEncoder(**encoder2_args)

    self.encoder3 = UnimodalEncoder(**encoder3_args)
    # encoder3_args["input_size"], encoder3_args["hidden_size"], **encoder3_args

    self.fuse = self._make_fusion_pipeline(
        [self.encoder1.out_size, self.encoder2.out_size, self.encoder3.out_size],
        **fuser_args,
    )

UnimodalClassifier

__init__(self, input_size, hidden_size, num_classes, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, **kwargs) special

Encode and classify unimodal inputs

Parameters:

Name Type Description Default
input_size int

The input modality feature size

required
hidden_size int

Hidden size for RNN

required
num_classes int

The number of target classes

required
layers int

Number of RNN layers

1
bidirectional bool

Use biRNN

True
dropout float

Dropout probability

0.2
rnn_type str

[lstm|gru]

'lstm'
attention bool

Use attention on hidden states

True
Source code in slp/modules/multimodal.py
def __init__(
    self,
    input_size: int,
    hidden_size: int,
    num_classes: int,
    layers: int = 1,
    bidirectional: bool = True,
    dropout: float = 0.2,
    rnn_type: str = "lstm",
    attention: bool = True,
    **kwargs,
):
    """Encode and classify unimodal inputs

    Args:
        input_size (int): The input modality feature size
        hidden_size (int): Hidden size for RNN
        num_classes (int): The number of target classes
        layers (int): Number of RNN layers
        bidirectional (bool): Use biRNN
        dropout (float): Dropout probability
        rnn_type (str): [lstm|gru]
        attention (bool): Use attention on hidden states

    """
    enc = UnimodalEncoder(
        input_size,
        hidden_size,
        layers=layers,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        attention=attention,
        aggregate_encoded=True,
    )
    super(UnimodalClassifier, self).__init__(enc, num_classes)

forward(self, x, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, x: torch.Tensor, lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    fused = self.enc(x, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

UnimodalEncoder

out_size: int property readonly

Output feature size

Returns:

Type Description
int

int: Output feature size

__init__(self, input_size, hidden_size, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, merge_bi='sum', aggregate_encoded=False, **kwargs) special

Single modality encoder

Encode a single modality using an Attentive RNN

Parameters:

Name Type Description Default
input_size int

Input feature size

required
hidden_size int

RNN hidden size

required
layers int

Number of RNN layers. Defaults to 1.

1
bidirectional bool

Use bidirectional RNN. Defaults to True.

True
dropout float

Dropout probability. Defaults to 0.2.

0.2
rnn_type str

lstm or gru. Defaults to "lstm".

'lstm'
attention bool

Use attention over hidden states. Defaults to True.

True
merge_bi str

How to merge hidden states [sum|cat]. Defaults to sum.

'sum'
aggregate_encoded bool

Aggregate hidden states. Defaults to False.

False
Source code in slp/modules/multimodal.py
def __init__(
    self,
    input_size: int,
    hidden_size: int,
    layers: int = 1,
    bidirectional: bool = True,
    dropout: float = 0.2,
    rnn_type: str = "lstm",
    attention: bool = True,
    merge_bi: str = "sum",
    aggregate_encoded: bool = False,
    **kwargs,
):
    """Single modality encoder

    Encode a single modality using an Attentive RNN

    Args:
        input_size (int): Input feature size
        hidden_size (int): RNN hidden size
        layers (int, optional): Number of RNN layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNN. Defaults to True.
        dropout (float, optional): Dropout probability. Defaults to 0.2.
        rnn_type (str, optional): lstm or gru. Defaults to "lstm".
        attention (bool, optional): Use attention over hidden states. Defaults to True.
        merge_bi (str, optional): How to merge hidden states [sum|cat]. Defaults to sum.
        aggregate_encoded (bool, optional): Aggregate hidden states. Defaults to False.
    """
    super(UnimodalEncoder, self).__init__(
        input_size,
        hidden_size,
        layers=layers,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        attention=attention,
        **kwargs,
    )
    self.aggregate_encoded = aggregate_encoded
    self.encoder = AttentiveRNN(
        input_size,
        hidden_size,
        batch_first=True,
        layers=layers,
        merge_bi=merge_bi,
        bidirectional=bidirectional,
        dropout=dropout,
        rnn_type=rnn_type,
        packed_sequence=True,
        attention=attention,
        return_hidden=True,
    )

VisualEncoder

Alias for Unimodal Encoder

VisualTextClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/multimodal.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

M3

HardMultimodalDropout

__init__(self, p=0.5, n_modalities=3, p_mod=None) special

MMDrop initial implementation

For each sample in a batch drop one of the modalities with probability p

Parameters:

Name Type Description Default
p float

drop probability

0.5
n_modalities int

number of modalities

3
p_mod Optional[List[float]]

Drop probabilities for each modality

None
Source code in slp/modules/mmdrop.py
def __init__(
    self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
    """MMDrop initial implementation

    For each sample in a batch drop one of the modalities with probability p

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
    """
    super(HardMultimodalDropout, self).__init__()
    self.p = p
    self.n_modalities = n_modalities

    self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]

    if p_mod is not None:
        self.p_mod = p_mod

forward(self, *mods)

Naive mmdrop forward

Iterate over batch and randomly choose modality to drop

Parameters:

Name Type Description Default
mods varargs torch.Tensor

[B, L, D_m] Modality representations

()

Returns:

Type Description
(List[torch.Tensor])

The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py
def forward(self, *mods):
    """Naive mmdrop forward

    Iterate over batch and randomly choose modality to drop

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped
    """
    mods = list(mods)

    # List of [B, L, D]

    if self.training:
        if random.random() < self.p:
            # Drop different modality for each sample in batch

            for batch in range(mods[0].size(0)):
                m = random.choices(
                    list(range(self.n_modalities)), weights=self.p_mod, k=1
                )[0]

                # m = random.randint(0, self.n_modalities - 1)
                mask = torch.ones_like(mods[m])
                mask[batch] = 0.0
                mods[m] = mods[m] * mask

        if self.p > 0:
            for m in range(len(mods)):
                keep_prob = 1 - (self.p / self.n_modalities)
                mods[m] = mods[m] * (1 / keep_prob)

    return mods

MultimodalDropout

__init__(self, p=0.5, n_modalities=3, p_mod=None, mode='hard') special

mmdrop wrapper class

Drop p * 100 % of features of a specific modality over batch

Parameters:

Name Type Description Default
p float

drop probability

0.5
n_modalities int

number of modalities

3
p_mod Optional[List[float]]

Drop probabilities for each modality

None
mode str

Hard or soft mmdrop

'hard'
Source code in slp/modules/mmdrop.py
def __init__(
    self,
    p: float = 0.5,
    n_modalities: int = 3,
    p_mod: Optional[List[float]] = None,
    mode: str = "hard",
):
    """mmdrop wrapper class

    Drop p * 100 % of features of a specific modality over batch

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
        mode (str): Hard or soft mmdrop
    """
    super(MultimodalDropout, self).__init__()

    assert mode in [
        "hard",
        "soft",
    ], "Allowed mode for MultimodalDropout ['hard' | 'soft']"

    if mode == "hard":
        self.mmdrop = HardMultimodalDropout(
            p=p, n_modalities=n_modalities, p_mod=p_mod
        )
    else:
        self.mmdrop = SoftMultimodalDropout(  # type: ignore
            p=p, n_modalities=n_modalities, p_mod=p_mod
        )

forward(self, *mods)

mmdrop wrapper forward

Perform hard or soft mmdrop

Parameters:

Name Type Description Default
mods varargs torch.Tensor

[B, L, D_m] Modality representations

()

Returns:

Type Description
(List[torch.Tensor])

The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py
def forward(self, *mods):
    """mmdrop wrapper forward

    Perform hard or soft mmdrop

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped

    """
    return self.mmdrop(*mods)

SoftMultimodalDropout

__init__(self, p=0.5, n_modalities=3, p_mod=None) special

Soft mmdrop implementation

Drop p * 100 % of features of a specific modality over batch

Parameters:

Name Type Description Default
p float

drop probability

0.5
n_modalities int

number of modalities

3
p_mod Optional[List[float]]

Drop probabilities for each modality

None
Source code in slp/modules/mmdrop.py
def __init__(
    self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
    """Soft mmdrop implementation

    Drop p * 100 % of features of a specific modality over batch

    Args:
        p (float): drop probability
        n_modalities (int): number of modalities
        p_mod (Optional[List[float]]): Drop probabilities for each modality
    """
    super(SoftMultimodalDropout, self).__init__()
    self.p = p  # p_drop
    self.n_modalities = n_modalities

    self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]

    if p_mod is not None:
        self.p_mod = p_mod

forward(self, *mods)

Soft mmdrop forward

Sample a binomial mask to mask a random modality in this batch

Parameters:

Name Type Description Default
mods varargs torch.Tensor

[B, L, D_m] Modality representations

()

Returns:

Type Description
(List[torch.Tensor])

The modality representations. Some of them are dropped

Source code in slp/modules/mmdrop.py
def forward(self, *mods):
    """Soft mmdrop forward

    Sample a binomial mask to mask a random modality in this batch

    Args:
        mods (varargs torch.Tensor): [B, L, D_m] Modality representations

    Returns:
        (List[torch.Tensor]): The modality representations. Some of them are dropped
    """
    mods = list(mods)

    if self.training:
        # m = random.randint(0, self.n_modalities - 1)
        m = random.choices(list(range(self.n_modalities)), weights=self.p_mod, k=1)[
            0
        ]

        binomial = torch.distributions.binomial.Binomial(probs=1 - self.p)
        mods[m] = mods[m] * binomial.sample(mods[m].size()).to(mods[m].device)

        for m in range(self.n_modalities):
            mods[m] = mods[m] * (1.0 / (1 - self.p / self.n_modalities))

    return mods

M3

encoder_cfg(input_size, **cfg) staticmethod

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
input_size int

Input modality size

required
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The encoder configuration

Source code in slp/modules/m3.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
    }

fuser_cfg(**cfg) staticmethod

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The fuser configuration

Source code in slp/modules/m3.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
        "mmdrop_prob": 0.2,
        "mmdrop_individual_mod_prob": None,
        "mmdrop_algorithm": "hard",
    }

M3Classifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/m3.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out

M3FuseAggregate

out_size: int property readonly

Define the feature size of the returned tensor

Returns:

Type Description
int

int: The feature dimension of the output tensor

__init__(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, mmdrop_prob=0.2, mmdrop_individual_mod_prob=None, mmdrop_algorithm='hard', **fuser_kwargs) special

MultimodalDropout, Fuse input feature sequences and aggregate across timesteps

MultimodalDropout -> Fuser -> TimestepsPooler

Parameters:

Name Type Description Default
feature_size int

The input modality representations dimension

required
n_modalities int

Number of input modalities

required
output_size Optional[int]

Required output size. If not provided, output_size = fuser.out_size

None
fusion_method str

Select which fuser to use [cat|sum|attention|bilinear]

'cat'
timesteps_pooling_method str

TimestepsPooler method [cat|sum|rnn]

'sum'
batch_first bool

Input tensors are in batch first configuration. Leave this as true except if you know what you are doing

True
mmdrop_prob float

The probability for multimodal dropout. Defaults to 0.2

0.2
mmdrop_individual_mod_prob Optional[List[float]]

Drop probabilities for each modality for multimodal dropout. If None all modalities are dropped with equal probability

None
mmdrop_algorithm str

Choose multimodal dropout algorithm [hard|soft]. Defaults to hard

'hard'
**fuser_kwargs dict

Extra keyword arguments to instantiate fuser

{}
Source code in slp/modules/m3.py
def __init__(
    self,
    feature_size: int,
    n_modalities: int,
    output_size: Optional[int] = None,
    fusion_method: str = "cat",
    timesteps_pooling_method: str = "sum",
    batch_first: bool = True,
    mmdrop_prob: float = 0.2,
    mmdrop_individual_mod_prob: Optional[List[float]] = None,
    mmdrop_algorithm: str = "hard",
    **fuser_kwargs,
):
    """MultimodalDropout, Fuse input feature sequences and aggregate across timesteps

    MultimodalDropout -> Fuser -> TimestepsPooler

    Args:
        feature_size (int): The input modality representations dimension
        n_modalities (int): Number of input modalities
        output_size (Optional[int]): Required output size. If not provided,
            output_size = fuser.out_size
        fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
        timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
        batch_first (bool): Input tensors are in batch first configuration. Leave this as true
            except if you know what you are doing
        mmdrop_prob (float): The probability for multimodal dropout. Defaults to 0.2
        mmdrop_individual_mod_prob (Optional[List[float]]): Drop probabilities for each modality
            for multimodal dropout. If None all modalities are dropped with equal probability
        mmdrop_algorithm (str): Choose multimodal dropout algorithm [hard|soft]. Defaults to hard
        **fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
    """
    super(M3FuseAggregate, self).__init__()

    self.m3 = MultimodalDropout(
        p=mmdrop_prob,
        n_modalities=n_modalities,
        p_mod=mmdrop_individual_mod_prob,
        mode=mmdrop_algorithm,
    )

    fuser_kwargs["output_size"] = output_size
    fuser_kwargs["fusion_method"] = fusion_method
    fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
    fuser_kwargs["batch_first"] = batch_first

    if "n_modalities" in fuser_kwargs:
        fuser_kwargs.pop("n_modalities")  # Avoid multiple arguments

    if "projection_size" in fuser_kwargs:
        fuser_kwargs.pop("projection_size")  # Avoid multiple arguments

    self.fuse_aggregate = FuseAggregateTimesteps(
        feature_size,
        n_modalities,
        **fuser_kwargs,
    )

forward(self, *mods, *, lengths=None)

Fuse the modality representations and aggregate across timesteps

Parameters:

Name Type Description Default
*mods Tensor

List of modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

Lengths of each modality

None

Returns:

Type Description
Tensor

torch.Tensor: Fused tensor [B, self.out_size]

Source code in slp/modules/m3.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Fuse the modality representations and aggregate across timesteps

    Args:
        *mods: List of modality tensors [B, L, D]
        lengths (Optional[Tensor]): Lengths of each modality

    Returns:
        torch.Tensor: Fused tensor [B, self.out_size]
    """
    mods_masked: List[torch.Tensor] = self.m3(*mods)
    fused: torch.Tensor = self.fuse_aggregate(*mods_masked, lengths=lengths)

    return fused

Multimodal Feedback

BaseFeedbackUnit

__init__(self, top_size, target_size, n_top_modalities, **kwargs) special

Base class for feedback unit

Feedback units are responsible for projecting top-level crossmodal representations to bottom-level features and applying the top-down masks

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
n_top_modalities int

Number of modalities to use for feedback

required
Source code in slp/modules/feedback.py
def __init__(
    self, top_size: int, target_size: int, n_top_modalities: int, **kwargs
):
    """Base class for feedback unit

    Feedback units are responsible for projecting top-level crossmodal
    representations to bottom-level features and applying the top-down masks

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        n_top_modalities (int): Number of modalities to use for feedback
    """
    super(BaseFeedbackUnit, self).__init__()
    self.n_ = n_top_modalities

    self.mask_layers = nn.ModuleList(
        [
            self.make_mask_layer(top_size, target_size, **kwargs)
            for _ in range(self.n_)
        ]
    )

forward(self, x_bottom, *mods_top, *, lengths=None)

Apply the top-down masks to the input feature vector

x = x * top_down_mask

Parameters:

Name Type Description Default
x_bottom Tensor

Bottom-level features [B, L, target_size]

required
*mods_top Tensor

Top-level modality representations

()
lengths Optional[torch.Tensor]

Original unpadded tensor lengths. Defaults to None.

None

Returns:

Type Description
Tensor

torch.Tensor: Masked low level feature tensor [B, L, target_size]

Source code in slp/modules/feedback.py
def forward(
    self,
    x_bottom: torch.Tensor,
    *mods_top: torch.Tensor,
    lengths: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """Apply the top-down masks to the input feature vector

    x = x * top_down_mask

    Args:
        x_bottom (torch.Tensor): Bottom-level features [B, L, target_size]
        *mods_top (torch.Tensor): Top-level modality representations
        lengths (Optional[torch.Tensor], optional): Original unpadded tensor lengths. Defaults to None.

    Returns:
        torch.Tensor: Masked low level feature tensor [B, L, target_size]
    """
    mask = self._get_feedback_mask(*mods_top, lengths=lengths)
    x_bottom = x_bottom * mask

    return x_bottom

make_mask_layer(self, top_size, target_size, **kwargs)

Abstract method to instantiate the layer to use for top-down feedback

To be implemented by subclasses

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
**kwargs

extra configuration for the feedback layer

{}

Returns:

Type Description
Module

nn.Module: The instanstiated feedback layer

Source code in slp/modules/feedback.py
@abc.abstractmethod
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Abstract method to instantiate the layer to use for top-down feedback

    To be implemented by subclasses

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: The instanstiated feedback layer
    """
    pass

BoomFeedbackUnit

make_mask_layer(self, top_size, target_size, **kwargs)

Use an boom module for top-down projection

A boom module is a two-layer MLP where the inner projection size is much larger than the input and output size. (similar to Position feedforward in transformers)

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
**kwargs

extra configuration for the feedback layer

{}

Returns:

Type Description
nn.Module

slp.modules.feedforward.TwoLayer instance

Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size, target_size, **kwargs):
    """Use an boom module for top-down projection

    A boom module is a two-layer MLP where the inner projection size is
    much larger than the input and output size. (similar to Position feedforward in transformers)

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.feedforward.TwoLayer instance
    """
    return TwoLayer(
        top_size,
        2 * top_size,
        target_size,
        activation=kwargs.get("activation", "gelu"),
        dropout=kwargs.get("dropout", 0.2),
    )

DownUpFeedbackUnit

make_mask_layer(self, top_size, target_size, **kwargs)

Use an down-up module for top-down projection

A down-up module is a two-layer MLP where the inner projection size is much smaller than the input and output size. (Similar to adapyers)

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
**kwargs

extra configuration for the feedback layer

{}

Returns:

Type Description
nn.Module

slp.modules.feedforward.TwoLayer instance

Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size, target_size, **kwargs):
    """Use an down-up module for top-down projection

    A down-up module is a two-layer MLP where the inner projection size is
    much smaller than the input and output size. (Similar to adapyers)

    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.feedforward.TwoLayer instance
    """
    return TwoLayer(
        top_size,
        top_size // 5,
        target_size,
        activation=kwargs.get("activation", "gelu"),
        dropout=kwargs.get("dropout", 0.2),
    )

Feedback

__init__(self, top_size, bottom_modality_sizes, use_self=False, mask_type='rnn', **kwargs) special

Feedback module

Given a list of low-level features and top-level representations for n modalities:

  • Create top-down masks for each modality
  • Apply top-down masks to the low level features
  • Return masked low-level features

Parameters:

Name Type Description Default
top_size int

Feature size for top-level representations (Common across modalities)

required
bottom_modality_sizes List[int]

List of feature sizes for each low-level modality feature

required
use_self bool

Include the self modality when creating the top-down mask. Defaults to False.

False
mask_type str

Which feedback unit to use [rnn|gated|boom|downup]. Defaults to "rnn".

'rnn'
Source code in slp/modules/feedback.py
def __init__(
    self,
    top_size: int,
    bottom_modality_sizes: List[int],
    use_self: bool = False,
    mask_type: str = "rnn",
    **kwargs,
):
    """Feedback module

    Given a list of low-level features and top-level representations for n modalities:

    * Create top-down masks for each modality
    * Apply top-down masks to the low level features
    * Return masked low-level features

    Args:
        top_size (int): Feature size for top-level representations (Common across modalities)
        bottom_modality_sizes (List[int]): List of feature sizes for each low-level modality feature
        use_self (bool, optional): Include the self modality when creating the top-down mask. Defaults to False.
        mask_type (str, optional): Which feedback unit to use [rnn|gated|boom|downup]. Defaults to "rnn".
    """
    super(Feedback, self).__init__()

    n_top_modalities = len(bottom_modality_sizes)
    self.use_self = use_self

    if not use_self:
        n_top_modalities = n_top_modalities - 1

    self.feedback_units = nn.ModuleList(
        [
            _make_feedback_unit(
                top_size,
                bottom_modality_sizes[i],
                n_top_modalities,
                mask_type=mask_type,
                **kwargs,
            )
            for i in range(len(bottom_modality_sizes))
        ]
    )

forward(self, mods_bottom, mods_top, lengths=None)

Create and apply the top-down masks to mods_bottom

Parameters:

Name Type Description Default
mods_bottom List[torch.Tensor]

Low-level features for each modality

required
mods_top List[torch.Tensor]

High-level representations for each modality

required
lengths Optional[torch.Tensor]

Original unpadded sequence lengths. Defaults to None.

None

Returns:

Type Description
List[torch.Tensor]

List[torch.Tensor]: Masked low level features for each modality

Source code in slp/modules/feedback.py
def forward(
    self,
    mods_bottom: List[torch.Tensor],
    mods_top: List[torch.Tensor],
    lengths: Optional[torch.Tensor] = None,
) -> List[torch.Tensor]:
    """Create and apply the top-down masks to mods_bottom

    Args:
        mods_bottom (List[torch.Tensor]): Low-level features for each modality
        mods_top (List[torch.Tensor]): High-level representations for each modality
        lengths (Optional[torch.Tensor], optional): Original unpadded sequence lengths. Defaults to None.

    Returns:
        List[torch.Tensor]: Masked low level features for each modality
    """
    out = []

    for i, bm in enumerate(mods_bottom):
        top = mods_top if self.use_self else mods_top[:i] + mods_top[i + 1 :]
        masked = self.feedback_units[i](bm, *top, lengths=lengths)
        out.append(masked)

    return out

GatedFeedbackUnit

Apply feedback mask using simple gating mechanism

\[x_bottom = x_bottom * \frac{1}{2} [\sigma(W1 * y_top) + \sigma(W2 * z_top)]\]

make_mask_layer(self, top_size, target_size, **kwargs)

Use a simple nn.Linear layer for top-down projection

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
**kwargs

extra configuration for the feedback layer

{}

Returns:

Type Description
Module

nn.Module: nn.Linear instance with dropout

Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Use a simple nn.Linear layer for top-down projection


    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: nn.Linear instance with dropout
    """
    return nn.Sequential(
        nn.Linear(top_size, target_size),
        nn.Dropout(p=kwargs.get("dropout", 0.2)),
    )

RNNFeedbackUnit

Apply feedback mask using top-down RNN layers

\[x_bottom = x_bottom * \frac{1}{2} [\sigma(RNN(y_top)) + \sigma(RNN(z_top))]\]

make_mask_layer(self, top_size, target_size, **kwargs)

Use an RNN for top-down projection

Parameters:

Name Type Description Default
top_size int

Feature size of the top-level representations

required
target_size int

Feature size of the bottom-level features

required
**kwargs

extra configuration for the feedback layer

{}

Returns:

Type Description
Module

nn.Module: slp.modules.rnn.AttentiveRNN instance

Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
    """Use an RNN for top-down projection


    Args:
        top_size (int): Feature size of the top-level representations
        target_size (int): Feature size of the bottom-level features
        **kwargs: extra configuration for the feedback layer

    Returns:
        nn.Module: slp.modules.rnn.AttentiveRNN instance
    """
    return AttentiveRNN(
        top_size,
        hidden_size=target_size,
        attention=kwargs.get("attention", False),
        dropout=kwargs.get("dropout", 0.2),
        return_hidden=True,
        bidirectional=kwargs.get("bidirectional", False),
        merge_bi="sum",
        rnn_type=kwargs.get("rnn_type", "lstm"),
    )

MMLatch

__init__(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False, feedback=True, use_self_feedback=False, feedback_algorithm='rnn') special

MMLatch implementation

Multimodal baseline + feedback

Parameters:

Name Type Description Default
text_size int

Text input size. Defaults to 300.

300
audio_size int

Audio input size. Defaults to 74.

74
visual_size int

Visual input size. Defaults to 35.

35
hidden_size int

Hidden dimension. Defaults to 100.

100
dropout float

Dropout rate. Defaults to 0.2.

0.2
encoder_layers float

Number of encoder layers. Defaults to 1.

1
bidirectional bool

Use bidirectional RNNs. Defaults to True.

True
merge_bi str

Bidirectional merging method in the encoders. Defaults to "sum".

'sum'
rnn_type str

RNN type [lstm|gru]. Defaults to "lstm".

'lstm'
encoder_attention bool

Use attention in the encoder RNNs. Defaults to True.

True
fuser_residual bool

Use vilbert like residual in the attention fuser. Defaults to True.

True
use_all_trimodal bool

Use all trimodal interactions for the Attention fuser. Defaults to False.

False
feedback bool

Use top-down feedback. Defaults to True.

True
use_self_feedback bool

If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False.

False
feedback_algorithm str

Feedback module [rnn|boom|gated|downup]. Defaults to "rnn".

'rnn'
Source code in slp/modules/mmlatch.py
def __init__(
    self,
    text_size: int = 300,
    audio_size: int = 74,
    visual_size: int = 35,
    hidden_size: int = 100,
    dropout: float = 0.2,
    encoder_layers: float = 1,
    bidirectional: bool = True,
    merge_bi: str = "sum",
    rnn_type: str = "lstm",
    encoder_attention: bool = True,
    fuser_residual: bool = True,
    use_all_trimodal: bool = False,
    feedback: bool = True,
    use_self_feedback: bool = False,
    feedback_algorithm: str = "rnn",
):
    """MMLatch implementation

    Multimodal baseline + feedback

    Args:
        text_size (int, optional): Text input size. Defaults to 300.
        audio_size (int, optional): Audio input size. Defaults to 74.
        visual_size (int, optional): Visual input size. Defaults to 35.
        hidden_size (int, optional): Hidden dimension. Defaults to 100.
        dropout (float, optional): Dropout rate. Defaults to 0.2.
        encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
        bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
        merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
        rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
        encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
        fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
        use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
        feedback (bool, optional): Use top-down feedback. Defaults to True.
        use_self_feedback (bool, optional): If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False.
        feedback_algorithm (str, optional): Feedback module [rnn|boom|gated|downup]. Defaults to "rnn".
    """
    super(MMLatch, self).__init__(
        text_size=text_size,
        audio_size=audio_size,
        visual_size=visual_size,
        hidden_size=hidden_size,
        dropout=dropout,
        encoder_layers=encoder_layers,
        bidirectional=bidirectional,
        merge_bi=merge_bi,
        rnn_type=rnn_type,
        encoder_attention=encoder_attention,
        fuser_residual=fuser_residual,
        use_all_trimodal=use_all_trimodal,
    )

    self.feedback = None

    if feedback:
        self.feedback = Feedback(
            hidden_size,
            [text_size, audio_size, visual_size],
            use_self=use_self_feedback,
            mask_type=feedback_algorithm,
        )

encoder_cfg(input_size, **cfg) staticmethod

Static method to create the encoder configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
input_size int

Input modality size

required
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The encoder configuration

Source code in slp/modules/mmlatch.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
    """Static method to create the encoder configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        input_size (int): Input modality size
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The encoder configuration
    """
    return {
        "input_size": input_size,
        "hidden_size": cfg.get("hidden_size", 100),
        "layers": cfg.get("layers", 1),
        "bidirectional": cfg.get("bidirectional", True),
        "dropout": cfg.get("dropout", 0.2),
        "rnn_type": cfg.get("rnn_type", "lstm"),
        "attention": cfg.get("attention", True),
    }

forward(self, *mods, *, lengths=None)

Encode + fuse

Parameters:

Name Type Description Default
*mods Tensor

Variable input modality tensors [B, L, D]

()
lengths Optional[torch.Tensor]

The unpadded tensor lengths. Defaults to None.

None

Returns:

Type Description
Tensor

torch.Tensor: The fused tensor [B, D]

Source code in slp/modules/mmlatch.py
def forward(
    self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
    encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)

    if self.feedback is not None:
        mods_feedback: List[torch.Tensor] = self.feedback(
            mods, encoded, lengths=lengths
        )
        encoded = self._encode(*mods_feedback, lengths=lengths)

    fused = self._fuse(*encoded, lengths=lengths)

    return fused

fuser_cfg(**cfg) staticmethod

Static method to create the fuser configuration

The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.

Parameters:

Name Type Description Default
**cfg

Optional keyword arguments

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The fuser configuration

Source code in slp/modules/mmlatch.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
    """Static method to create the fuser configuration

    The default configuration is provided here
    This configuration corresponds to the official paper implementation
    and is tuned for CMU MOSEI.

    Args:
        **cfg: Optional keyword arguments

    Returns:
        Dict[str, Any]: The fuser configuration
    """
    return {
        "n_modalities": 3,
        "dropout": cfg.get("dropout", 0.2),
        "output_size": cfg.get("hidden_size", 100),
        "hidden_size": cfg.get("hidden_size", 100),
        "fusion_method": "attention",
        "timesteps_pooling_method": "rnn",
        "residual": cfg.get("residual", True),
        "use_all_trimodal": cfg.get("use_all_trimodal", True),
    }

MMLatchClassifier

forward(self, mod_dict, lengths)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Source code in slp/modules/mmlatch.py
def forward(
    self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
    mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
    fused = self.enc(*mods, lengths=lengths["text"])
    fused = self.drop(fused)
    out: torch.Tensor = self.clf(fused)

    return out