Multimodal Modules
We include strong baselines for multimodal fusion and state-of-the-art paper implementations.
Fusers
This module contains the implementation of basic fusion algorithms and fusion pipelines.
The fusers are implemented for arbitrary number of input modalities, unless otherwise stated and are geared towards sequential inputs.
A fusion pipeline consists generally of three stages
- Pre-fuse processing: Perform some common operations to all input modalities (e.g. project to a common dimension.)
- Fuser: Fuse all modality representations into a single vector (e.g. concatenate all modality features using CatFuser).
- Timesteps Pooling: Aggregate fused features for all timesteps into a single vector (e.g. add all timesteps with SumPooler)
SUPPORTED_FUSERS: Mapping[str, Type[slp.modules.fuse.BaseFuser]]
Currently implemented fusers
SUPPORTED_POOLERS: Mapping[str, Type[slp.modules.fuse.BaseTimestepsPooler]]
Supported poolers
AttentionFuser
__init__(self, feature_size, n_modalities, use_all_trimodal=False, residual=True, dropout=0.1, **kwargs)
special
Fuse all combinations of three modalities using a base module using bilinear fusion
If input modalities are a, t, v, then the output is
Where f is TwowayAttention and g is Attention modules and values with [] are optional o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Number of feature dimensions |
required |
n_modalities |
int |
Number of input modalities (should be 3) |
required |
use_all_trimodal |
bool |
Use all optional trimodal combinations |
False |
residual |
bool |
Use residual connection in TwowayAttention. Defaults to True |
True |
dropout |
float |
Dropout probability |
0.1 |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
use_all_trimodal: bool = False,
residual: bool = True,
dropout: float = 0.1,
**kwargs,
):
"""Fuse all combinations of three modalities using a base module using bilinear fusion
If input modalities are a, t, v, then the output is
Where f is TwowayAttention and g is Attention modules and values with [] are optional
o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Args:
feature_size (int): Number of feature dimensions
n_modalities (int): Number of input modalities (should be 3)
use_all_trimodal (bool): Use all optional trimodal combinations
residual (bool): Use residual connection in TwowayAttention. Defaults to True
dropout (float): Dropout probability
"""
kwargs["dropout"] = dropout
kwargs["residual"] = residual
super(AttentionFuser, self).__init__(
feature_size,
n_modalities,
use_all_trimodal=use_all_trimodal,
**kwargs,
)
fuse(self, *mods, *, lengths=None)
Perform attention fusion on input modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Input tensors to fuse. This module accepts 3 input modalities. [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Unpadded tensors lengths |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D] |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Perform attention fusion on input modalities
Args:
*mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
lengths (Optional[torch.Tensor]): Unpadded tensors lengths
Returns:
torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]
"""
txt, au, vi = mods
ta, at = self.ta(txt, au)
va, av = self.va(vi, au)
tv, vt = self.tv(txt, vi)
va = va + av
tv = vt + tv
ta = ta + at
tav, _ = self.tav(txt, queries=va)
out_list = [txt, au, vi, ta, tv, va, tav]
if self.use_all_trimodal:
vat, _ = self.vat(vi, queries=ta)
atv, _ = self.atv(au, queries=tv)
out_list = out_list + [vat, atv]
# B x L x 7*D or B x L x 9*D
fused = torch.cat(out_list, dim=-1)
return fused
BaseFuser
out_size: int
property
readonly
Output feature size.
Each fuser specifies its output feature size
__init__(self, feature_size, n_modalities, **extra_kwargs)
special
Base fuser class.
Our fusion methods are separated in direct and combinatorial. An example for direct fusion is concatenation, where feature vectors of N modalities are concatenated into a fused vector. When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio, text -> visual, audio -> visaul etc.) In the current implementation, combinatorial fusion is implemented for 3 input modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Assume all modality representations have the same feature_size |
required |
n_modalities |
int |
Number of input modalities |
required |
**extra_kwargs |
dict |
Extra keyword arguments to maintain interoperability of children classes |
{} |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
**extra_kwargs,
):
"""Base fuser class.
Our fusion methods are separated in direct and combinatorial.
An example for direct fusion is concatenation, where feature vectors of N modalities
are concatenated into a fused vector.
When performing combinatorial fusion all crossmodal relations are examined (e.g. text -> audio,
text -> visual, audio -> visaul etc.)
In the current implementation, combinatorial fusion is implemented for 3 input modalities
Args:
feature_size (int): Assume all modality representations have the same feature_size
n_modalities (int): Number of input modalities
**extra_kwargs (dict): Extra keyword arguments to maintain interoperability of children
classes
"""
super(BaseFuser, self).__init__()
self.feature_size = feature_size
self.n_modalities = n_modalities
forward(self, *mods, *, lengths=None)
Fuse the modality representations
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
List of modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Lengths of each modality |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Fused tensor [B, L, self.out_size] |
Source code in slp/modules/fuse.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Fuse the modality representations
Args:
*mods: List of modality tensors [B, L, D]
lengths (Optional[Tensor]): Lengths of each modality
Returns:
torch.Tensor: Fused tensor [B, L, self.out_size]
"""
fused = self.fuse(*mods, lengths=lengths)
return fused
fuse(self, *mods, *, lengths=None)
Abstract method to fuse the modality representations
Children classes should implement this method
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
List of modality tensors |
() |
lengths |
Optional[torch.Tensor] |
Lengths of each modality |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Fused tensor |
Source code in slp/modules/fuse.py
@abstractmethod
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Abstract method to fuse the modality representations
Children classes should implement this method
Args:
*mods: List of modality tensors
lengths (Optional[Tensor]): Lengths of each modality
Returns:
torch.Tensor: Fused tensor
"""
pass
BaseFusionPipeline
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, *args, **kwargs)
special
Base class for a fusion pipeline
Inherit this class to implement a fusion pipeline
Source code in slp/modules/fuse.py
def __init__(self, *args, **kwargs):
"""Base class for a fusion pipeline
Inherit this class to implement a fusion pipeline
"""
super(BaseFusionPipeline, self).__init__()
BaseTimestepsPooler
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, feature_size, batch_first=True, **kwargs)
special
Abstract base class for Timesteps Poolers
Timesteps Poolers aggregate the features for different timesteps
Given a tensor with dimensions [BatchSize, Length, Dim] they return an aggregated tensor with dimensions [BatchSize, Dim]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Feature dimension |
required |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
**kwargs |
|
Variable keyword arguments for subclasses |
{} |
Source code in slp/modules/fuse.py
def __init__(self, feature_size: int, batch_first: bool = True, **kwargs):
"""Abstract base class for Timesteps Poolers
Timesteps Poolers aggregate the features for different timesteps
Given a tensor with dimensions [BatchSize, Length, Dim]
they return an aggregated tensor with dimensions [BatchSize, Dim]
Args:
feature_size (int): Feature dimension
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
**kwargs: Variable keyword arguments for subclasses
"""
super(BaseTimestepsPooler, self).__init__()
self.pooling_dim = 0 if not batch_first else 1
self.feature_size = feature_size
forward(self, x, lengths=None)
Pool features of input tensor across timesteps
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
Tensor |
[B, L, D] Input sequence |
required |
lengths |
Optional[torch.Tensor] |
Optional unpadded sequence lengths for input tensor |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: [B, D] Output aggregated features across timesteps |
Source code in slp/modules/fuse.py
def forward(
self, x: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Pool features of input tensor across timesteps
Args:
x (torch.Tensor): [B, L, D] Input sequence
lengths (Optional[torch.Tensor]): Optional unpadded sequence lengths for input tensor
Returns:
torch.Tensor: [B, D] Output aggregated features across timesteps
"""
if x.ndim == 2:
return x
if x.ndim != 3:
raise ValueError("Expected 3 dimensional tensor [B, L, D] or [L, B, D]")
return self._pool(x, lengths=lengths)
BilinearFuser
__init__(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)
special
Fuse all combinations of three modalities using a base module using bilinear fusion
If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Where f and g are the nn.Bilinear function and values with [] are optional
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Number of feature dimensions |
required |
n_modalities |
int |
Number of input modalities (should be 3) |
required |
use_all_trimodal |
bool |
Use all optional trimodal combinations |
False |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
use_all_trimodal: bool = False,
**kwargs,
):
"""Fuse all combinations of three modalities using a base module using bilinear fusion
If input modalities are a, t, v, then the output is
o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Where f and g are the nn.Bilinear function and values with [] are optional
Args:
feature_size (int): Number of feature dimensions
n_modalities (int): Number of input modalities (should be 3)
use_all_trimodal (bool): Use all optional trimodal combinations
"""
super(BilinearFuser, self).__init__(
feature_size,
n_modalities,
use_all_trimodal=use_all_trimodal,
**kwargs,
)
fuse(self, *mods, *, lengths=None)
Perform bilinear fusion on input modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Input tensors to fuse. This module accepts 3 input modalities. [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Unpadded tensors lengths |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: fused output vector [B, L, 7D] or [B, L, 9D] |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Perform bilinear fusion on input modalities
Args:
*mods: Input tensors to fuse. This module accepts 3 input modalities. [B, L, D]
lengths (Optional[torch.Tensor]): Unpadded tensors lengths
Returns:
torch.Tensor: fused output vector [B, L, 7*D] or [B, L, 9*D]
"""
txt, au, vi = mods
ta = self.ta(txt, au)
va = self.va(vi, au)
tv = self.tv(txt, vi)
tav = self.tav(txt, va)
out_list = [txt, au, vi, ta, tv, va, tav]
if self.use_all_trimodal:
vat = self.vat(vi, ta)
atv = self.atv(au, tv)
out_list = out_list + [vat, atv]
# B x L x 7*D or B x L x 9*D
fused = torch.cat(out_list, dim=-1)
return fused
BimodalAttentionFuser
fuse(self, *mods, *, lengths=None)
Perform attention fusion on input modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Input tensors to fuse. This module accepts 2 input modalities. [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Unpadded tensors lengths |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: fused output vector [B, L, 3*D] |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Perform attention fusion on input modalities
Args:
*mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
lengths (Optional[torch.Tensor]): Unpadded tensors lengths
Returns:
torch.Tensor: fused output vector [B, L, 3*D]
"""
x, y = mods
xy, yx = self.xy(x, y)
xy = xy + yx
# B x L x 3*D
fused = torch.cat([x, y, xy], dim=-1)
return fused
BimodalBilinearFuser
fuse(self, *mods, *, lengths=None)
Perform bilinear fusion on input modalities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Input tensors to fuse. This module accepts 2 input modalities. [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Unpadded tensors lengths |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: fused output vector [B, L, 3*D] |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Perform bilinear fusion on input modalities
Args:
*mods: Input tensors to fuse. This module accepts 2 input modalities. [B, L, D]
lengths (Optional[torch.Tensor]): Unpadded tensors lengths
Returns:
torch.Tensor: fused output vector [B, L, 3*D]
"""
x, y = mods
xy = self.xy(x, y)
# B x L x 3*D
fused = torch.cat([x, y, xy], dim=-1)
return fused
BimodalCombinatorialFuser
out_size: int
property
readonly
Fused vector feature dimension
Returns:
Type | Description |
---|---|
int |
int: 3 * feature_size |
__init__(self, feature_size, n_modalities, **kwargs)
special
Fuse all combinations of three modalities using a base module
If input modalities are x, y, then the output is o = x || y || f(x, y)
Where f is a network module (e.g. attention)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Number of feature dimensions |
required |
n_modalities |
int |
Number of input modalities (should be 3) |
required |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
**kwargs,
):
"""Fuse all combinations of three modalities using a base module
If input modalities are x, y, then the output is
o = x || y || f(x, y)
Where f is a network module (e.g. attention)
Args:
feature_size (int): Number of feature dimensions
n_modalities (int): Number of input modalities (should be 3)
"""
super(BimodalCombinatorialFuser, self).__init__(
feature_size, n_modalities, **kwargs
)
self._check_n_modalities(n=2)
self.xy = self._bimodal_fusion_module(feature_size, **kwargs)
CatFuser
Fuse by concatenating modality representations
o = m1 || m2 || m3 ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Assume all modality representations have the same feature_size |
required |
n_modalities |
int |
Number of input modalities |
required |
**extra_kwargs |
dict |
Extra keyword arguments to maintain interoperability of children classes |
required |
out_size: int
property
readonly
d_out = n_modalities * d_in
Returns:
Type | Description |
---|---|
int |
int: output feature size |
fuse(self, *mods, *, lengths=None)
Concatenate input tensors into a single tensor
Examples:
fuser = CatFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, 2 * D)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable number of input tensors |
() |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Concatenated input tensors |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Concatenate input tensors into a single tensor
Example:
fuser = CatFuser(5, 2)
x = torch.rand(16, 6, 5) # (B, L, D)
y = torch.rand(16, 6, 5) # (B, L, D)
out = fuser(x, y) # (B, L, 2 * D)
Args:
*mods: Variable number of input tensors
Returns:
torch.Tensor: Concatenated input tensors
"""
return torch.cat(mods, dim=-1)
Conv1dProjection
__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=False)
special
Project features for N modalities using 1D convolutions
Parameters:
Name | Type | Description | Default |
---|---|---|---|
modality_sizes |
List[int] |
List of number of features for each modality. E.g. for MOSEI: [300, 74, 35] |
required |
projection_size |
int |
Output number of features for each modality |
required |
kernel_size |
int |
Convolution kernel size |
1 |
padding |
int |
Convlution amount of padding |
0 |
bias |
bool |
Use bias in convolutional layers |
False |
Source code in slp/modules/fuse.py
def __init__(
self,
modality_sizes: List[int],
projection_size: int,
kernel_size: int = 1,
padding: int = 0,
bias: bool = False,
):
"""Project features for N modalities using 1D convolutions
Args:
modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
[300, 74, 35]
projection_size (int): Output number of features for each modality
kernel_size (int): Convolution kernel size
padding (int): Convlution amount of padding
bias (bool): Use bias in convolutional layers
"""
super(Conv1dProjection, self).__init__()
self.p = nn.ModuleList(
[
nn.Conv1d(
sz,
projection_size,
kernel_size=kernel_size,
padding=padding,
bias=bias,
)
for sz in modality_sizes
]
)
forward(self, *mods)
Project modality representations to a given number of features using Conv1d layers
Examples:
Inputs:
text: (B, L, 300)
audio: (B, L, 74)
visual: (B, L, 35)
Outputs:
text_p: (B, L, 100)
audio_p: (B, L, 100)
visual_p: (B, L, 100)
c_proj = Conv1dProjection([300, 74, 35], 100) text_p, audio_p, visual_p = c_proj(text, audio, visual)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable length tensors list |
() |
Returns:
Type | Description |
---|---|
List[torch.Tensor] |
List[torch.Tensor]: Variable length projected tensors list |
Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
"""Project modality representations to a given number of features using Conv1d layers
Example:
# Inputs:
# text: (B, L, 300)
# audio: (B, L, 74)
# visual: (B, L, 35)
# Outputs:
# text_p: (B, L, 100)
# audio_p: (B, L, 100)
# visual_p: (B, L, 100)
c_proj = Conv1dProjection([300, 74, 35], 100)
text_p, audio_p, visual_p = c_proj(text, audio, visual)
Args:
*mods: Variable length tensors list
Returns:
List[torch.Tensor]: Variable length projected tensors list
"""
mods_o: List[torch.Tensor] = [
self.p[i](m.transpose(1, 2)).transpose(1, 2) for i, m in enumerate(mods)
]
return mods_o
FuseAggregateTimesteps
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, **fuser_kwargs)
special
Fuse input feature sequences and aggregate across timesteps
Fuser -> TimestepsPooler
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
The input modality representations dimension |
required |
n_modalities |
int |
Number of input modalities |
required |
output_size |
Optional[int] |
Required output size. If not provided, output_size = fuser.out_size |
None |
fusion_method |
str |
Select which fuser to use [cat|sum|attention|bilinear] |
'cat' |
timesteps_pooling_method |
str |
TimestepsPooler method [cat|sum|rnn] |
'sum' |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
**fuser_kwargs |
dict |
Extra keyword arguments to instantiate fuser |
{} |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
output_size: Optional[int] = None,
fusion_method: str = "cat",
timesteps_pooling_method: str = "sum",
batch_first: bool = True,
**fuser_kwargs,
):
"""Fuse input feature sequences and aggregate across timesteps
Fuser -> TimestepsPooler
Args:
feature_size (int): The input modality representations dimension
n_modalities (int): Number of input modalities
output_size (Optional[int]): Required output size. If not provided,
output_size = fuser.out_size
fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
**fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
"""
super(FuseAggregateTimesteps, self).__init__(
feature_size, n_modalities, fusion_method=fusion_method
)
self.fuser = make_fuser(
fusion_method, feature_size, n_modalities, **fuser_kwargs
)
output_size = ( # bidirectional rnn. fused_size / 2 results to fused_size outputs
output_size if output_size is not None else self.fuser.out_size // 2
)
self.timesteps_pooler = TimestepsPooler(
self.fuser.out_size,
hidden_size=output_size,
mode=timesteps_pooling_method,
batch_first=batch_first,
)
forward(self, *mods, *, lengths=None)
Fuse the modality representations and aggregate across timesteps
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
List of modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Lengths of each modality |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Fused tensor [B, self.out_size] |
Source code in slp/modules/fuse.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Fuse the modality representations and aggregate across timesteps
Args:
*mods: List of modality tensors [B, L, D]
lengths (Optional[Tensor]): Lengths of each modality
Returns:
torch.Tensor: Fused tensor [B, self.out_size]
"""
fused = self.fuser(*mods, lengths=lengths)
out: torch.Tensor = self.timesteps_pooler(fused, lengths=lengths)
return out
LinearProjection
__init__(self, modality_sizes, projection_size, bias=True)
special
Project features for N modalities using feedforward layers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
modality_sizes |
List[int] |
List of number of features for each modality. E.g. for MOSEI: [300, 74, 35] |
required |
bias |
bool |
Use bias in feedforward layers |
True |
Source code in slp/modules/fuse.py
def __init__(
self, modality_sizes: List[int], projection_size: int, bias: bool = True
):
"""Project features for N modalities using feedforward layers
Args:
modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
[300, 74, 35]
bias (bool): Use bias in feedforward layers
"""
super(LinearProjection, self).__init__()
self.p = nn.ModuleList(
[nn.Linear(sz, projection_size, bias=bias) for sz in modality_sizes]
)
forward(self, *mods)
Project modality representations to a given number of features using Linear layers
Examples:
Inputs:
text: (B, L, 300)
audio: (B, L, 74)
visual: (B, L, 35)
Outputs:
text_p: (B, L, 100)
audio_p: (B, L, 100)
visual_p: (B, L, 100)
l_proj = LinearProjection([300, 74, 35], 100) text_p, audio_p, visual_p = l_proj(text, audio, visual)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable length tensor list |
() |
Returns:
Type | Description |
---|---|
List[torch.Tensor] |
List[torch.Tensor]: Variable length projected tensors list |
Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
"""Project modality representations to a given number of features using Linear layers
Example:
# Inputs:
# text: (B, L, 300)
# audio: (B, L, 74)
# visual: (B, L, 35)
# Outputs:
# text_p: (B, L, 100)
# audio_p: (B, L, 100)
# visual_p: (B, L, 100)
l_proj = LinearProjection([300, 74, 35], 100)
text_p, audio_p, visual_p = l_proj(text, audio, visual)
Args:
*mods: Variable length tensor list
Returns:
List[torch.Tensor]: Variable length projected tensors list
"""
mods_o: List[torch.Tensor] = [self.p[i](m) for i, m in enumerate(mods)]
return mods_o
ModalityProjection
__init__(self, modality_sizes, projection_size, kernel_size=1, padding=0, bias=True, mode=None)
special
Adapter module to project features for N modalities using 1D convolutions or feedforward
Parameters:
Name | Type | Description | Default |
---|---|---|---|
modality_sizes |
List[int] |
List of number of features for each modality. E.g. for MOSEI: [300, 74, 35] |
required |
projection_size |
int |
Output number of features for each modality |
required |
kernel_size |
int |
Convolution kernel size. Used when mode=="conv" |
1 |
padding |
int |
Convlution amount of padding. Used when mode=="conv" |
0 |
bias |
bool |
Use bias |
True |
mode |
Optional[str] |
Projection method. linear -> LinearProjection conv|conv1d|convolutional -> Conv1dProjection |
None |
Source code in slp/modules/fuse.py
def __init__(
self,
modality_sizes: List[int],
projection_size: int,
kernel_size: int = 1,
padding: int = 0,
bias: bool = True,
mode: Optional[str] = None,
):
"""Adapter module to project features for N modalities using 1D convolutions or feedforward
Args:
modality_sizes (List[int]): List of number of features for each modality. E.g. for MOSEI:
[300, 74, 35]
projection_size (int): Output number of features for each modality
kernel_size (int): Convolution kernel size. Used when mode=="conv"
padding (int): Convlution amount of padding. Used when mode=="conv"
bias (bool): Use bias
mode (Optional[str]): Projection method.
linear -> LinearProjection
conv|conv1d|convolutional -> Conv1dProjection
"""
super(ModalityProjection, self).__init__()
if mode is None:
self.p: Optional[Union[LinearProjection, Conv1dProjection]] = None
elif mode == "linear":
self.p = LinearProjection(modality_sizes, projection_size, bias=bias)
elif mode == "conv" or mode == "conv1d" or mode == "convolutional":
self.p = Conv1dProjection(
modality_sizes,
projection_size,
kernel_size=kernel_size,
padding=padding,
bias=bias,
)
else:
raise ValueError(
"Supported mode=[linear|conv|conv1d|convolutional]."
"conv, conv1d and convolutional are equivalent."
)
forward(self, *mods)
Project modality representations to a given number of features
Examples:
Inputs:
text: (B, L, 300)
audio: (B, L, 74)
visual: (B, L, 35)
Outputs:
text_p: (B, L, 100)
audio_p: (B, L, 100)
visual_p: (B, L, 100)
l_proj = ModalityProjection([300, 74, 35], 100, mode="linear") text_p, audio_p, visual_p = l_proj(text, audio, visual)
Examples:
Inputs:
text: (B, L, 300)
audio: (B, L, 74)
visual: (B, L, 35)
Outputs:
text_p: (B, L, 300)
audio_p: (B, L, 74)
visual_p: (B, L, 35)
l_proj = ModalityProjection([300, 74, 35], 100, mode=None) text_p, audio_p, visual_p = l_proj(text, audio, visual)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable length tensor list |
() |
Returns:
Type | Description |
---|---|
List[torch.Tensor] |
List[torch.Tensor]: Variable length projected tensors list |
Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
"""Project modality representations to a given number of features
Example:
# Inputs:
# text: (B, L, 300)
# audio: (B, L, 74)
# visual: (B, L, 35)
# Outputs:
# text_p: (B, L, 100)
# audio_p: (B, L, 100)
# visual_p: (B, L, 100)
l_proj = ModalityProjection([300, 74, 35], 100, mode="linear")
text_p, audio_p, visual_p = l_proj(text, audio, visual)
Example:
# Inputs:
# text: (B, L, 300)
# audio: (B, L, 74)
# visual: (B, L, 35)
# Outputs:
# text_p: (B, L, 300)
# audio_p: (B, L, 74)
# visual_p: (B, L, 35)
l_proj = ModalityProjection([300, 74, 35], 100, mode=None)
text_p, audio_p, visual_p = l_proj(text, audio, visual)
Args:
*mods: Variable length tensor list
Returns:
List[torch.Tensor]: Variable length projected tensors list
"""
if self.p is None:
return list(mods)
mods_o: List[torch.Tensor] = self.p(*mods)
return mods_o
ModalityWeights
__init__(self, feature_size)
special
Multiply each modality features with a learnable weight
i: modality index learnable_weight[i] = softmax(Linear(modality_features[i])) output_modality[i] = learnable_weight * modality_features[i]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
All modalities are assumed to be projected into a space with the same number of features. |
required |
Source code in slp/modules/fuse.py
def __init__(self, feature_size: int):
"""Multiply each modality features with a learnable weight
i: modality index
learnable_weight[i] = softmax(Linear(modality_features[i]))
output_modality[i] = learnable_weight * modality_features[i]
Args:
feature_size (int): All modalities are assumed to be projected into a space with the same
number of features.
"""
super(ModalityWeights, self).__init__()
self.mod_w = nn.Linear(feature_size, 1)
forward(self, *mods)
Use learnable weights to multiply modality features
Examples:
Inputs:
text: (B, L, 100)
audio: (B, L, 100)
visual: (B, L, 100)
Outputs:
text_p: (B, L, 100)
audio_p: (B, L, 100)
visual_p: (B, L, 100)
mw = ModalityWeights(100) text_w, audio_w, visual_w = mw(text, audio, visual)
The operation is summarized as:
w_x = softmax(W * x + b) w_y = softmax(W * y + b) x_out = w_x * x y_out = w_y * y
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable length tensor list |
() |
Returns:
Type | Description |
---|---|
List[torch.Tensor] |
List[torch.Tensor]: Variable length reweighted tensors list |
Source code in slp/modules/fuse.py
def forward(self, *mods: torch.Tensor) -> List[torch.Tensor]:
"""Use learnable weights to multiply modality features
Example:
# Inputs:
# text: (B, L, 100)
# audio: (B, L, 100)
# visual: (B, L, 100)
# Outputs:
# text_p: (B, L, 100)
# audio_p: (B, L, 100)
# visual_p: (B, L, 100)
mw = ModalityWeights(100)
text_w, audio_w, visual_w = mw(text, audio, visual)
The operation is summarized as:
w_x = softmax(W * x + b)
w_y = softmax(W * y + b)
x_out = w_x * x
y_out = w_y * y
Args:
*mods: Variable length tensor list
Returns:
List[torch.Tensor]: Variable length reweighted tensors list
"""
weight = self.mod_w(torch.cat([x.unsqueeze(1) for x in mods], dim=1))
weight = F.softmax(weight, dim=1)
mods_o: List[torch.Tensor] = [m * weight[:, i, ...] for i, m in enumerate(mods)]
return mods_o
ProjectFuseAggregate
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, modality_sizes, projection_size, projection_type=None, fusion_method='cat', timesteps_pooling_method='sum', modality_weights=False, batch_first=True, **fuser_kwargs)
special
Project input feature sequences, fuse and aggregate across timesteps
ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler
Parameters:
Name | Type | Description | Default |
---|---|---|---|
modality_sizes |
List[int] |
List of input modality representations dimensions |
required |
projection_size |
int |
Project all modalities to have this feature size |
required |
projection_type |
Optional[str] |
Optional projection method [linear|conv] |
None |
fusion_method |
str |
Select which fuser to use [cat|sum|attention|bilinear] |
'cat' |
timesteps_pooling_method |
str |
TimestepsPooler method [cat|sum|rnn] |
'sum' |
modality_weights |
bool |
Multiply projected modality representations with learnable weights. Default value is False. |
False |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
**fuser_kwargs |
dict |
Extra keyword arguments to instantiate fuser |
{} |
Source code in slp/modules/fuse.py
def __init__(
self,
modality_sizes: List[int],
projection_size: int,
projection_type: Optional[str] = None,
fusion_method="cat",
timesteps_pooling_method="sum",
modality_weights: bool = False,
batch_first: bool = True,
**fuser_kwargs,
):
"""Project input feature sequences, fuse and aggregate across timesteps
ModalityProjection -> Optional[ModalityWeights] -> Fuser -> TimestepsPooler
Args:
modality_sizes (List[int]): List of input modality representations dimensions
projection_size (int): Project all modalities to have this feature size
projection_type (Optional[str]): Optional projection method [linear|conv]
fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
modality_weights (bool): Multiply projected modality representations with learnable
weights. Default value is False.
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
**fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
"""
super(ProjectFuseAggregate, self).__init__()
n_modalities = len(modality_sizes)
self.projection = None
self.modality_weights = None
if projection_type is not None:
self.projection = ModalityProjection(
modality_sizes, projection_size, mode=projection_type
)
if modality_weights:
self.modality_weights = ModalityWeights(projection_size)
fuser_kwargs["output_size"] = projection_size
fuser_kwargs["fusion_method"] = fusion_method
fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
fuser_kwargs["batch_first"] = batch_first
if "n_modalities" in fuser_kwargs:
del fuser_kwargs["n_modalities"]
if "projection_size" in fuser_kwargs:
del fuser_kwargs["projection_size"]
self.fuse_aggregate = FuseAggregateTimesteps(
projection_size,
n_modalities,
**fuser_kwargs,
)
forward(self, *mods, *, lengths=None)
Project modality representations to a common dimension, fuse and aggregate across timesteps
Optionally use modality weights
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
List of modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Lengths of each modality |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Fused tensor [B, self.out_size] |
Source code in slp/modules/fuse.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Project modality representations to a common dimension, fuse and aggregate across timesteps
Optionally use modality weights
Args:
*mods: List of modality tensors [B, L, D]
lengths (Optional[Tensor]): Lengths of each modality
Returns:
torch.Tensor: Fused tensor [B, self.out_size]
"""
if self.projection is not None:
mods = self.projection(*mods)
if self.modality_weights is not None:
mods = self.modality_weights(*mods)
fused: torch.Tensor = self.fuse_aggregate(*mods, lengths=lengths)
return fused
RnnPooler
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, feature_size, hidden_size=None, batch_first=True, bidirectional=True, merge_bi='cat', attention=True, **kwargs)
special
Aggregate features of the input tensor using an AttentiveRNN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Feature dimension |
required |
hidden_size |
Optional[int] |
Hidden dimension |
None |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
bidirectional |
bool |
Use bidirectional RNN. Defaults to True |
True |
merge_bi |
str |
How bidirectional states are merged. Defaults to "cat" |
'cat' |
attention |
bool |
Use attention for the RNN output states |
True |
**kwargs |
|
Variable keyword arguments |
{} |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
hidden_size: Optional[int] = None,
batch_first: bool = True,
bidirectional: bool = True,
merge_bi: str = "cat",
attention: bool = True,
**kwargs,
):
"""Aggregate features of the input tensor using an AttentiveRNN
Args:
feature_size (int): Feature dimension
hidden_size (int): Hidden dimension
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
bidirectional (bool): Use bidirectional RNN. Defaults to True
merge_bi (str): How bidirectional states are merged. Defaults to "cat"
attention (bool): Use attention for the RNN output states
**kwargs: Variable keyword arguments
"""
super(RnnPooler, self).__init__(feature_size, batch_first=batch_first, **kwargs)
self.hidden_size = hidden_size if hidden_size is not None else feature_size
self.rnn = AttentiveRNN(
feature_size,
hidden_size=self.hidden_size,
batch_first=batch_first,
bidirectional=bidirectional,
merge_bi=merge_bi,
attention=attention,
return_hidden=False, # We want to aggregate all hidden states.
)
SumFuser
Fuse by adding modality representations
o = m1 + m2 + m3 ...
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Assume all modality representations have the same feature_size |
required |
n_modalities |
int |
Number of input modalities |
required |
**extra_kwargs |
dict |
Extra keyword arguments to maintain interoperability of children classes |
required |
out_size: int
property
readonly
d_out = d_in
Returns:
Type | Description |
---|---|
int |
int: output feature size |
fuse(self, *mods, *, lengths=None)
Sum input tensors into a single tensor
Examples:
fuser = SumFuser(5, 2) x = torch.rand(16, 6, 5) # (B, L, D) y = torch.rand(16, 6, 5) # (B, L, D) out = fuser(x, y) # (B, L, D)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable number of input tensors |
() |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Summed input tensors |
Source code in slp/modules/fuse.py
def fuse(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Sum input tensors into a single tensor
Example:
fuser = SumFuser(5, 2)
x = torch.rand(16, 6, 5) # (B, L, D)
y = torch.rand(16, 6, 5) # (B, L, D)
out = fuser(x, y) # (B, L, D)
Args:
*mods: Variable number of input tensors
Returns:
torch.Tensor: Summed input tensors
"""
return torch.cat([m.unsqueeze(-1) for m in mods], dim=-1).sum(-1)
TimestepsPooler
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, feature_size, mode='sum', batch_first=True, **kwargs)
special
Aggregate features from all timesteps into a single representation.
Three methods supported: sum: Sum features from all timesteps mean: Average features from all timesteps max: Max pool features from all timesteps rnn: Use the output from an attentive RNN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
The number of features for the input fused representations |
required |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
mode |
str |
The timestep pooling method sum: Sum hidden states mean: Average hidden states max: Max pool features from all hidden states rnn: Use the output of an Attentive RNN |
'sum' |
Source code in slp/modules/fuse.py
def __init__(
self, feature_size: int, mode: str = "sum", batch_first=True, **kwargs
):
"""Aggregate features from all timesteps into a single representation.
Three methods supported:
sum: Sum features from all timesteps
mean: Average features from all timesteps
max: Max pool features from all timesteps
rnn: Use the output from an attentive RNN
Args:
feature_size (int): The number of features for the input fused representations
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
mode (str): The timestep pooling method
sum: Sum hidden states
mean: Average hidden states
max: Max pool features from all hidden states
rnn: Use the output of an Attentive RNN
"""
super(TimestepsPooler, self).__init__(
feature_size, batch_first=batch_first, **kwargs
)
assert (
mode is None or mode in SUPPORTED_POOLERS
), "Unsupported timestep pooling method. Available methods: {SUPPORTED_POOLERS.keys()}"
self.pooler = None
if mode is not None:
self.pooler = SUPPORTED_POOLERS[mode](
feature_size, batch_first=batch_first, **kwargs
)
TrimodalCombinatorialFuser
out_size: int
property
readonly
Fused vector feature dimension
Returns:
Type | Description |
---|---|
int |
int: 7 * feature_size if use_all_trimodal==False else 9*feature_size |
__init__(self, feature_size, n_modalities, use_all_trimodal=False, **kwargs)
special
Fuse all combinations of three modalities using a base module
If input modalities are a, t, v, then the output is o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Where f and g network modules (e.g. attention) and values with [] are optional
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
Number of feature dimensions |
required |
n_modalities |
int |
Number of input modalities (should be 3) |
required |
use_all_trimodal |
bool |
Use all optional trimodal combinations |
False |
Source code in slp/modules/fuse.py
def __init__(
self,
feature_size: int,
n_modalities: int,
use_all_trimodal: bool = False,
**kwargs,
):
"""Fuse all combinations of three modalities using a base module
If input modalities are a, t, v, then the output is
o = t || a || v || f(t, a) || f(v, a) || f(t, v) || g(t, f(v, a)) || [ g(v, f(t,a)) ] || [g(a, f(t,v))]
Where f and g network modules (e.g. attention) and values with [] are optional
Args:
feature_size (int): Number of feature dimensions
n_modalities (int): Number of input modalities (should be 3)
use_all_trimodal (bool): Use all optional trimodal combinations
"""
super(TrimodalCombinatorialFuser, self).__init__(
feature_size, n_modalities, **kwargs
)
self._check_n_modalities(n=3)
self.use_all_trimodal = use_all_trimodal
self.ta = self._bimodal_fusion_module(feature_size, **kwargs)
self.va = self._bimodal_fusion_module(feature_size, **kwargs)
self.tv = self._bimodal_fusion_module(feature_size, **kwargs)
self.tav = self._trimodal_fusion_module(feature_size, **kwargs)
if use_all_trimodal:
self.vat = self._trimodal_fusion_module(feature_size, **kwargs)
self.atv = self._trimodal_fusion_module(feature_size, **kwargs)
make_fuser(fusion_method, feature_size, n_modalities, **kwargs)
Helper function to instantiate a fuser given a string fusion_method parameter
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fusion_method |
str |
One of the supported fusion methods [cat|add|bilinear|attention] |
required |
feature_size |
int |
The input modality representations dimension |
required |
n_modalities |
int |
Number of input modalities |
required |
**kwargs |
|
Variable keyword arguments to pass to the instantiated fuser |
{} |
Source code in slp/modules/fuse.py
def make_fuser(fusion_method: str, feature_size: int, n_modalities: int, **kwargs):
"""Helper function to instantiate a fuser given a string fusion_method parameter
Args:
fusion_method (str): One of the supported fusion methods [cat|add|bilinear|attention]
feature_size (int): The input modality representations dimension
n_modalities (int): Number of input modalities
**kwargs: Variable keyword arguments to pass to the instantiated fuser
"""
if fusion_method not in SUPPORTED_FUSERS.keys():
raise NotImplementedError(
f"The supported fusers are {SUPPORTED_FUSERS.keys()}. You provided {fusion_method}"
)
if fusion_method == "bilinear":
if n_modalities == 2:
return BimodalBilinearFuser(feature_size, n_modalities, **kwargs)
elif n_modalities == 3:
return BilinearFuser(feature_size, n_modalities, **kwargs)
else:
raise ValueError("bilinear implemented for 2 or 3 modalities")
if fusion_method == "attention":
if n_modalities == 2:
return BimodalAttentionFuser(feature_size, n_modalities, **kwargs)
elif n_modalities == 3:
return AttentionFuser(feature_size, n_modalities, **kwargs)
else:
raise ValueError("attention implemented for 2 or 3 modalities")
return SUPPORTED_FUSERS[fusion_method](feature_size, n_modalities, **kwargs)
Multimodal encoders
These modules implement mid and late fusion. The general structure of a multimodal encoder contains:
- N Unimodal encoders (e.g. RNNs), where N is the number of input modalities
- A fusion pipeline
We furthermore implement Multimodal classifiers, which consist of a multimodal encoder followed by an nn.Linear
layer.
A special mention should be added for our MultimodalBaseline
. This baseline consists of RNN encoders followed by an attention fuser and an RNN timesteps poolwer in multimodal tasks and is tuned on CMU-MOSEI. The default configuration is provided through static methods and achieve strong performance.
AudioEncoder
Alias for Unimodal Encoder
AudioTextClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["audio"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
AudioVisualClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["visual"], mod_dict["audio"]]
fused = self.enc(*mods, lengths=lengths["visual"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
BaseEncoder
out_size: int
property
readonly
An encoder returns its output size
Returns:
Type | Description |
---|---|
int |
int: The output feature size of the encoder |
__init__(self, *args, **kwargs)
special
Base class implementing a multimodal encoder
A BaseEncoder child encodes and fuses the modality features and returns representations ready to be provided to a classification layer
Source code in slp/modules/multimodal.py
def __init__(self, *args, **kwargs):
"""Base class implementing a multimodal encoder
A BaseEncoder child encodes and fuses the modality features
and returns representations ready to be provided to a classification layer
"""
super(BaseEncoder, self).__init__()
self.args = args
self.kwargs = kwargs
self.clf = None
forward(self, *mods, *, lengths=None)
Encode + fuse
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable input modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
The unpadded tensor lengths. Defaults to None. |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: The fused tensor [B, D] |
Source code in slp/modules/multimodal.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Encode + fuse
Args:
*mods (torch.Tensor): Variable input modality tensors [B, L, D]
lengths (Optional[torch.Tensor], optional): The unpadded tensor lengths. Defaults to None.
Returns:
torch.Tensor: The fused tensor [B, D]
"""
encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)
fused = self._fuse(*encoded, lengths=lengths)
return fused
BimodalEncoder
out_size: int
property
readonly
Output feature size
Returns:
Type | Description |
---|---|
int |
int: Output feature size |
__init__(self, encoder1_args, encoder2_args, fuser_args, **kwargs)
special
Two modality encoder
Encode + Fuse two input modalities
Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }
Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }
Parameters:
Name | Type | Description | Default |
---|---|---|---|
encoder1_args |
Dict[str, Any] |
Configuration for first encoder |
required |
encoder2_args |
Dict[str, Any] |
Configuration for second encoder |
required |
fuser_args |
Dict[str, Any] |
Configuration for fuser |
required |
Source code in slp/modules/multimodal.py
def __init__(
self,
encoder1_args: Dict[str, Any],
encoder2_args: Dict[str, Any],
fuser_args: Dict[str, Any],
**kwargs,
):
"""Two modality encoder
Encode + Fuse two input modalities
Example encoder_args:
{
"input_size": 35,
"hidden_size": 100,
"layers": 1,
"bidirectional": True,
"dropout": 0.2,
"rnn_type": "lstm",
"attention": True,
}
Example fuser_args:
{
"n_modalities": 3,
"dropout": 0.2,
"output_size": 100,
"hidden_size": 100,
"fusion_method": "cat",
"timesteps_pooling_method": "rnn",
}
Args:
encoder1_args (Dict[str, Any]): Configuration for first encoder
encoder2_args (Dict[str, Any]): Configuration for second encoder
fuser_args (Dict[str, Any]): Configuration for fuser
"""
super(BimodalEncoder, self).__init__(
encoder1_args,
encoder2_args,
fuser_args,
**kwargs,
)
self.input_projection = None
if "input_projection" in fuser_args and fuser_args["input_projection"]:
self.input_projection = ModalityProjection(
[encoder1_args["input_size"], encoder2_args["input_size"]],
fuser_args["hidden_size"],
mode=fuser_args["input_projection"],
)
encoder1_args["return_hidden"] = True
encoder2_args["return_hidden"] = True
self.encoder1 = UnimodalEncoder(**encoder1_args)
self.encoder2 = UnimodalEncoder(**encoder2_args)
self.fuse = self._make_fusion_pipeline(
[self.encoder1.out_size, self.encoder2.out_size], **fuser_args
)
GloveEncoder
Alias for Unimodal Encoder
MOSEIClassifier
__init__(self, encoder, num_classes, dropout=0.2)
special
Encode and classify multimodal inputs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
encoder |
BaseEncoder |
The encoder module |
required |
num_classes |
int |
The number of target classes |
required |
dropout |
float |
Dropout probability |
0.2 |
Source code in slp/modules/multimodal.py
def __init__(self, encoder: BaseEncoder, num_classes: int, dropout: float = 0.2):
"""Encode and classify multimodal inputs
Args:
encoder (BaseEncoder): The encoder module
num_classes (int): The number of target classes
dropout (float): Dropout probability
"""
super(MOSEIClassifier, self).__init__()
self.enc = encoder
self.drop = nn.Dropout(p=dropout)
self.clf = nn.Linear(self.enc.out_size, num_classes)
MultimodalBaseline
__init__(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False)
special
Multimodal baseline architecture
This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler. The default configuration is tuned for good performance on MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_size |
int |
Text input size. Defaults to 300. |
300 |
audio_size |
int |
Audio input size. Defaults to 74. |
74 |
visual_size |
int |
Visual input size. Defaults to 35. |
35 |
hidden_size |
int |
Hidden dimension. Defaults to 100. |
100 |
dropout |
float |
Dropout rate. Defaults to 0.2. |
0.2 |
encoder_layers |
float |
Number of encoder layers. Defaults to 1. |
1 |
bidirectional |
bool |
Use bidirectional RNNs. Defaults to True. |
True |
merge_bi |
str |
Bidirectional merging method in the encoders. Defaults to "sum". |
'sum' |
rnn_type |
str |
RNN type [lstm|gru]. Defaults to "lstm". |
'lstm' |
encoder_attention |
bool |
Use attention in the encoder RNNs. Defaults to True. |
True |
fuser_residual |
bool |
Use vilbert like residual in the attention fuser. Defaults to True. |
True |
use_all_trimodal |
bool |
Use all trimodal interactions for the Attention fuser. Defaults to False. |
False |
Source code in slp/modules/multimodal.py
def __init__(
self,
text_size: int = 300,
audio_size: int = 74,
visual_size: int = 35,
hidden_size: int = 100,
dropout: float = 0.2,
encoder_layers: float = 1,
bidirectional: bool = True,
merge_bi: str = "sum",
rnn_type: str = "lstm",
encoder_attention: bool = True,
fuser_residual: bool = True,
use_all_trimodal: bool = False,
):
"""Multimodal baseline architecture
This baseline composes of three unimodal RNNs followed by an Attention Fuser and an RNN timestep pooler.
The default configuration is tuned for good performance on MOSEI.
Args:
text_size (int, optional): Text input size. Defaults to 300.
audio_size (int, optional): Audio input size. Defaults to 74.
visual_size (int, optional): Visual input size. Defaults to 35.
hidden_size (int, optional): Hidden dimension. Defaults to 100.
dropout (float, optional): Dropout rate. Defaults to 0.2.
encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
"""
cfg = {
"hidden_size": hidden_size,
"dropout": dropout,
"layers": encoder_layers,
"attention": encoder_attention,
"bidirectional": bidirectional,
"rnn_type": rnn_type,
"merge_bi": merge_bi,
}
text_cfg = MultimodalBaseline.encoder_cfg(text_size, **cfg)
audio_cfg = MultimodalBaseline.encoder_cfg(audio_size, **cfg)
visual_cfg = MultimodalBaseline.encoder_cfg(visual_size, **cfg)
fuser_cfg = MultimodalBaseline.fuser_cfg(
hidden_size=hidden_size,
dropout=dropout,
residual=fuser_residual,
use_all_trimodal=use_all_trimodal,
)
super(MultimodalBaseline, self).__init__(
text_cfg, audio_cfg, visual_cfg, fuser_cfg
)
encoder_cfg(input_size, **cfg)
staticmethod
Static method to create the encoder configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_size |
int |
Input modality size |
required |
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The encoder configuration |
Source code in slp/modules/multimodal.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
"""Static method to create the encoder configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
input_size (int): Input modality size
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The encoder configuration
"""
return {
"input_size": input_size,
"hidden_size": cfg.get("hidden_size", 100),
"layers": cfg.get("layers", 1),
"bidirectional": cfg.get("bidirectional", True),
"dropout": cfg.get("dropout", 0.2),
"rnn_type": cfg.get("rnn_type", "lstm"),
"attention": cfg.get("attention", True),
"merge_bi": cfg.get("merge_bi", "sum"),
}
fuser_cfg(**cfg)
staticmethod
Static method to create the fuser configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The fuser configuration |
Source code in slp/modules/multimodal.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
"""Static method to create the fuser configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The fuser configuration
"""
return {
"n_modalities": 3,
"dropout": cfg.get("dropout", 0.2),
"output_size": cfg.get("hidden_size", 100),
"hidden_size": cfg.get("hidden_size", 100),
"fusion_method": "attention",
"timesteps_pooling_method": "rnn",
"residual": cfg.get("residual", True),
"use_all_trimodal": cfg.get("use_all_trimodal", True),
}
MultimodalBaselineClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
TrimodalClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
TrimodalEncoder
out_size: int
property
readonly
Output feature size
Returns:
Type | Description |
---|---|
int |
int: Output feature size |
__init__(self, encoder1_args, encoder2_args, encoder3_args, fuser_args, **kwargs)
special
Two modality encoder
Encode + Fuse three input modalities
Example encoder_args: { "input_size": 35, "hidden_size": 100, "layers": 1, "bidirectional": True, "dropout": 0.2, "rnn_type": "lstm", "attention": True, }
Example fuser_args: { "n_modalities": 3, "dropout": 0.2, "output_size": 100, "hidden_size": 100, "fusion_method": "cat", "timesteps_pooling_method": "rnn", }
Parameters:
Name | Type | Description | Default |
---|---|---|---|
encoder1_args |
Dict[str, Any] |
Configuration for first encoder |
required |
encoder2_args |
Dict[str, Any] |
Configuration for second encoder |
required |
encoder3_args |
Dict[str, Any] |
Configuration for third encoder |
required |
fuser_args |
Dict[str, Any] |
Configuration for fuser |
required |
Source code in slp/modules/multimodal.py
def __init__(
self,
encoder1_args: Dict[str, Any],
encoder2_args: Dict[str, Any],
encoder3_args: Dict[str, Any],
fuser_args: Dict[str, Any],
**kwargs,
):
"""Two modality encoder
Encode + Fuse three input modalities
Example encoder_args:
{
"input_size": 35,
"hidden_size": 100,
"layers": 1,
"bidirectional": True,
"dropout": 0.2,
"rnn_type": "lstm",
"attention": True,
}
Example fuser_args:
{
"n_modalities": 3,
"dropout": 0.2,
"output_size": 100,
"hidden_size": 100,
"fusion_method": "cat",
"timesteps_pooling_method": "rnn",
}
Args:
encoder1_args (Dict[str, Any]): Configuration for first encoder
encoder2_args (Dict[str, Any]): Configuration for second encoder
encoder3_args (Dict[str, Any]): Configuration for third encoder
fuser_args (Dict[str, Any]): Configuration for fuser
"""
super(TrimodalEncoder, self).__init__(
encoder1_args,
encoder2_args,
encoder3_args,
fuser_args,
**kwargs,
)
self.input_projection = None
if "input_projection" in fuser_args and fuser_args["input_projection"]:
self.input_projection = ModalityProjection(
[encoder1_args["input_size"], encoder2_args["input_size"]],
fuser_args["hidden_size"],
mode=fuser_args["input_projection"],
)
self.encoder1 = UnimodalEncoder(**encoder1_args)
self.encoder2 = UnimodalEncoder(**encoder2_args)
self.encoder3 = UnimodalEncoder(**encoder3_args)
# encoder3_args["input_size"], encoder3_args["hidden_size"], **encoder3_args
self.fuse = self._make_fusion_pipeline(
[self.encoder1.out_size, self.encoder2.out_size, self.encoder3.out_size],
**fuser_args,
)
UnimodalClassifier
__init__(self, input_size, hidden_size, num_classes, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, **kwargs)
special
Encode and classify unimodal inputs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_size |
int |
The input modality feature size |
required |
hidden_size |
int |
Hidden size for RNN |
required |
num_classes |
int |
The number of target classes |
required |
layers |
int |
Number of RNN layers |
1 |
bidirectional |
bool |
Use biRNN |
True |
dropout |
float |
Dropout probability |
0.2 |
rnn_type |
str |
[lstm|gru] |
'lstm' |
attention |
bool |
Use attention on hidden states |
True |
Source code in slp/modules/multimodal.py
def __init__(
self,
input_size: int,
hidden_size: int,
num_classes: int,
layers: int = 1,
bidirectional: bool = True,
dropout: float = 0.2,
rnn_type: str = "lstm",
attention: bool = True,
**kwargs,
):
"""Encode and classify unimodal inputs
Args:
input_size (int): The input modality feature size
hidden_size (int): Hidden size for RNN
num_classes (int): The number of target classes
layers (int): Number of RNN layers
bidirectional (bool): Use biRNN
dropout (float): Dropout probability
rnn_type (str): [lstm|gru]
attention (bool): Use attention on hidden states
"""
enc = UnimodalEncoder(
input_size,
hidden_size,
layers=layers,
bidirectional=bidirectional,
dropout=dropout,
rnn_type=rnn_type,
attention=attention,
aggregate_encoded=True,
)
super(UnimodalClassifier, self).__init__(enc, num_classes)
forward(self, x, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, x: torch.Tensor, lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
fused = self.enc(x, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
UnimodalEncoder
out_size: int
property
readonly
Output feature size
Returns:
Type | Description |
---|---|
int |
int: Output feature size |
__init__(self, input_size, hidden_size, layers=1, bidirectional=True, dropout=0.2, rnn_type='lstm', attention=True, merge_bi='sum', aggregate_encoded=False, **kwargs)
special
Single modality encoder
Encode a single modality using an Attentive RNN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_size |
int |
Input feature size |
required |
hidden_size |
int |
RNN hidden size |
required |
layers |
int |
Number of RNN layers. Defaults to 1. |
1 |
bidirectional |
bool |
Use bidirectional RNN. Defaults to True. |
True |
dropout |
float |
Dropout probability. Defaults to 0.2. |
0.2 |
rnn_type |
str |
lstm or gru. Defaults to "lstm". |
'lstm' |
attention |
bool |
Use attention over hidden states. Defaults to True. |
True |
merge_bi |
str |
How to merge hidden states [sum|cat]. Defaults to sum. |
'sum' |
aggregate_encoded |
bool |
Aggregate hidden states. Defaults to False. |
False |
Source code in slp/modules/multimodal.py
def __init__(
self,
input_size: int,
hidden_size: int,
layers: int = 1,
bidirectional: bool = True,
dropout: float = 0.2,
rnn_type: str = "lstm",
attention: bool = True,
merge_bi: str = "sum",
aggregate_encoded: bool = False,
**kwargs,
):
"""Single modality encoder
Encode a single modality using an Attentive RNN
Args:
input_size (int): Input feature size
hidden_size (int): RNN hidden size
layers (int, optional): Number of RNN layers. Defaults to 1.
bidirectional (bool, optional): Use bidirectional RNN. Defaults to True.
dropout (float, optional): Dropout probability. Defaults to 0.2.
rnn_type (str, optional): lstm or gru. Defaults to "lstm".
attention (bool, optional): Use attention over hidden states. Defaults to True.
merge_bi (str, optional): How to merge hidden states [sum|cat]. Defaults to sum.
aggregate_encoded (bool, optional): Aggregate hidden states. Defaults to False.
"""
super(UnimodalEncoder, self).__init__(
input_size,
hidden_size,
layers=layers,
bidirectional=bidirectional,
dropout=dropout,
rnn_type=rnn_type,
attention=attention,
**kwargs,
)
self.aggregate_encoded = aggregate_encoded
self.encoder = AttentiveRNN(
input_size,
hidden_size,
batch_first=True,
layers=layers,
merge_bi=merge_bi,
bidirectional=bidirectional,
dropout=dropout,
rnn_type=rnn_type,
packed_sequence=True,
attention=attention,
return_hidden=True,
)
VisualEncoder
Alias for Unimodal Encoder
VisualTextClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/multimodal.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["visual"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
M3
HardMultimodalDropout
__init__(self, p=0.5, n_modalities=3, p_mod=None)
special
MMDrop initial implementation
For each sample in a batch drop one of the modalities with probability p
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float |
drop probability |
0.5 |
n_modalities |
int |
number of modalities |
3 |
p_mod |
Optional[List[float]] |
Drop probabilities for each modality |
None |
Source code in slp/modules/mmdrop.py
def __init__(
self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
"""MMDrop initial implementation
For each sample in a batch drop one of the modalities with probability p
Args:
p (float): drop probability
n_modalities (int): number of modalities
p_mod (Optional[List[float]]): Drop probabilities for each modality
"""
super(HardMultimodalDropout, self).__init__()
self.p = p
self.n_modalities = n_modalities
self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]
if p_mod is not None:
self.p_mod = p_mod
forward(self, *mods)
Naive mmdrop forward
Iterate over batch and randomly choose modality to drop
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mods |
varargs torch.Tensor |
[B, L, D_m] Modality representations |
() |
Returns:
Type | Description |
---|---|
(List[torch.Tensor]) |
The modality representations. Some of them are dropped |
Source code in slp/modules/mmdrop.py
def forward(self, *mods):
"""Naive mmdrop forward
Iterate over batch and randomly choose modality to drop
Args:
mods (varargs torch.Tensor): [B, L, D_m] Modality representations
Returns:
(List[torch.Tensor]): The modality representations. Some of them are dropped
"""
mods = list(mods)
# List of [B, L, D]
if self.training:
if random.random() < self.p:
# Drop different modality for each sample in batch
for batch in range(mods[0].size(0)):
m = random.choices(
list(range(self.n_modalities)), weights=self.p_mod, k=1
)[0]
# m = random.randint(0, self.n_modalities - 1)
mask = torch.ones_like(mods[m])
mask[batch] = 0.0
mods[m] = mods[m] * mask
if self.p > 0:
for m in range(len(mods)):
keep_prob = 1 - (self.p / self.n_modalities)
mods[m] = mods[m] * (1 / keep_prob)
return mods
MultimodalDropout
__init__(self, p=0.5, n_modalities=3, p_mod=None, mode='hard')
special
mmdrop wrapper class
Drop p * 100 % of features of a specific modality over batch
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float |
drop probability |
0.5 |
n_modalities |
int |
number of modalities |
3 |
p_mod |
Optional[List[float]] |
Drop probabilities for each modality |
None |
mode |
str |
Hard or soft mmdrop |
'hard' |
Source code in slp/modules/mmdrop.py
def __init__(
self,
p: float = 0.5,
n_modalities: int = 3,
p_mod: Optional[List[float]] = None,
mode: str = "hard",
):
"""mmdrop wrapper class
Drop p * 100 % of features of a specific modality over batch
Args:
p (float): drop probability
n_modalities (int): number of modalities
p_mod (Optional[List[float]]): Drop probabilities for each modality
mode (str): Hard or soft mmdrop
"""
super(MultimodalDropout, self).__init__()
assert mode in [
"hard",
"soft",
], "Allowed mode for MultimodalDropout ['hard' | 'soft']"
if mode == "hard":
self.mmdrop = HardMultimodalDropout(
p=p, n_modalities=n_modalities, p_mod=p_mod
)
else:
self.mmdrop = SoftMultimodalDropout( # type: ignore
p=p, n_modalities=n_modalities, p_mod=p_mod
)
forward(self, *mods)
mmdrop wrapper forward
Perform hard or soft mmdrop
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mods |
varargs torch.Tensor |
[B, L, D_m] Modality representations |
() |
Returns:
Type | Description |
---|---|
(List[torch.Tensor]) |
The modality representations. Some of them are dropped |
Source code in slp/modules/mmdrop.py
def forward(self, *mods):
"""mmdrop wrapper forward
Perform hard or soft mmdrop
Args:
mods (varargs torch.Tensor): [B, L, D_m] Modality representations
Returns:
(List[torch.Tensor]): The modality representations. Some of them are dropped
"""
return self.mmdrop(*mods)
SoftMultimodalDropout
__init__(self, p=0.5, n_modalities=3, p_mod=None)
special
Soft mmdrop implementation
Drop p * 100 % of features of a specific modality over batch
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float |
drop probability |
0.5 |
n_modalities |
int |
number of modalities |
3 |
p_mod |
Optional[List[float]] |
Drop probabilities for each modality |
None |
Source code in slp/modules/mmdrop.py
def __init__(
self, p: float = 0.5, n_modalities: int = 3, p_mod: Optional[List[float]] = None
):
"""Soft mmdrop implementation
Drop p * 100 % of features of a specific modality over batch
Args:
p (float): drop probability
n_modalities (int): number of modalities
p_mod (Optional[List[float]]): Drop probabilities for each modality
"""
super(SoftMultimodalDropout, self).__init__()
self.p = p # p_drop
self.n_modalities = n_modalities
self.p_mod = [1.0 / n_modalities for _ in range(n_modalities)]
if p_mod is not None:
self.p_mod = p_mod
forward(self, *mods)
Soft mmdrop forward
Sample a binomial mask to mask a random modality in this batch
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mods |
varargs torch.Tensor |
[B, L, D_m] Modality representations |
() |
Returns:
Type | Description |
---|---|
(List[torch.Tensor]) |
The modality representations. Some of them are dropped |
Source code in slp/modules/mmdrop.py
def forward(self, *mods):
"""Soft mmdrop forward
Sample a binomial mask to mask a random modality in this batch
Args:
mods (varargs torch.Tensor): [B, L, D_m] Modality representations
Returns:
(List[torch.Tensor]): The modality representations. Some of them are dropped
"""
mods = list(mods)
if self.training:
# m = random.randint(0, self.n_modalities - 1)
m = random.choices(list(range(self.n_modalities)), weights=self.p_mod, k=1)[
0
]
binomial = torch.distributions.binomial.Binomial(probs=1 - self.p)
mods[m] = mods[m] * binomial.sample(mods[m].size()).to(mods[m].device)
for m in range(self.n_modalities):
mods[m] = mods[m] * (1.0 / (1 - self.p / self.n_modalities))
return mods
M3
encoder_cfg(input_size, **cfg)
staticmethod
Static method to create the encoder configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_size |
int |
Input modality size |
required |
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The encoder configuration |
Source code in slp/modules/m3.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
"""Static method to create the encoder configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
input_size (int): Input modality size
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The encoder configuration
"""
return {
"input_size": input_size,
"hidden_size": cfg.get("hidden_size", 100),
"layers": cfg.get("layers", 1),
"bidirectional": cfg.get("bidirectional", True),
"dropout": cfg.get("dropout", 0.2),
"rnn_type": cfg.get("rnn_type", "lstm"),
"attention": cfg.get("attention", True),
}
fuser_cfg(**cfg)
staticmethod
Static method to create the fuser configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The fuser configuration |
Source code in slp/modules/m3.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
"""Static method to create the fuser configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The fuser configuration
"""
return {
"n_modalities": 3,
"dropout": cfg.get("dropout", 0.2),
"output_size": cfg.get("hidden_size", 100),
"hidden_size": cfg.get("hidden_size", 100),
"fusion_method": "attention",
"timesteps_pooling_method": "rnn",
"residual": cfg.get("residual", True),
"use_all_trimodal": cfg.get("use_all_trimodal", True),
"mmdrop_prob": 0.2,
"mmdrop_individual_mod_prob": None,
"mmdrop_algorithm": "hard",
}
M3Classifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/m3.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out
M3FuseAggregate
out_size: int
property
readonly
Define the feature size of the returned tensor
Returns:
Type | Description |
---|---|
int |
int: The feature dimension of the output tensor |
__init__(self, feature_size, n_modalities, output_size=None, fusion_method='cat', timesteps_pooling_method='sum', batch_first=True, mmdrop_prob=0.2, mmdrop_individual_mod_prob=None, mmdrop_algorithm='hard', **fuser_kwargs)
special
MultimodalDropout, Fuse input feature sequences and aggregate across timesteps
MultimodalDropout -> Fuser -> TimestepsPooler
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_size |
int |
The input modality representations dimension |
required |
n_modalities |
int |
Number of input modalities |
required |
output_size |
Optional[int] |
Required output size. If not provided, output_size = fuser.out_size |
None |
fusion_method |
str |
Select which fuser to use [cat|sum|attention|bilinear] |
'cat' |
timesteps_pooling_method |
str |
TimestepsPooler method [cat|sum|rnn] |
'sum' |
batch_first |
bool |
Input tensors are in batch first configuration. Leave this as true except if you know what you are doing |
True |
mmdrop_prob |
float |
The probability for multimodal dropout. Defaults to 0.2 |
0.2 |
mmdrop_individual_mod_prob |
Optional[List[float]] |
Drop probabilities for each modality for multimodal dropout. If None all modalities are dropped with equal probability |
None |
mmdrop_algorithm |
str |
Choose multimodal dropout algorithm [hard|soft]. Defaults to hard |
'hard' |
**fuser_kwargs |
dict |
Extra keyword arguments to instantiate fuser |
{} |
Source code in slp/modules/m3.py
def __init__(
self,
feature_size: int,
n_modalities: int,
output_size: Optional[int] = None,
fusion_method: str = "cat",
timesteps_pooling_method: str = "sum",
batch_first: bool = True,
mmdrop_prob: float = 0.2,
mmdrop_individual_mod_prob: Optional[List[float]] = None,
mmdrop_algorithm: str = "hard",
**fuser_kwargs,
):
"""MultimodalDropout, Fuse input feature sequences and aggregate across timesteps
MultimodalDropout -> Fuser -> TimestepsPooler
Args:
feature_size (int): The input modality representations dimension
n_modalities (int): Number of input modalities
output_size (Optional[int]): Required output size. If not provided,
output_size = fuser.out_size
fusion_method (str): Select which fuser to use [cat|sum|attention|bilinear]
timesteps_pooling_method (str): TimestepsPooler method [cat|sum|rnn]
batch_first (bool): Input tensors are in batch first configuration. Leave this as true
except if you know what you are doing
mmdrop_prob (float): The probability for multimodal dropout. Defaults to 0.2
mmdrop_individual_mod_prob (Optional[List[float]]): Drop probabilities for each modality
for multimodal dropout. If None all modalities are dropped with equal probability
mmdrop_algorithm (str): Choose multimodal dropout algorithm [hard|soft]. Defaults to hard
**fuser_kwargs (dict): Extra keyword arguments to instantiate fuser
"""
super(M3FuseAggregate, self).__init__()
self.m3 = MultimodalDropout(
p=mmdrop_prob,
n_modalities=n_modalities,
p_mod=mmdrop_individual_mod_prob,
mode=mmdrop_algorithm,
)
fuser_kwargs["output_size"] = output_size
fuser_kwargs["fusion_method"] = fusion_method
fuser_kwargs["timesteps_pooling_method"] = timesteps_pooling_method
fuser_kwargs["batch_first"] = batch_first
if "n_modalities" in fuser_kwargs:
fuser_kwargs.pop("n_modalities") # Avoid multiple arguments
if "projection_size" in fuser_kwargs:
fuser_kwargs.pop("projection_size") # Avoid multiple arguments
self.fuse_aggregate = FuseAggregateTimesteps(
feature_size,
n_modalities,
**fuser_kwargs,
)
forward(self, *mods, *, lengths=None)
Fuse the modality representations and aggregate across timesteps
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
List of modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
Lengths of each modality |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Fused tensor [B, self.out_size] |
Source code in slp/modules/m3.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""Fuse the modality representations and aggregate across timesteps
Args:
*mods: List of modality tensors [B, L, D]
lengths (Optional[Tensor]): Lengths of each modality
Returns:
torch.Tensor: Fused tensor [B, self.out_size]
"""
mods_masked: List[torch.Tensor] = self.m3(*mods)
fused: torch.Tensor = self.fuse_aggregate(*mods_masked, lengths=lengths)
return fused
Multimodal Feedback
BaseFeedbackUnit
__init__(self, top_size, target_size, n_top_modalities, **kwargs)
special
Base class for feedback unit
Feedback units are responsible for projecting top-level crossmodal representations to bottom-level features and applying the top-down masks
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
n_top_modalities |
int |
Number of modalities to use for feedback |
required |
Source code in slp/modules/feedback.py
def __init__(
self, top_size: int, target_size: int, n_top_modalities: int, **kwargs
):
"""Base class for feedback unit
Feedback units are responsible for projecting top-level crossmodal
representations to bottom-level features and applying the top-down masks
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
n_top_modalities (int): Number of modalities to use for feedback
"""
super(BaseFeedbackUnit, self).__init__()
self.n_ = n_top_modalities
self.mask_layers = nn.ModuleList(
[
self.make_mask_layer(top_size, target_size, **kwargs)
for _ in range(self.n_)
]
)
forward(self, x_bottom, *mods_top, *, lengths=None)
Apply the top-down masks to the input feature vector
x = x * top_down_mask
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_bottom |
Tensor |
Bottom-level features [B, L, target_size] |
required |
*mods_top |
Tensor |
Top-level modality representations |
() |
lengths |
Optional[torch.Tensor] |
Original unpadded tensor lengths. Defaults to None. |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: Masked low level feature tensor [B, L, target_size] |
Source code in slp/modules/feedback.py
def forward(
self,
x_bottom: torch.Tensor,
*mods_top: torch.Tensor,
lengths: Optional[torch.Tensor] = None,
) -> torch.Tensor:
"""Apply the top-down masks to the input feature vector
x = x * top_down_mask
Args:
x_bottom (torch.Tensor): Bottom-level features [B, L, target_size]
*mods_top (torch.Tensor): Top-level modality representations
lengths (Optional[torch.Tensor], optional): Original unpadded tensor lengths. Defaults to None.
Returns:
torch.Tensor: Masked low level feature tensor [B, L, target_size]
"""
mask = self._get_feedback_mask(*mods_top, lengths=lengths)
x_bottom = x_bottom * mask
return x_bottom
make_mask_layer(self, top_size, target_size, **kwargs)
Abstract method to instantiate the layer to use for top-down feedback
To be implemented by subclasses
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
**kwargs |
|
extra configuration for the feedback layer |
{} |
Returns:
Type | Description |
---|---|
Module |
nn.Module: The instanstiated feedback layer |
Source code in slp/modules/feedback.py
@abc.abstractmethod
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
"""Abstract method to instantiate the layer to use for top-down feedback
To be implemented by subclasses
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
**kwargs: extra configuration for the feedback layer
Returns:
nn.Module: The instanstiated feedback layer
"""
pass
BoomFeedbackUnit
make_mask_layer(self, top_size, target_size, **kwargs)
Use an boom module for top-down projection
A boom module is a two-layer MLP where the inner projection size is much larger than the input and output size. (similar to Position feedforward in transformers)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
**kwargs |
|
extra configuration for the feedback layer |
{} |
Returns:
Type | Description |
---|---|
nn.Module |
slp.modules.feedforward.TwoLayer instance |
Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size, target_size, **kwargs):
"""Use an boom module for top-down projection
A boom module is a two-layer MLP where the inner projection size is
much larger than the input and output size. (similar to Position feedforward in transformers)
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
**kwargs: extra configuration for the feedback layer
Returns:
nn.Module: slp.modules.feedforward.TwoLayer instance
"""
return TwoLayer(
top_size,
2 * top_size,
target_size,
activation=kwargs.get("activation", "gelu"),
dropout=kwargs.get("dropout", 0.2),
)
DownUpFeedbackUnit
make_mask_layer(self, top_size, target_size, **kwargs)
Use an down-up module for top-down projection
A down-up module is a two-layer MLP where the inner projection size is much smaller than the input and output size. (Similar to adapyers)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
**kwargs |
|
extra configuration for the feedback layer |
{} |
Returns:
Type | Description |
---|---|
nn.Module |
slp.modules.feedforward.TwoLayer instance |
Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size, target_size, **kwargs):
"""Use an down-up module for top-down projection
A down-up module is a two-layer MLP where the inner projection size is
much smaller than the input and output size. (Similar to adapyers)
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
**kwargs: extra configuration for the feedback layer
Returns:
nn.Module: slp.modules.feedforward.TwoLayer instance
"""
return TwoLayer(
top_size,
top_size // 5,
target_size,
activation=kwargs.get("activation", "gelu"),
dropout=kwargs.get("dropout", 0.2),
)
Feedback
__init__(self, top_size, bottom_modality_sizes, use_self=False, mask_type='rnn', **kwargs)
special
Feedback module
Given a list of low-level features and top-level representations for n modalities:
- Create top-down masks for each modality
- Apply top-down masks to the low level features
- Return masked low-level features
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size for top-level representations (Common across modalities) |
required |
bottom_modality_sizes |
List[int] |
List of feature sizes for each low-level modality feature |
required |
use_self |
bool |
Include the self modality when creating the top-down mask. Defaults to False. |
False |
mask_type |
str |
Which feedback unit to use [rnn|gated|boom|downup]. Defaults to "rnn". |
'rnn' |
Source code in slp/modules/feedback.py
def __init__(
self,
top_size: int,
bottom_modality_sizes: List[int],
use_self: bool = False,
mask_type: str = "rnn",
**kwargs,
):
"""Feedback module
Given a list of low-level features and top-level representations for n modalities:
* Create top-down masks for each modality
* Apply top-down masks to the low level features
* Return masked low-level features
Args:
top_size (int): Feature size for top-level representations (Common across modalities)
bottom_modality_sizes (List[int]): List of feature sizes for each low-level modality feature
use_self (bool, optional): Include the self modality when creating the top-down mask. Defaults to False.
mask_type (str, optional): Which feedback unit to use [rnn|gated|boom|downup]. Defaults to "rnn".
"""
super(Feedback, self).__init__()
n_top_modalities = len(bottom_modality_sizes)
self.use_self = use_self
if not use_self:
n_top_modalities = n_top_modalities - 1
self.feedback_units = nn.ModuleList(
[
_make_feedback_unit(
top_size,
bottom_modality_sizes[i],
n_top_modalities,
mask_type=mask_type,
**kwargs,
)
for i in range(len(bottom_modality_sizes))
]
)
forward(self, mods_bottom, mods_top, lengths=None)
Create and apply the top-down masks to mods_bottom
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mods_bottom |
List[torch.Tensor] |
Low-level features for each modality |
required |
mods_top |
List[torch.Tensor] |
High-level representations for each modality |
required |
lengths |
Optional[torch.Tensor] |
Original unpadded sequence lengths. Defaults to None. |
None |
Returns:
Type | Description |
---|---|
List[torch.Tensor] |
List[torch.Tensor]: Masked low level features for each modality |
Source code in slp/modules/feedback.py
def forward(
self,
mods_bottom: List[torch.Tensor],
mods_top: List[torch.Tensor],
lengths: Optional[torch.Tensor] = None,
) -> List[torch.Tensor]:
"""Create and apply the top-down masks to mods_bottom
Args:
mods_bottom (List[torch.Tensor]): Low-level features for each modality
mods_top (List[torch.Tensor]): High-level representations for each modality
lengths (Optional[torch.Tensor], optional): Original unpadded sequence lengths. Defaults to None.
Returns:
List[torch.Tensor]: Masked low level features for each modality
"""
out = []
for i, bm in enumerate(mods_bottom):
top = mods_top if self.use_self else mods_top[:i] + mods_top[i + 1 :]
masked = self.feedback_units[i](bm, *top, lengths=lengths)
out.append(masked)
return out
GatedFeedbackUnit
Apply feedback mask using simple gating mechanism
make_mask_layer(self, top_size, target_size, **kwargs)
Use a simple nn.Linear layer for top-down projection
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
**kwargs |
|
extra configuration for the feedback layer |
{} |
Returns:
Type | Description |
---|---|
Module |
nn.Module: nn.Linear instance with dropout |
Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
"""Use a simple nn.Linear layer for top-down projection
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
**kwargs: extra configuration for the feedback layer
Returns:
nn.Module: nn.Linear instance with dropout
"""
return nn.Sequential(
nn.Linear(top_size, target_size),
nn.Dropout(p=kwargs.get("dropout", 0.2)),
)
RNNFeedbackUnit
Apply feedback mask using top-down RNN layers
make_mask_layer(self, top_size, target_size, **kwargs)
Use an RNN for top-down projection
Parameters:
Name | Type | Description | Default |
---|---|---|---|
top_size |
int |
Feature size of the top-level representations |
required |
target_size |
int |
Feature size of the bottom-level features |
required |
**kwargs |
|
extra configuration for the feedback layer |
{} |
Returns:
Type | Description |
---|---|
Module |
nn.Module: slp.modules.rnn.AttentiveRNN instance |
Source code in slp/modules/feedback.py
def make_mask_layer(self, top_size: int, target_size: int, **kwargs) -> nn.Module:
"""Use an RNN for top-down projection
Args:
top_size (int): Feature size of the top-level representations
target_size (int): Feature size of the bottom-level features
**kwargs: extra configuration for the feedback layer
Returns:
nn.Module: slp.modules.rnn.AttentiveRNN instance
"""
return AttentiveRNN(
top_size,
hidden_size=target_size,
attention=kwargs.get("attention", False),
dropout=kwargs.get("dropout", 0.2),
return_hidden=True,
bidirectional=kwargs.get("bidirectional", False),
merge_bi="sum",
rnn_type=kwargs.get("rnn_type", "lstm"),
)
MMLatch
__init__(self, text_size=300, audio_size=74, visual_size=35, hidden_size=100, dropout=0.2, encoder_layers=1, bidirectional=True, merge_bi='sum', rnn_type='lstm', encoder_attention=True, fuser_residual=True, use_all_trimodal=False, feedback=True, use_self_feedback=False, feedback_algorithm='rnn')
special
MMLatch implementation
Multimodal baseline + feedback
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_size |
int |
Text input size. Defaults to 300. |
300 |
audio_size |
int |
Audio input size. Defaults to 74. |
74 |
visual_size |
int |
Visual input size. Defaults to 35. |
35 |
hidden_size |
int |
Hidden dimension. Defaults to 100. |
100 |
dropout |
float |
Dropout rate. Defaults to 0.2. |
0.2 |
encoder_layers |
float |
Number of encoder layers. Defaults to 1. |
1 |
bidirectional |
bool |
Use bidirectional RNNs. Defaults to True. |
True |
merge_bi |
str |
Bidirectional merging method in the encoders. Defaults to "sum". |
'sum' |
rnn_type |
str |
RNN type [lstm|gru]. Defaults to "lstm". |
'lstm' |
encoder_attention |
bool |
Use attention in the encoder RNNs. Defaults to True. |
True |
fuser_residual |
bool |
Use vilbert like residual in the attention fuser. Defaults to True. |
True |
use_all_trimodal |
bool |
Use all trimodal interactions for the Attention fuser. Defaults to False. |
False |
feedback |
bool |
Use top-down feedback. Defaults to True. |
True |
use_self_feedback |
bool |
If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False. |
False |
feedback_algorithm |
str |
Feedback module [rnn|boom|gated|downup]. Defaults to "rnn". |
'rnn' |
Source code in slp/modules/mmlatch.py
def __init__(
self,
text_size: int = 300,
audio_size: int = 74,
visual_size: int = 35,
hidden_size: int = 100,
dropout: float = 0.2,
encoder_layers: float = 1,
bidirectional: bool = True,
merge_bi: str = "sum",
rnn_type: str = "lstm",
encoder_attention: bool = True,
fuser_residual: bool = True,
use_all_trimodal: bool = False,
feedback: bool = True,
use_self_feedback: bool = False,
feedback_algorithm: str = "rnn",
):
"""MMLatch implementation
Multimodal baseline + feedback
Args:
text_size (int, optional): Text input size. Defaults to 300.
audio_size (int, optional): Audio input size. Defaults to 74.
visual_size (int, optional): Visual input size. Defaults to 35.
hidden_size (int, optional): Hidden dimension. Defaults to 100.
dropout (float, optional): Dropout rate. Defaults to 0.2.
encoder_layers (float, optional): Number of encoder layers. Defaults to 1.
bidirectional (bool, optional): Use bidirectional RNNs. Defaults to True.
merge_bi (str, optional): Bidirectional merging method in the encoders. Defaults to "sum".
rnn_type (str, optional): RNN type [lstm|gru]. Defaults to "lstm".
encoder_attention (bool, optional): Use attention in the encoder RNNs. Defaults to True.
fuser_residual (bool, optional): Use vilbert like residual in the attention fuser. Defaults to True.
use_all_trimodal (bool, optional): Use all trimodal interactions for the Attention fuser. Defaults to False.
feedback (bool, optional): Use top-down feedback. Defaults to True.
use_self_feedback (bool, optional): If false use only crossmodal features for top-down feedback. If True also use the self modality. Defaults to False.
feedback_algorithm (str, optional): Feedback module [rnn|boom|gated|downup]. Defaults to "rnn".
"""
super(MMLatch, self).__init__(
text_size=text_size,
audio_size=audio_size,
visual_size=visual_size,
hidden_size=hidden_size,
dropout=dropout,
encoder_layers=encoder_layers,
bidirectional=bidirectional,
merge_bi=merge_bi,
rnn_type=rnn_type,
encoder_attention=encoder_attention,
fuser_residual=fuser_residual,
use_all_trimodal=use_all_trimodal,
)
self.feedback = None
if feedback:
self.feedback = Feedback(
hidden_size,
[text_size, audio_size, visual_size],
use_self=use_self_feedback,
mask_type=feedback_algorithm,
)
encoder_cfg(input_size, **cfg)
staticmethod
Static method to create the encoder configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_size |
int |
Input modality size |
required |
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The encoder configuration |
Source code in slp/modules/mmlatch.py
@staticmethod
def encoder_cfg(input_size: int, **cfg) -> Dict[str, Any]:
"""Static method to create the encoder configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
input_size (int): Input modality size
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The encoder configuration
"""
return {
"input_size": input_size,
"hidden_size": cfg.get("hidden_size", 100),
"layers": cfg.get("layers", 1),
"bidirectional": cfg.get("bidirectional", True),
"dropout": cfg.get("dropout", 0.2),
"rnn_type": cfg.get("rnn_type", "lstm"),
"attention": cfg.get("attention", True),
}
forward(self, *mods, *, lengths=None)
Encode + fuse
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*mods |
Tensor |
Variable input modality tensors [B, L, D] |
() |
lengths |
Optional[torch.Tensor] |
The unpadded tensor lengths. Defaults to None. |
None |
Returns:
Type | Description |
---|---|
Tensor |
torch.Tensor: The fused tensor [B, D] |
Source code in slp/modules/mmlatch.py
def forward(
self, *mods: torch.Tensor, lengths: Optional[torch.Tensor] = None
) -> torch.Tensor:
encoded: List[torch.Tensor] = self._encode(*mods, lengths=lengths)
if self.feedback is not None:
mods_feedback: List[torch.Tensor] = self.feedback(
mods, encoded, lengths=lengths
)
encoded = self._encode(*mods_feedback, lengths=lengths)
fused = self._fuse(*encoded, lengths=lengths)
return fused
fuser_cfg(**cfg)
staticmethod
Static method to create the fuser configuration
The default configuration is provided here This configuration corresponds to the official paper implementation and is tuned for CMU MOSEI.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**cfg |
|
Optional keyword arguments |
{} |
Returns:
Type | Description |
---|---|
Dict[str, Any] |
Dict[str, Any]: The fuser configuration |
Source code in slp/modules/mmlatch.py
@staticmethod
def fuser_cfg(**cfg) -> Dict[str, Any]:
"""Static method to create the fuser configuration
The default configuration is provided here
This configuration corresponds to the official paper implementation
and is tuned for CMU MOSEI.
Args:
**cfg: Optional keyword arguments
Returns:
Dict[str, Any]: The fuser configuration
"""
return {
"n_modalities": 3,
"dropout": cfg.get("dropout", 0.2),
"output_size": cfg.get("hidden_size", 100),
"hidden_size": cfg.get("hidden_size", 100),
"fusion_method": "attention",
"timesteps_pooling_method": "rnn",
"residual": cfg.get("residual", True),
"use_all_trimodal": cfg.get("use_all_trimodal", True),
}
MMLatchClassifier
forward(self, mod_dict, lengths)
Defines the computation performed at every call.
Should be overridden by all subclasses.
.. note::
Although the recipe for forward pass needs to be defined within
this function, one should call the :class:Module
instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Source code in slp/modules/mmlatch.py
def forward(
self, mod_dict: Dict[str, torch.Tensor], lengths: Dict[str, torch.Tensor]
) -> torch.Tensor:
mods = [mod_dict["text"], mod_dict["audio"], mod_dict["visual"]]
fused = self.enc(*mods, lengths=lengths["text"])
fused = self.drop(fused)
out: torch.Tensor = self.clf(fused)
return out