Data manipulation

We provide out of the box support for easy preprocessing of NLP corpora and helpers to work with datasets in the Pytorch way.

`MultimodalSequenceClassificationCollator`

`call(self, batch)` `special`

Call collate function

Parameters:

Name	Type	Description	Default
`batch`	`List[Dict[str, torch.Tensor]]`	Batch of samples. It expects a list of dictionaries from modalities to torch tensors	required

Returns:

Type	Description
`Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]`	Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]: tuple of (dict batched modality tensors, labels, dict of modality sequence lengths)

Source code in slp/data/collators.py

def __call__(
    self, batch: List[Dict[str, torch.Tensor]]
) -> Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]:
    """Call collate function

    Args:
        batch (List[Dict[str, torch.Tensor]]): Batch of samples.
            It expects a list of dictionaries from modalities to torch tensors

    Returns:
        Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]: tuple of
            (dict batched modality tensors, labels, dict of modality sequence lengths)
    """
    inputs = {}
    lengths = {}

    for m in self.modalities:
        seq = self.extract_sequence(batch, m)
        lengths[m] = torch.tensor([s.size(0) for s in seq], device=self.device)

        if self.max_length > 0:
            lengths[m] = torch.clamp(lengths[m], min=0, max=self.max_length)

        inputs[m] = pad_sequence(
            seq,
            batch_first=True,
            padding_value=self.pad_indx,
            max_length=self.max_length,
        ).to(self.device)

    targets: List[Label] = [b[self.label_key] for b in batch]

    # Pad and convert to tensor
    ttargets: torch.Tensor = mktensor(
        targets, device=self.device, dtype=self.label_dtype
    )

    return inputs, ttargets.to(self.device), lengths

`init(self, pad_indx=0, modalities={'audio', 'visual', 'text'}, label_key='label', max_length=-1, label_dtype=torch.float32, device='cpu')` `special`

Collate function for sequence classification tasks

Perform padding
Calculate sequence lengths

Parameters:

Name	Type	Description	Default
`pad_indx`	`int`	Pad token index. Defaults to 0.	`0`
`modalities`	`Set`	Which modalities are included in the batch dict	`{'audio', 'visual', 'text'}`
`max_length`	`int`	Pad sequences to a fixed maximum length	`-1`
`label_key`	`str`	String to access the label in the batch dict	`'label'`
`device`	`str`	device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.	`'cpu'`

Examples:

>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=MultimodalSequenceClassificationCollator())

Source code in slp/data/collators.py

def __init__(
    self,
    pad_indx=0,
    modalities={"visual", "text", "audio"},
    label_key="label",
    max_length=-1,
    label_dtype=torch.float,
    device="cpu",
):
    """Collate function for sequence classification tasks

    * Perform padding
    * Calculate sequence lengths

    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        modalities (Set): Which modalities are included in the batch dict
        max_length (int): Pad sequences to a fixed maximum length
        label_key (str): String to access the label in the batch dict
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.

    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=MultimodalSequenceClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.device = device
    self.max_length = max_length
    self.label_key = label_key
    self.modalities = modalities
    self.label_dtype = label_dtype

`Seq2SeqCollator`

`call(self, batch)` `special`

Call collate function

Parameters:

Name	Type	Description	Default
`batch`	`List[Tuple[torch.Tensor, torch.Tensor]]`	Batch of samples. It expects a list of tuples (source, target) Each source and target are a sequences of features or ids.	required

Returns:

Type	Description
`Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]`	Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths_inputs, lengths_targets)

Source code in slp/data/collators.py

def __call__(
    self, batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Call collate function

    Args:
        batch (List[Tuple[torch.Tensor, torch.Tensor]]): Batch of samples.
            It expects a list of tuples (source, target)
            Each source and target are a sequences of features or ids.

    Returns:
        Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors
            (inputs, labels, lengths_inputs, lengths_targets)
    """
    inputs: List[torch.Tensor] = [b[0] for b in batch]
    targets: List[torch.Tensor] = [b[1] for b in batch]
    lengths_inputs = torch.tensor([s.size(0) for s in inputs], device=self.device)
    lengths_targets = torch.tensor([s.size(0) for s in targets], device=self.device)

    if self.max_length > 0:
        lengths_inputs = torch.clamp(lengths_inputs, min=0, max=self.max_length)
        lengths_targets = torch.clamp(lengths_targets, min=0, max=self.max_length)

    inputs_padded: torch.Tensor = pad_sequence(
        inputs,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)

    targets_padded: torch.Tensor = pad_sequence(
        targets,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)

    return inputs_padded, targets_padded, lengths_inputs, lengths_targets

`init(self, pad_indx=0, max_length=-1, device='cpu')` `special`

Collate function for seq2seq tasks

Perform padding
Calculate sequence lengths

Parameters:

Name	Type	Description	Default
`pad_indx`	`int`	Pad token index. Defaults to 0.	`0`
`max_length`	`int`	Pad sequences to a fixed maximum length	`-1`
`device`	`str`	device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.	`'cpu'`

Examples:

>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=Seq2SeqClassificationCollator())

Source code in slp/data/collators.py

def __init__(self, pad_indx=0, max_length=-1, device="cpu"):
    """Collate function for seq2seq tasks

    * Perform padding
    * Calculate sequence lengths

    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        max_length (int): Pad sequences to a fixed maximum length
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.

    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=Seq2SeqClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.max_length = max_length
    self.device = device

`SequenceClassificationCollator`

`call(self, batch)` `special`

Call collate function

Parameters:

Name	Type	Description	Default
`batch`	`List[Tuple[torch.Tensor, Union[numpy.ndarray, torch.Tensor, List[~T], int]]]`	Batch of samples. It expects a list of tuples (inputs, label).	required

Returns:

Type	Description
`Tuple[torch.Tensor, torch.Tensor, torch.Tensor]`	Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths)

Source code in slp/data/collators.py

def __call__(
    self, batch: List[Tuple[torch.Tensor, Label]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Call collate function

    Args:
        batch (List[Tuple[torch.Tensor, slp.util.types.Label]]): Batch of samples.
            It expects a list of tuples (inputs, label).

    Returns:
        Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths)
    """
    inputs: List[torch.Tensor] = [b[0] for b in batch]
    targets: List[Label] = [b[1] for b in batch]
    #  targets: List[torch.tensor] = map(list, zip(*batch))
    lengths = torch.tensor([s.size(0) for s in inputs], device=self.device)

    if self.max_length > 0:
        lengths = torch.clamp(lengths, min=0, max=self.max_length)
    # Pad and convert to tensor
    inputs_padded: torch.Tensor = pad_sequence(
        inputs,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)

    ttargets: torch.Tensor = mktensor(targets, device=self.device, dtype=torch.long)

    return inputs_padded, ttargets.to(self.device), lengths

`init(self, pad_indx=0, max_length=-1, device='cpu')` `special`

Collate function for sequence classification tasks

Perform padding
Calculate sequence lengths

Parameters:

Name	Type	Description	Default
`pad_indx`	`int`	Pad token index. Defaults to 0.	`0`
`max_length`	`int`	Pad sequences to a fixed maximum length	`-1`
`device`	`str`	device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.	`'cpu'`

Examples:

>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=SequenceClassificationCollator())

Source code in slp/data/collators.py

def __init__(self, pad_indx=0, max_length=-1, device="cpu"):
    """Collate function for sequence classification tasks

    * Perform padding
    * Calculate sequence lengths

    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        max_length (int): Pad sequences to a fixed maximum length
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.

    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=SequenceClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.device = device
    self.max_length = max_length

`EmbeddingsLoader`

`init(self, embeddings_file, dim, vocab=None, extra_tokens=None)` `special`

Load word embeddings in text format

Parameters:

Name	Type	Description	Default
`embeddings_file`	`str`	File where embeddings are stored (e.g. glove.6B.50d.txt)	required
`dim`	`int`	Dimensionality of embeddings	required
`vocab`	`Optional[Dict[str, int]]`	Load only embeddings in vocab. Defaults to None.	`None`
`extra_tokens`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Create random embeddings for these special tokens. Defaults to None.	`None`

Source code in slp/data/corpus.py

def __init__(
    self,
    embeddings_file: str,
    dim: int,
    vocab: Optional[Dict[str, int]] = None,
    extra_tokens: Optional[SPECIAL_TOKENS] = None,
) -> None:
    """Load word embeddings in text format

    Args:
        embeddings_file (str): File where embeddings are stored (e.g. glove.6B.50d.txt)
        dim (int): Dimensionality of embeddings
        vocab (Optional[Dict[str, int]]): Load only embeddings in vocab. Defaults to None.
        extra_tokens (Optional[slp.config.nlp.SPECIAL_TOKENS]): Create random embeddings for these special tokens.
            Defaults to None.
    """
    self.embeddings_file = embeddings_file
    self.vocab = vocab
    self.cache_ = self._get_cache_name()
    self.dim_ = dim
    self.extra_tokens = extra_tokens

`repr(self)` `special`

String representation of class

Source code in slp/data/corpus.py

def __repr__(self):
    """String representation of class"""

    return f"{self.__class__.__name__}({self.embeddings_file}, {self.dim_})"

`augment_embeddings(self, word2idx, idx2word, embeddings, token, emb=None)`

Create a random embedding for a special token and append it to the embeddings array

Parameters:

Name	Type	Description	Default
`word2idx`	`Dict[str, int]`	Current word2idx map	required
`idx2word`	`Dict[int, str]`	Current idx2word map	required
`embeddings`	`List[numpy.ndarray]`	Embeddings array as list of embeddings	required
`token`	`str`	The special token (e.g. [PAD])	required
`emb`	`Optional[numpy.ndarray]`	Optional value for the embedding to be appended. Defaults to None, where a random embedding is created.	`None`

Returns:

Type	Description
`Tuple[Dict[str, int], Dict[int, str], List[numpy.ndarray]]`	Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]: (word2idx, idx2word, embeddings) tuple

Source code in slp/data/corpus.py

def augment_embeddings(
    self,
    word2idx: Dict[str, int],
    idx2word: Dict[int, str],
    embeddings: List[np.ndarray],
    token: str,
    emb: Optional[np.ndarray] = None,
) -> Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]:
    """Create a random embedding for a special token and append it to the embeddings array

    Args:
        word2idx (Dict[str, int]): Current word2idx map
        idx2word (Dict[int, str]): Current idx2word map
        embeddings (List[np.ndarray]): Embeddings array as list of embeddings
        token (str): The special token (e.g. [PAD])
        emb (Optional[np.ndarray]): Optional value for the embedding to be appended.
            Defaults to None, where a random embedding is created.

    Returns:
        Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]: (word2idx, idx2word, embeddings) tuple
    """
    word2idx[token] = len(embeddings)
    idx2word[len(embeddings)] = token

    if emb is None:
        emb = np.random.uniform(low=-0.05, high=0.05, size=self.dim_)
    embeddings.append(emb)

    return word2idx, idx2word, embeddings

`in_accepted_vocab(self, word)`

Check if word exists in given vocabulary

Parameters:

Name	Type	Description	Default
`word`	`str`	word from embeddings file	required

Returns:

Type	Description
`bool`	bool: Word exists

Source code in slp/data/corpus.py

def in_accepted_vocab(self, word: str) -> bool:
    """Check if word exists in given vocabulary

    Args:
        word (str): word from embeddings file

    Returns:
        bool: Word exists
    """

    return True if self.vocab is None else word in self.vocab

`load(self)`

Read the word vectors from a text file

Read embeddings
Filter with given vocabulary
Augment with special tokens

Returns:

Type	Description
`Tuple[Dict[str, int], Dict[int, str], numpy.ndarray]`	types.Embeddings: (word2idx, idx2word, embeddings) tuple

Source code in slp/data/corpus.py

@system.timethis(method=True)
def load(self) -> types.Embeddings:
    """Read the word vectors from a text file

    * Read embeddings
    * Filter with given vocabulary
    * Augment with special tokens

    Returns:
        types.Embeddings: (word2idx, idx2word, embeddings) tuple
    """
    # in order to avoid this time consuming operation, cache the results
    try:
        cache = self._load_cache()
        logger.info("Loaded word embeddings from cache.")

        return cache
    except OSError:
        logger.warning(f"Didn't find embeddings cache file {self.embeddings_file}")
        logger.warning("Loading embeddings from file.")

    # create the necessary dictionaries and the word embeddings matrix

    if not os.path.exists(self.embeddings_file):
        logger.critical(f"{self.embeddings_file} not found!")
        raise OSError(errno.ENOENT, os.strerror(errno.ENOENT), self.embeddings_file)

    logger.info(f"Indexing file {self.embeddings_file} ...")

    # create the 2D array, which will be used for initializing
    # the Embedding layer of a NN.
    # We reserve the first row (idx=0), as the word embedding,
    # which will be used for zero padding (word with id = 0).

    if self.extra_tokens is not None:
        word2idx, idx2word, embeddings = self.augment_embeddings(
            {},
            {},
            [],
            self.extra_tokens.PAD.value,  # type: ignore
            emb=np.zeros(self.dim_),
        )

        for token in self.extra_tokens:  # type: ignore
            logger.debug(f"Adding token {token.value} to embeddings matrix")

            if token == self.extra_tokens.PAD:
                continue
            word2idx, idx2word, embeddings = self.augment_embeddings(
                word2idx, idx2word, embeddings, token.value
            )
    else:
        word2idx, idx2word, embeddings = self.augment_embeddings(
            {}, {}, [], "[PAD]", emb=np.zeros(self.dim_)
        )
    # read file, line by line
    with open(self.embeddings_file, "r") as f:
        num_lines = sum(1 for line in f)

    with open(self.embeddings_file, "r") as f:
        index = len(embeddings)

        for line in tqdm(
            f, total=num_lines, desc="Loading word embeddings...", leave=False
        ):
            # skip the first row if it is a header

            if len(line.split()) < self.dim_:
                continue

            values = line.rstrip().split(" ")
            word = values[0]

            if word in word2idx:
                continue

            if not self.in_accepted_vocab(word):
                continue

            vector = np.asarray(values[1:], dtype=np.float32)
            idx2word[index] = word
            word2idx[word] = index
            embeddings.append(vector)
            index += 1

    logger.info(f"Loaded {len(embeddings)} word vectors.")
    embeddings_out = np.array(embeddings, dtype="float32")

    # write the data to a cache file
    self._dump_cache((word2idx, idx2word, embeddings_out))

    return word2idx, idx2word, embeddings_out

`HfCorpus`

`embeddings: None` `property` `readonly`

Unused. Defined for compatibility

`frequencies: Dict[str, int]` `property` `readonly`

Retrieve wordpieces occurence counts

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: wordpieces occurence counts

`idx2word: None` `property` `readonly`

Unused. Defined for compatibility

`indices: List[List[int]]` `property` `readonly`

Retrieve corpus as token indices

Returns:

Type	Description
`List[List[int]]`	List[List[int]]: Token indices for corpus

`raw: List[str]` `property` `readonly`

Retrieve raw corpus

Returns:

Type	Description
`List[str]`	List[str]: Raw Corpus

`tokenized: List[List[str]]` `property` `readonly`

Retrieve tokenized corpus

Returns:

Type	Description
`List[List[str]]`	List[List[str]]: tokenized corpus

`vocab: Set[str]` `property` `readonly`

Retrieve set of words in vocabulary

Returns:

Type	Description
`Set[str]`	Set[str]: set of words in vocabulary

`vocab_size: int` `property` `readonly`

Retrieve vocabulary size

Returns:

Type	Description
`int`	int: Vocabulary size

`word2idx: None` `property` `readonly`

Unused. Defined for compatibility

`getitem(self, idx)` `special`

Get ith element in corpus as token indices

Parameters:

Name	Type	Description	Default
`idx`	`List[int]`	index in corpus	required

Returns:

Type	Description
`List[int]`	List[int]: List of token indices for sentence

Source code in slp/data/corpus.py

def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices

    Args:
        idx (List[int]): index in corpus

    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )

    return out

`init(self, corpus, lower=True, tokenizer_model='bert-base-uncased', add_special_tokens=True, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)` `special`

Process a corpus using hugging face tokenizers

Select one of hugging face tokenizers and process corpus

Parameters:

Name	Type	Description	Default
`corpus`	`List[str]`	List of sentences	required
`lower`	`bool`	Convert strings to lower case. Defaults to True.	`True`
`tokenizer_model`	`str`	Hugging face model to use. Defaults to "bert-base-uncased".	`'bert-base-uncased'`
`add_special_tokens`	`bool`	Add special tokens in sentence during tokenization. Defaults to True.	`True`
`special_tokens`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens to include in the vocabulary. Defaults to slp.config.nlp.SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`
`max_length`	`int`	Crop sequences above this length. Defaults to -1 where sequences are left unaltered.	`-1`

Source code in slp/data/corpus.py

def __init__(
    self,
    corpus: List[str],
    lower: bool = True,
    tokenizer_model: str = "bert-base-uncased",
    add_special_tokens: bool = True,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    max_length: int = -1,
    **kwargs,
):
    """Process a corpus using hugging face tokenizers

    Select one of hugging face tokenizers and process corpus

    Args:
        corpus (List[str]): List of sentences
        lower (bool): Convert strings to lower case. Defaults to True.
        tokenizer_model (str): Hugging face model to use. Defaults to "bert-base-uncased".
        add_special_tokens (bool): Add special tokens in sentence during tokenization. Defaults to True.
        special_tokens (Optional[SPECIAL_TOKENS]): Special tokens to include in the vocabulary.
             Defaults to slp.config.nlp.SPECIAL_TOKENS.
        max_length (int): Crop sequences above this length. Defaults to -1 where sequences are left unaltered.
    """
    self.corpus_ = corpus
    self.max_length = max_length

    logger.info(
        f"Tokenizing corpus using hugging face tokenizer from {tokenizer_model}"
    )

    self.tokenizer = HuggingFaceTokenizer(
        lower=lower, model=tokenizer_model, add_special_tokens=add_special_tokens
    )

    self.corpus_indices_ = [
        self.tokenizer(s)
        for s in tqdm(
            self.corpus_, desc="Converting tokens to indices...", leave=False
        )
    ]

    self.tokenized_corpus_ = [
        self.tokenizer.detokenize(s)
        for s in tqdm(
            self.corpus_indices_,
            desc="Mapping indices to tokens...",
            leave=False,
        )
    ]

    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=-1,
        special_tokens=special_tokens,
    )

`len(self)` `special`

Number of samples in corpus

Returns:

Type	Description
`int`	int: Corpus length

Source code in slp/data/corpus.py

def __len__(self) -> int:
    """Number of samples in corpus

    Returns:
        int: Corpus length
    """

    return len(self.corpus_indices_)

`TokenizedCorpus`

`embeddings: None` `property` `readonly`

Unused. Kept for compatibility

`frequencies: Dict[str, int]` `property` `readonly`

Retrieve wordpieces occurence counts

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: wordpieces occurence counts

`idx2word: Dict[int, str]` `property` `readonly`

Retrieve idx2word mapping

Returns:

Type	Description
`Dict[int, str]`	Dict[str, int]: idx2word mapping

`indices: Union[List[int], List[List[int]]]` `property` `readonly`

Retrieve corpus as token indices

Returns:

Type	Description
`Union[List[int], List[List[int]]]`	List[List[int]]: Token indices for corpus

`raw: Union[List[str], List[List[str]]]` `property` `readonly`

Retrieve raw corpus

Returns:

Type	Description
`Union[List[str], List[List[str]]]`	List[str]: Raw Corpus

`tokenized: Union[List[str], List[List[str]]]` `property` `readonly`

Retrieve tokenized corpus

Returns:

Type	Description
`Union[List[str], List[List[str]]]`	List[List[str]]: Tokenized corpus

`vocab: Set[str]` `property` `readonly`

Retrieve set of words in vocabulary

Returns:

Type	Description
`Set[str]`	Set[str]: set of words in vocabulary

`vocab_size: int` `property` `readonly`

Retrieve vocabulary size

Returns:

Type	Description
`int`	int: Vocabulary size

`word2idx: Dict[str, int]` `property` `readonly`

Retrieve word2idx mapping

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: word2idx mapping

`getitem(self, idx)` `special`

Get ith element in corpus as token indices

Parameters:

Name	Type	Description	Default
`idx`	`List[int]`	index in corpus	required

Returns:

Type	Description
`List[int]`	List[int]: List of token indices for sentence

Source code in slp/data/corpus.py

def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices

    Args:
        idx (List[int]): index in corpus

    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )

    return out

`init(self, corpus, word2idx=None, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)` `special`

Wrap a corpus that's already tokenized

Parameters:

Name	Type	Description	Default
`corpus`	`Union[List[str], List[List[str]]]`	List of tokens or List of lists of tokens	required
`word2idx`	`Dict[str, int]`	Token to index mapping. Defaults to None.	`None`
`special_tokens`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special Tokens. Defaults to SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`

Source code in slp/data/corpus.py

def __init__(
    self,
    corpus: Union[List[str], List[List[str]]],
    word2idx: Dict[str, int] = None,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    max_length: int = -1,
    **kwargs,
):
    """Wrap a corpus that's already tokenized

    Args:
        corpus (Union[List[str], List[List[str]]]): List of tokens or List of lists of tokens
        word2idx (Dict[str, int], optional): Token to index mapping. Defaults to None.
        special_tokens (Optional[SPECIAL_TOKENS], optional): Special Tokens. Defaults to SPECIAL_TOKENS.
    """
    self.corpus_ = corpus
    self.tokenized_corpus_ = corpus
    self.max_length = max_length

    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=-1,
        special_tokens=special_tokens,
    )

    if word2idx is not None:
        logger.info("Converting tokens to ids using word2idx.")
        self.word2idx_ = word2idx
    else:
        logger.info(
            "No word2idx provided. Will convert tokens to ids using an iterative counter."
        )
        self.word2idx_ = dict(zip(self.vocab_.keys(), itertools.count()))

    self.idx2word_ = {v: k for k, v in self.word2idx_.items()}

    self.to_token_ids = ToTokenIds(
        self.word2idx_,
        specials=SPECIAL_TOKENS,  # type: ignore
    )

    if isinstance(self.tokenized_corpus_[0], list):
        self.corpus_indices_ = [
            self.to_token_ids(s)
            for s in tqdm(
                self.tokenized_corpus_,
                desc="Converting tokens to token ids...",
                leave=False,
            )
        ]
    else:
        self.corpus_indices_ = self.to_token_ids(self.tokenized_corpus_)  # type: ignore

`len(self)` `special`

Number of samples in corpus

Returns:

Type	Description
`int`	int: Corpus length

Source code in slp/data/corpus.py

def __len__(self) -> int:
    """Number of samples in corpus

    Returns:
        int: Corpus length
    """

    return len(self.corpus_indices_)

`WordCorpus`

`embeddings: ndarray` `property` `readonly`

Retrieve embeddings array

Returns:

Type	Description
`ndarray`	np.ndarray: Array of pretrained word embeddings

`frequencies: Dict[str, int]` `property` `readonly`

Retrieve word occurence counts

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: word occurence counts

`idx2word: Dict[int, str]` `property` `readonly`

Retrieve idx2word mapping

Returns:

Type	Description
`Dict[int, str]`	Dict[str, int]: idx2word mapping

`indices: List[List[int]]` `property` `readonly`

Retrieve corpus as token indices

Returns:

Type	Description
`List[List[int]]`	List[List[int]]: Token indices for corpus

`raw: List[str]` `property` `readonly`

Retrieve raw corpus

Returns:

Type	Description
`List[str]`	List[str]: Raw Corpus

`tokenized: List[List[str]]` `property` `readonly`

Retrieve tokenized corpus

Returns:

Type	Description
`List[List[str]]`	List[List[str]]: Tokenized corpus

`vocab: Set[str]` `property` `readonly`

Retrieve set of words in vocabulary

Returns:

Type	Description
`Set[str]`	Set[str]: set of words in vocabulary

`vocab_size: int` `property` `readonly`

Retrieve vocabulary size for corpus

Returns:

Type	Description
`int`	int: vocabulary size

`word2idx: Dict[str, int]` `property` `readonly`

Retrieve word2idx mapping

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: word2idx mapping

`getitem(self, idx)` `special`

Get ith element in corpus as token indices

Parameters:

Name	Type	Description	Default
`idx`	`List[int]`	index in corpus	required

Returns:

Type	Description
`List[int]`	List[int]: List of token indices for sentence

Source code in slp/data/corpus.py

def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices

    Args:
        idx (List[int]): index in corpus

    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )

    return out

`init(self, corpus, limit_vocab_size=30000, word2idx=None, idx2word=None, embeddings=None, embeddings_file=None, embeddings_dim=300, lower=True, special_tokens=<enum 'SPECIAL_TOKENS'>, prepend_bos=False, append_eos=False, lang='en_core_web_md', max_length=-1, **kwargs)` `special`

Load corpus embeddings, tokenize in words using spacy and convert to ids

This class handles the handling of a raw corpus. It handles:

Tokenization into words (spacy)
Loading of pretrained word embedding
Calculation of word frequencies / corpus statistics
Conversion to token ids

You can pass either:

Pass an embeddings file to load pretrained embeddings and create the word2idx mapping
Pass already loaded embeddings array and word2idx. This is useful for the dev / test splits where we want to pass the train split embeddings / word2idx.

Parameters:

Name	Type	Description	Default
`corpus`	`List[str]`	Corpus as a list of sentences	required
`limit_vocab_size`	`int`	Upper bound for number of most frequent tokens to keep. Defaults to 30000.	`30000`
`word2idx`	`Optional[Dict[str, int]]`	Mapping of word to indices. Defaults to None.	`None`
`idx2word`	`Optional[Dict[int, str]]`	Mapping of indices to words. Defaults to None.	`None`
`embeddings`	`Optional[numpy.ndarray]`	Embeddings array. Defaults to None.	`None`
`embeddings_file`	`Optional[str]`	Embeddings file to read. Defaults to None.	`None`
`embeddings_dim`	`int`	Dimension of embeddings. Defaults to 300.	`300`
`lower`	`bool`	Convert strings to lower case. Defaults to True.	`True`
`special_tokens`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens to include in the vocabulary. Defaults to slp.config.nlp.SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`
`prepend_bos`	`bool`	Prepend Beginning of Sequence token for seq2seq tasks. Defaults to False.	`False`
`append_eos`	`bool`	Append End of Sequence token for seq2seq tasks. Defaults to False.	`False`
`lang`	`str`	Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".	`'en_core_web_md'`
`max_length`	`int`	Crop sequences above this length. Defaults to -1 where sequences are left unaltered.	`-1`

Source code in slp/data/corpus.py

def __init__(
    self,
    corpus: List[str],
    limit_vocab_size: int = 30000,
    word2idx: Optional[Dict[str, int]] = None,
    idx2word: Optional[Dict[int, str]] = None,
    embeddings: Optional[np.ndarray] = None,
    embeddings_file: Optional[str] = None,
    embeddings_dim: int = 300,
    lower: bool = True,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    prepend_bos: bool = False,
    append_eos: bool = False,
    lang: str = "en_core_web_md",
    max_length: int = -1,
    **kwargs,
):
    """Load corpus embeddings, tokenize in words using spacy and convert to ids

    This class handles the handling of a raw corpus. It handles:

    * Tokenization into words (spacy)
    * Loading of pretrained word embedding
    * Calculation of word frequencies / corpus statistics
    * Conversion to token ids

    You can pass either:

    * Pass an embeddings file to load pretrained embeddings and create the word2idx mapping
    * Pass already loaded embeddings array and word2idx. This is useful for the dev / test splits
      where we want to pass the train split embeddings / word2idx.

    Args:
        corpus (List[List[str]]): Corpus as a list of sentences
        limit_vocab_size (int): Upper bound for number of most frequent tokens to keep. Defaults to 30000.
        word2idx (Optional[Dict[str, int]]): Mapping of word to indices. Defaults to None.
        idx2word (Optional[Dict[int, str]]): Mapping of indices to words. Defaults to None.
        embeddings (Optional[np.ndarray]): Embeddings array. Defaults to None.
        embeddings_file (Optional[str]): Embeddings file to read. Defaults to None.
        embeddings_dim (int): Dimension of embeddings. Defaults to 300.
        lower (bool): Convert strings to lower case. Defaults to True.
        special_tokens (Optional[SPECIAL_TOKENS]): Special tokens to include in the vocabulary.
             Defaults to slp.config.nlp.SPECIAL_TOKENS.
        prepend_bos (bool): Prepend Beginning of Sequence token for seq2seq tasks. Defaults to False.
        append_eos (bool): Append End of Sequence token for seq2seq tasks. Defaults to False.
        lang (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
        max_length (int): Crop sequences above this length. Defaults to -1 where sequences are left unaltered.
    """
    # FIXME: Extract super class to avoid repetition
    self.corpus_ = corpus
    self.max_length = max_length
    self.tokenizer = SpacyTokenizer(
        lower=lower,
        prepend_bos=prepend_bos,
        append_eos=append_eos,
        specials=special_tokens,
        lang=lang,
    )

    logger.info(f"Tokenizing corpus using spacy {lang}")

    self.tokenized_corpus_ = [
        self.tokenizer(s)
        for s in tqdm(self.corpus_, desc="Tokenizing corpus...", leave=False)
    ]

    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=limit_vocab_size if word2idx is None else -1,
        special_tokens=special_tokens,
    )

    self.word2idx_, self.idx2word_, self.embeddings_ = None, None, None
    # self.corpus_indices_ = self.tokenized_corpus_

    if word2idx is not None:
        logger.info("Word2idx was already provided. Going to used it.")

    if embeddings_file is not None and word2idx is None:
        logger.info(
            f"Going to load {len(self.vocab_)} embeddings from {embeddings_file}"
        )
        loader = EmbeddingsLoader(
            embeddings_file,
            embeddings_dim,
            vocab=self.vocab_,
            extra_tokens=special_tokens,
        )
        word2idx, idx2word, embeddings = loader.load()

    if embeddings is not None:
        self.embeddings_ = embeddings

    if idx2word is not None:
        self.idx2word_ = idx2word

    if word2idx is not None:
        self.word2idx_ = word2idx

        logger.info("Converting tokens to ids using word2idx.")
        self.to_token_ids = ToTokenIds(
            self.word2idx_,
            specials=SPECIAL_TOKENS,  # type: ignore
        )

        self.corpus_indices_ = [
            self.to_token_ids(s)
            for s in tqdm(
                self.tokenized_corpus_,
                desc="Converting tokens to token ids...",
                leave=False,
            )
        ]

        logger.info("Filtering corpus vocabulary.")

        updated_vocab = {}

        for k, v in self.vocab_.items():
            if k in self.word2idx_:
                updated_vocab[k] = v

        logger.info(
            f"Out of {len(self.vocab_)} tokens {len(self.vocab_) - len(updated_vocab)} were not found in the pretrained embeddings."
        )

        self.vocab_ = updated_vocab

`len(self)` `special`

Number of samples in corpus

Returns:

Type	Description
`int`	int: Corpus length

Source code in slp/data/corpus.py

def __len__(self) -> int:
    """Number of samples in corpus

    Returns:
        int: Corpus length
    """

    return len(self.corpus_indices_)

`create_vocab(corpus, vocab_size=-1, special_tokens=None)`

Create the vocabulary based on tokenized input corpus

Injects special tokens in the vocabulary
Calculates the occurence count for each token
Limits vocabulary to vocab_size most common tokens

Parameters:

Name	Type	Description	Default
`corpus`	`Union[List[str], List[List[str]]]`	The tokenized corpus as a list of sentences or a list of tokenized sentences	required
`vocab_size`	`int`	[description]. Limit vocabulary to vocab_size most common tokens. Defaults to -1 which keeps all tokens.	`-1`
`special_tokens`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens to include in the vocabulary. Defaults to None.	`None`

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: Dictionary of all accepted tokens and their corresponding occurence counts

Examples:

>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"])
{'far': 2, 'away': 1, 'galaxy': 1, 'a': 1, 'in': 1}
>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3)
{'far': 2, 'a': 1, 'in': 1}
>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3, special_tokens=slp.config.nlp.SPECIAL_TOKENS)
{'[PAD]': 0, '[MASK]': 0, '[UNK]': 0, '[BOS]': 0, '[EOS]': 0, '[CLS]': 0, '[SEP]': 0, 'far': 2, 'a': 1, 'in': 1}

Source code in slp/data/corpus.py

def create_vocab(
    corpus: Union[List[str], List[List[str]]],
    vocab_size: int = -1,
    special_tokens: Optional[SPECIAL_TOKENS] = None,
) -> Dict[str, int]:
    """Create the vocabulary based on tokenized input corpus

    * Injects special tokens in the vocabulary
    * Calculates the occurence count for each token
    * Limits vocabulary to vocab_size most common tokens

    Args:
        corpus (Union[List[str], List[List[str]]]): The tokenized corpus as a list of sentences or a list of tokenized sentences
        vocab_size (int): [description]. Limit vocabulary to vocab_size most common tokens.
            Defaults to -1 which keeps all tokens.
        special_tokens Optional[SPECIAL_TOKENS]: Special tokens to include in the vocabulary. Defaults to None.

    Returns:
        Dict[str, int]: Dictionary of all accepted tokens and their corresponding occurence counts

    Examples:
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"])
        {'far': 2, 'away': 1, 'galaxy': 1, 'a': 1, 'in': 1}
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3)
        {'far': 2, 'a': 1, 'in': 1}
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3, special_tokens=slp.config.nlp.SPECIAL_TOKENS)
        {'[PAD]': 0, '[MASK]': 0, '[UNK]': 0, '[BOS]': 0, '[EOS]': 0, '[CLS]': 0, '[SEP]': 0, 'far': 2, 'a': 1, 'in': 1}
    """

    if isinstance(corpus[0], list):
        corpus = list(itertools.chain.from_iterable(corpus))
    freq = Counter(corpus)

    if special_tokens is None:
        extra_tokens = []
    else:
        extra_tokens = special_tokens.to_list()

    if vocab_size < 0:
        vocab_size = len(freq)
    take = min(vocab_size, len(freq))
    logger.info(f"Keeping {vocab_size} most common tokens out of {len(freq)}")

    def take0(x: Tuple[Any, Any]) -> Any:
        """Take first tuple element"""

        return x[0]

    common_words = list(map(take0, freq.most_common(take)))
    common_words = list(set(common_words) - set(extra_tokens))
    words = extra_tokens + common_words

    if len(words) > vocab_size:
        words = words[: vocab_size + len(extra_tokens)]

    def token_freq(t):
        """Token frequeny"""

        return 0 if t in extra_tokens else freq[t]

    vocab = dict(zip(words, map(token_freq, words)))
    logger.info(f"Vocabulary created with {len(vocab)} tokens.")
    logger.info(f"The 10 most common tokens are:\n{freq.most_common(10)}")

    return vocab

`CorpusDataset`

`getitem(self, idx)` `special`

Get a source and target token from the corpus

Parameters:

Name	Type	Description	Default
`idx`	`int`	Token position	required

Returns:

Type	Description
`Tuple[torch.Tensor, torch.Tensor]`	(processed sentence, label)

Source code in slp/data/datasets.py

def __getitem__(self, idx):
    """Get a source and target token from the corpus

    Args:
        idx (int): Token position

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: (processed sentence, label)
    """
    text, target = self.corpus[idx], self.labels[idx]
    if self.label_encoder is not None:
        target = self.label_encoder.transform([target])[0]
    for t in self.transforms:
        text = t(text)
    return text, target

`init(self, corpus, labels)` `special`

Labeled corpus dataset

Parameters:

Name	Type	Description	Default
`corpus`	`WordCorpus, HfCorpus etc..`	Input corpus	required
`labels`	`List[Any]`	Labels for examples	required

Source code in slp/data/datasets.py

def __init__(self, corpus, labels):
    """Labeled corpus dataset

    Args:
        corpus (WordCorpus, HfCorpus etc..): Input corpus
        labels (List[Any]): Labels for examples
    """
    self.corpus = corpus
    self.labels = labels
    assert len(self.labels) == len(self.corpus), "Incompatible labels and corpus"
    self.transforms = []
    self.label_encoder = None
    if isinstance(self.labels[0], str):
        self.label_encoder = LabelEncoder().fit(self.labels)

`len(self)` `special`

Length of corpus

Returns:

Type	Description
`int`	Corpus Length

Source code in slp/data/datasets.py

def __len__(self):
    """Length of corpus

    Returns:
        int: Corpus Length
    """
    return len(self.corpus)

`map(self, t)`

Append a transform to self.transforms, in order to be applied to the data

Parameters:

Name	Type	Description	Default
`t`	`Callable[[str], Any]`	Transform of input token	required

Returns:

Type	Description
`CorpusDataset`	self

Source code in slp/data/datasets.py

def map(self, t):
    """Append a transform to self.transforms, in order to be applied to the data

    Args:
        t (Callable[[str], Any]): Transform of input token

    Returns:
        CorpusDataset: self
    """
    self.transforms.append(t)
    return self

`CorpusLMDataset`

`getitem(self, idx)` `special`

Get a source and target token from the corpus

Parameters:

Name	Type	Description	Default
`idx`	`int`	Token position	required

Returns:

Type	Description
`Tuple[torch.Tensor, torch.Tensor]`	source=coprus[idx], target=corpus[idx+1]

Source code in slp/data/datasets.py

def __getitem__(self, idx):
    """Get a source and target token from the corpus

    Args:
        idx (int): Token position

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: source=coprus[idx], target=corpus[idx+1]
    """
    src, tgt = self.source[idx], self.target[idx]
    for t in self.transforms:
        src = t(src)
        tgt = t(tgt)
    return src, tgt

`init(self, corpus)` `special`

Wraps a tokenized dataset which is provided as a list of tokens

Targets = source shifted one token to the left (next token prediction)

Parameters:

Name	Type	Description	Default
`corpus`	`List[str] or WordCorpus`	List of tokens	required

Source code in slp/data/datasets.py

def __init__(self, corpus):
    """Wraps a tokenized dataset which is provided as a list of tokens

    Targets = source shifted one token to the left (next token prediction)

    Args:
        corpus (List[str] or WordCorpus): List of tokens
    """
    self.source = corpus[:-1]
    self.target = corpus[1:]
    self.transforms = []

`len(self)` `special`

Length of corpus

Returns:

Type	Description
`int`	Corpus Length

Source code in slp/data/datasets.py

def __len__(self):
    """Length of corpus

    Returns:
        int: Corpus Length
    """
    return int(len(self.source))

`map(self, t)`

Append a transform to self.transforms, in order to be applied to the data

Parameters:

Name	Type	Description	Default
`t`	`Callable[[str], Any]`	Transform of input token	required

Returns:

Type	Description
`CorpusLMDataset`	self

Source code in slp/data/datasets.py

def map(self, t):
    """Append a transform to self.transforms, in order to be applied to the data

    Args:
        t (Callable[[str], Any]): Transform of input token

    Returns:
        CorpusLMDataset: self
    """
    self.transforms.append(t)
    return self

`HuggingFaceTokenizer`

`call(self, x)` `special`

Call to tokenize function

Parameters:

Name	Type	Description	Default
`x`	`str`	Input string	required

Returns:

Type	Description
`List[int]`	List[int]: List of token ids

Source code in slp/data/transforms.py

def __call__(self, x: str) -> List[int]:
    """Call to tokenize function

    Args:
        x (str): Input string

    Returns:
        List[int]: List of token ids
    """
    out: List[int] = self.tokenizer.encode(
        x, add_special_tokens=self.add_special_tokens, max_length=65536
    )
    return out

`init(self, lower=True, model='bert-base-uncased', add_special_tokens=True)` `special`

Apply one of huggingface tokenizers to a string

Parameters:

Name	Type	Description	Default
`lower`	`bool`	Lowercase string. Defaults to True.	`True`
`model`	`str`	Select transformer model. Defaults to "bert-base-uncased".	`'bert-base-uncased'`
`add_special_tokens`	`bool`	Insert special tokens to tokenized string. Defaults to True.	`True`

Source code in slp/data/transforms.py

def __init__(
    self,
    lower: bool = True,
    model: str = "bert-base-uncased",
    add_special_tokens: bool = True,
):
    """Apply one of huggingface tokenizers to a string

    Args:
        lower (bool): Lowercase string. Defaults to True.
        model (str): Select transformer model. Defaults to "bert-base-uncased".
        add_special_tokens (bool): Insert special tokens to tokenized string. Defaults to True.
    """
    self.tokenizer = AutoTokenizer.from_pretrained(model, do_lower_case=lower)
    self.vocab_size = len(self.tokenizer.vocab)
    self.add_special_tokens = add_special_tokens

`detokenize(self, x)`

Convert list of token ids to list of tokens

Parameters:

Name	Type	Description	Default
`x`	`List[int]`	List of token ids	required

Returns:

Type	Description
`List[str]`	List[str]: List of tokens

Source code in slp/data/transforms.py

def detokenize(self, x: List[int]) -> List[str]:
    """Convert list of token ids to list of tokens

    Args:
        x (List[int]): List of token ids

    Returns:
        List[str]: List of tokens
    """
    out: List[str] = self.tokenizer.convert_ids_to_tokens(x)
    return out

`ReplaceUnknownToken`

`call(self, x)` `special`

Convert in list of tokens to [UNK]

Parameters:

Name	Type	Description	Default
`x`	`List[str]`	List of tokens	required

Returns:

Type	Description
`List[str]`	List[str]: List of tokens

Source code in slp/data/transforms.py

def __call__(self, x: List[str]) -> List[str]:
    """Convert <unk> in list of tokens to [UNK]

    Args:
        x (List[str]): List of tokens

    Returns:
        List[str]: List of tokens
    """
    return [w if w != self.old_unk else self.new_unk for w in x]

`init(self, old_unk='<unk>', new_unk='[UNK]')` `special`

Replace existing unknown tokens in the vocab to [UNK]. Useful for wikitext

Parameters:

Name	Type	Description	Default
`old_unk`	`str`	Unk token in corpus. Defaults to "".	`'<unk>'`
`new_unk`	`str`	Desired unk value. Defaults to SPECIAL_TOKENS.UNK.value.	`'[UNK]'`

Source code in slp/data/transforms.py

def __init__(
    self,
    old_unk: str = "<unk>",
    new_unk: str = SPECIAL_TOKENS.UNK.value,  # type: ignore
):
    """Replace existing unknown tokens in the vocab to [UNK]. Useful for wikitext

    Args:
        old_unk (str): Unk token in corpus. Defaults to "<unk>".
        new_unk (str): Desired unk value. Defaults to SPECIAL_TOKENS.UNK.value.
    """
    self.old_unk = old_unk
    self.new_unk = new_unk

`SentencepieceTokenizer`

`call(self, x)` `special`

Call to tokenize function

Parameters:

Name	Type	Description	Default
`x`	`str`	Input string	required

Returns:

Type	Description
`List[int]`	List[int]: List of tokens ids

Source code in slp/data/transforms.py

def __call__(self, x: str) -> List[int]:
    """Call to tokenize function

    Args:
        x (str): Input string

    Returns:
        List[int]: List of tokens ids
    """
    if self.lower:
        x = x.lower()
    ids: List[int] = self.pre_id + self.tokenizer.encode_as_ids(x) + self.post_id
    return ids

`init(self, lower=True, model=None, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>)` `special`

Tokenize sentence using pretrained sentencepiece model

Parameters:

Name	Type	Description	Default
`lower`	`bool`	Lowercase string. Defaults to True.	`True`
`model`	`Optional[Any]`	Sentencepiece model. Defaults to None.	`None`
`prepend_bos`	`bool`	Prepend BOS for seq2seq. Defaults to False.	`False`
`append_eos`	`bool`	Append EOS for seq2seq. Defaults to False.	`False`
`specials`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens. Defaults to SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`

Source code in slp/data/transforms.py

def __init__(
    self,
    lower: bool = True,
    model: Optional[Any] = None,
    prepend_bos: bool = False,
    append_eos: bool = False,
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
):
    """Tokenize sentence using pretrained sentencepiece model

    Args:
        lower (bool): Lowercase string. Defaults to True.
        model (Optional[Any]): Sentencepiece model. Defaults to None.
        prepend_bos (bool): Prepend BOS for seq2seq. Defaults to False.
        append_eos (bool): Append EOS for seq2seq. Defaults to False.
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
    """
    self.tokenizer = spm.SentencePieceProcessor()
    self.tokenizer.Load(model)
    self.specials = specials
    self.lower = lower
    self.vocab_size = self.tokenizer.get_piece_size()
    self.pre_id = []
    self.post_id = []
    if prepend_bos:
        self.pre_id.append(self.tokenizer.piece_to_id(self.specials.BOS.value))  # type: ignore
    if append_eos:
        self.post_id.append(self.tokenizer.piece_to_id(self.specials.EOS.value))  # type: ignore

`SpacyTokenizer`

`call(self, x)` `special`

Call to tokenize function

Parameters:

Name	Type	Description	Default
`x`	`str`	Input string	required

Returns:

Type	Description
`List[str]`	List[str]: List of tokens

Source code in slp/data/transforms.py

def __call__(self, x: str) -> List[str]:
    """Call to tokenize function

    Args:
        x (str): Input string

    Returns:
        List[str]: List of tokens
    """
    if self.lower:
        x = x.lower()
    out: List[str] = (
        self.pre_id + [y.text for y in self.nlp.tokenizer(x)] + self.post_id
    )
    return out

`init(self, lower=True, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>, lang='en_core_web_sm')` `special`

Apply spacy tokenizer to str

Parameters:

Name	Type	Description	Default
`lower`	`bool`	Lowercase string. Defaults to True.	`True`
`prepend_bos`	`bool`	Prepend BOS for seq2seq. Defaults to False.	`False`
`append_eos`	`bool`	Append EOS for seq2seq. Defaults to False.	`False`
`specials`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens. Defaults to SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`
`lang`	`str`	Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".	`'en_core_web_sm'`

Source code in slp/data/transforms.py

def __init__(
    self,
    lower: bool = True,
    prepend_bos: bool = False,
    append_eos: bool = False,
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    lang: str = "en_core_web_sm",
):
    """Apply spacy tokenizer to str

    Args:
        lower (bool): Lowercase string. Defaults to True.
        prepend_bos (bool): Prepend BOS for seq2seq. Defaults to False.
        append_eos (bool): Append EOS for seq2seq. Defaults to False.
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
        lang (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
    """
    self.lower = lower
    self.specials = SPECIAL_TOKENS
    self.lang = lang
    self.pre_id = []
    self.post_id = []
    if prepend_bos:
        self.pre_id.append(self.specials.BOS.value)
    if append_eos:
        self.post_id.append(self.specials.EOS.value)
    self.nlp = self.get_nlp(name=lang, specials=specials)

`get_nlp(self, name='en_core_web_sm', specials=<enum 'SPECIAL_TOKENS'>)`

Get spacy nlp object for given lang and add SPECIAL_TOKENS

Parameters:

Name	Type	Description	Default
`name`	`str`	Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".	`'en_core_web_sm'`
`specials`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens. Defaults to SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`

Returns:

Type	Description
`Language`	spacy.Language: spacy text-processing pipeline

Source code in slp/data/transforms.py

def get_nlp(
    self,
    name: str = "en_core_web_sm",
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
) -> spacy.Language:
    """Get spacy nlp object for given lang and add SPECIAL_TOKENS

    Args:
        name (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.

    Returns:
        spacy.Language: spacy text-processing pipeline
    """
    nlp = spacy.load(name)
    if specials is not None:
        for token in specials.to_list():
            nlp.tokenizer.add_special_case(token, [{ORTH: token}])
    return nlp

`ToTensor`

`call(self, x)` `special`

Convert list of tokens or list of features to tensor

Parameters:

Name	Type	Description	Default
`x`	`List[Any]`	List of tokens or features	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Resulting tensor

Source code in slp/data/transforms.py

def __call__(self, x: List[Any]) -> torch.Tensor:
    """Convert list of tokens or list of features to tensor

    Args:
        x (List[Any]): List of tokens or features

    Returns:
        torch.Tensor: Resulting tensor
    """
    return mktensor(x, device=self.device, dtype=self.dtype)

`init(self, device='cpu', dtype=torch.int64)` `special`

To tensor convertor

Parameters:

Name	Type	Description	Default
`device`	`str`	Device to map the tensor. Defaults to "cpu".	`'cpu'`
`dtype`	`dtype`	Type of resulting tensor. Defaults to torch.long.	`torch.int64`

Source code in slp/data/transforms.py

def __init__(self, device: str = "cpu", dtype: torch.dtype = torch.long):
    """To tensor convertor

    Args:
        device (str): Device to map the tensor. Defaults to "cpu".
        dtype (torch.dtype): Type of resulting tensor. Defaults to torch.long.
    """
    self.device = device
    self.dtype = dtype

`ToTokenIds`

`call(self, x)` `special`

Convert list of tokens to list of token ids

Parameters:

Name	Type	Description	Default
`x`	`List[str]`	List of tokens	required

Returns:

Type	Description
`List[int]`	List[int]: List of token ids

Source code in slp/data/transforms.py

def __call__(self, x: List[str]) -> List[int]:
    """Convert list of tokens to list of token ids

    Args:
        x (List[str]): List of tokens

    Returns:
        List[int]: List of token ids
    """
    return [
        self.word2idx[w] if w in self.word2idx else self.word2idx[self.unk_value]
        for w in x
    ]

`init(self, word2idx, specials=<enum 'SPECIAL_TOKENS'>)` `special`

Convert List of tokens to list of token ids

Parameters:

Name	Type	Description	Default
`word2idx`	`Dict[str, int]`	Word to index mapping	required
`specials`	`Optional[slp.config.nlp.SPECIAL_TOKENS]`	Special tokens. Defaults to SPECIAL_TOKENS.	`<enum 'SPECIAL_TOKENS'>`

Source code in slp/data/transforms.py

def __init__(
    self,
    word2idx: Dict[str, int],
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
):
    """Convert List of tokens to list of token ids

    Args:
        word2idx (Dict[str, int]): Word to index mapping
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
    """
    self.word2idx = word2idx
    self.unk_value = specials.UNK.value if specials is not None else "[UNK]"  # type: ignore

Data manipulation

MultimodalSequenceClassificationCollator

__call__(self, batch) special

__init__(self, pad_indx=0, modalities={'audio', 'visual', 'text'}, label_key='label', max_length=-1, label_dtype=torch.float32, device='cpu') special

Seq2SeqCollator

__call__(self, batch) special

__init__(self, pad_indx=0, max_length=-1, device='cpu') special

SequenceClassificationCollator

__call__(self, batch) special

__init__(self, pad_indx=0, max_length=-1, device='cpu') special

EmbeddingsLoader

__init__(self, embeddings_file, dim, vocab=None, extra_tokens=None) special

__repr__(self) special

augment_embeddings(self, word2idx, idx2word, embeddings, token, emb=None)

in_accepted_vocab(self, word)

load(self)

HfCorpus

embeddings: None property readonly

frequencies: Dict[str, int] property readonly

idx2word: None property readonly

indices: List[List[int]] property readonly

raw: List[str] property readonly

tokenized: List[List[str]] property readonly

vocab: Set[str] property readonly

vocab_size: int property readonly

word2idx: None property readonly

__getitem__(self, idx) special

__init__(self, corpus, lower=True, tokenizer_model='bert-base-uncased', add_special_tokens=True, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs) special

__len__(self) special

TokenizedCorpus

embeddings: None property readonly

frequencies: Dict[str, int] property readonly

idx2word: Dict[int, str] property readonly

indices: Union[List[int], List[List[int]]] property readonly

raw: Union[List[str], List[List[str]]] property readonly

tokenized: Union[List[str], List[List[str]]] property readonly

vocab: Set[str] property readonly

vocab_size: int property readonly

word2idx: Dict[str, int] property readonly

__getitem__(self, idx) special

__init__(self, corpus, word2idx=None, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs) special

__len__(self) special

WordCorpus

embeddings: ndarray property readonly

frequencies: Dict[str, int] property readonly

idx2word: Dict[int, str] property readonly

indices: List[List[int]] property readonly

raw: List[str] property readonly

tokenized: List[List[str]] property readonly

vocab: Set[str] property readonly

vocab_size: int property readonly

word2idx: Dict[str, int] property readonly

__getitem__(self, idx) special

__init__(self, corpus, limit_vocab_size=30000, word2idx=None, idx2word=None, embeddings=None, embeddings_file=None, embeddings_dim=300, lower=True, special_tokens=<enum 'SPECIAL_TOKENS'>, prepend_bos=False, append_eos=False, lang='en_core_web_md', max_length=-1, **kwargs) special

__len__(self) special

create_vocab(corpus, vocab_size=-1, special_tokens=None)

CorpusDataset

__getitem__(self, idx) special

__init__(self, corpus, labels) special

__len__(self) special

map(self, t)

CorpusLMDataset

__getitem__(self, idx) special

__init__(self, corpus) special

__len__(self) special

map(self, t)

HuggingFaceTokenizer

__call__(self, x) special

__init__(self, lower=True, model='bert-base-uncased', add_special_tokens=True) special

detokenize(self, x)

ReplaceUnknownToken

__call__(self, x) special

__init__(self, old_unk='<unk>', new_unk='[UNK]') special

SentencepieceTokenizer

__call__(self, x) special

__init__(self, lower=True, model=None, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>) special

SpacyTokenizer

__call__(self, x) special

__init__(self, lower=True, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>, lang='en_core_web_sm') special

get_nlp(self, name='en_core_web_sm', specials=<enum 'SPECIAL_TOKENS'>)

`MultimodalSequenceClassificationCollator`

`call(self, batch)` `special`

`init(self, pad_indx=0, modalities={'audio', 'visual', 'text'}, label_key='label', max_length=-1, label_dtype=torch.float32, device='cpu')` `special`

`Seq2SeqCollator`

`call(self, batch)` `special`

`init(self, pad_indx=0, max_length=-1, device='cpu')` `special`

`SequenceClassificationCollator`

`call(self, batch)` `special`

`init(self, pad_indx=0, max_length=-1, device='cpu')` `special`

`EmbeddingsLoader`

`init(self, embeddings_file, dim, vocab=None, extra_tokens=None)` `special`

`repr(self)` `special`

`augment_embeddings(self, word2idx, idx2word, embeddings, token, emb=None)`

`in_accepted_vocab(self, word)`

`load(self)`

`HfCorpus`

`embeddings: None` `property` `readonly`

`frequencies: Dict[str, int]` `property` `readonly`

`idx2word: None` `property` `readonly`

`indices: List[List[int]]` `property` `readonly`

`raw: List[str]` `property` `readonly`

`tokenized: List[List[str]]` `property` `readonly`

`vocab: Set[str]` `property` `readonly`

`vocab_size: int` `property` `readonly`

`word2idx: None` `property` `readonly`

`getitem(self, idx)` `special`

`init(self, corpus, lower=True, tokenizer_model='bert-base-uncased', add_special_tokens=True, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)` `special`

`len(self)` `special`

`TokenizedCorpus`

`embeddings: None` `property` `readonly`

`frequencies: Dict[str, int]` `property` `readonly`

`idx2word: Dict[int, str]` `property` `readonly`

`indices: Union[List[int], List[List[int]]]` `property` `readonly`

`raw: Union[List[str], List[List[str]]]` `property` `readonly`

`tokenized: Union[List[str], List[List[str]]]` `property` `readonly`

`vocab: Set[str]` `property` `readonly`

`vocab_size: int` `property` `readonly`

`word2idx: Dict[str, int]` `property` `readonly`

`getitem(self, idx)` `special`

`init(self, corpus, word2idx=None, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)` `special`

`len(self)` `special`

`WordCorpus`

`embeddings: ndarray` `property` `readonly`

`frequencies: Dict[str, int]` `property` `readonly`

`idx2word: Dict[int, str]` `property` `readonly`

`indices: List[List[int]]` `property` `readonly`

`raw: List[str]` `property` `readonly`

`tokenized: List[List[str]]` `property` `readonly`

`vocab: Set[str]` `property` `readonly`

`vocab_size: int` `property` `readonly`

`word2idx: Dict[str, int]` `property` `readonly`

`getitem(self, idx)` `special`

`init(self, corpus, limit_vocab_size=30000, word2idx=None, idx2word=None, embeddings=None, embeddings_file=None, embeddings_dim=300, lower=True, special_tokens=<enum 'SPECIAL_TOKENS'>, prepend_bos=False, append_eos=False, lang='en_core_web_md', max_length=-1, **kwargs)` `special`

`len(self)` `special`

`create_vocab(corpus, vocab_size=-1, special_tokens=None)`

`CorpusDataset`

`getitem(self, idx)` `special`

`init(self, corpus, labels)` `special`

`len(self)` `special`

`map(self, t)`

`CorpusLMDataset`

`getitem(self, idx)` `special`

`init(self, corpus)` `special`

`len(self)` `special`

`map(self, t)`

`HuggingFaceTokenizer`

`call(self, x)` `special`

`init(self, lower=True, model='bert-base-uncased', add_special_tokens=True)` `special`

`detokenize(self, x)`

`ReplaceUnknownToken`

`call(self, x)` `special`

`init(self, old_unk='<unk>', new_unk='[UNK]')` `special`

`SentencepieceTokenizer`

`call(self, x)` `special`

`init(self, lower=True, model=None, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>)` `special`

`SpacyTokenizer`

`call(self, x)` `special`

`init(self, lower=True, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>, lang='en_core_web_sm')` `special`

`get_nlp(self, name='en_core_web_sm', specials=<enum 'SPECIAL_TOKENS'>)`

`ToTensor`