Data manipulation
We provide out of the box support for easy preprocessing of NLP corpora and helpers to work with datasets in the Pytorch way.
        MultimodalSequenceClassificationCollator
    
__call__(self, batch)
  
      special
  
    Call collate function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
batch | 
        List[Dict[str, torch.Tensor]] | 
        Batch of samples. It expects a list of dictionaries from modalities to torch tensors  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]] | 
      Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]: tuple of (dict batched modality tensors, labels, dict of modality sequence lengths)  | 
    
Source code in slp/data/collators.py
          def __call__(
    self, batch: List[Dict[str, torch.Tensor]]
) -> Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]:
    """Call collate function
    Args:
        batch (List[Dict[str, torch.Tensor]]): Batch of samples.
            It expects a list of dictionaries from modalities to torch tensors
    Returns:
        Tuple[Dict[str, torch.Tensor], torch.Tensor, Dict[str, torch.Tensor]]: tuple of
            (dict batched modality tensors, labels, dict of modality sequence lengths)
    """
    inputs = {}
    lengths = {}
    for m in self.modalities:
        seq = self.extract_sequence(batch, m)
        lengths[m] = torch.tensor([s.size(0) for s in seq], device=self.device)
        if self.max_length > 0:
            lengths[m] = torch.clamp(lengths[m], min=0, max=self.max_length)
        inputs[m] = pad_sequence(
            seq,
            batch_first=True,
            padding_value=self.pad_indx,
            max_length=self.max_length,
        ).to(self.device)
    targets: List[Label] = [b[self.label_key] for b in batch]
    # Pad and convert to tensor
    ttargets: torch.Tensor = mktensor(
        targets, device=self.device, dtype=self.label_dtype
    )
    return inputs, ttargets.to(self.device), lengths
__init__(self, pad_indx=0, modalities={'audio', 'visual', 'text'}, label_key='label', max_length=-1, label_dtype=torch.float32, device='cpu')
  
      special
  
    Collate function for sequence classification tasks
- Perform padding
 - Calculate sequence lengths
 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
pad_indx | 
        int | 
        Pad token index. Defaults to 0.  | 
        0 | 
      
modalities | 
        Set | 
        Which modalities are included in the batch dict  | 
        {'audio', 'visual', 'text'} | 
      
max_length | 
        int | 
        Pad sequences to a fixed maximum length  | 
        -1 | 
      
label_key | 
        str | 
        String to access the label in the batch dict  | 
        'label' | 
      
device | 
        str | 
        device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.  | 
        'cpu' | 
      
Examples:
>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=MultimodalSequenceClassificationCollator())
Source code in slp/data/collators.py
          def __init__(
    self,
    pad_indx=0,
    modalities={"visual", "text", "audio"},
    label_key="label",
    max_length=-1,
    label_dtype=torch.float,
    device="cpu",
):
    """Collate function for sequence classification tasks
    * Perform padding
    * Calculate sequence lengths
    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        modalities (Set): Which modalities are included in the batch dict
        max_length (int): Pad sequences to a fixed maximum length
        label_key (str): String to access the label in the batch dict
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.
    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=MultimodalSequenceClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.device = device
    self.max_length = max_length
    self.label_key = label_key
    self.modalities = modalities
    self.label_dtype = label_dtype
        Seq2SeqCollator
    
__call__(self, batch)
  
      special
  
    Call collate function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
batch | 
        List[Tuple[torch.Tensor, torch.Tensor]] | 
        Batch of samples. It expects a list of tuples (source, target) Each source and target are a sequences of features or ids.  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor] | 
      Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths_inputs, lengths_targets)  | 
    
Source code in slp/data/collators.py
          def __call__(
    self, batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Call collate function
    Args:
        batch (List[Tuple[torch.Tensor, torch.Tensor]]): Batch of samples.
            It expects a list of tuples (source, target)
            Each source and target are a sequences of features or ids.
    Returns:
        Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors
            (inputs, labels, lengths_inputs, lengths_targets)
    """
    inputs: List[torch.Tensor] = [b[0] for b in batch]
    targets: List[torch.Tensor] = [b[1] for b in batch]
    lengths_inputs = torch.tensor([s.size(0) for s in inputs], device=self.device)
    lengths_targets = torch.tensor([s.size(0) for s in targets], device=self.device)
    if self.max_length > 0:
        lengths_inputs = torch.clamp(lengths_inputs, min=0, max=self.max_length)
        lengths_targets = torch.clamp(lengths_targets, min=0, max=self.max_length)
    inputs_padded: torch.Tensor = pad_sequence(
        inputs,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)
    targets_padded: torch.Tensor = pad_sequence(
        targets,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)
    return inputs_padded, targets_padded, lengths_inputs, lengths_targets
__init__(self, pad_indx=0, max_length=-1, device='cpu')
  
      special
  
    Collate function for seq2seq tasks
- Perform padding
 - Calculate sequence lengths
 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
pad_indx | 
        int | 
        Pad token index. Defaults to 0.  | 
        0 | 
      
max_length | 
        int | 
        Pad sequences to a fixed maximum length  | 
        -1 | 
      
device | 
        str | 
        device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.  | 
        'cpu' | 
      
Examples:
>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=Seq2SeqClassificationCollator())
Source code in slp/data/collators.py
          def __init__(self, pad_indx=0, max_length=-1, device="cpu"):
    """Collate function for seq2seq tasks
    * Perform padding
    * Calculate sequence lengths
    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        max_length (int): Pad sequences to a fixed maximum length
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.
    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=Seq2SeqClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.max_length = max_length
    self.device = device
        SequenceClassificationCollator
    
__call__(self, batch)
  
      special
  
    Call collate function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
batch | 
        List[Tuple[torch.Tensor, Union[numpy.ndarray, torch.Tensor, List[~T], int]]] | 
        Batch of samples. It expects a list of tuples (inputs, label).  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tuple[torch.Tensor, torch.Tensor, torch.Tensor] | 
      Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths)  | 
    
Source code in slp/data/collators.py
          def __call__(
    self, batch: List[Tuple[torch.Tensor, Label]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """Call collate function
    Args:
        batch (List[Tuple[torch.Tensor, slp.util.types.Label]]): Batch of samples.
            It expects a list of tuples (inputs, label).
    Returns:
        Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Returns tuple of batched tensors (inputs, labels, lengths)
    """
    inputs: List[torch.Tensor] = [b[0] for b in batch]
    targets: List[Label] = [b[1] for b in batch]
    #  targets: List[torch.tensor] = map(list, zip(*batch))
    lengths = torch.tensor([s.size(0) for s in inputs], device=self.device)
    if self.max_length > 0:
        lengths = torch.clamp(lengths, min=0, max=self.max_length)
    # Pad and convert to tensor
    inputs_padded: torch.Tensor = pad_sequence(
        inputs,
        batch_first=True,
        padding_value=self.pad_indx,
        max_length=self.max_length,
    ).to(self.device)
    ttargets: torch.Tensor = mktensor(targets, device=self.device, dtype=torch.long)
    return inputs_padded, ttargets.to(self.device), lengths
__init__(self, pad_indx=0, max_length=-1, device='cpu')
  
      special
  
    Collate function for sequence classification tasks
- Perform padding
 - Calculate sequence lengths
 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
pad_indx | 
        int | 
        Pad token index. Defaults to 0.  | 
        0 | 
      
max_length | 
        int | 
        Pad sequences to a fixed maximum length  | 
        -1 | 
      
device | 
        str | 
        device of returned tensors. Leave this as "cpu". The LightningModule will handle the Conversion.  | 
        'cpu' | 
      
Examples:
>>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=SequenceClassificationCollator())
Source code in slp/data/collators.py
          def __init__(self, pad_indx=0, max_length=-1, device="cpu"):
    """Collate function for sequence classification tasks
    * Perform padding
    * Calculate sequence lengths
    Args:
        pad_indx (int): Pad token index. Defaults to 0.
        max_length (int): Pad sequences to a fixed maximum length
        device (str): device of returned tensors. Leave this as "cpu".
            The LightningModule will handle the Conversion.
    Examples:
        >>> dataloader = torch.utils.DataLoader(my_dataset, collate_fn=SequenceClassificationCollator())
    """
    self.pad_indx = pad_indx
    self.device = device
    self.max_length = max_length
        EmbeddingsLoader
    
__init__(self, embeddings_file, dim, vocab=None, extra_tokens=None)
  
      special
  
    Load word embeddings in text format
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
embeddings_file | 
        str | 
        File where embeddings are stored (e.g. glove.6B.50d.txt)  | 
        required | 
dim | 
        int | 
        Dimensionality of embeddings  | 
        required | 
vocab | 
        Optional[Dict[str, int]] | 
        Load only embeddings in vocab. Defaults to None.  | 
        None | 
      
extra_tokens | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Create random embeddings for these special tokens. Defaults to None.  | 
        None | 
      
Source code in slp/data/corpus.py
          def __init__(
    self,
    embeddings_file: str,
    dim: int,
    vocab: Optional[Dict[str, int]] = None,
    extra_tokens: Optional[SPECIAL_TOKENS] = None,
) -> None:
    """Load word embeddings in text format
    Args:
        embeddings_file (str): File where embeddings are stored (e.g. glove.6B.50d.txt)
        dim (int): Dimensionality of embeddings
        vocab (Optional[Dict[str, int]]): Load only embeddings in vocab. Defaults to None.
        extra_tokens (Optional[slp.config.nlp.SPECIAL_TOKENS]): Create random embeddings for these special tokens.
            Defaults to None.
    """
    self.embeddings_file = embeddings_file
    self.vocab = vocab
    self.cache_ = self._get_cache_name()
    self.dim_ = dim
    self.extra_tokens = extra_tokens
__repr__(self)
  
      special
  
    String representation of class
Source code in slp/data/corpus.py
          def __repr__(self):
    """String representation of class"""
    return f"{self.__class__.__name__}({self.embeddings_file}, {self.dim_})"
augment_embeddings(self, word2idx, idx2word, embeddings, token, emb=None)
    Create a random embedding for a special token and append it to the embeddings array
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
word2idx | 
        Dict[str, int] | 
        Current word2idx map  | 
        required | 
idx2word | 
        Dict[int, str] | 
        Current idx2word map  | 
        required | 
embeddings | 
        List[numpy.ndarray] | 
        Embeddings array as list of embeddings  | 
        required | 
token | 
        str | 
        The special token (e.g. [PAD])  | 
        required | 
emb | 
        Optional[numpy.ndarray] | 
        Optional value for the embedding to be appended. Defaults to None, where a random embedding is created.  | 
        None | 
      
Returns:
| Type | Description | 
|---|---|
Tuple[Dict[str, int], Dict[int, str], List[numpy.ndarray]] | 
      Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]: (word2idx, idx2word, embeddings) tuple  | 
    
Source code in slp/data/corpus.py
          def augment_embeddings(
    self,
    word2idx: Dict[str, int],
    idx2word: Dict[int, str],
    embeddings: List[np.ndarray],
    token: str,
    emb: Optional[np.ndarray] = None,
) -> Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]:
    """Create a random embedding for a special token and append it to the embeddings array
    Args:
        word2idx (Dict[str, int]): Current word2idx map
        idx2word (Dict[int, str]): Current idx2word map
        embeddings (List[np.ndarray]): Embeddings array as list of embeddings
        token (str): The special token (e.g. [PAD])
        emb (Optional[np.ndarray]): Optional value for the embedding to be appended.
            Defaults to None, where a random embedding is created.
    Returns:
        Tuple[Dict[str, int], Dict[int, str], List[np.ndarray]]: (word2idx, idx2word, embeddings) tuple
    """
    word2idx[token] = len(embeddings)
    idx2word[len(embeddings)] = token
    if emb is None:
        emb = np.random.uniform(low=-0.05, high=0.05, size=self.dim_)
    embeddings.append(emb)
    return word2idx, idx2word, embeddings
in_accepted_vocab(self, word)
    Check if word exists in given vocabulary
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
word | 
        str | 
        word from embeddings file  | 
        required | 
Returns:
| Type | Description | 
|---|---|
bool | 
      bool: Word exists  | 
    
Source code in slp/data/corpus.py
          def in_accepted_vocab(self, word: str) -> bool:
    """Check if word exists in given vocabulary
    Args:
        word (str): word from embeddings file
    Returns:
        bool: Word exists
    """
    return True if self.vocab is None else word in self.vocab
load(self)
    Read the word vectors from a text file
- Read embeddings
 - Filter with given vocabulary
 - Augment with special tokens
 
Returns:
| Type | Description | 
|---|---|
Tuple[Dict[str, int], Dict[int, str], numpy.ndarray] | 
      types.Embeddings: (word2idx, idx2word, embeddings) tuple  | 
    
Source code in slp/data/corpus.py
          @system.timethis(method=True)
def load(self) -> types.Embeddings:
    """Read the word vectors from a text file
    * Read embeddings
    * Filter with given vocabulary
    * Augment with special tokens
    Returns:
        types.Embeddings: (word2idx, idx2word, embeddings) tuple
    """
    # in order to avoid this time consuming operation, cache the results
    try:
        cache = self._load_cache()
        logger.info("Loaded word embeddings from cache.")
        return cache
    except OSError:
        logger.warning(f"Didn't find embeddings cache file {self.embeddings_file}")
        logger.warning("Loading embeddings from file.")
    # create the necessary dictionaries and the word embeddings matrix
    if not os.path.exists(self.embeddings_file):
        logger.critical(f"{self.embeddings_file} not found!")
        raise OSError(errno.ENOENT, os.strerror(errno.ENOENT), self.embeddings_file)
    logger.info(f"Indexing file {self.embeddings_file} ...")
    # create the 2D array, which will be used for initializing
    # the Embedding layer of a NN.
    # We reserve the first row (idx=0), as the word embedding,
    # which will be used for zero padding (word with id = 0).
    if self.extra_tokens is not None:
        word2idx, idx2word, embeddings = self.augment_embeddings(
            {},
            {},
            [],
            self.extra_tokens.PAD.value,  # type: ignore
            emb=np.zeros(self.dim_),
        )
        for token in self.extra_tokens:  # type: ignore
            logger.debug(f"Adding token {token.value} to embeddings matrix")
            if token == self.extra_tokens.PAD:
                continue
            word2idx, idx2word, embeddings = self.augment_embeddings(
                word2idx, idx2word, embeddings, token.value
            )
    else:
        word2idx, idx2word, embeddings = self.augment_embeddings(
            {}, {}, [], "[PAD]", emb=np.zeros(self.dim_)
        )
    # read file, line by line
    with open(self.embeddings_file, "r") as f:
        num_lines = sum(1 for line in f)
    with open(self.embeddings_file, "r") as f:
        index = len(embeddings)
        for line in tqdm(
            f, total=num_lines, desc="Loading word embeddings...", leave=False
        ):
            # skip the first row if it is a header
            if len(line.split()) < self.dim_:
                continue
            values = line.rstrip().split(" ")
            word = values[0]
            if word in word2idx:
                continue
            if not self.in_accepted_vocab(word):
                continue
            vector = np.asarray(values[1:], dtype=np.float32)
            idx2word[index] = word
            word2idx[word] = index
            embeddings.append(vector)
            index += 1
    logger.info(f"Loaded {len(embeddings)} word vectors.")
    embeddings_out = np.array(embeddings, dtype="float32")
    # write the data to a cache file
    self._dump_cache((word2idx, idx2word, embeddings_out))
    return word2idx, idx2word, embeddings_out
        HfCorpus
    
embeddings: None
  
      property
      readonly
  
    Unused. Defined for compatibility
frequencies: Dict[str, int]
  
      property
      readonly
  
    Retrieve wordpieces occurence counts
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: wordpieces occurence counts  | 
    
idx2word: None
  
      property
      readonly
  
    Unused. Defined for compatibility
indices: List[List[int]]
  
      property
      readonly
  
    Retrieve corpus as token indices
Returns:
| Type | Description | 
|---|---|
List[List[int]] | 
      List[List[int]]: Token indices for corpus  | 
    
raw: List[str]
  
      property
      readonly
  
    Retrieve raw corpus
Returns:
| Type | Description | 
|---|---|
List[str] | 
      List[str]: Raw Corpus  | 
    
tokenized: List[List[str]]
  
      property
      readonly
  
    Retrieve tokenized corpus
Returns:
| Type | Description | 
|---|---|
List[List[str]] | 
      List[List[str]]: tokenized corpus  | 
    
vocab: Set[str]
  
      property
      readonly
  
    Retrieve set of words in vocabulary
Returns:
| Type | Description | 
|---|---|
Set[str] | 
      Set[str]: set of words in vocabulary  | 
    
vocab_size: int
  
      property
      readonly
  
    Retrieve vocabulary size
Returns:
| Type | Description | 
|---|---|
int | 
      int: Vocabulary size  | 
    
word2idx: None
  
      property
      readonly
  
    Unused. Defined for compatibility
__getitem__(self, idx)
  
      special
  
    Get ith element in corpus as token indices
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
idx | 
        List[int] | 
        index in corpus  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of token indices for sentence  | 
    
Source code in slp/data/corpus.py
          def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices
    Args:
        idx (List[int]): index in corpus
    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )
    return out
__init__(self, corpus, lower=True, tokenizer_model='bert-base-uncased', add_special_tokens=True, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)
  
      special
  
    Process a corpus using hugging face tokenizers
Select one of hugging face tokenizers and process corpus
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        List[str] | 
        List of sentences  | 
        required | 
lower | 
        bool | 
        Convert strings to lower case. Defaults to True.  | 
        True | 
      
tokenizer_model | 
        str | 
        Hugging face model to use. Defaults to "bert-base-uncased".  | 
        'bert-base-uncased' | 
      
add_special_tokens | 
        bool | 
        Add special tokens in sentence during tokenization. Defaults to True.  | 
        True | 
      
special_tokens | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens to include in the vocabulary. Defaults to slp.config.nlp.SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
max_length | 
        int | 
        Crop sequences above this length. Defaults to -1 where sequences are left unaltered.  | 
        -1 | 
      
Source code in slp/data/corpus.py
          def __init__(
    self,
    corpus: List[str],
    lower: bool = True,
    tokenizer_model: str = "bert-base-uncased",
    add_special_tokens: bool = True,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    max_length: int = -1,
    **kwargs,
):
    """Process a corpus using hugging face tokenizers
    Select one of hugging face tokenizers and process corpus
    Args:
        corpus (List[str]): List of sentences
        lower (bool): Convert strings to lower case. Defaults to True.
        tokenizer_model (str): Hugging face model to use. Defaults to "bert-base-uncased".
        add_special_tokens (bool): Add special tokens in sentence during tokenization. Defaults to True.
        special_tokens (Optional[SPECIAL_TOKENS]): Special tokens to include in the vocabulary.
             Defaults to slp.config.nlp.SPECIAL_TOKENS.
        max_length (int): Crop sequences above this length. Defaults to -1 where sequences are left unaltered.
    """
    self.corpus_ = corpus
    self.max_length = max_length
    logger.info(
        f"Tokenizing corpus using hugging face tokenizer from {tokenizer_model}"
    )
    self.tokenizer = HuggingFaceTokenizer(
        lower=lower, model=tokenizer_model, add_special_tokens=add_special_tokens
    )
    self.corpus_indices_ = [
        self.tokenizer(s)
        for s in tqdm(
            self.corpus_, desc="Converting tokens to indices...", leave=False
        )
    ]
    self.tokenized_corpus_ = [
        self.tokenizer.detokenize(s)
        for s in tqdm(
            self.corpus_indices_,
            desc="Mapping indices to tokens...",
            leave=False,
        )
    ]
    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=-1,
        special_tokens=special_tokens,
    )
__len__(self)
  
      special
  
    Number of samples in corpus
Returns:
| Type | Description | 
|---|---|
int | 
      int: Corpus length  | 
    
Source code in slp/data/corpus.py
          def __len__(self) -> int:
    """Number of samples in corpus
    Returns:
        int: Corpus length
    """
    return len(self.corpus_indices_)
        TokenizedCorpus
    
embeddings: None
  
      property
      readonly
  
    Unused. Kept for compatibility
frequencies: Dict[str, int]
  
      property
      readonly
  
    Retrieve wordpieces occurence counts
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: wordpieces occurence counts  | 
    
idx2word: Dict[int, str]
  
      property
      readonly
  
    Retrieve idx2word mapping
Returns:
| Type | Description | 
|---|---|
Dict[int, str] | 
      Dict[str, int]: idx2word mapping  | 
    
indices: Union[List[int], List[List[int]]]
  
      property
      readonly
  
    Retrieve corpus as token indices
Returns:
| Type | Description | 
|---|---|
Union[List[int], List[List[int]]] | 
      List[List[int]]: Token indices for corpus  | 
    
raw: Union[List[str], List[List[str]]]
  
      property
      readonly
  
    Retrieve raw corpus
Returns:
| Type | Description | 
|---|---|
Union[List[str], List[List[str]]] | 
      List[str]: Raw Corpus  | 
    
tokenized: Union[List[str], List[List[str]]]
  
      property
      readonly
  
    Retrieve tokenized corpus
Returns:
| Type | Description | 
|---|---|
Union[List[str], List[List[str]]] | 
      List[List[str]]: Tokenized corpus  | 
    
vocab: Set[str]
  
      property
      readonly
  
    Retrieve set of words in vocabulary
Returns:
| Type | Description | 
|---|---|
Set[str] | 
      Set[str]: set of words in vocabulary  | 
    
vocab_size: int
  
      property
      readonly
  
    Retrieve vocabulary size
Returns:
| Type | Description | 
|---|---|
int | 
      int: Vocabulary size  | 
    
word2idx: Dict[str, int]
  
      property
      readonly
  
    Retrieve word2idx mapping
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: word2idx mapping  | 
    
__getitem__(self, idx)
  
      special
  
    Get ith element in corpus as token indices
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
idx | 
        List[int] | 
        index in corpus  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of token indices for sentence  | 
    
Source code in slp/data/corpus.py
          def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices
    Args:
        idx (List[int]): index in corpus
    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )
    return out
__init__(self, corpus, word2idx=None, special_tokens=<enum 'SPECIAL_TOKENS'>, max_length=-1, **kwargs)
  
      special
  
    Wrap a corpus that's already tokenized
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        Union[List[str], List[List[str]]] | 
        List of tokens or List of lists of tokens  | 
        required | 
word2idx | 
        Dict[str, int] | 
        Token to index mapping. Defaults to None.  | 
        None | 
      
special_tokens | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special Tokens. Defaults to SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
Source code in slp/data/corpus.py
          def __init__(
    self,
    corpus: Union[List[str], List[List[str]]],
    word2idx: Dict[str, int] = None,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    max_length: int = -1,
    **kwargs,
):
    """Wrap a corpus that's already tokenized
    Args:
        corpus (Union[List[str], List[List[str]]]): List of tokens or List of lists of tokens
        word2idx (Dict[str, int], optional): Token to index mapping. Defaults to None.
        special_tokens (Optional[SPECIAL_TOKENS], optional): Special Tokens. Defaults to SPECIAL_TOKENS.
    """
    self.corpus_ = corpus
    self.tokenized_corpus_ = corpus
    self.max_length = max_length
    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=-1,
        special_tokens=special_tokens,
    )
    if word2idx is not None:
        logger.info("Converting tokens to ids using word2idx.")
        self.word2idx_ = word2idx
    else:
        logger.info(
            "No word2idx provided. Will convert tokens to ids using an iterative counter."
        )
        self.word2idx_ = dict(zip(self.vocab_.keys(), itertools.count()))
    self.idx2word_ = {v: k for k, v in self.word2idx_.items()}
    self.to_token_ids = ToTokenIds(
        self.word2idx_,
        specials=SPECIAL_TOKENS,  # type: ignore
    )
    if isinstance(self.tokenized_corpus_[0], list):
        self.corpus_indices_ = [
            self.to_token_ids(s)
            for s in tqdm(
                self.tokenized_corpus_,
                desc="Converting tokens to token ids...",
                leave=False,
            )
        ]
    else:
        self.corpus_indices_ = self.to_token_ids(self.tokenized_corpus_)  # type: ignore
__len__(self)
  
      special
  
    Number of samples in corpus
Returns:
| Type | Description | 
|---|---|
int | 
      int: Corpus length  | 
    
Source code in slp/data/corpus.py
          def __len__(self) -> int:
    """Number of samples in corpus
    Returns:
        int: Corpus length
    """
    return len(self.corpus_indices_)
        WordCorpus
    
embeddings: ndarray
  
      property
      readonly
  
    Retrieve embeddings array
Returns:
| Type | Description | 
|---|---|
ndarray | 
      np.ndarray: Array of pretrained word embeddings  | 
    
frequencies: Dict[str, int]
  
      property
      readonly
  
    Retrieve word occurence counts
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: word occurence counts  | 
    
idx2word: Dict[int, str]
  
      property
      readonly
  
    Retrieve idx2word mapping
Returns:
| Type | Description | 
|---|---|
Dict[int, str] | 
      Dict[str, int]: idx2word mapping  | 
    
indices: List[List[int]]
  
      property
      readonly
  
    Retrieve corpus as token indices
Returns:
| Type | Description | 
|---|---|
List[List[int]] | 
      List[List[int]]: Token indices for corpus  | 
    
raw: List[str]
  
      property
      readonly
  
    Retrieve raw corpus
Returns:
| Type | Description | 
|---|---|
List[str] | 
      List[str]: Raw Corpus  | 
    
tokenized: List[List[str]]
  
      property
      readonly
  
    Retrieve tokenized corpus
Returns:
| Type | Description | 
|---|---|
List[List[str]] | 
      List[List[str]]: Tokenized corpus  | 
    
vocab: Set[str]
  
      property
      readonly
  
    Retrieve set of words in vocabulary
Returns:
| Type | Description | 
|---|---|
Set[str] | 
      Set[str]: set of words in vocabulary  | 
    
vocab_size: int
  
      property
      readonly
  
    Retrieve vocabulary size for corpus
Returns:
| Type | Description | 
|---|---|
int | 
      int: vocabulary size  | 
    
word2idx: Dict[str, int]
  
      property
      readonly
  
    Retrieve word2idx mapping
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: word2idx mapping  | 
    
__getitem__(self, idx)
  
      special
  
    Get ith element in corpus as token indices
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
idx | 
        List[int] | 
        index in corpus  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of token indices for sentence  | 
    
Source code in slp/data/corpus.py
          def __getitem__(self, idx) -> List[int]:
    """Get ith element in corpus as token indices
    Args:
        idx (List[int]): index in corpus
    Returns:
        List[int]: List of token indices for sentence
    """
    out: List[int] = (
        self.corpus_indices_[idx]
        if self.max_length <= 0
        else self.corpus_indices_[idx][: self.max_length]
    )
    return out
__init__(self, corpus, limit_vocab_size=30000, word2idx=None, idx2word=None, embeddings=None, embeddings_file=None, embeddings_dim=300, lower=True, special_tokens=<enum 'SPECIAL_TOKENS'>, prepend_bos=False, append_eos=False, lang='en_core_web_md', max_length=-1, **kwargs)
  
      special
  
    Load corpus embeddings, tokenize in words using spacy and convert to ids
This class handles the handling of a raw corpus. It handles:
- Tokenization into words (spacy)
 - Loading of pretrained word embedding
 - Calculation of word frequencies / corpus statistics
 - Conversion to token ids
 
You can pass either:
- Pass an embeddings file to load pretrained embeddings and create the word2idx mapping
 - Pass already loaded embeddings array and word2idx. This is useful for the dev / test splits where we want to pass the train split embeddings / word2idx.
 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        List[str] | 
        Corpus as a list of sentences  | 
        required | 
limit_vocab_size | 
        int | 
        Upper bound for number of most frequent tokens to keep. Defaults to 30000.  | 
        30000 | 
      
word2idx | 
        Optional[Dict[str, int]] | 
        Mapping of word to indices. Defaults to None.  | 
        None | 
      
idx2word | 
        Optional[Dict[int, str]] | 
        Mapping of indices to words. Defaults to None.  | 
        None | 
      
embeddings | 
        Optional[numpy.ndarray] | 
        Embeddings array. Defaults to None.  | 
        None | 
      
embeddings_file | 
        Optional[str] | 
        Embeddings file to read. Defaults to None.  | 
        None | 
      
embeddings_dim | 
        int | 
        Dimension of embeddings. Defaults to 300.  | 
        300 | 
      
lower | 
        bool | 
        Convert strings to lower case. Defaults to True.  | 
        True | 
      
special_tokens | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens to include in the vocabulary. Defaults to slp.config.nlp.SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
prepend_bos | 
        bool | 
        Prepend Beginning of Sequence token for seq2seq tasks. Defaults to False.  | 
        False | 
      
append_eos | 
        bool | 
        Append End of Sequence token for seq2seq tasks. Defaults to False.  | 
        False | 
      
lang | 
        str | 
        Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".  | 
        'en_core_web_md' | 
      
max_length | 
        int | 
        Crop sequences above this length. Defaults to -1 where sequences are left unaltered.  | 
        -1 | 
      
Source code in slp/data/corpus.py
          def __init__(
    self,
    corpus: List[str],
    limit_vocab_size: int = 30000,
    word2idx: Optional[Dict[str, int]] = None,
    idx2word: Optional[Dict[int, str]] = None,
    embeddings: Optional[np.ndarray] = None,
    embeddings_file: Optional[str] = None,
    embeddings_dim: int = 300,
    lower: bool = True,
    special_tokens: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    prepend_bos: bool = False,
    append_eos: bool = False,
    lang: str = "en_core_web_md",
    max_length: int = -1,
    **kwargs,
):
    """Load corpus embeddings, tokenize in words using spacy and convert to ids
    This class handles the handling of a raw corpus. It handles:
    * Tokenization into words (spacy)
    * Loading of pretrained word embedding
    * Calculation of word frequencies / corpus statistics
    * Conversion to token ids
    You can pass either:
    * Pass an embeddings file to load pretrained embeddings and create the word2idx mapping
    * Pass already loaded embeddings array and word2idx. This is useful for the dev / test splits
      where we want to pass the train split embeddings / word2idx.
    Args:
        corpus (List[List[str]]): Corpus as a list of sentences
        limit_vocab_size (int): Upper bound for number of most frequent tokens to keep. Defaults to 30000.
        word2idx (Optional[Dict[str, int]]): Mapping of word to indices. Defaults to None.
        idx2word (Optional[Dict[int, str]]): Mapping of indices to words. Defaults to None.
        embeddings (Optional[np.ndarray]): Embeddings array. Defaults to None.
        embeddings_file (Optional[str]): Embeddings file to read. Defaults to None.
        embeddings_dim (int): Dimension of embeddings. Defaults to 300.
        lower (bool): Convert strings to lower case. Defaults to True.
        special_tokens (Optional[SPECIAL_TOKENS]): Special tokens to include in the vocabulary.
             Defaults to slp.config.nlp.SPECIAL_TOKENS.
        prepend_bos (bool): Prepend Beginning of Sequence token for seq2seq tasks. Defaults to False.
        append_eos (bool): Append End of Sequence token for seq2seq tasks. Defaults to False.
        lang (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
        max_length (int): Crop sequences above this length. Defaults to -1 where sequences are left unaltered.
    """
    # FIXME: Extract super class to avoid repetition
    self.corpus_ = corpus
    self.max_length = max_length
    self.tokenizer = SpacyTokenizer(
        lower=lower,
        prepend_bos=prepend_bos,
        append_eos=append_eos,
        specials=special_tokens,
        lang=lang,
    )
    logger.info(f"Tokenizing corpus using spacy {lang}")
    self.tokenized_corpus_ = [
        self.tokenizer(s)
        for s in tqdm(self.corpus_, desc="Tokenizing corpus...", leave=False)
    ]
    self.vocab_ = create_vocab(
        self.tokenized_corpus_,
        vocab_size=limit_vocab_size if word2idx is None else -1,
        special_tokens=special_tokens,
    )
    self.word2idx_, self.idx2word_, self.embeddings_ = None, None, None
    # self.corpus_indices_ = self.tokenized_corpus_
    if word2idx is not None:
        logger.info("Word2idx was already provided. Going to used it.")
    if embeddings_file is not None and word2idx is None:
        logger.info(
            f"Going to load {len(self.vocab_)} embeddings from {embeddings_file}"
        )
        loader = EmbeddingsLoader(
            embeddings_file,
            embeddings_dim,
            vocab=self.vocab_,
            extra_tokens=special_tokens,
        )
        word2idx, idx2word, embeddings = loader.load()
    if embeddings is not None:
        self.embeddings_ = embeddings
    if idx2word is not None:
        self.idx2word_ = idx2word
    if word2idx is not None:
        self.word2idx_ = word2idx
        logger.info("Converting tokens to ids using word2idx.")
        self.to_token_ids = ToTokenIds(
            self.word2idx_,
            specials=SPECIAL_TOKENS,  # type: ignore
        )
        self.corpus_indices_ = [
            self.to_token_ids(s)
            for s in tqdm(
                self.tokenized_corpus_,
                desc="Converting tokens to token ids...",
                leave=False,
            )
        ]
        logger.info("Filtering corpus vocabulary.")
        updated_vocab = {}
        for k, v in self.vocab_.items():
            if k in self.word2idx_:
                updated_vocab[k] = v
        logger.info(
            f"Out of {len(self.vocab_)} tokens {len(self.vocab_) - len(updated_vocab)} were not found in the pretrained embeddings."
        )
        self.vocab_ = updated_vocab
__len__(self)
  
      special
  
    Number of samples in corpus
Returns:
| Type | Description | 
|---|---|
int | 
      int: Corpus length  | 
    
Source code in slp/data/corpus.py
          def __len__(self) -> int:
    """Number of samples in corpus
    Returns:
        int: Corpus length
    """
    return len(self.corpus_indices_)
create_vocab(corpus, vocab_size=-1, special_tokens=None)
    Create the vocabulary based on tokenized input corpus
- Injects special tokens in the vocabulary
 - Calculates the occurence count for each token
 - Limits vocabulary to vocab_size most common tokens
 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        Union[List[str], List[List[str]]] | 
        The tokenized corpus as a list of sentences or a list of tokenized sentences  | 
        required | 
vocab_size | 
        int | 
        [description]. Limit vocabulary to vocab_size most common tokens. Defaults to -1 which keeps all tokens.  | 
        -1 | 
      
special_tokens | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens to include in the vocabulary. Defaults to None.  | 
        None | 
      
Returns:
| Type | Description | 
|---|---|
Dict[str, int] | 
      Dict[str, int]: Dictionary of all accepted tokens and their corresponding occurence counts  | 
    
Examples:
>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"])
{'far': 2, 'away': 1, 'galaxy': 1, 'a': 1, 'in': 1}
>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3)
{'far': 2, 'a': 1, 'in': 1}
>>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3, special_tokens=slp.config.nlp.SPECIAL_TOKENS)
{'[PAD]': 0, '[MASK]': 0, '[UNK]': 0, '[BOS]': 0, '[EOS]': 0, '[CLS]': 0, '[SEP]': 0, 'far': 2, 'a': 1, 'in': 1}
Source code in slp/data/corpus.py
          def create_vocab(
    corpus: Union[List[str], List[List[str]]],
    vocab_size: int = -1,
    special_tokens: Optional[SPECIAL_TOKENS] = None,
) -> Dict[str, int]:
    """Create the vocabulary based on tokenized input corpus
    * Injects special tokens in the vocabulary
    * Calculates the occurence count for each token
    * Limits vocabulary to vocab_size most common tokens
    Args:
        corpus (Union[List[str], List[List[str]]]): The tokenized corpus as a list of sentences or a list of tokenized sentences
        vocab_size (int): [description]. Limit vocabulary to vocab_size most common tokens.
            Defaults to -1 which keeps all tokens.
        special_tokens Optional[SPECIAL_TOKENS]: Special tokens to include in the vocabulary. Defaults to None.
    Returns:
        Dict[str, int]: Dictionary of all accepted tokens and their corresponding occurence counts
    Examples:
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"])
        {'far': 2, 'away': 1, 'galaxy': 1, 'a': 1, 'in': 1}
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3)
        {'far': 2, 'a': 1, 'in': 1}
        >>> create_vocab(["in", "a", "galaxy", "far", "far", "away"], vocab_size=3, special_tokens=slp.config.nlp.SPECIAL_TOKENS)
        {'[PAD]': 0, '[MASK]': 0, '[UNK]': 0, '[BOS]': 0, '[EOS]': 0, '[CLS]': 0, '[SEP]': 0, 'far': 2, 'a': 1, 'in': 1}
    """
    if isinstance(corpus[0], list):
        corpus = list(itertools.chain.from_iterable(corpus))
    freq = Counter(corpus)
    if special_tokens is None:
        extra_tokens = []
    else:
        extra_tokens = special_tokens.to_list()
    if vocab_size < 0:
        vocab_size = len(freq)
    take = min(vocab_size, len(freq))
    logger.info(f"Keeping {vocab_size} most common tokens out of {len(freq)}")
    def take0(x: Tuple[Any, Any]) -> Any:
        """Take first tuple element"""
        return x[0]
    common_words = list(map(take0, freq.most_common(take)))
    common_words = list(set(common_words) - set(extra_tokens))
    words = extra_tokens + common_words
    if len(words) > vocab_size:
        words = words[: vocab_size + len(extra_tokens)]
    def token_freq(t):
        """Token frequeny"""
        return 0 if t in extra_tokens else freq[t]
    vocab = dict(zip(words, map(token_freq, words)))
    logger.info(f"Vocabulary created with {len(vocab)} tokens.")
    logger.info(f"The 10 most common tokens are:\n{freq.most_common(10)}")
    return vocab
        CorpusDataset
    
__getitem__(self, idx)
  
      special
  
    Get a source and target token from the corpus
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
idx | 
        int | 
        Token position  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tuple[torch.Tensor, torch.Tensor] | 
      (processed sentence, label)  | 
    
Source code in slp/data/datasets.py
          def __getitem__(self, idx):
    """Get a source and target token from the corpus
    Args:
        idx (int): Token position
    Returns:
        Tuple[torch.Tensor, torch.Tensor]: (processed sentence, label)
    """
    text, target = self.corpus[idx], self.labels[idx]
    if self.label_encoder is not None:
        target = self.label_encoder.transform([target])[0]
    for t in self.transforms:
        text = t(text)
    return text, target
__init__(self, corpus, labels)
  
      special
  
    Labeled corpus dataset
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        WordCorpus, HfCorpus etc.. | 
        Input corpus  | 
        required | 
labels | 
        List[Any] | 
        Labels for examples  | 
        required | 
Source code in slp/data/datasets.py
          def __init__(self, corpus, labels):
    """Labeled corpus dataset
    Args:
        corpus (WordCorpus, HfCorpus etc..): Input corpus
        labels (List[Any]): Labels for examples
    """
    self.corpus = corpus
    self.labels = labels
    assert len(self.labels) == len(self.corpus), "Incompatible labels and corpus"
    self.transforms = []
    self.label_encoder = None
    if isinstance(self.labels[0], str):
        self.label_encoder = LabelEncoder().fit(self.labels)
__len__(self)
  
      special
  
    Length of corpus
Returns:
| Type | Description | 
|---|---|
int | 
      Corpus Length  | 
    
Source code in slp/data/datasets.py
          def __len__(self):
    """Length of corpus
    Returns:
        int: Corpus Length
    """
    return len(self.corpus)
map(self, t)
    Append a transform to self.transforms, in order to be applied to the data
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
t | 
        Callable[[str], Any] | 
        Transform of input token  | 
        required | 
Returns:
| Type | Description | 
|---|---|
CorpusDataset | 
      self  | 
    
Source code in slp/data/datasets.py
          def map(self, t):
    """Append a transform to self.transforms, in order to be applied to the data
    Args:
        t (Callable[[str], Any]): Transform of input token
    Returns:
        CorpusDataset: self
    """
    self.transforms.append(t)
    return self
        CorpusLMDataset
    
__getitem__(self, idx)
  
      special
  
    Get a source and target token from the corpus
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
idx | 
        int | 
        Token position  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tuple[torch.Tensor, torch.Tensor] | 
      source=coprus[idx], target=corpus[idx+1]  | 
    
Source code in slp/data/datasets.py
          def __getitem__(self, idx):
    """Get a source and target token from the corpus
    Args:
        idx (int): Token position
    Returns:
        Tuple[torch.Tensor, torch.Tensor]: source=coprus[idx], target=corpus[idx+1]
    """
    src, tgt = self.source[idx], self.target[idx]
    for t in self.transforms:
        src = t(src)
        tgt = t(tgt)
    return src, tgt
__init__(self, corpus)
  
      special
  
    Wraps a tokenized dataset which is provided as a list of tokens
Targets = source shifted one token to the left (next token prediction)
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
corpus | 
        List[str] or WordCorpus | 
        List of tokens  | 
        required | 
Source code in slp/data/datasets.py
          def __init__(self, corpus):
    """Wraps a tokenized dataset which is provided as a list of tokens
    Targets = source shifted one token to the left (next token prediction)
    Args:
        corpus (List[str] or WordCorpus): List of tokens
    """
    self.source = corpus[:-1]
    self.target = corpus[1:]
    self.transforms = []
__len__(self)
  
      special
  
    Length of corpus
Returns:
| Type | Description | 
|---|---|
int | 
      Corpus Length  | 
    
Source code in slp/data/datasets.py
          def __len__(self):
    """Length of corpus
    Returns:
        int: Corpus Length
    """
    return int(len(self.source))
map(self, t)
    Append a transform to self.transforms, in order to be applied to the data
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
t | 
        Callable[[str], Any] | 
        Transform of input token  | 
        required | 
Returns:
| Type | Description | 
|---|---|
CorpusLMDataset | 
      self  | 
    
Source code in slp/data/datasets.py
          def map(self, t):
    """Append a transform to self.transforms, in order to be applied to the data
    Args:
        t (Callable[[str], Any]): Transform of input token
    Returns:
        CorpusLMDataset: self
    """
    self.transforms.append(t)
    return self
        HuggingFaceTokenizer
    
__call__(self, x)
  
      special
  
    Call to tokenize function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        str | 
        Input string  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of token ids  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: str) -> List[int]:
    """Call to tokenize function
    Args:
        x (str): Input string
    Returns:
        List[int]: List of token ids
    """
    out: List[int] = self.tokenizer.encode(
        x, add_special_tokens=self.add_special_tokens, max_length=65536
    )
    return out
__init__(self, lower=True, model='bert-base-uncased', add_special_tokens=True)
  
      special
  
    Apply one of huggingface tokenizers to a string
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
lower | 
        bool | 
        Lowercase string. Defaults to True.  | 
        True | 
      
model | 
        str | 
        Select transformer model. Defaults to "bert-base-uncased".  | 
        'bert-base-uncased' | 
      
add_special_tokens | 
        bool | 
        Insert special tokens to tokenized string. Defaults to True.  | 
        True | 
      
Source code in slp/data/transforms.py
          def __init__(
    self,
    lower: bool = True,
    model: str = "bert-base-uncased",
    add_special_tokens: bool = True,
):
    """Apply one of huggingface tokenizers to a string
    Args:
        lower (bool): Lowercase string. Defaults to True.
        model (str): Select transformer model. Defaults to "bert-base-uncased".
        add_special_tokens (bool): Insert special tokens to tokenized string. Defaults to True.
    """
    self.tokenizer = AutoTokenizer.from_pretrained(model, do_lower_case=lower)
    self.vocab_size = len(self.tokenizer.vocab)
    self.add_special_tokens = add_special_tokens
detokenize(self, x)
    Convert list of token ids to list of tokens
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        List[int] | 
        List of token ids  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[str] | 
      List[str]: List of tokens  | 
    
Source code in slp/data/transforms.py
          def detokenize(self, x: List[int]) -> List[str]:
    """Convert list of token ids to list of tokens
    Args:
        x (List[int]): List of token ids
    Returns:
        List[str]: List of tokens
    """
    out: List[str] = self.tokenizer.convert_ids_to_tokens(x)
    return out
        ReplaceUnknownToken
    
__call__(self, x)
  
      special
  
    Convert 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        List[str] | 
        List of tokens  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[str] | 
      List[str]: List of tokens  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: List[str]) -> List[str]:
    """Convert <unk> in list of tokens to [UNK]
    Args:
        x (List[str]): List of tokens
    Returns:
        List[str]: List of tokens
    """
    return [w if w != self.old_unk else self.new_unk for w in x]
__init__(self, old_unk='<unk>', new_unk='[UNK]')
  
      special
  
    Replace existing unknown tokens in the vocab to [UNK]. Useful for wikitext
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
old_unk | 
        str | 
        Unk token in corpus. Defaults to "  | 
        '<unk>' | 
      
new_unk | 
        str | 
        Desired unk value. Defaults to SPECIAL_TOKENS.UNK.value.  | 
        '[UNK]' | 
      
Source code in slp/data/transforms.py
          def __init__(
    self,
    old_unk: str = "<unk>",
    new_unk: str = SPECIAL_TOKENS.UNK.value,  # type: ignore
):
    """Replace existing unknown tokens in the vocab to [UNK]. Useful for wikitext
    Args:
        old_unk (str): Unk token in corpus. Defaults to "<unk>".
        new_unk (str): Desired unk value. Defaults to SPECIAL_TOKENS.UNK.value.
    """
    self.old_unk = old_unk
    self.new_unk = new_unk
        SentencepieceTokenizer
    
__call__(self, x)
  
      special
  
    Call to tokenize function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        str | 
        Input string  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of tokens ids  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: str) -> List[int]:
    """Call to tokenize function
    Args:
        x (str): Input string
    Returns:
        List[int]: List of tokens ids
    """
    if self.lower:
        x = x.lower()
    ids: List[int] = self.pre_id + self.tokenizer.encode_as_ids(x) + self.post_id
    return ids
__init__(self, lower=True, model=None, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>)
  
      special
  
    Tokenize sentence using pretrained sentencepiece model
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
lower | 
        bool | 
        Lowercase string. Defaults to True.  | 
        True | 
      
model | 
        Optional[Any] | 
        Sentencepiece model. Defaults to None.  | 
        None | 
      
prepend_bos | 
        bool | 
        Prepend BOS for seq2seq. Defaults to False.  | 
        False | 
      
append_eos | 
        bool | 
        Append EOS for seq2seq. Defaults to False.  | 
        False | 
      
specials | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens. Defaults to SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
Source code in slp/data/transforms.py
          def __init__(
    self,
    lower: bool = True,
    model: Optional[Any] = None,
    prepend_bos: bool = False,
    append_eos: bool = False,
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
):
    """Tokenize sentence using pretrained sentencepiece model
    Args:
        lower (bool): Lowercase string. Defaults to True.
        model (Optional[Any]): Sentencepiece model. Defaults to None.
        prepend_bos (bool): Prepend BOS for seq2seq. Defaults to False.
        append_eos (bool): Append EOS for seq2seq. Defaults to False.
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
    """
    self.tokenizer = spm.SentencePieceProcessor()
    self.tokenizer.Load(model)
    self.specials = specials
    self.lower = lower
    self.vocab_size = self.tokenizer.get_piece_size()
    self.pre_id = []
    self.post_id = []
    if prepend_bos:
        self.pre_id.append(self.tokenizer.piece_to_id(self.specials.BOS.value))  # type: ignore
    if append_eos:
        self.post_id.append(self.tokenizer.piece_to_id(self.specials.EOS.value))  # type: ignore
        SpacyTokenizer
    
__call__(self, x)
  
      special
  
    Call to tokenize function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        str | 
        Input string  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[str] | 
      List[str]: List of tokens  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: str) -> List[str]:
    """Call to tokenize function
    Args:
        x (str): Input string
    Returns:
        List[str]: List of tokens
    """
    if self.lower:
        x = x.lower()
    out: List[str] = (
        self.pre_id + [y.text for y in self.nlp.tokenizer(x)] + self.post_id
    )
    return out
__init__(self, lower=True, prepend_bos=False, append_eos=False, specials=<enum 'SPECIAL_TOKENS'>, lang='en_core_web_sm')
  
      special
  
    Apply spacy tokenizer to str
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
lower | 
        bool | 
        Lowercase string. Defaults to True.  | 
        True | 
      
prepend_bos | 
        bool | 
        Prepend BOS for seq2seq. Defaults to False.  | 
        False | 
      
append_eos | 
        bool | 
        Append EOS for seq2seq. Defaults to False.  | 
        False | 
      
specials | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens. Defaults to SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
lang | 
        str | 
        Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".  | 
        'en_core_web_sm' | 
      
Source code in slp/data/transforms.py
          def __init__(
    self,
    lower: bool = True,
    prepend_bos: bool = False,
    append_eos: bool = False,
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
    lang: str = "en_core_web_sm",
):
    """Apply spacy tokenizer to str
    Args:
        lower (bool): Lowercase string. Defaults to True.
        prepend_bos (bool): Prepend BOS for seq2seq. Defaults to False.
        append_eos (bool): Append EOS for seq2seq. Defaults to False.
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
        lang (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
    """
    self.lower = lower
    self.specials = SPECIAL_TOKENS
    self.lang = lang
    self.pre_id = []
    self.post_id = []
    if prepend_bos:
        self.pre_id.append(self.specials.BOS.value)
    if append_eos:
        self.post_id.append(self.specials.EOS.value)
    self.nlp = self.get_nlp(name=lang, specials=specials)
get_nlp(self, name='en_core_web_sm', specials=<enum 'SPECIAL_TOKENS'>)
    Get spacy nlp object for given lang and add SPECIAL_TOKENS
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
name | 
        str | 
        Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".  | 
        'en_core_web_sm' | 
      
specials | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens. Defaults to SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
Returns:
| Type | Description | 
|---|---|
Language | 
      spacy.Language: spacy text-processing pipeline  | 
    
Source code in slp/data/transforms.py
          def get_nlp(
    self,
    name: str = "en_core_web_sm",
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
) -> spacy.Language:
    """Get spacy nlp object for given lang and add SPECIAL_TOKENS
    Args:
        name (str): Spacy language, e.g. el_core_web_sm, en_core_web_sm etc. Defaults to "en_core_web_md".
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
    Returns:
        spacy.Language: spacy text-processing pipeline
    """
    nlp = spacy.load(name)
    if specials is not None:
        for token in specials.to_list():
            nlp.tokenizer.add_special_case(token, [{ORTH: token}])
    return nlp
        ToTensor
    
__call__(self, x)
  
      special
  
    Convert list of tokens or list of features to tensor
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        List[Any] | 
        List of tokens or features  | 
        required | 
Returns:
| Type | Description | 
|---|---|
Tensor | 
      torch.Tensor: Resulting tensor  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: List[Any]) -> torch.Tensor:
    """Convert list of tokens or list of features to tensor
    Args:
        x (List[Any]): List of tokens or features
    Returns:
        torch.Tensor: Resulting tensor
    """
    return mktensor(x, device=self.device, dtype=self.dtype)
__init__(self, device='cpu', dtype=torch.int64)
  
      special
  
    To tensor convertor
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
device | 
        str | 
        Device to map the tensor. Defaults to "cpu".  | 
        'cpu' | 
      
dtype | 
        dtype | 
        Type of resulting tensor. Defaults to torch.long.  | 
        torch.int64 | 
      
Source code in slp/data/transforms.py
          def __init__(self, device: str = "cpu", dtype: torch.dtype = torch.long):
    """To tensor convertor
    Args:
        device (str): Device to map the tensor. Defaults to "cpu".
        dtype (torch.dtype): Type of resulting tensor. Defaults to torch.long.
    """
    self.device = device
    self.dtype = dtype
        ToTokenIds
    
__call__(self, x)
  
      special
  
    Convert list of tokens to list of token ids
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
x | 
        List[str] | 
        List of tokens  | 
        required | 
Returns:
| Type | Description | 
|---|---|
List[int] | 
      List[int]: List of token ids  | 
    
Source code in slp/data/transforms.py
          def __call__(self, x: List[str]) -> List[int]:
    """Convert list of tokens to list of token ids
    Args:
        x (List[str]): List of tokens
    Returns:
        List[int]: List of token ids
    """
    return [
        self.word2idx[w] if w in self.word2idx else self.word2idx[self.unk_value]
        for w in x
    ]
__init__(self, word2idx, specials=<enum 'SPECIAL_TOKENS'>)
  
      special
  
    Convert List of tokens to list of token ids
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
word2idx | 
        Dict[str, int] | 
        Word to index mapping  | 
        required | 
specials | 
        Optional[slp.config.nlp.SPECIAL_TOKENS] | 
        Special tokens. Defaults to SPECIAL_TOKENS.  | 
        <enum 'SPECIAL_TOKENS'> | 
      
Source code in slp/data/transforms.py
          def __init__(
    self,
    word2idx: Dict[str, int],
    specials: Optional[SPECIAL_TOKENS] = SPECIAL_TOKENS,  # type: ignore
):
    """Convert List of tokens to list of token ids
    Args:
        word2idx (Dict[str, int]): Word to index mapping
        specials (Optional[SPECIAL_TOKENS]): Special tokens. Defaults to SPECIAL_TOKENS.
    """
    self.word2idx = word2idx
    self.unk_value = specials.UNK.value if specials is not None else "[UNK]"  # type: ignore