Digest | the Moscow Method, Notes for SE, BERT Model From Scratch

>  2021年07月16日信息消化

### Notes to Myself on Software Engineering

原文：[Notes to Myself on Software Engineering](https://medium.com/s/story/notes-to-myself-on-software-engineering-c890f16f4e4d)

#### On the Development Process

1. Code isn’t just meant to be executed. Code is also a means of communication across a team, a way to describe to others the solution to a problem. Readable code is not a nice-to-have, it is a fundamental part of what writing code is about. This involves factoring code clearly, picking self-explanatory variable names, and inserting comments to describe anything that’s implicit.

代码不仅仅是用来执行的。代码也是一个团队间交流的手段，是向他人描述问题解决方案的一种方式。可读的代码不是一个很好的东西，它是编写代码的基本部分。这涉及到代码的清晰划分，选择不言自明的变量名称，并插入注释来描述任何隐含的东西。

2. Ask not what your pull request can do for your next promotion, ask what your pull request can do for your users and your community. Avoid “conspicuous contribution” at all cost. Let no feature be added if it isn’t clearly helping with the purpose of your product.

不要问你的拉动请求能为你的下一次推广做什么，要问你的拉动请求能为你的用户和社区做什么。不惜一切代价避免 "明显的贡献"。如果没有明显地帮助你的产品的目的，就不要添加任何功能。

3. Taste applies to code, too. Taste is a constraint-satisfaction process regularized by a desire for simplicity. Keep a bias toward simplicity.

品味也适用于代码。品味是一个满足约束的过程，它被对简单的渴望所规范。保持对简单的偏爱。

4. It’s okay to say no — just because someone asks for a feature doesn’t mean you should do it. Every feature has a cost that goes beyond the initial implementation: maintenance cost, documentation cost, and cognitive cost for your users. Always ask: Should we really do this? Often, the answer is simply no.

说 "不 "是可以的--仅仅因为有人要求一个功能，并不意味着你应该做它。每个功能都有超出最初实施的成本：维护成本、文档成本和用户的认知成本。总是问：我们真的应该这样做吗？通常情况下，答案是简单的否定。

5. When you say yes to a request for supporting a new use case, remember that literally adding what the user requested is often not the optimal choice. Users are focused on their own specific use case, and you must counter this with a holistic and principled vision of the whole project. Often, the right answer is to extend an existing feature.

当你对一个支持新用例的请求说 "是 "时，请记住，从字面上添加用户要求的东西往往不是最佳选择。用户关注的是他们自己的特定用例，而你必须用整个项目的整体性和原则性的眼光来应对这一点。通常，正确的答案是扩展一个现有的功能。

6. Invest in continuous integration and aim for full unit test coverage. Make sure you are in an environment where you can code with confidence; if that isn’t the case, start by focusing on building the right infrastructure.

对持续集成进行投资，并以全面的单元测试覆盖率为目标。确保你处在一个可以放心编码的环境中；如果不是这样，就从专注于建立正确的基础设施开始。

7. It’s okay not to plan everything in advance. Try things and see how they turn out. Revert incorrect choices early. Make sure you create an environment where that is possible.

不要事先计划好一切。试着做一些事情，看看结果如何。尽早纠正不正确的选择。确保你创造了一个可以这样做的环境。

8. Good software makes hard things easy. Just because a problem looks difficult at first doesn’t mean the solution will have to be complex or hard to use. Too often, engineers go with reflex solutions that introduce undesirable complexity (*Let’s use ML! Let’s build an app! Let’s add blockchain!*) in situations where a far easier, though maybe less obvious, alternative is available. Before you write any code, make sure your solution of choice cannot be made any simpler. Approach everything from first principles.

好的软件使困难的事情变得简单。仅仅因为一个问题一开始看起来很困难，并不意味着解决方案必须是复杂或难以使用的。很多时候，工程师们会采用反射性的解决方案，在有更简单，但可能不那么明显的替代方案的情况下，引入不理想的复杂性（*让我们使用ML！让我们建立一个应用程序！让我们添加区块链！*）。在你写任何代码之前，确保你选择的解决方案不能再简单了。从第一原则出发来处理一切。

9. Avoid implicit rules. Implicit rules that you find yourself developing should always be made explicit and shared with others or automated. Whenever you find yourself coming up with a recurring, quasi-algorithmic workflow, you should seek to formalize it into a documented process, so that other team members will benefit from the experience. In addition, you should seek to automate in software any part of such a workflow that can be automated (e.g., correctness checks).

避免隐性规则。你发现自己开发的隐含规则应该总是被明确化，并与他人分享或自动化。每当你发现自己想出一个经常性的、准算法的工作流程时，你应该设法将其正式化为一个有记录的过程，以便其他团队成员能从这个经验中受益。此外，你应该设法将这种工作流程中任何可以自动化的部分（例如，正确性检查）在软件中自动化。

10. The total impact of your choices should be taken into account in the design process, not just the bits you want to focus on — such as revenue or growth. Beyond the metrics you are monitoring, what total impact does your software have on its users, on the world? Are there undesirable side effects that outweigh the value proposition? What can you do to address them while preserving the software’s usefulness?

在设计过程中应该考虑到你的选择的总影响，而不仅仅是你想关注的部分--如收入或增长。除了你所监控的指标之外，你的软件对它的用户、对世界有什么总的影响？是否有超过价值主张的不良副作用？你可以做什么来解决这些问题，同时保持软件的实用性？

> Design for ethics. Bake your values into your creations.

#### On API Design

1. Your API has users, thus it has a user experience. In every decision you make, always keep the user in mind. Have empathy for your users, whether they are beginners or experienced developers.

你的API有用户，因此它有一个用户体验。在你做的每一个决定中，都要把用户放在心上。对你的用户要有同情心，不管他们是初学者还是有经验的开发者。

2. Always seek to minimize the cognitive load imposed on your users in the course of using your API. Automate what can be automated, minimize the actions and choices needed from the user, don’t expose options that are unimportant, design simple and consistent workflows that reflect simple and consistent mental models.

在使用你的API的过程中，总是设法尽量减少强加给你的用户的认知负担。把可以自动化的东西自动化，尽量减少用户需要的操作和选择，不要暴露不重要的选项，设计简单和一致的工作流程，反映简单和一致的心理模型。

3. Simple things should be simple, complex things should be possible. Don’t increase the cognitive load of common use cases for the sake of niche use cases, even minimally.

简单的事情应该是简单的，复杂的事情应该是可能的。不要为了小众用例而增加普通用例的认知负荷，即使是最小的。

4. If the cognitive load of a workflow is sufficiently low, it should be possible for a user to go through it from memory (without looking up a tutorial or documentation) after having done it once or twice.

如果一个工作流程的认知负荷足够低，那么用户在做了一两次之后，应该可以凭记忆来完成它（不需要翻阅教程或文档）。

5. Seek to have an API that matches the mental models of domain experts and practitioners. Someone who has domain experience, but no experience with your API, should be able to intuitively understand your API using minimal documentation, mostly just by looking at a couple of code examples and seeing what objects are available and what their signatures are.

争取拥有一个与领域专家和从业者的心智模式相匹配的API。一个有领域经验但对你的API没有经验的人，应该能够用最少的文档直观地理解你的API，主要是通过看几个代码例子，看看有哪些对象，它们的签名是什么。

6. The meaning of an argument should be understandable without having any context about the underlying implementation. Arguments that have to be specified by users should relate to the mental models that the users have about the problem, not to implementation details in your code. An API is all about the problem it solves, not about how the software works in the background.

一个参数的含义应该是可以理解的，不需要有任何关于底层实现的背景。必须由用户指定的参数应该与用户对问题的心理模型有关，而不是与你代码中的实现细节有关。一个API是关于它所解决的问题的，而不是关于软件如何在后台工作的。

7. The most powerful mental models are modular and hierarchical: simple at a high level, yet precise as you need to go into details. In the same way, a good API is modular and hierarchical: easy to approach, yet expressive. There is a balance to strike between having complex signatures on fewer objects, and having more objects with simpler signatures. A good API has a reasonable number of objects, with reasonably simple signatures.

最强大的心理模型是模块化和层次化的：在高层次上是简单的，但在你需要深入了解细节时是精确的。同样地，一个好的API也是模块化和分层的：易于接近，但又有表现力。在较少的对象上有复杂的签名和较多的对象上有简单的签名之间，要取得一个平衡。一个好的API有合理数量的对象，有合理的简单签名。

8. Your API is inevitably a reflection of your implementation choices, in particular your choice of data structures. To achieve an intuitive API, you must choose data structures that naturally fit the domain at hand — that match the mental models of domain experts.

你的API不可避免地反映了你的实现选择，特别是你对数据结构的选择。为了实现直观的API，你必须选择自然适合手头领域的数据结构--与领域专家的心理模型相匹配。

9. Deliberately design end-to-end workflows, not a set of atomic features. Most developers approach API design by asking: *What capabilities should be available? Let’s have configuration options for them.* Instead, ask: *What are the use cases for this tool? For each use case, what is the optimal sequence of user actions? What’s the easiest API that could support this workflow?* Atomic options in your API should answer a clear need that arises in a high-level workflow — they should not be added “because someone might need it.”

精心设计端到端的工作流程，而不是一组原子性的功能。大多数开发者在进行API设计时，都会问："什么是可用的功能？应该有哪些功能？让我们为它们提供配置选项。相反，要问：这个工具的用例是什么？对于每个用例，用户操作的最佳顺序是什么？什么是可以支持这个工作流程的最简单的API？你的API中的原子选项应该回答高层工作流程中出现的明确需求--它们不应该被添加到 "因为有人可能需要它"。

10. Error messages, and in general any feedback provided to a user in the course of interacting with your API, is part of the API. Interactivity and feedback are integral to the user experience. Design your API’s error messages deliberately.

错误信息，以及在与你的API交互的过程中提供给用户的任何反馈，都是API的一部分。互动性和反馈是用户体验的组成部分。慎重地设计你的API的错误信息。

11. Because code is communication, naming matters — whether naming a project or a variable. Names reflect how you think about a problem. Avoid overly generic names (*x, variable, parameter*), avoid *OverlyLongAndSpecificNamingPatterns*, avoid terms that can create unnecessary friction (*master, slave*), and make sure you are consistent in your naming choices. Naming consistency means both internal naming consistency (don’t call “dim” what is called “axis” in other places) and consistency with established conventions for the problem domain. Before settling on a name, make sure to look up existing names used by domain experts (or other APIs).

因为代码就是交流，所以命名很重要--无论是给项目还是变量命名。名称反映了你对问题的思考方式。避免过于通用的名称（X、变量、参数），避免过于冗长和具体的命名模式，避免会造成不必要的摩擦的术语（主、从），并确保你的命名选择是一致的。命名的一致性意味着内部命名的一致性（不要把其他地方称为 "轴 "的东西称为 "dim"）以及与问题领域的既定惯例的一致性。在确定一个名字之前，一定要查一下领域专家（或其他API）使用的现有名字。

12. Documentation is central to the user experience of your API. It is not an add-on. Invest in high-quality documentation; you will see higher returns than investing in more features.

文档是你的API的用户体验的核心。它不是一个附加的东西。投资于高质量的文档；你会看到比投资于更多的功能更高的回报。

13. Show, don’t tell: Your documentation should not talk about how the software works, it should show how to use it. Show code examples for end-to-end workflows; show code examples for each and every common use case and key feature of your API.

展示，而不是讲述。你的文档不应该谈论软件是如何工作的，它应该展示如何使用它。展示端到端工作流程的代码实例；展示你的API的每一个常见的使用案例和关键功能的代码实例。

> Productivity boils down to high-velocity decision-making and a bias for action.

#### On Software Careers

1. **Career progress is not how many people you manage, it is how much of an impact you make**: the differential between a world with and without your work.

职业发展不是你管理了多少人，而是你产生了多大的影响：有你的工作和没有你的工作的世界之间的差异。

2. Software development is teamwork; it is about relationships as much as it is about technical ability. Be a good teammate. As you go on your way, stay in touch with people.

软件开发是一种团队工作；它既涉及关系，也涉及技术能力。做一个好的队友。在你前进的道路上，与人们保持联系。

3. **Technology is never neutral. If your work has any impact on the world, then this impact has a moral direction.** The seemingly innocuous technical choices we make in software products modulate the terms of access to technology, its usage incentives, who will benefit, and who will suffer. Technical choices are also ethical choices. Thus, always be deliberate and explicit about the values you want your choices to support. Design for ethics. Bake your values into your creations. Never think, *I’m just building the capability; that in itself is neutral.* It is not because the way you build it determines how it will get used.

技术从来不是中立的。如果你的工作对世界有任何影响，那么这种影响就有一个道德方向。我们在软件产品中做出的看似无害的技术选择，调节了获得技术的条件，其使用激励，谁将受益，谁将遭受损失。技术选择也是道德的选择。因此，在你希望你的选择能够支持哪些价值时，一定要慎重和明确。为道德而设计。将你的价值观融入你的创作中。不要想，我只是在建造能力；这本身就是中性的。它不是，因为你建造它的方式决定了它将如何被使用。

4. Self-direction — agency over your work and your circumstances — is the key to life satisfaction. Make sure you grant sufficient self-direction to the people around you, and make sure your career choices result in greater agency for yourself.

自我指导--对你的工作和环境的代理权--是生活满意度的关键。确保你给予你周围的人足够的自我指导，并确保你的职业选择会给你自己带来更大的代理权。

5. **Build what the world needs — not just what you wish you had.** Too often, technologists live rarefied lives and focus on products catering to their own specific needs. Seek opportunities to broaden your life experience, which will give you better visibility into what the world needs.

创造世界所需要的东西--而不仅仅是你希望自己拥有的东西。技术人员往往过着稀松平常的生活，专注于满足自己的特定需求的产品。寻找机会拓宽你的生活经验，这将使你更好地了解世界的需求。

6. When making any choice with long-term repercussions, place your values above short-term self-interest and passing emotions — such as greed or fear. Know what your values are, and let them guide you.

在做出任何具有长期影响的选择时，将你的价值观置于短期的自我利益和短暂的情绪--如贪婪或恐惧之上。知道你的价值观是什么，并让它们指导你。

7. When we find ourselves in a conflict, it’s a good idea to pause to acknowledge our shared values and our shared goals, and remind ourselves that we are, almost certainly, on the same side.

当我们发现自己处于冲突中时，最好暂停一下，承认我们共同的价值观和共同的目标，并提醒自己，我们几乎肯定是站在同一边的。

8. Productivity boils down to high-velocity decision-making and a bias for action. This requires *a*) good intuition, which comes from experience, so as to make generally correct decisions given partial information, *b*) a keen awareness of when to move more carefully and wait for more information, because the cost of an incorrect decision would be greater than cost of the delay. The optimal velocity/quality decision-making tradeoff can vary greatly in different environments.

生产力可以归结为高速的决策和行动的偏向。这需要：（1）良好的直觉，这来自于经验，以便在给定部分信息的情况下做出基本正确的决定；（2）敏锐地意识到何时应该更谨慎地行动，等待更多的信息，因为一个错误的决定的成本将大于延误的成本。在不同的环境中，最佳的速度/质量决策权衡会有很大的不同。

9. Making decisions faster means you make more decisions over the course of your career, which will give you stronger intuition about the correctness of available options. Experience is key to productivity, and greater productivity will provide you with more experience: a virtuous cycle.

更快的决策意味着你在职业生涯中做出更多的决定，这将使你对可用选项的正确性有更强的直觉。经验是生产力的关键，而更大的生产力将为你提供更多的经验：一个良性循环。

10. In situations where you are aware that your intuition is lacking, adhere to abstract principles. Build up lists of tried-and-true principles throughout your career. Principles are formalized intuition that generalize to a broader range of situations than raw pattern recognition (which requires direct and extensive experience of similar situations).

在你意识到你的直觉缺乏的情况下，坚持抽象的原则。在你的职业生涯中建立起久经考验的原则清单。原则是正式的直觉，比原始的模式识别（需要对类似情况有直接和广泛的经验）能概括到更多的情况。

### How to Train a BERT Model From Scratch

原文：[How to Train a BERT Model From Scratch](https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6)

the process looks a little like this:

- `pip install transformers`
- Initialize a pre-trained transformers model — `from_pretrained`.
- Test it on some data.
- *Maybe* fine-tune the model (train it some more).

#### Getting The Data

One of the largest datasets in the domain of text scraped from the internet is the OSCAR dataset.

https://huggingface.co/datasets/viewer/

Dataset: oscar

Subset: unshuffled_duplicated_ja

Size: 154116275128

```python
!pip install datasets
# check datasets
import datasets
all_ds = datasets.list_datasets()
len(all_ds)
# load ja data
dataset = datasets.load_dataset(
   'oscar', 'unshuffled_deduplicated_ja')
# Colab: OSError: Not enough disk space. Needed: 143.53 GiB

# check data： {'id':0,'text':'xxxxx'}
dataset['train']
dataset['train'].features
dataset['train'][0]

```

Great, now let’s store our data in a format `un formato` that we can use when building our tokenizer. We need to create a set of plaintext files containing just the `text` feature from our dataset, and we will split each *sample* using a newline `\n`.

```python
from tqdm.auto import tqdm

text_data = []
file_count = 0

for sample in tqdm(dataset['train']):
    sample = sample['text'].replace('\n', '')
    text_data.append(sample)
    if len(text_data) == 10_000:
        # once we git the 10K mark, save to file
        with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'../../data/text/oscar_it/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(text_data))
```

#### Building a Tokenizer

When building our tokenizer we will feed it all of our OSCAR data, specify our vocabulary size (number of tokens in the tokenizer), and any special tokens.

Now, the RoBERTa special tokens look like this:

| Token    | Use                                                   |
| -------- | ----------------------------------------------------- |
| `<s>`    | Beginning of sequence (BOS) or classifier (CLS) token |
| `</s>`   | End of sequence (EOS) or seperator (SEP) token        |
| `<unk>`  | Unknown token                                         |
| `<pad>`  | Padding token                                         |
| `<mask>` | Masking token                                         |

So, we make sure to include them within the `special_tokens` parameter of our tokenizer’s `train` method call.

```python
# Get a list of paths to each file in our oscar_it directory.
from pathlib import Path
paths = [str(x) for x in Path('../../data/text/oscar_it').glob('**/*.txt')]

# Now we move onto training the tokenizer. We use a byte-level Byte-pair encoding (BPE) tokenizer. This allows us to build the vocabulary from an alphabet of single bytes, meaning all words will be decomposable into tokens.
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths[:5], vocab_size=30_522, min_frequency=2,
special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])
```

Our tokenizer is now ready, and we can save it file for later use:

```python
import os
os.mkdir('./filiberto')
tokenizer.save_model('filiberto')
```

Now we have two files that define our new *FiliBERTo* tokenizer:

- *merges.txt* — performs the initial mapping of text to tokens
- *vocab.json* — maps the tokens to token IDs

#### Initializing the Tokenizer

We first initialize the tokenizer using the two files we built before — using a simple `from_pretrained`:

```python
from transformers import RobertaTokenizer

# initialize the tokenizer using the tokenizer we initialized and saved to file
tokenizer = RobertaTokenizer.from_pretrained('filiberto', max_len=512)

# test our tokenizer on a simple sentence
tokens = tokenizer('ciao, come va?')
print(tokens)
>>  {'input_ids': [0, 16834, 16, 488, 611, 35, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

tokens.input_ids
>> [0, 16834, 16, 488, 611, 35, 2]
```

From the encodings object `tokens` we will be extracting the `input_ids` and `attention_mask` tensors for use with FiliBERTo.

#### Creating the Input Pipeline

The input pipeline of our training process is the more complex part of the entire process. It consists of us taking our raw OSCAR training data, transforming it, and loading it into a `DataLoader` ready for training.

我们训练过程中的输入管道是整个过程中比较复杂的部分。它包括我们获取原始的OSCAR训练数据，对其进行转换，并将其加载到DataLoader中准备进行训练。

##### Preparing the Data

First, we need to open our file — the same files that we saved as *.txt* files earlier. We split each based on newline characters `\n` as this indicates the individual samples.

首先，我们需要打开我们的文件--也就是我们之前保存为.txt文件的那些文件。我们根据换行字符（n）来分割每一个，因为这表示各个样本。

```python
with open('../../data/text/oscar_it/text_0.txt', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')
```

Then we encode our data using the `tokenizer` — making sure to include key parameters like `max_length`, `padding`, and `truncation`.

然后，我们使用标记器对我们的数据进行编码 - 确保包括关键参数，如最大长度、填充和截断。

```python
batch = tokenizer(lines, max_length=512, padding='max_length', truncation=True)
len(batch)
```

And now we can move onto creating our tensors — we will be training our model through masked-language modeling (MLM). So, we need three tensors:

现在我们可以开始创建我们的张量了--我们将通过掩蔽语言建模（MLM）来训练我们的模型。因此，我们需要三个张量。

- **input_ids** — our *token_ids* with ~15% of tokens masked using the mask token `<mask>`.
- **attention_mask** — a tensor of **1**s and **0**s, marking the position of ‘real’ tokens/padding tokens — used in attention calculations.
- **labels** — our *token_ids* with **no** masking.

If you’re not familiar with MLM, I’ve explained it [here](https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c).

Our `attention_mask` and `labels` tensors are simply extracted from our `batch`. The `input_ids` tensors require more attention however, for this tensor we mask ~15% of the tokens — assigning them the token ID `3`.

In the final output, we can see part of an encoded `input_ids` tensor. The very first token ID is `1` — the `[CLS]` token. Dotted around the tensor we have several `3` token IDs — these are our newly added `[MASK]` tokens.

##### Building the DataLoader

Next, we define our `Dataset` class — which we use to initialize our three encoded tensors as PyTorch `torch.utils.data.Dataset` objects.

```python
encodings = {'input_ids': input_ids, 'attention_mask': mask, 'labels': labels}

class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        # store encodings internally
        self.encodings = encodings

def __len__(self):
        # return the number of samples
        return self.encodings['input_ids'].shape[0]

def __getitem__(self, i):
        # return dictionary of input_ids, attention_mask, and labels for index i
        return {key: tensor[i] for key, tensor in self.encodings.items()}

# we initialize our Dataset.
dataset = Dataset(encodings)
# initialize the dataloader which will load the data into the model during training.
loader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
```

Finally, our `dataset` is loaded into a PyTorch `DataLoader` object — which we use to load our data into our model during training.

#### Training the Model

We need two things for training, our `DataLoader` and a model.

##### Initializing the Model

For training, we need a raw (not pre-trained) `BERTLMHeadModel`. To create that, we first need to create a RoBERTa config object to describe the parameters we’d like to initialize FiliBERTo with.

```python
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=30_522,  # we align this to the tokenizer vocab_size
    max_position_embeddings=514,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)
```

Then, we import and initialize our RoBERTa model with a language modeling (LM) head.

```python
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config)
```

##### Training Preparation

Before moving onto our training loop we need to set up a few things. First, we set up GPU/CPU usage. Then we activate the training mode of our model — and finally, initialize our optimizer.

```python
# Setup GPU/CPU usage.
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

# Activate the training mode of our model, and initialize our optimizer (Adam with weighted decay - reduces chance of overfitting).
from transformers import AdamW

# activate training mode
model.train()
# initialize optimizer
optim = AdamW(model.parameters(), lr=1e-4)
```

#### Training

Finally — training time! We train just as we usually would when training via PyTorch.

```python
epochs = 2

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        # process
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        # extract loss
        loss = outputs.loss
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
```

#### The Real Test

Now it’s time for the real test. We set up an MLM pipeline — and ask Laura to assess the results.

```python
from transformers import pipeline
fill = pipeline('fill-mask', model='filiberto', tokenizer='filiberto')
fill(f'ciao {fill.tokenizer.mask_token} va?')

>> [{'sequence': '<s>ciao come va?</s>',
  'score': 0.33601945638656616,
  'token': 482,
  'token_str': 'Ġcome'},
 {'sequence': '<s>ciao, va?</s>',
  'score': 0.13736604154109955,
  'token': 16,
  'token_str': ','},
 {'sequence': '<s>ciao mi va?</s>',
  'score': 0.05658061057329178,
  'token': 474,
  'token_str': 'Ġmi'},
```

*“ciao* **come** *va?”* is the right answer!

### 一点收获

- Do you have a long to-do list and not sure where to start? Try the Moscow Method - it takes 5 minutes! ✅ [@anthilemoon](https://twitter.com/anthilemoon/status/1415246894620725254?ck_subscriber_id=1187272855&utm_source=convertkit&utm_medium=email&utm_campaign=Maker+Mind%3A+The+perfect+productivity+system+%E2%9C%A8%20-%206230751)
  - ![image-20210726112917650](https://raw.githubusercontent.com/Phalacrocorax/memo-image-host/master/uPic/image-20210726112917650.png)

- This week, Cybin agreed to sponsor a Kernel run study using Flow to measure brain activity after administration of **ketamine**, an FDA-approved anaesthetic which has generated a lot of excitement lately for its potential off-label use. The study will be submitted for ethics and regulatory reviews, and should yield important data on how our state-of-the-art neuroimaging technology will introduce new levels of scientific rigor for insight and discovery about the brain and mind.