Pytorch中知识点02

发表于 2022-06-03 更新于 2023-04-13 分类于深度学习本文字数： 15k 阅读时长 ≈ 14 分钟

本文记录一下在实现 DDRQM 过程中的一些 Pytorch 框架和 python 相关知识点。

1.torch.utils.data.Dataset：一个表示数据集的抽象类。

其完整形式为：CLASS torch.utils.data.Dataset(*args, **kwds)。

所有表示从keys到data samples的映射的数据集都应该是该抽象类的子集。它的所有子类都应该重写__getitem__()方法，从而支持通过key获取data sample；其子类可以选择重写__len__()方法，该方法返回许多通过Sampler实现或Dataloader默认实现的数据集尺寸。a

PS：Dataloader默认构造一个生成整数索引的index sampler，要想其对一个具有非整数的indices/keys的 map-style 的数据集生效，需要提供定制化的sampler。

参考资料：

2.Creating a Custom Dataset for your files：给自己的文件创建一个定制化的数据集。

一个定制化的数据集必须实现三种函数：__init__、__len__和__getitem__。看一下经典的 FashionMNIST 数据集的实现，我们可以发现图像存储在img_dir目录中，labels 存储在一个 CSV 文件annotation_file中。下面我们看一下在每个函数中发生了什么：

import os
import pandas as pd
from torchvision.io import read_image

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

__init__：该函数在实例化数据集对象的时候运行一次，该函数初始化包含图像数据的目录，注释文件和 transforms。labels.csv文件格式如下图所示：

tshirt1.jpg, 0
tshirt2.jpg, 0
......
ankleboot999.jpg, 9

__len__：该函数返回数据集中的样本数目- __getitem__：该函数加载和返回在给定索引idx处的一个样本。基于索引，该函数定位在磁盘中图像的位置，通过read_image将其转换为一个tensor，从self.img_labels的 csv 数据中取到对应的 label，如果需要的话在它们身上应用 transform 函数，最后以元组的形式返回 tensor 图像和对应的 label。

参考资料：

Datasets & Dataloaders

Understanding __getitem__ method

3.argparse：Parser for command-line options, arguments and sub-command.

其源码位于Lib/argparse.py。下面是该 API 的参考信息，argparse模块使得写用户友好的命令行界面变得很容易，该程序定义了它要求的 arguments，argparse将推算出如何从sys.argv中解析出这些 arguments。当用户给出对程序来说无效的 arguments 时argparse模块也就自动生成帮助信息和错误信息。下面通过例子来说明：

在编程中，arguments 是指在程序、子线程或函数之间传递的值，是包含数据或者代码的独立的 items (表示一个数据单元) 或者 variables。当一个 argument 被用来为一个用户定制化一个程序时，它通常也被称为参数。在 C 语言中，当程序运行时，argc (ARGumentC) 为默认变量，表示被加入到命令行的参数的数量（argument count）。

下面的代码是一个将一系列整数作为输入的程序，并得到它们的和或者最大值：

import argparse

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
                    help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')

args = parser.parse_args()
print(args.accumulate(args.integers))

假设上述代码存入prog.py文件。它能够在命令行运行并提供有用的帮助信息：

$ python prog.py -h
usage: prog.py [-h] [--sum] N [N ...]

Process some integers.

positional arguments:
 N           an integer for the accumulator

options:
 -h, --help  show this help message and exit
 --sum       sum the integers (default: find the max)

当从命令行给出有效的 arguments 时，会打印出这些整数的和或者最大值：

$ python prog.py 1 2 3 4
4

$ python prog.py 1 2 3 4 --sum
10

当传入无效的 arguments 时，会生成一个 error：

1
2
3

$ python prog.py a b c
usage: prog.py [-h] [--sum] N [N ...]
prog.py: error: argument N: invalid int value: 'a'

下面对这个例子做详细说明：

Creating a parser：第一步使用argparse模块创建一个ArgumentParser对象
1
parser = argparse.ArgumentParser(description='Process some integers.')
该ArgumentParser对象包含将命令行解析为 Python data types 的所有必要的信息。

Adding arguments：通过调用add_argument()方法向ArgumentParser对象填入和程序 arguments 有关的信息。通常来说，这些调用告诉ArgumentParser如何取得命令行中的字符串并将其转化为对象。这些信息被存储起来并可以通过调用parse_args()来使用，例如：

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('integers', metavar='N', type=int, nargs='+',
                    help='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', action='store_const',
                    const=sum, default=max,
                    help='sum the integers (default: find the max)')

args = parser.parse_args()

调用parse_args()将会返回一个具有两个 attributes ——integers和accumulate的对象，integers属性是一个或多个整数值的列表；accumulate是sum()或max()函数。

Parsing arguments：ArgumentParser通过parse_args()解析 arguments。其过程中会监测命令行，并将每个 argument 转换为合适的 type，然后采取合适的 action。在大多数情况下，这意味着将从命令行解析的 attributes 中创建一个简单的 Namespace 对象。
1
2
>>> parser.parse_args(['--sum', '7', '-1', '42'])
Namespace(accumulate=<built-in function sum>, integers=[7, -1, 42])

更详细的内容可见：Argparse Tutorial

参考资料：

argparse

Argparse Tutorial

4.Reading and Writing Files：读取和写入文件

open()返回一个文件对象（file object），该函数通常通过两个 positional arguments 和一个 keyword argument 进行调用：open(filename, mode, encoding=None)。如下图所示：

1	f = open('workfile', 'w', encoding='utf-8')

第一个参数表示文件名；
第二个参数表示打开文件的模式，r表示文件只读，w表示文件只写（已存在的同名文件中数据将被擦除），a表示在文件内容之后appending，写入文件中的数据将被添加到文件最后，r+表示文件可同时读和写，模式参数是可选的，默认为r
第三个参数表示文件的编码格式，正常情况下文件以text模式打开，从该文件中读取和写入字符串。当编码格式没有被指定时，默认编码格式是 platform dependent，由于 UTF-8 是现行的标准，建议使用该格式。在text模式，在读取文件时会将 platform-specific line endings 转换为\n，在写入文件时则反之。

当处理文件对象时建议使用with关键字，其优点在于在操作完成后文件能被合适地关闭，即使异常发生。其也比等价的try-finally块更短：

>>> with open('workfile', encoding="utf-8") as f:
...     read_data = f.read()

>>> # We can check that the file has been automatically closed.
>>> f.closed
True

参考资料：

Reading and Writing Files

5.threading.Thread：多线程。

参考资料：

Python 多线程

6.multiprocessing.Process：多进程。

参考资料：

想要利用CPU多核资源一Python中多进程（一）

多进程multiprocess

7.在python文件中包含from PIL import PILLOW_VERSION代码时，可能会出现如下报错：

1	ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/storage/FT/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/PIL/__init__.py)

其原因在于在较新的pillow版本中PILLOW_VERSION已被去除，可以代替使用__version__或者安装较老的pillow版本pip install Pillow==6.1。

参考资料：

ImportError: cannot import name ‘PILLOW_VERSION’ from ‘PIL’

PILLOW_VERSION constant

8.Python中的Logging包，在SCWSSOD中的用法示例为：

import logging as logger
logger.basicConfig(level=logger.INFO, format='%(levelname)s %(asctime)s %(filename)s: %(lineno)d] %(message)s', datefmt='%Y-%m-%d %H:%M:%S', \
                           filename="train_%s.log"%(TAG), filemode="w")
...
logger.info(msg)

该模块定义了一系列的函数和类，为applications和libraries实现了一个灵活的event logging system。由一个标准的库模块提供logging API的关键好处在于，所有的Python模块都能加入logging，所以application log可以包含自己的信息以及整合来自第三方模块的信息。简单示例为：

1
2
3

>>> import logging
>>> logging.warning('Watch Out!')
WARNING:root:Watch Out!

参考资料：

Logging facility for Python

Logging HOWTo

9.在Pytorch中register意味着什么？

在pytorch文档和方法名中register意味着“在一个官方的列表中记录一个名字或者信息的行为”。

例如，register_backward_hook(hook)将函数hook添加到一个其他函数的列表中，nn.Module会在forward过程中执行这些函数。

与之相似，register_parameter(name, param)添加一个nn.Parameter类型的名为name的参数param到nn.Module的可训练参数的列表之中。register可训练参数很关键，这样pytorch才会知道那些tensors传送给优化器，那些tensors作为nn.Module的state_dict存储。

参考资料：

What do we mean by ‘register’ in PyTorch?

10.Pytorch、CUDA版本与显卡驱动版本对应关系：

CUDA驱动和CUDAToolkit对应版本

Pytorch和cudatoolkit版本

cuda和pytorch版本	安装命令
cuda==10.1 pytorch=1.7.1	`conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch`
cuda==10.1 pytorch=1.7.0	`conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch`
cuda==10.1 pytorch=1.6.0	`conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch`
cuda==10.1 pytorch=1.5.1	`conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch`
cuda==10.1 pytorch=1.5.0	`conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.1 -c pytorch`
cuda==10.1 pytorch=1.4.0	`conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch`

参考资料：

pytorch版本，cuda版本，系统cuda版本查询和对应关系

INSTALLING PREVIOUS VERSIONS OF PYTORCH

CUDA Compatibility

11.以如下目录组织文件：

/model
|_ vgg.py
|_ vgg_models.py
test.py

如果test.py文件中包含对vgg_models.py的依赖：from model.vgg_models import Back_VGG

同时，vgg_models.py又包含对vgg.py的依赖：from vgg import B2_VGG。

运行python test.py可能会出现如下报错：

这是由于运行test.py时将当前目录./作为导入包时的本地查找路径，vgg_models.py在导入包时只会在./中查找，而不会在./model/中查找，导致找不到包。此时可以通过在test.py开头添加如下代码把./model/添加为查找路径来解决该问题：

1 2	import sys sys.path.insert(0, './model')

也可以插入绝对路径：

1 2	import sys sys.path.insert(0, '/storage/FT/SCWSSOD/SCWSSOD31')

参考资料：

import error: ‘No module named’ does exist

12.使用cv2.imwrite写入文件时，可能会出现如下问题：

这是由于存入路径save_path+name无文件扩展名，可以通过在name后添加.png扩展名解决。

参考资料：

cv::imwrite could not find a writer for the specified extension

13.当使用如下代码进行权重初始化时：

1
2
3

def _initialize_weights(self, pre_train):
        keys = pre_train.keys()
        self.conv1.conv1_1.weight.data.copy_(pre_train[keys[0]])

可能会出现以下报错：

这是由于在Python2中Class collections.OrderedDict的keys()属性返回的是一个list，而在Python3中其返回一个odict_keys，此时可以通过将odict_keys转换为list解决该问题：

1
2
3

def _initialize_weights(self, pre_train):
        keys = list(pre_train.keys())
        self.conv1.conv1_1.weight.data.copy_(pre_train[keys[0]])

参考资料：

[]‘odict_keys’ object does not support indexing #1](https://github.com/taehoonlee/tensornets/issues/1)

14.为什么在Pytorch中通常使用PIL (即PILLOW) 包，而不是cv2 (即opencv)。有以下几个原因：

OpenCV2以BGR的形式加载图片，可能需要包装类在内部将其转换为RGB
会导致在torchvision中的用于transforms的functional的代码重复，因为许多functional使用PIL的操作实现
OpenCV加载图片为np.array，在arrays上做transformations并没有那么容易
PIL和OpenCV对图像不同的表示可能会导致用户很难捕捉到bugs
Pytorch的modelzoo也依赖于RGB格式，它们想要很容易地支持RGB格式

参考资料：

Why is PIL used so often with Pytorch?

OpenCV transforms with tests #34

I wonder why Pytorch uses PIL not the cv2

15.在加载模型权重进行测试时，可能会出现如下报错：

1	Missing keys & unexpected keys in state_dict when loading self trained model

其原因可能在于在训练模型时使用了nn.DataParallel，因此存储的模型权重和不使用前者时的权重的keys有所不同。其解决方法为，在创建模型时同样用nn.DataParallel进行包装：

# Network
self.model = TRACER(args).to(self.device)
if args.multi_gpu:
	self.model = nn.DataParallel(self.model).to(self.device)

也可以直接去除.modulekey：

check_point = torch.load('myfile.pth.tar')
check_point.key()


from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove 'module.' of dataparallel
    new_state_dict[name]=v

model.load_state_dict(new_state_dict)

参考资料：

Missing keys & unexpected keys in state_dict when loading self trained model

[solved] KeyError: ‘unexpected key “module.encoder.embedding.weight” in state_dict’

16.tensorboardX vs tensorboard：

参考资料：

tensorboardX

VISUALIZING MODELS, DATA, AND TRAINING WITH TENSORBOARD

17.当在train.py文件中指定了os.environ["CUDA_VISIBLE_DEVICES"] = '1'时，如果在调用的其他文件如utils.py中使用fx = Variable(torch.from_numpy(fx)).cuda()或fx = torch.FloatTensor(fx).cuda()，其默认gpu设备仍然为0，此时应该在utils.py文件中加上：

1 2	import os os.environ["CUDA_VISIBLE_DEVICES"] = '1'

18.当scipy版本过高时，如1.7.3。在使用如下代码进行图像存储时：

1 2	from scipy import misc misc.imsave(save_path + name, pred_edge_kk)

会报如下错误：

其原因在于在较新的scipy版本中scipy.misc.imsave已经被去除。解决方法为将上述代码改为：

1 2	import imageio imageio.imwrite(save_path + name, pred_edge_kk)

参考资料：

My scipy.misc module appears to be missing imsave

19.Variable deprecated

参考资料：

Variable deprecated- how to change the code

20.tensor和numpy之间的转换：(张量转换)

numpy to tensor:

import cv2 
import torch
mask = cv2.imread('./mask.png', 0)
mask = torch.from_numpy(mask)

tensor to numpy:

import torch 
import cv2
# this is just my embedding matrix which is a Torch tensor object
embedding = learn.model.u_weight

embedding_list = list(range(0, 64382))

input = torch.cuda.LongTensor(embedding_list)
tensor_array = embedding(input)
# the output of the line below is a numpy array
tensor_array.cpu().detach().numpy()

参考资料：

RuntimeError: Can only calculate the mean of floating types. Got Byte instead. for mean += images_data.mean(2).sum(0)

Pytorch tensor to numpy array

TORCH.FROM_NUMPY

21.pytorch中的L1/L2 regularization。

参考资料：

How to add a L1 or L2 regularization to weights in pytorch

L1/L2 regularization in PyTorch

22.pytorch报错“CUDA out of memory”，如下图所示：

参考资料：

How to avoid “CUDA out of memory” in PyTorch

Solving “CUDA out of memory” Error

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)

FREQUENTLY ASKED QUESTIONS

How to allocate more GPU memory to be reserved by PyTorch to avoid “RuntimeError: CUDA out of memory”?

How does “reserved in total by PyTorch” work?[https://discuss.pytorch.org/t/how-does-reserved-in-total-by-pytorch-work/70172]

pytorch如何使用多块gpu?

pytorch多gpu并行训练

23.在定义模型时，我们通常使用如下框架的代码：

class Model(nn.Module):
	def __init__(self, config):
		self(Model, self).__init__()
		self.config = config
		self.layers = ...

	def forward(self, x):
		out = self.layers(x)
		return out

在训练或者测试模型时，我们则使用如下的代码：

net = Model(config)
net.train(true)
net.cuda()
out = net(image)

上述代码中`net(image)`是`net.__call__(image)`的简写形式，那么上述`Model`中定义的`forward`在哪被调用呢？实际上，`__call__`已经在`nn.Module`中定义，将会register all hooks 并且调用`forward`，因此我们不需要调用`model.forward(image)`而只需要调用`model(image)`。可以参考下面参考资料4对python hook有一个迅速的了解。 > 参考资料： > 1. [Is model.forward(x) the same as model.\__call\__(x)?](https://discuss.pytorch.org/t/is-model-forward-x-the-same-as-model-call-x/33460) > 2. [PyTorch module__call__() vs forward()](https://stephencowchau.medium.com/pytorch-module-call-vs-forward-c4df3ff304b1) > 3. [5 分钟掌握 Python 中的 Hook 钩子函数](https://cloud.tencent.com/developer/article/1761121) > 4. [python hook 机制](https://zhuanlan.zhihu.com/p/275643739)
24.出现`RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor`，如下图所示： ![](https://raw.githubusercontent.com/Tom89757/ImageHost/main/hexo/20220914162425.png) 其原因为model和data分处于GPU和CPU，如果模型在GPU中 (`model.to(device)`)，此时需要添加如下代码将data也加载进GPU：

1 2	inputs, labels = data # this is what you had inputs, labels = inputs.cuda(), labels.cuda() # add this line

> 参考资料： > 1. [RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same](https://stackoverflow.com/questions/59013109/runtimeerror-input-type-torch-floattensor-and-weight-type-torch-cuda-floatte) > 2. [Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor](https://discuss.pytorch.org/t/input-type-torch-floattensor-and-weight-type-torch-cuda-floattensor-should-be-the-same-or-input-should-be-a-mkldnn-tensor-and-weight-is-a-dense-tensor/152430)
25.出现报错`AttributeError: module 'distutils' has no attribute 'version' : with setuptools 59.6.0`。解决方案：`pip install setuptools==59.5.0`，安装较老的setuptools版本 > 参考资料： > 1. [AttributeError: module 'distutils' has no attribute 'version' : with setuptools 59.6.0 #69894](https://github.com/pytorch/pytorch/issues/69894)

26.numpy array与torch tensor之间的转换：

numpy array to torch tensor

1 2	np_array = np.array(data) x_np = torch.from_numpy(np_array)

torch tensor to numpy

1	na = a.to('cpu').numpy()

参考资料：

Pytorch tensor to numpy array

27.查看numpy数组的各属性信息：

def numpy_attr(image):
    print("type: ", type(image))
    print("dtype: ", image.dtype)
    print("size: ", image.size)
    print("shape: ", image.shape)
    print("dims: ", image.ndim)

> 参考资料： > 1. [numpy库数组属性查看：类型、尺寸、形状、维度](https://blog.csdn.net/weixin_41770169/article/details/80565326)
28.出现如下报错： ![](https://raw.githubusercontent.com/Tom89757/ImageHost/main/hexo/20220930172422.png)

其原因为数据集中读取的数据超出范围，例如对于n类label的数据，其值应该t>=0 && t<n。本人遇到这种报错的原因为mask数据未做转换：

1 2	mask[mask == 0.] = 255. mask[mask == 2.] = 0.

参考资料：

RuntimeError: cuda runtime error (59) : device-side assert triggered when running transfer_learning_tutorial #1204

29.出现报错：`Boolean value of Tensor with more than one value is ambiguous in PyTorch`。 original code:

1	loss = CrossEntropyLoss(y_pred, y_true)

应该改为：

# 初始化损失
L = CrossEntropyLoss()
# 计算损失
L(y_pred, y_true)

> 参考资料： > 1. [Bool value of Tensor with more than one value is ambiguous in Pytorch](https://stackoverflow.com/questions/52946920/bool-value-of-tensor-with-more-than-one-value-is-ambiguous-in-pytorch)
30.交叉熵为负数： original code:

1 2	losse = torch.nn.BCELoss() losse = losse(edge_map, edge)

原因：`edge`未归一化或者`edge_map`未归一化应改为：

1
2
3

edge = edge / 255.0
# 或
edge_amp = F.softmax(out1, dim=1)

> 参考资料： > 1. [解决pytorch 交叉熵损失输出为负数的问题](https://cloud.tencent.com/developer/article/1725343)