July 17, 2019

Compling ARM stuff without an ARM board / Build PyTorch for the Raspberry Pi

I am in the process of building a self-driving RC car. It’s a fun process full of discovery (I hate it already). Once it is finished I hope to write a longer article here about what I learned so stay tuned!

While the electronics stuff was difficult for me (fingers still burnt from soldering) I hoped that the computer vision stuff would be easier. Right? Right? Well no.

Neural network inference on small devices #

To be clear I didn’t expect to train my CNN on the Raspberry Pi that I have (its revision 2, with added USB WiFi dongle and USB webcam) but I wanted to do some inference on a model that I can train on my other computers.

I love using PyTorch and I use it for all my projects/work/research. Simply put it’s fantastic software.

Problem #1 - PyTorch doesn’t have official ARMv7 or ARMv8 builds. #

While you can get PyTorch if you have NVIDIA Jetson hardware, there are no builds for other generic boards. Insert sad emoji.

Problem #2 - ONNX, no real options #

I had the idea to export my trained model to ONNX (Open Neural Network eXchange format), but then what.

There are two projects:

Microsoft’s ONNX Runtime - doesn’t support RPi2
Snips Tract - Seems super-cool but Rust underneath (nothing against Rust, just not familiar)

So the only solution was: Build PyTorch from source.

“When your build takes two days you have time to think about life” - Anonymous programmer 2019. #

The PyTorch build process is fantastically simple. You get the code and run a single command. It’s robust and I used it many times before. So I jumped right in, it can’t take that long, yeah? NOP.

On my Raspberry Pi 2, with a decent SD card (Kingston UHS1 16GB) the build took 36 and a bit hours. Yes you read that correctly. Not 3.6 hours. Thirty six hours. While it ran, during these 36 hours I had a lot of down time. So I wondered how to do it quicker.

Option 1 - Cross compilation #

Cross compilation (or witchcraft in software development circles) is a process where you can build software for some architecture on another architecture. So here I wanted to build for ARM on a standard x86_64 machine. From my (albeit small) experience cross compilation is complicated and difficult. Even though it was my first thought and I wanted to try it out, I then discovered on the PyTorch Github issues, that it is not supported for the project.

Option 2 - What about emulation? #

This seems reasonable. You emulate generic ARM or ARMv8 board and build on it. QEMU/libvirt can emulate ARM just fine and there are clear instructions on how to achieve it. For example Fedora Wiki (I am using Fedora 30 both on RPi and my build machine) has a short guide on how to do it. Here is the link.

I tried this, and to be fair it worked fine. But it was slow. Almost unusably slow.

Option 3 - Witchcraft, sort of #

Remember cross compilation? I ran into an article which explains this weird setup for building ARM software. It is amazing. Basically there is a qemu-user package that allows you chroot into a rootFS of a different architecture with very little performance loss (!!!). Pair this with DNF’s feature to make a rootfs of any architecture, and you got something immensely powerful. Not just for building Python packages, for building anything for ARM or ARMv8 (aarch64 as it is called by DNF).

But then I read the last line. This was just a proposal.

So I went down the rabbit hole and followed the bug reports. All of them seemed closed. Could this feature work already? The answer was: YES!

Building PyTorch for the Raspberry Pi boards #

Once I discovered qemu-user chroot thingy, everything clicked.

So here we go, this is how to do it.

We need qemu and qemu-user packages. Virt manager is optional but nice to have.

sudo dnf install qemu-system-arm qemu-user-static virt-manager

We now need the rootfs, which is a single-liner

sudo dnf install --releasever=30 --installroot=/tmp/F30ARM --forcearch=armv7hl --repo=fedora --repo=updates systemd passwd dnf fedora-release vim-minimal openblas-devel blas-devel m4 cmake python3-Cython python3-devel python3-yaml python3-pillow python3-setuptools python3-numpy python3-cffi python3-wheel gcc-c++ tar gcc git make tmux -y

This will install a ARM rootfs to your /tmp directory along with everything you need to build PyTorch. Yes, it is that easy.

Let’s chroot

sudo chroot /tmp/F30ARM

Welcome to your “ARM board”, verify your kernel arch:

bash-5.0# uname -a
Linux toshiba-x70-a 5.1.12-300.fc30.x86_64 #1 SMP Wed Jun 19 15:19:49 UTC 2019 armv7l armv7l armv7l GNU/Linux

So cool, isn’t it? Some things are broken, but easy to fix. Mainly network and DNF wrongly detects your arch.

# Fix for 1691430
sed -i "s/'armv7hnl', 'armv8hl'/'armv7hnl', 'armv7hcnl', 'armv8hl'/" /usr/lib/python3.7/site-packages/dnf/rpm/__init__.py
alias dnf='dnf --releasever=30 --forcearch=armv7hl --repo=fedora --repo=updates'

# Fixes for default python and network
alias python=python3
echo 'nameserver 8.8.8.8' > /etc/resolv.conf

Your configuration is now complete and you have a working emulated ARM board.

Get PyTorch source:

git clone https://github.com/pytorch/pytorch --recursive && cd pytorch
git checkout v1.1.0 # optional, you can build master if you are brave
git submodule update --init --recursive

Since we are building for a Raspberry Pi we want to disable CUDA, MKL etc.

export NO_CUDA=1
export NO_DISTRIBUTED=1
export NO_MKLDNN=1 
export BUILD_TEST=0 # for faster builds
export MAX_JOBS=8 # I have 8 cores
# export NO_NNPACK=1 # update July 19, this is optional, can build with NNPACK
# export NO_QNNPACK=1 # same as above, can be omitted

All ready, build!

python setup.py bdist_wheel

To build torchvision, install the built wheel and then:

mount -o bind /dev /tmp/F30ARM/dev # from outside the chroot, we need urandom

And build:

cd ..
git clone https://github.com/pytorch/vision && cd vision
git checkout v0.3.0
git submodule update --init --recursive
python setup.py bdist_wheel

Performance #

The RPi2 took 36+ hours. This? Under two. My laptop isn’t that new (i7 4700MQ) and I guess you can do it even faster with a faster CPU.

Conclusion #

Building for ARM shouldn’t be done on a board. There are probably some exceptions to the rule, but you really should consider the way explained here. It’s faster, reproducible and easy. Fedora works remarkably well for this (as for all other things, hehe) both on the device and on the build system.

Let me know how it goes for you.

Oh, and if you just stumbled on this page on Google wanting a wheel/.whl of PyTorch for your RPi:

https://github.com/nmilosev/pytorch-arm-builds/

Image credit: https://xkcd.com/303/

114

Kudos

114