Troubleshooting and Maintenance¶
While the deployment procedures should be enough for weekly deployments, being on-call and performing maintenance regularly requires doing more. The following are somewhat common procedures that you might need to do:
Worker Managers¶
Azure Batch Worker Managers¶
We use Azure Batch worker managers to automatically spin up VMs and start CodaLab workers on them. For more information, see the following documentation.
Restarting the worker managers for worksheets.codalab.org¶
If for some reason the Docker container running the worker manager fails, go through the following steps to restart the container:
- SSH into the server by running
azure/ssh.sh vm-clws-<env>-server-0
. - Run
docker ps -a
to list all the containers including the ones that exited. - Run
docker restart <name of the failed worker manager container>
Add/remove workers with static VMs¶
If there is an issue with Azure Batch, we may want to start some workers with static VMs, while investigating the issue.
We use the venv/bin/python manage.py
script for this.
To add and delete workers:
venv/bin/python manage.py -m <env> -t gpuworker -a create,install,start -i 0-5 # Adds 6 workers with GPUs, starting from index 0
venv/bin/python manage.py -m <env> -t gpuworker -a stop,delete -i 0-5 # Deletes the workers
Or simply run ./up-gpuworker
and ./down-gpuworker
.
Sometimes the commands fail. These operations are idempotent, so you can just run it again.
Kubernetes Worker Managers¶
To configure a GKE cluster, following the instructions here.
For codalab.stanford.edu¶
A Kubernetes worker manager is already running as a service for codalab.stanford.edu. If a new cluster needs to be configured and created for this instance, update the values of the environment variables accordingly in the start-stanford.sh script and then redeploy by following the instructions in this deployment documentation.
Adding disks¶
While Azure supports 32 TB disks, backups only work for disks of 4 TB, so that's what we will do.
Do this when the disk usage of all the disks is greater than 80%.
-
Log into
portal.azure.com
and go tovm-clws-prod-server-0
, then click on the "Disks" option in the side menu. -
Click "Create and attach a new disk" to add a new disk called
disk-clws-prod-server-<id>
, where<id>
is just incremented on the current number. -
Choose Premium SSD and 4 TB for the size. Increment the value of "LUN" from the last number.
-
We can't make Linux partitions of more than 2 TB using
fdisk
, so we'll just format the entire disk.azure/ssh.sh vm-clws-prod-server-0 sudo mkfs.ext4 /dev/sd<letter> # get this from seeing what shows up at the end of `dmesg`. For example, it might say "[sdm] Attached SCSI disk" which means that <letter> is equal to m. mkdir /data/codalab<id> sudo mount /dev/sd<letter> /data/codalab<id> sudo chown -R azureuser:azureuser /data/codalab<id> mkdir /data/codalab<id>/bundles
-
Finally, add an entry to the "If everything reboots" section of this page with value:
sudo mount /dev/sd<letter> /data/codalab<id>
, so that it is easy to remount this disk if needed. -
Restart CodaLab on the prod server so that the new bundle partition will be recognized. (from the deployment directory, run:
venv/bin/ansible-playbook -i ansible/prod.inventory ansible/server.yml -e mode=prod -t stop,start
). -
To test this out, upload a file. Then run
ls /data/codalab<id>/bundles
to ensure that the file uses the new disk (it should use the new disk by default because CodaLab always prefers the disk with the most free space).
Freeing up disk space on dev¶
To free up disk space on dev, you need to delete bundles that are taking up too much space.
To see which bundles are taking up too much space, you can run cl search size=.sort- .limit=100
.
On the other hand, if you want to check the raw disk to see which bundles are taking up space (in case some of these bundles were not properly cleaned up), run:
azure/ssh.sh vm-clws-dev-server-0
du -h /data | sort -h
Then, you can delete the bundles (using cl rm
) which are taking up the most space.
Bulk delete recently created bundles on dev¶
Sometimes, we might have a lot of staged bundles on dev that were created due to testing. If you need to bulk delete recently created bundles on dev (don't do this on prod!), you can run the following command:
for i in {1..10000}; cl rm --force $(cl search .after=2021-08-02 .limit=1000 state=staged -u); done
If everything reboots¶
This should be very rare, but if something bad happens on Azure (like all our VMs get suspended), then here's how you bring everything back. You may also need to run these steps if you manually stopped / started one of the VMs on Azure.
Restart mysql:
azure/ssh.sh vm-clws-prod-mysql-0 # log into database server
cd mysql-server
docker-compose up -d
Set things up on the main server:
azure/ssh.sh vm-clws-prod-server-0 # log into main server
# Mount disks manually
sudo mount /dev/sdc /data/codalab0
sudo mount /dev/sdd /data/codalab1
sudo mount /dev/sde /data/codalab2
sudo mount /dev/sdf /data/codalab3
sudo mount /dev/sdg1 /data/codalab4
sudo mount /dev/sdh1 /data/codalab5
sudo mount /dev/sdi1 /data/codalab6
sudo mount /dev/sdj1 /data/codalab7
sudo mount /dev/sdk /data/codalab8
sudo mount /dev/sdl /data/codalab9
sudo mount /dev/sdm /data/codalab10
# Stop the default nginx that's running or else it interferes with the version from docker that we try to start
sudo service nginx stop
And then just run the usual manage.py
script to re-deploy the server.
Then bring all the workers up too.
Sometimes, the mount points (such as dev/sdc
) may change. If the above command doesn't work, run lsblk
to see the right mount points.
If things hang on codalab.stanford.edu¶
Try one or more of the following:
- Restart Docker daemon: sudo systemctl daemon-reload && sudo systemctl restart docker
- Delete CodaLab networks: python3 codalab_service.py stop && docker network prune
- Make sure that the Docker Daemon has the right IP ranges configured (you can ask support about this): sudo cat /etc/docker/daemon.json
And then restart CodaLab.
SSH into machines¶
This repo comes with an SSH script that uses the certs that come with it to SSH into our Azure VMs. Use that script, you can't SSH into the VMs without this repo's certs
azure/ssh.sh vm-clws-<env>-<machine-type>-<index>
Some examples:
azure/ssh.sh vm-clws-dev-server-0 # dev environment, server (only 1)
azure/ssh.sh vm-clws-prod-gpuworker-1 # prod environment, worker with GPUs (2nd one)
azure/ssh.sh vm-clws-prod-mysql-0 # prod environment, MySQL server (only 1)
It is handy to have a wrapper script that's in your PATH that has this:
cd <path to this repo> && azure/zssh.sh vm-clws-$1-$2-${3:-0}
Check docker logs¶
Sometimes you need to check docker logs to see the logs of CodaLab components. To do so, SSH into the machine which runs the docker container you want to check and do the following:
docker ps -a # make sure the container is running
docker logs <container> --tail 100 # print the last 100 lines of logs, probably more useful
docker logs <container> 2>&1 | grep <pattern> # grep in the docker logs, useful to find a bundle UUID or a particular error
docker logs <container> --since 60m 2>&1 | grep <pattern> # find what you're looking for but only if it happened within the past hour
Manually clear worker caches and restart¶
TODO: Our workers sometimes are flaky right now. Periodically, run the script to see how things are going:
./check-workers | tee workers.log
Sometimes you need to manually debug workers. If all workers are down a good idea is to clear their caches and restart them. To do so, for each worker:
azure/ssh.sh vm-clws-<env>-worker-0 # log in to the worker
cd worker # stop the worker Docker container
docker-compose down
sudo bash # become root to clear all worker caches
cd /mnt/scratch/bundles
rm -rf *
docker rmi $(docker images -q)
exit # exit root, this brings you back to ~/worker as azureuser
docker-compose up -d # start worker again
exit # log out of worker
Edit Ansible tasks¶
In the background ansible applies the tasks with the variables and templates found in the following files to the server machine described in the inventory file.
You don't need to fully understand what's going on here but familiarity with what Ansible does is good, so it's recommended to take a look. It's a good idea to start with the tasks files to see each individual step Ansible takes when setting the machine up:
# Main entry point (just sources credentials, prod|dev, common, server|worker)
./ansible/server.yml
./ansible/worker.yml
# Credentials for all instances
./ansible/roles/credentials/vars/main.yml
# Specific to environments
./ansible/roles/common/vars/main.yml
./ansible/roles/prod/vars/main.yml
./ansible/roles/dev/vars/main.yml
# Installation of general environment
./ansible/roles/common/tasks/main.yml
# Server
./ansible/roles/server/vars/main.yml
./ansible/roles/server/tasks/main.yml
# Workers
./ansible/roles/worker/vars/main.yml
./ansible/roles/worker/tasks/main.yml
./ansible/roles/worker/templates/docker-compose.yml
Manually pushing default cpu/gpu images¶
Whenever Dockerfile.default-cpu
or Dockerfile.default-gpu
changes on a release,
we usually release the new Docker images automatically through GitHub Actions.
Here are steps to release these images manually, if needed:
Make sure environment variables $CODALAB_DOCKER_USERNAME
and$CODALAB_DOCKER_PASSWORD
are set based on the credentials. Then run:
python3.6 codalab_service.py build default-cpu default-gpu -v latest --push
If you are doing a release of a version (e.g., v0.3.3
), then also push that
version:
python3.6 codalab_service.py build default-cpu default-gpu -v v0.3.3 --push
MySQL Issues¶
The Azure VM with MySQL server goes down¶
Check the server logs to see if the MySQL server went down:
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (2003, "Can't connect to MySQL server on 'vm-clws-prod-mysql-0' (111)")
(Background on this error at: http://sqlalche.me/e/13/e3q8)
Resolve the issue by following these steps:
- Log on to the Azure Portal.
- Search the server name in the search bar. If there is an issue with the MySQL server for our
prod
instance, you would searchvm-clws-prod-mysql-0
. - Check the Service Health Events of the VM and verify that there was some disruption that caused the VM to go down. For more information, see the following documentation.
- Restart the VM by clicking
Restart
and wait for the restart to complete. - Follow the steps here to restart MySQL for the CodaLab instance.
- Verify that the CodaLab instance is back up and can serve requests.
MySQL server connection issues during dev deployment¶
When deploying to dev, you may sometimes run into an exception that it can't connect to the database. For example,
return Connection(*args, **kwargs)", "process: File \"/usr/local/lib/python3.6/dist-packages/MySQLdb/connections.py\", line 164, in __init__", "process: super(Connection, self).__init__(*args, **kwargs2)", "process: sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (2003, \"Can't connect to MySQL server on 'vm-clws-dev-mysql-0' (111)\")", "process: (Background on this error at: http://sqlalche.me/e/13/e3q8)"]}
To fix this, go into the Azure console and restart vm-clws-dev-mysql-0 and vm-clws-dev-server-0. Also, make sure that the disks are properly attached for vm-clws-dev-mysql-0.
Manual intervention needed to resolve MySQL database issues¶
Sometimes you need to manually debug the database. To do so:
azure/ssh.sh vm-clws-prod-mysql-0 # log into database server
docker ps # make sure MySQL image is running
# enter MySQL interface (see `ansible/roles/credentials/vars/main.yml` for MySQL login)
docker exec -it mysqlserver_mysql_1 mysql -u bundles_user -p -D codalab_bundles
You can find the MySQL passwords in this file.
Manually sending password reset link¶
Rarely, we may have a situation in which someone requests a password reset for an email address XXX@XXX.com
, but doesn't receive the link due to some unknown issue, and they contact us to help. Here, we need to do a manual reset. Follow the following steps:
-
First, prove that the requestor owns the given email. Send a test email to
XXX@XXX.com
and ask the requestor to send you the contents of the email to ensure that they have access to the inbox. -
Go to the CodaLab instance website (e.g. https://worksheets.codalab.org/account/reset) and request a password reset of
XXX@XXX.com
. -
Manually SSH into the MySQL DB (see section above).
-
Run the following SQL query to get the latest created user reset code:
sql select r.code from user_reset_code r inner join user u on r.user_id = u.user_id where u.email = 'XXX@XXX.com' order by date_created desc limit 1;
-
Send the user a URL in the following format to allow them to reset their password: https://worksheets.codalab.org/rest/account/reset/verify/[userresetcode]
Restoring MySQL database for codalab.stanford.edu¶
The following are the instructions to restore the MySQL database from a backup for codalab.stanford.edu. These instructions will also fix the "Schrodinger's table" issue, where a table exists and doesn't exist simultaneously.
Please be extra careful when following these instructions.
- Run
cd /nlp/u/codalab/codalab-worksheets
. - Ensure all the services are stopped by running:
python3 codalab_service.py stop
. - Move the problematic mysql files by renaming the folder:
mv /home/codalab/mysql/ /home/codalab/mysql-corrupted/
. - Double check that the
codalab_bundles
database is now empty: - Start just the MySQL service by running:
./codalab_service.py start -p --service mysql
. - Run
docker exec -it codalab_mysql_1 /bin/bash
. - Run
mysql -u root --password=codalab
. - Run
use codalab_bundles;
. - Run
SHOW TABLES;
which should output that there are no longer any tables in the database. - Find the right
mysqldump.gz
file to restore. The dump files are stored at path/juice/u/codalab/var/codalab/monitor/
. - Restore from the backup by running:
zcat <Path to dump file> | docker exec -i codalab_mysql_1 mysql -u root --password=codalab codalab_bundles
. - Repeat step 4 to double check that the
codalab_bundles
database is correctly populated. - Restart codalab.stanford.edu by running
python3 codalab_service.py stop && ./start-stanford.sh
. - Once you verify that everything is working, delete the corrupted folder at
/home/codalab/mysql-corrupted/
. DO NOT DELETE/home/codalab/mysql
.
SendGrid¶
When we're not getting Slack notifications¶
If we're not getting Slack notifications through SendGrid, log on to SendGrid using these credentials, follow the steps here and make sure the DNS records are installed for codalab.org domain for our GoDaddy account.
Installing docker-compose on codalab.stanford.edu¶
If docker-compose is not available on codalab.stanford.edu, run:
wget https://github.com/docker/compose/releases/download/1.29.2/docker-compose-Linux-x86_64 -O /usr/bin/docker-compose