Deployment process¶

The goal is to put CodaLab onto worksheets.codalab.org. BE CAREFUL, AS THIS AFFECTS the REAL SYSTEM! Take each step cautiously and double check logs to see that there are no error messages before proceeding.

Make sure you do this when there's not a ton of activity (check out the CodaLab status page). Make sure there's no major deadlines coming up. When in doubt, discuss with the team, and ping the users #codalab Slack channel.

1. Cut a new release¶

Create a new branch off of master with the name rc[version], such as the following:

git checkout -b rc0.5.19

When creating a new version, be sure to follow Semantic Versioning. The version should be formatted as MAJOR.MINOR.PATCH, where:
- MAJOR versions involve incompatible API changes. For example, if a REST endpoint changes so that old versions of the CodaLab CLI can no longer connect to CodaLab, or if we drop support for the upload of certain types of files, this should be a major version bump. Major version changes should generally be rare. Changes to the frontend usually don't involve major version changes.
- MINOR versions involve new features. Any new feature on the frontend / backend should require at least a minor version bump.
- PATCH versions only include backwards compatible bug fixes.
- If making a MAJOR version bump with breaking changes, ensure that the PR / release description includes a section called "Breaking changes" that describes what has changed and how users can adapt existing code / configuration for the new version. See this PR for v1.0.0 for an example.
Bump the version (the CODALAB_VERSION variable) in the code (example: #2825). Do not create a version tag yet.
Push the branch (git push origin rc0.5.19).
Create a rc0.5.19 -> master PR, which allows the staging Docker images build. Do not merge this PR; it's used only to trigger CI at this time. In the PR, include draft release notes with a list of the major changes. You can base the release notes off of this PR: https://github.com/codalab/codalab-worksheets/pull/4041

2. Deploy release to dev¶

Wait for CI to complete, and then update codalab_version in ansible/roles/dev/vars/main.yml to point to the branch name (rc0.5.19).
To test backwards compatibility, create a run bundle that will run for some time (e.g. sleep 1800) and wait for a worker to start the job. You can do this by logging in to https://worksheets-dev.codalab.org, running cl run "sleep 1800" from the web CLI, and then waiting for the bundle to be in the "running" state.
One-time setup step: Make sure you have read and write (but not executable) perms on keys/azureuser. Do this with chmod 600 keys/azureuser. If you don't do this, you will get a WARNING: UNPROTECTED PRIVATE KEY FILE! error.

Update the server (if something goes wrong, use the -v flag to debug):

venv/bin/ansible-playbook -i ansible/dev.inventory ansible/server.yml -e mode=dev -t stop,update,start

From the web CLI on worksheets-dev.codalab.org, run cl workers to check that the existing worker is running and ensure the bundle from step 4 completes.
Double check that the site is running functionally. Post an announcement in the #deployment Slack channel, saying that dev has been deployed, a summary of the major changes, and asking everyone to test out the system. Make sure everyone has time to bang on the system (give around 24 hours for this).

Note

It's fine to continue work on the master branch after a release has been cut.

If something goes wrong, you should check the logs and try to understand what's going on, or send a message on Slack. Don't proceed to deploying on prod unless everything is working perfectly.

3. Run stress tests¶

NOTE: Stress tests on GitHub Actions are currently broken. To run stress tests, go to the "Manually stress testing" section to see how the tests are triggered.

It is required to stress test dev with the new changes applied before deploying to prod. We can run stress tests directly by triggering a GitHub Actions workflow. The following are the steps needed to run the stress tests on dev:

Go to https://github.com/codalab/codalab-worksheets/actions and select "Stress Test" on the workflows tab on the left.
Click on "Run workflow" -> "Run" (keep "master" selected as the branch).
Let everyone know on Slack once the tests have succeeded!
If the tests have failed, you could either try re-running the stress tests or manually running stress tests.

Manually stress testing¶

If you need to manually trigger stress testing, run the following commands:

SSH into dev:
```
azure/ssh.sh vm-clws-dev-server-0
```
You'll want to run stress tests in the background through a tmux session. Start a tmux session by running tmux and then running the following command in the next step in that shell. (Note: you can leave a tmux session by pressing Control + B and then D. You can attach to an existing tmux session by running tmux attach)

Run the stress tests using tests/stress/stress_test.py:

time docker exec -it codalab_rest-server_1 /bin/bash -c "python3 tests/stress/stress_test.py --heavy"

You may get prompted to login to dev. Enter the credentials specified here.

If any of the stress test fails for some reason, you can clean up the bundles and worksheets created from the stress tests by running:

docker exec -it codalab_rest-server_1 /bin/bash -c "python3 tests/stress/stress_test.py --instance https://worksheets-dev.codalab.org --cleanup-only"

Check on the status of the stress testing by going to the Worksheets and Bundles with codalab-stress-test tag worksheet in dev.
You can check to ensure the stress test is done by SSH'ing into dev and then running tmux attach to view the output from the stress test run.
Update other team members of the stress testing result on Slack.

3.1 If manual testing fails¶

If staging stress tests fail or other issues come up when testing, just make PRs to the rc0.5.19 branch (so it will be included in the next release), and proceed in the deployment process only after the branch is working properly.
Note that if commits are made to the rc0.5.19 branch using this process, you must merge the branch back into master (see Step 3) after deployment is successful.

4. Deploy a new release to production¶

Perform the below steps only after the stress tests and other testing mentioned above have passed.

Create a new release. Choose the "rc0.5.19" branch for the "Target" and choose "v0.5.19" as the Tag version (don't forget the "v"). Title the release Version <version> (<date>) (just mimic the titles of previous releases). The notes should list the main changes since the last release.
Wait for the CI pipeline to complete; this involves building the right Docker images and uploaing the latest CodaLab package to PyPi.
Update codalab_version in ansible/roles/prod/vars/main.yml to point to the branch name (e.g., v0.3.3).

If this change involves non-trivial database migrations, first stop the server and then back up the database by restarting the monitor script:

venv/bin/ansible-playbook -i ansible/prod.inventory ansible/server.yml -e mode=prod -t stop
azure/ssh.sh vm-clws-prod-server-0
cd codalab-worksheets
./codalab-service.py start -s monitor
docker logs codalab_monitor_1 -f  # Wait until the backup phase is done
./codalab-service.py delete -s monitor

Deploy to prod:

venv/bin/ansible-playbook -i ansible/prod.inventory ansible/server.yml -e mode=prod -t stop,update,start

Test the site manually (should be same as local testing).
Double check site is running functionally. Once it's done, Slack other team members to have them try it out.

Todo

Currently, stopping the services isn't terribly graceful because of some transient errors when things get torn down. Need to update ansible with depends_on to do it in the right order.

3. Merge changes back into master¶

This step should be performed after deployment is successful, and it is only needed if additional commits have been made to the rc0.5.19 branch since the version bump.
Once deployment is successful, merge the rc0.5.- -> master PR (use a merge commit, not squash commit, so that master has the latest tag on it).