On-Call Duties (for core team)¶

The on-call team member is broadly responsible for keeping the public CodaLab instance running and ensuring we respond to user issues and questions in a timely manner. The on-call person is also responsible for the weekly release and deployment schedule. Here’s a checklist of responsibilities:

Daily: Reply to email¶

Reply to email sent to codalab.worksheets@gmail.com

If it is low quality or spammy, just ignore it and archive it.
If a user is experiencing a problem but didn’t provide enough details, ask for more details.
If the question is of general interest, first look in the FAQ. If it is not there, then answer it in the FAQ. Respond with the link to the answer.
Once you have responded to an email and there is no action on our end (beyond waiting for their reply), archive it. This way, we can keep the inbox clean and it will only have emails that require our attention.

Daily: Check quota increase requests¶

Check disk quota and time quota increase requests on this spreadsheet. If you do not have access to view/edit this spreadsheets, change to codalab.worksheets@gmail.com using credential here.

Check the user’s bundles to see if there’s an obvious way to reduce usage.
If not, if they are submitting to a competition (e.g., using BERT), then grant them 20 GB (the default increase). Log in as the codalab user and run cl uedit <username> -d 20g.
If they are asking for more, encourage them to find ways to reduce disk usage.
If the size they request is too large to grant, reply with this email format:

Hi XXX,
Thanks for your email.  Unfortunately, we cannot fulfill your quota request at this time, as it exceeds our default quota of 20 GB.  The team is working hard to enable you to be able to host your bundles on servers external to CodaLab, so then there will not be a restriction, so please be on the lookout for that.  In the meantime, feel free to reach out to us if you have any more questions.
Thanks,
The CodaLab Team

For time quota requests, run cl uedit <username> -t <new time quota amount>.
If you’re unsure, then ask in Slack.

Daily: Check Slack¶

Check the #codalab channel in Slack (shared across multiple research groups).

If a user requests access to codalab.stanford.edu, log on to the instance as the root user and run cl uedit <username> --grant-access.

Daily: Check SendGrid¶

Log on to SendGrid using the credentials here and check on the Activity Feed to ensure emails are being delivered to users and Slack.

Daily: Check GitHub issues¶

Check the new user-filed GitHub issues (in particular, focus on issues which are not yet assigned labels, e.g. Orphaned issues).

Try to understand and reproduce the problem. If there’s not enough information, ask for more.
Once it’s reproduced, triage it into the board based on severity.

Daily: Check server and workers are up¶

Make sure the server is alive (do this daily):

The public web UI responds
Check if all workers are alive using the credentials specified here (do this daily): cl workers
Check if there aren't too many runs stuck in the staged state based on the What's Running on CodaLab? worksheet. Additionally, check to ensure that runs aren't stuck in a running state for too long.
Check each failed bundle for the day in the Failed Runs worksheet.
Post in Slack if there are any doubts or if it is an urgent production issue (e.g. lots of bundles fail with the same error).
Bring up the failed bundles in our weekly status meetings.
Anything can happen, and if something catastrophic happens, you have to be able to react quickly.

Possible problems¶

CodaLab is not responding.
Workers are down.
Runs are failing.
Runs are staged forever.
Ran out of disk space (due to worker cache, bundles, docker images, MySQL database, docker logs).
No more workers free.
Server or worker just crashed.
You might have to ssh into the server or workers to see what’s going on.
Be really careful about making changes on a live system. Always document what you’re doing and back things up. If you’re not sure, ask!

Daily: Check how users are using CodaLab¶

Check the running bundles on the What's Running on CodaLab? worksheet to see if anyone is using CodaLab inappropriately or maliciously (e.g., crypto mining). If you do identify a malicious user, follow these steps:

Log in as the root user using the credentials here.
Go to the What's Running on CodaLab? worksheet.
Get the user's username and run cl uedit -p 0 <username>. The command will prevent the user from running any more jobs on the public workers.
Kill the user's jobs by selecting all of their bundles and clicking the Kill button at the top of the page.

Important: double-check and make sure you don't change the parallel run quota and kill all the running jobs of a legitimate user.

Weekly: Check Azure credits¶

Log on to Azure to check the amount of Azure credits we have left using the credentials here, report in slack if we run out of credits.

Weekly: Release and deploy a new version of CodaLab¶

See the deployment docs for instructions.
Check in the #deployment channel in Slack to see when / whether to do this.
Always post in Slack in advance of and during a planned deployment (dev or prod), to keep the rest of the team in the loop.
Post the info of the release in the #codalab Slack channel.
Weekly deployment schedule:
- By the end of Monday: Dev Deployment
- By the end of Thursday: Prod Deployment

Weekly report¶

In advance of the weekly team meeting, you should write up a 5-minute report on the following: (Do this on both worksheets.codalab.org and codalab.stanford.edu)

Users
- How many new users joined? (run cl uls .joined_after=<on-call start date YYYY-MM-DD> .count)
Workers (run cl workers)
- How many workers are up? How many custom user workers?
- How many workers are down (our workers)?
Bundles (Check what's running on worksheets.codalab.org here and on codalab.stanford.edu here)
- How many new bundles have been created (run cl search .after=<on-call start date YYYY-MM-DD> .count)? What generally are people doing (competitions or just screwing around, etc.)?
- How many of them failed (run cl search .after=<on-call start date YYYY-MM-DD> state=failed .count) and why (see the failed-runs on worksheets.codalab.org and failed-runs on codalab.stanford.edu worksheet)?
System (look in #prod-monitoring)
- What exceptions have been thrown?
- How much free disk is left on the server?
- How much disk is the DB using?
System (check sentry)
- Go through all the Sentry events that have been reported in the last week, and if the issue is on our end, file a GitHub issue.
- This can be done by just clicking "Link Github Issue" in the lower left hand side of the webpage.