Lix infrastructure guide

Information about adminstering Lix's infrastructure.

Machine and service overview

The Lix infrastructure is maintained with Nix code at https://git.lix.systems/lix-project/web-services.

That repository is the source of truth for what's serving where, but we attempt to reflect that here as well for ease of reference. This page is the source of truth for points of contact and where a machine physically exists.

This page was previously called "Infrastructure overview", but that name falsely implied it was a good entry point for beginners seeking to understand how to get started in our tooling. This page is more of an operational reference on how things are deployed.

Hosts

lix.systems

Host info

Services

buildbot.lix.systems

Host info

Services

monitoring.lix.systems

Host info

Services

cache.lix.systems

"S3" host for the future binary cache.

Host info

Services

scratch.lix.systems

Scratch host to do staging things on.

Host info

pad.lix.systems

Host info

Services

core.lix.systems

Host info

Services

matrix.lix.systems

Host info

Services

Builders

build01.aarch64.lix.systems

build02.aarch64.lix.systems

build01.aarch64-darwin.lix.systems

epyc.infra.newtype.fr

Auth/SSO systems

A major part of Lix infrastructure is the authentication/SSO systems. Here, you can find information about how to run them.

Auth/SSO systems

Changing names, emails, etc

The Lix project endeavours to not deadname people, because we believe in human decency. However, some of our software has other ideas. This page documents the workarounds to manually fix profile updates that don't get conducted because various software is busted.

Intended design

Ideally, contributors should be able to go to https://identity.lix.systems and change their usernames, display names and emails and relog every service, and then every service will have correct names and emails.

wiki.lix.systems

The wiki does not update emails when they are changed via OIDC. Furthermore, users can't change them themselves. Why do they do this, we will never know; OIDC has persistent UUIDs, they have no reason to do this.

To fix a user's email manually, go to https://wiki.lix.systems/settings/users, and select the user in question and edit them.

The wiki will also not change fullnames automatically, which is also broken, but users can simply change them. It does not seem to use usernames at all.

git.lix.systems

!! Currently this is broken and we cannot change forgejo usernames at all: https://git.lix.systems/lix-project/web-services/issues/93. The workaround here is to clear the username field when changing to a local account.

Forgejo blocks username changes for accounts with external sign-in for no reason. These have to be fixed by an administrator. Go to https://git.lix.systems/admin/users, click the edit icon next to the user in question, then set the Authentication Source to Local, fix the username, then press Update User Account. Next, set the Authentication Source back to Lix.

Users are able to change emails themselves by adding a new email then deleting the old one. They can also be changed by administrators in the same page as above.

OIDC has persistent UUIDs, there is no reason for Forgejo to do this.

Forgejo does not update names or emails from Keycloak after initial login, which is broken as well.

gerrit.lix.systems

Gerrit will break accounts rendering them incapable of logging in if they change username. Changing email and display name works as expected. It appears that Gerrit wants to believe that usernames are not possible to change, which is a skill issue, because they have numeric IDs.

Extremely untested scuffed-looking db hacking procedure: https://wikitech.wikimedia.org/wiki/SRE/LDAP/Renaming_users/Gerrit

pad.lix.systems

We think this one works properly last time we checked. It seems to just replace the profile on each login.

Auth/SSO systems

How accounts work

Lix has one source of truth for authentication: Keycloak (identity.lix.systems). Most services are bound to Keycloak for authentication via OAuth2, although it supports SAML as well.

GitHub vs Local accounts

GitHub accounts are used at Lix for two reasons:

We don't really care if people have the same username or other information on Lix as they have on GitHub. We don't care about whether people have first/last names on our Keycloak or if they are using pseudonyms.

Allow/ban listing

There is an allow-list and a ban list maintained at: https://git.lix.systems/lix-project/access-control (private repo, available only to Lix core team). To add people to a list, use ./add.sh list.txt gh-username. Once a list change is pushed, it can take up to five minutes for the change to take effect, as this is currently running on a 5m cron job.

In short, the process for adding a user to the ban or allow list is:

  1. Make sure you have the latest version of the repo (i.e. git pull) and the github gh command is installed.
  2. Run ./add.sh <relevant-list-file> <github username>.
  3. Commit and push the change.
  4. The ACL change will apply automatically within five minutes.

Be warned -- the allow-list method of access control is temporary / established for the beta period.

Our allow/ban listing is done by GitHub ID, using keycloak-allowban-plugin, a custom Keycloak plugin that reads text files with allow/ban lists. The GitHub ID is put into a user profile attribute, which prevents ban-evasion via account unlinking since it will stick across unlinking.

Known weirdness with the allow/ban list plugin

If a user tries to log in via GitHub and they are not allowed by the plugin, the account is created anyway, it is simply not usable. This is a known issue; putting the plugin in the registration flow caused half-registered users, so it is only in the post-gh-login flow and the normal login flow (to catch unlinked banned accounts).

Local accounts

The Lix core team should have local accounts (linking to GitHub is OK), strongly preferably with 2FA. Other people can be given local accounts if they are trustworthy and prefer to have local accounts (since the usual ban process doesn't work on them; though it is not hard to ban them, just disable the account).

Note that GitHub backed accounts can be turned mostly into local accounts by the user simply setting up local auth and unlinking the GitHub account (though the GitHub ID will intentionally persist in properties so this doesn't degrade our bans story).

We would prefer for everyone to use WebAuthn for local accounts, but this is often not possible and passwords are OK as long as they're just put in a password manager.

To create a local account, get the following info:

Then create an account on https://identity.lix.systems/admin with the provided details. On the account's page, go to Credentials, select Credential Reset, then if WebAuthn is ok for the person, set "WebAuthn Register Passwordless" in the actions (otherwise just password reset) and send it.

Removing last names for people

Due to Keycloak being a silly little thing, we need to use "declarative user profiles" to allow not setting last names. For now, Lix core team members with necessary access will have to remove them manually on request.

This would be fixed by updating Keycloak to 24 on lix.systems and setting up declarative user profiles: https://git.lix.systems/lix-project/web-services/issues/64

How to ban someone

If a user has violated our community norms and needs to have their access to our infrastructure removed, follow the following steps:

  1. Add them to the banned users list on https://git.lix.systems/lix-project/access-control and push the changes.
  2. Go to https://identity.lix.systems/admin and disable their account for good measure.
  3. Ban them from Matrix: FIXME
  4. (if you really don't like what's going on) invalidate all sessions:
    1. ssh root@git.lix.systems -- mysql -D forgejo -e 'delete from session;'
    2. ssh -p 2022 youruser@gerrit.lix.systems gerrit flush-caches --cache web_sessions
    3. FIXME: bookstack
Auth/SSO systems

How do permissions work?

In an ideal world, all permissions are managed directly in Keycloak and propagated down to downstream systems automatically. We mostly live in that world. We would also like more parts of profiles to propagate from Keycloak into downstream systems (see changing names document).

First, let's enumerate the access that we have available to grant.

Available access

Roles that exist in Keycloak

"sticky" is referring to whether later-removed permissions get stuck in the downstream system if they are removed upstream

Roles that we wish existed in Keycloak

These can't happen due to current technical limitations.

Structure of access

Keycloak appears to want its structure to work like:

Group -> Composite Realm Role -> Client Role

We don't have that many groups or client roles to assign to make composite realm roles make much sense at the time of this writing.

When creating new clients, make their roles client roles, which we can then assign to other objects inside Keycloak so we can do role-based access control at a later time without having to mess with the services.

Groups

Policy on who goes in groups

Auth/SSO systems

Assigning Groups

See How do permissions work? for implementation details.

tldr;

Note: most permissions only update after logging out and back into the appropriate application.

Auth/SSO systems

Tutorial: adding auto mapping of forgejo groups

Create a role on the Keycloak client: Screenshot_20240708_134308.png

Go into the group in question and map it the role you just made: Screenshot_20240708_134350.png

Add a json snippet to map the role in the incoming tokens to the appropriate team on the org:

{"the-distro-committer": {"the-distro": ["committers"]}, "the-distro-org-owner": {"the-distro": ["owners"]}}

It needs to be added to: https://git.lix.systems/admin/auths/1

Buildbot runbook

Our buildbot instance has a habit of breaking due to excess load.

Restarting the worker

If the worker (primary, handling nix evaluations) explodes, it can be restarted.

ssh root@buildbot.lix.systems 'systemctl restart buildbot-worker.service'

Re-trying spurious CI failures

Those with the relevant permissions can click the "rebuild" button on a given CI job, but in order to count for Gerrit's checks and set the Verified +1 flag, you must restart the top-level lix/nix-eval job; restarting e.g. a single test or build will not affect things on the Gerrit side.

Why

Why? Why self-host all your own infrastructure?

We tried not to, at the very beginning of the project. We agreed that Github-style code review wasn't really fit for our kind of project, and wanted to use Gerrit for code review. We started setting up Gerrithub for a repo hosted on Github, and we ran into so many problems with that approach that it was actually easier to just self-host Gerrit instead (for starters, some members could not log in at all).

Then there was little reason to use GitHub, since none of us are really happy with Github direction lately anyway, and Forgejo + GitHub-enabled SSO mean that contributors shouldn't have to jump through too many hoops to help out.

So now we have a fully independent and open source infrastructure stack, with (hopefully) a good onboarding path as well. And we're also in our own critical path: we run into Nix's papercuts and gashes alike every day, so we better fix it!

Here is our thought process from back then:

The Lix CI is broken though!

Yes, our buildbot is a high maintenance service and it is janky. Multiple members of the Lix team have plans about writing entirely new Nix CI systems, but they are otherwise busy with another major project in the form of Lix. This is the matrix of extant alternatives:

Obliterating history from Git

To obliterate history from the Git repo means removing it from three different sources: Gerrit, Forgejo, and GitHub.

A tool has been written, called gerrit-rewrite-branch, to rewrite Gerrit history completely, including the meta on past CLs.

To use it, build it as --release (it will stack overflow on debug mode), and find the following repos, and make backups of them:

To start off, stop Gerrit and find the Git repo for it. The tool requires four things: The email address to obliterate, and a replacement name + email address. It also needs a cutoff date for where to remove commits before. To find this, run git cat-file -p {commit} for a commit earlier than the oldest you want to remove, and note down the timestamp on the committer line.

Call the tool. It will churn for a while, and rewrite all previous Git commits, plus the Gerrit metadata of affected commits. As a bonus, run a git gc --prune=now.

Before turning on Gerrit, run systemd-run -p DynamicUser=yes -p StateDirectory=gerrit -t gerrit reindex -d /var/lib/gerrit. This ensures Gerrit is aware of the changes made outside of its existence.

For forgejo, no special steps are needed; just run the same tool over these repos plus all their forks, and prune the reflog and unreachable commits as well:

[root@lix:/var/lib/forgejo/repositories]# for i in */lix.git; do pushd $i; sudo -u git git reflog expire --expire=all --expire-unreachable=all --all; sudo -u git git gc --prune=now; popd; done

Once Gerrit and Forgejo are back up, run ssh gerrit.lix.systems replication start --now --url github to propagate the changes to GitHub.

Don't forget to ban the commits as well, using ssh gerrit.lix.systems gerrit ban-commit lix {commits}.

Tooling improvements

We use a lot of tooling. There are papercuts we run into with our use cases that we would really like to have fixed.

Tooling improvements

Forgejo improvements

A brief overview of our code infrastructure for those not in the Lix project:

Stuff that works great

Forgejo does a lot of stuff better than GitHub and we love it very much for these things.

Stuff that makes us Very Sad

Stuff that would be Nice

UX

Operations

Stuff we patched that could probably be done Better upstream

Postmortems

Postmortems

buildbot.lix.systems out of free disk 2024-06-09

The buildbot box was returning "no free space" to basically any btrfs operation including collecting garbage. Yet df -h stated that it had disk around.

Damn it!!

https://ohthehugemanatee.org/blog/2019/02/11/btrfs-out-of-space-emergency-response/

The box has another disk on it that did have space, but it was ext4. So I did something inadvisable:

# the device we were going to migrate nix store to
mount /dev/sda1 /mnt
fallocate -l 20GiB /mnt/ohno
losetup -f /mnt/ohno
losetup -a
# bad evil!! do not do this! this is a great way how you break your fs if the machine goes down
btrfs device add /dev/loop0 /
btrfs balance start -dusage=10 /
nix-collect-garbage # for a bit, then ctrl-c'd
btrfs device delete /dev/loop0 /

This freed enough disk that the machine was unstuck. I then ran more of a garbage collection, which freed enough space to further recover the machine.

Working with S3

Introduction

We use garage, an open-source server compatible with Amazon's S3 API, hosted on our own infrastructure. Currently we store both documentation and binaries there; it may be used for other things down the line.

Configuring a client

You probably want to use rclone; it's friendly (to people who like the terminal) and not tightly bound to any specific storage service. You can also use the Amazon first-party S3 tooling, but this guide does not attempt to explain how.

To follow these steps, you will need to already have ssh access to the server garage runs on, which is s3.lix.systems. As of this writing, there is no guide about how to do that, but take a look at services/ssh.nix in the web-services repo and see whether it makes sense to you. Please feel free to write said guide and add it to this wiki. :)

Generating S3 credentials

Once you have ssh access, you will need to make s3 credentials. You can do it like this:

$ ssh root@s3.lix.systems
[root@cache:~]# garage key create some-key-name
Key name: some-key-name
Key ID: GKa653da6819c4140c3db9dfc5 
Secret key: ab2b6106fbb7681517cba875c26c8ea99e281f113e2fd809decd6e524ebbc639

Can create buckets: false

Key-specific bucket aliases:

Authorized buckets:

(Don't worry - those aren't real keys there, nor were they ever! They're synthetic examples so you know what they look like.)

You'll want to choose a key name that helps the rest of us know whose it is and what it's used for. Don't just create a key called some-key-name by copying the example verbatim, it will be confusing clutter!

The most important criterion for a key name is that reading the name should let you answer the questions "is anyone still using this?" and "what will break if this key is deleted?" If you need naming inspiration you can see other people's key names with garage key list; in particular, keys meant for individual use should probably start with your username.

Before you sign out of the server, also make sure to grant the key the permissions you need. For example, if you need to work with the docs bucket, do:

[root@cache:~]# garage bucket allow --read --write docs --key irenes-temp-delete
New permissions for GKa653da6819c4140c3db9dfc5 on docs: read true, write true, owner false.

You can see what buckets exist by doing garage bucket list.

The "Key ID" and "Secret key" values from the key you generated are what you'll need in the next step. Make sure you have them; there's no way to look up the secret part later.

Configuring your client (probably rclone)

You may find it useful to reference the garage documentation on this.

There are two ways to configure rclone, either of which will work. The one Irenes recommend is to put the credentials directly into the rclone configuration (it has its own tooling for securing them, which you can set up if you want). The other way is to let rclone read them out of the config file used by Amazon's first-party tooling. Either will work; using the AWS config file is a little harder to figure out what you did later, if you happen to forget. Also, if Lix infrastructure isn't the only S3 service you use on a regular basis, the rclone config is probably a better place to keep track of everything because AWS profiles are a pain to use.

If you're doing it the Irenes way, you can either run rclone config and go through the prompts, or just prepare a config file by hand.

Here's a sample rclone.config:

[lix]
type = s3
provider = Other
env_auth = false
endpoint = s3.lix.systems
region = garage
access_key_id = GKa653da6819c4140c3db9dfc5
secret_access_key = ab2b6106fbb7681517cba875c26c8ea99e281f113e2fd809decd6e524ebbc639

For more information about where to put this config file, see man rclone; it's likely that ~/.config/rclone/rclone.conf is the right place.

Please notice that this example file uses lix as the name of the rclone "remote". That means that, when interacting with it, you'll use paths like lix: to refer to the entire thing, or lix:docs/ to refer to the root of the bucket named docs, and so on. You can use any name you find convenient for the remote, it doesn't have to be lix, but this document will assume it's that. If you think you might have done this configuration already but don't remember what you called the remote, do rclone listremotes.

If you're going through the interactive configuration, choose the generic S3-compatible service as the type of service. For the endpoint, write in s3.lix.systems, and for the region, write in garage. If you leave region blank you'll get weird errors about us-east-1, but we're not Amazon and we don't have a global network of highly-redundant data centers, so don't leave it blank. :)

If you're storing credentials in the AWS config file, everything is pretty similar except you'll need to prepare ~/.aws/credentials yourself, and tell rclone to use it; the rclone config wizard has options for that. The easy way is to use the default profile in the AWS credentials file; otherwise you'll have to make sure your environment sets AWS_PROFILE, since rclone has no option to manage that itself.

Copying files into and out of s3

If you're using rclone, you may find it useful to do rclone ls lix: to get a sense of what's there. This will probably become increasingly bad advice as our usage of S3 grows! :)

Notice that the first path component in this output is the bucket name, so ie. a file named index.html at the root of the docs bucket is listed as docs/index.html in this view. That is also how you will refer to it from the command line. Other S3 clients have different conventions in this regard, so if you're using something else, check its upstream documentation.

If you have a local file index.html and you want to overwrite the remote docs/index.html with it, do rclone copy index.html lix:docs/. You have to give a directory prefix, not a filename, for the second part.

In general, the rclone CLI lets you intermingle local and remote paths, so pay close attention to the colons. lix:something is a remote path, something is a local one. If you lose track of this you will end up sad.

For any other rclone-related questions, rclone --help and man rclone are good references.

Happy filing!

Creating Matrix Rooms/Spaces

actual explanation will follow, tldr; here:

Merging Gerrit identities

Basically, following https://ovirt-infra-docs.readthedocs.io/en/latest/General/Gerrit_account_merge/index.html.

If for some reason, you don't have access to refs/meta/external-ids, you can still do it on the server directly as long as you ensure that you restore the permbits for gerrit:gerrit on the git storage.

You can extract a worktree git worktree add /tmp/external-ids refs/meta/external-ids, generate a commit that fixes things and you can complete by git update-ref refs/meta/external-ids $commit_sha1.

Note: