Flakyness using `auth` action and Workload Identity Federation

This issue has been tracked since 2023-02-13.

TL;DR

We recently have seen flakyness/failures in our CI/CD system that seem related to the auth action when using it with workload identity federation.

Failures in our tests:

Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/***/instances/us-central1~***/connectSettings?alt=json&prettyPrint=false": oauth2/google: unable to generate access token: Post "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/***:generateAccessToken": oauth2/google: invalid response when retrieving subject token: Get "https://pipelines.actions.githubusercontent.com/wnFNgWBjsU8cogeeTzb2CiO5AuGdnZuICpyvLHwtISiGZGW9qa/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/c74b5a43-cfd0-4a22-8e48-7ecf7f3f3bfa/jobs/a87764f3-d2a4-5991-9b9b-9ec78441f076/idtoken?api-version=2.0&audience=https%!A(MISSING)%!F(MISSING)%!F(MISSING)iam.googleapis.com%!F(MISSING)projects%!F(MISSING)174904406655%!F(MISSING)locations%!F(MISSING)global%!F(MISSING)workloadIdentityPools%!F(MISSING)gh-13a715-cloudsql-proxy%!F(MISSING)providers%!F(MISSING)gh-13a715-cloudsql-proxy"

Flakybot issues on our repo for context:
GoogleCloudPlatform/cloud-sql-proxy#1649
GoogleCloudPlatform/cloud-sql-proxy#1648

Expected behavior

Build normally passes without issues.

Observed behavior

Flakyness resulting from unable to generate access token using auth creds.

Action YAML

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: v1 periodic
on:
  schedule:
    - cron:  '0 2 * * *'

jobs:
  integration:
    name: integration tests
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [macos-latest, windows-latest, ubuntu-latest]
      fail-fast: false
    permissions:
      contents: 'read'
      id-token: 'write'
    steps:
      - name: Checkout code
        uses: 'actions/[email protected]'
        with:
          ref: v1

      - name: Setup Go
        uses: actions/[email protected]
        with:
          go-version: 1.19

      - id: 'auth'
        name: 'Authenticate to Google Cloud'
        uses: 'google-github-actions/[email protected]'
        with:
          workload_identity_provider: ${{ secrets.PROVIDER_NAME }}
          service_account: ${{ secrets.SERVICE_ACCOUNT }}
          access_token_lifetime: 600s

      - id: 'secrets'
        name: Get secrets
        uses: 'google-github-actions/[email protected]'
        with:
          secrets: |-
            MYSQL_CONNECTION_NAME:${{ secrets.GOOGLE_CLOUD_PROJECT }}/MYSQL_CONNECTION_NAME
            MYSQL_USER:${{ secrets.GOOGLE_CLOUD_PROJECT }}/MYSQL_USER
            MYSQL_PASS:${{ secrets.GOOGLE_CLOUD_PROJECT }}/MYSQL_PASS
            MYSQL_DB:${{ secrets.GOOGLE_CLOUD_PROJECT }}/MYSQL_DB
            POSTGRES_CONNECTION_NAME:${{ secrets.GOOGLE_CLOUD_PROJECT }}/POSTGRES_CONNECTION_NAME
            POSTGRES_USER:${{ secrets.GOOGLE_CLOUD_PROJECT }}/POSTGRES_USER
            POSTGRES_USER_IAM:${{ secrets.GOOGLE_CLOUD_PROJECT }}/POSTGRES_USER_IAM
            POSTGRES_PASS:${{ secrets.GOOGLE_CLOUD_PROJECT }}/POSTGRES_PASS
            POSTGRES_DB:${{ secrets.GOOGLE_CLOUD_PROJECT }}/POSTGRES_DB
            SQLSERVER_CONNECTION_NAME:${{ secrets.GOOGLE_CLOUD_PROJECT }}/SQLSERVER_CONNECTION_NAME
            SQLSERVER_USER:${{ secrets.GOOGLE_CLOUD_PROJECT }}/SQLSERVER_USER
            SQLSERVER_PASS:${{ secrets.GOOGLE_CLOUD_PROJECT }}/SQLSERVER_PASS
            SQLSERVER_DB:${{ secrets.GOOGLE_CLOUD_PROJECT }}/SQLSERVER_DB

      - name: Enable fuse config (Linux)
        if: runner.os == 'Linux'
        run: |
          sudo sed -i 's/#user_allow_other/user_allow_other/g' /etc/fuse.conf

      - name: Run tests
        env:
          GOOGLE_CLOUD_PROJECT: '${{ secrets.GOOGLE_CLOUD_PROJECT }}'
          MYSQL_CONNECTION_NAME: '${{ steps.secrets.outputs.MYSQL_CONNECTION_NAME }}'
          MYSQL_USER: '${{ steps.secrets.outputs.MYSQL_USER }}'
          MYSQL_PASS: '${{ steps.secrets.outputs.MYSQL_PASS }}'
          MYSQL_DB: '${{ steps.secrets.outputs.MYSQL_DB }}'
          POSTGRES_CONNECTION_NAME: '${{ steps.secrets.outputs.POSTGRES_CONNECTION_NAME }}'
          POSTGRES_USER: '${{ steps.secrets.outputs.POSTGRES_USER }}'
          POSTGRES_USER_IAM: '${{ steps.secrets.outputs.POSTGRES_USER_IAM }}'
          POSTGRES_PASS: '${{ steps.secrets.outputs.POSTGRES_PASS }}'
          POSTGRES_DB: '${{ steps.secrets.outputs.POSTGRES_DB }}'
          SQLSERVER_CONNECTION_NAME: '${{ steps.secrets.outputs.SQLSERVER_CONNECTION_NAME }}'
          SQLSERVER_USER: '${{ steps.secrets.outputs.SQLSERVER_USER }}'
          SQLSERVER_PASS: '${{ steps.secrets.outputs.SQLSERVER_PASS }}'
          SQLSERVER_DB: '${{ steps.secrets.outputs.SQLSERVER_DB }}'
          TMPDIR: "/tmp"
          TMP: '${{ runner.temp }}'
        # specifying bash shell ensures a failure in a piped process isn't lost by using `set -eo pipefail`
        shell: bash
        run: |
          go test -race -v ./... | tee test_results.txt

      - name: Convert test output to XML
        if: ${{ github.event_name == 'schedule' && always() }}
        run: |
          go install github.com/jstemmer/go-junit-report/[email protected]
          go-junit-report -in test_results.txt -set-exit-code -out v1periodic_sponge_log.xml

      - name: FlakyBot (Linux)
        # only run flakybot on periodic (schedule) event
        if: ${{ github.event_name == 'schedule' && runner.os == 'Linux' && always() }}
        run: |
          curl https://github.com/googleapis/repo-automation-bots/releases/download/flakybot-1.1.0/flakybot -o flakybot -s -L
          chmod +x ./flakybot
          ./flakybot --repo ${{github.repository}} --commit_hash ${{github.sha}} --build_url https://github.com/${{github.repository}}/actions/runs/${{github.run_id}}

      - name: FlakyBot (Windows)
        # only run flakybot on periodic (schedule) event
        if: ${{ github.event_name == 'schedule' && runner.os == 'Windows' && always() }}
        run: |
          curl https://github.com/googleapis/repo-automation-bots/releases/download/flakybot-1.1.0/flakybot.exe -o flakybot.exe -s -L
          ./flakybot.exe --repo ${{github.repository}} --commit_hash ${{github.sha}} --build_url https://github.com/${{github.repository}}/actions/runs/${{github.run_id}}

      - name: FlakyBot (macOS)
        # only run flakybot on periodic (schedule) event
        if: ${{ github.event_name == 'schedule' && runner.os == 'macOS' && always() }}
        run: |
          curl https://github.com/googleapis/repo-automation-bots/releases/download/flakybot-1.1.0/flakybot-darwin-amd64 -o flakybot -s -L
          chmod +x ./flakybot
          ./flakybot --repo ${{github.repository}} --commit_hash ${{github.sha}} --build_url https://github.com/${{github.repository}}/actions/runs/${{github.run_id}}

Log output

https://github.com/GoogleCloudPlatform/cloud-sql-proxy/actions/runs/4149257897/jobs/7178042999#step:7:354

Additional information

No response

github-actions[bot] wrote this answer on 2023-02-24

Hi there @jackwotherspoon 👋!

Thank you for opening an issue. Our team will triage this as soon as we can. Please take a moment to review the troubleshooting steps which lists common error messages and their resolution steps.

sethvargo wrote this answer on 2023-02-24

Hi @jackwotherspoon

Thank you for opening an issue. I'm seeing a bunch of "!F(MISSING)" in that output. For example, I would expect:

https://pipelines.actions.githubusercontent.com/wnFNgWBjsU8cogeeTzb2CiO5AuGdnZuICpyvLHwtISiGZGW9qa/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/c74b5a43-cfd0-4a22-8e48-7ecf7f3f3bfa/jobs/a87764f3-d2a4-5991-9b9b-9ec78441f076/idtoken?api-version=2.0&audience=https%!A(MISSING)%!F(MISSING)%!F(MISSING)iam.googleapis.com%!F(MISSING)projects%!F(MISSING)174904406655%!F(MISSING)locations%!F(MISSING)global%!F(MISSING)workloadIdentityPools%!F(MISSING)gh-13a715-cloudsql-proxy%!F(MISSING)providers%!F(MISSING)gh-13a715-cloudsql-proxy

to be:

https://pipelines.actions.githubusercontent.com/wnFNgWBjsU8cogeeTzb2CiO5AuGdnZuICpyvLHwtISiGZGW9qa/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/c74b5a43-cfd0-4a22-8e48-7ecf7f3f3bfa/jobs/a87764f3-d2a4-5991-9b9b-9ec78441f076/idtoken?api-version=2.0&audience=https://iam.googleapis.com/projects/174904406655/locations/global/workloadIdentityPools/gh-13a715-cloudsql-proxy/providers/gh-13a715-cloudsql-proxy

It looks like some kind of stripping or variable substitution might be failing. Per the troubleshooting steps, can you please enable debug logging and provide the logs (or a link to the logs)? I can add a few retries, but I want to make sure I understand the problem first. Does reducing the concurrency help at all?

jackwotherspoon wrote this answer on 2023-02-24

I've enabled debug logging and will monitor the builds for the failure again and will post the logs here once it does.

sethvargo wrote this answer on 2023-02-24

Thanks @jackwotherspoon. Do you know if each test is generating a new auth token through the WIF workflow? I wonder if generating an auth token and injecting it into the process instead of relying on ADC could help. That would mean you only have one auth exchange.

This line sets a token format, but you're not actually generating a token (token_format: 'access_token') or injecting it into the subsequent processes. That means each run does an ADC cycle, which might be why you're getting errors. I would expect the errors to be rate limits or quota though, not connection errors.

I wonder if our action or the nodejs action isn't properly cleaning up connections?

enocom wrote this answer on 2023-02-24

The error message also suggests the OAuth2 Go library might not be formatting URLs correctly as well.

jackwotherspoon wrote this answer on 2023-02-24

@sethvargo Thanks for the suggestions! I will look at how we can more efficiently generate tokens/creds as we do heavily rely on ADC currently.

This may be why we also see timeout errors in some of our runs:

Error: google-github-actions/get-secretmanager-secrets failed with: failed to access secret "projects/***/secrets/MYSQL_CONNECTION_NAME/versions/latest": request to https://pipelines.actions.githubusercontent.com/umAmnh0OhcfbtGEt7J16Yga6HsgM8dYIhPxbPiOYFLVwMnfbKz/00000000-0000-0000-0000-000000000000/_apis/distributedtask/hubs/Actions/plans/833be520-2cef-45cf-837f-a087e0c4b14d/jobs/8b492971-3af8-5c25-5c38-fc3954dced57/idtoken?api-version=2.0&audience=https%3A%2F%2Fiam.googleapis.com%2F*** failed, reason: connect ETIMEDOUT 13.107.42.16:443

Build: https://github.com/GoogleCloudPlatform/cloud-sql-python-connector/actions/runs/4185940413/jobs/7253728369

More Details About Repo
Owner Name google-github-actions
Repo Name auth
Full Name google-github-actions/auth
Language TypeScript
Created Date 2021-09-16
Updated Date 2023-03-24
Star Count 573
Watcher Count 16
Fork Count 116
Issue Count 3

YOU MAY BE INTERESTED

Issue Title Created Date Updated Date