I’ve been working in a GitHub documentation repo recently with many different markdown files and I kept noticing broken links. Usually they were my own fault, links within the repo I didn’t catch after moving, renaming, or otherwise reorganizing the content. Sometimes though they were external links that were valid at the time months ago but not anymore.
I wanted to find all the broken links, correct them, and help reduce them in the future; that would require some automation to be effective. A quick search lead me to this Markdown link check 🔗✔️ GitHub action.
Using the action is very straightforward; this is my initial action setup using it.
# Validates markdown links to check for bad / invalid / broken links. # Uses mlc_config.json in root to configure patterns to ignore etc. # # https://github.com/marketplace/actions/markdown-link-check # name: Check Markdown links # Just running manually and weekly. Can take a few minutes potentially. on: workflow_dispatch: schedule: # Every Monday at 1p UTC https://crontab.guru/#0_13_*_*_1 - cron: "0 13 * * 1" jobs: markdown-link-check: runs-on: ubuntu-latest steps: - uses: actions/checkout@master - uses: gaurav-nelson/github-action-markdown-link-check@v1 with: config-file: 'mlc_config.json' # Quiet mode only shows errors in output not successful links too use-quiet-mode: 'yes' # Specify yes to show detailed HTTP status for checked links. use-verbose-mode: 'yes'
- I used quiet mode as the number of successful links overwhelms the output and I’m mostly interested in the problems.
- Verbose mode is used to get a more detailed error dump for links that can’t be resolved.
- Initially I was running the action on every push but the action would take 2-10 minutes or so and the links didn’t need to be checked that aggressively.
- The action has various other settings – it can only check modified markdown files for example.
- I found the manual run dispatch and running once weekly on a schedule was a good middle ground trigger wise.
- GitHub action schedule triggers are UTC so keep time zone conversion in mind for your local time.
Initially I ran the tool without a configuration file so there were no URL patterns to ignore and many ‘broken’ links. I say ‘broken’ as many of these links may be valid internal sites that GitHub.com can’t reach or public sites requiring authentication. Others may not require authentication but may return non-standard HTTP status codes when the URL is hit via a bot / automated process / outside of a user request in a browser.
By default the action looks for a config file named mlc_config.json
in the repo root but a different filename can be given. URL patterns to ignore can be put here along with other other configuration options. I found the easiest method was copying unreachable URLs from the GitHub action output into a tool like regexr.com, testing a pattern there, then copying to the config file. A partial sample follows.
{ "ignorePatterns": [ { "pattern": "(.*\\.)?company-domain\\.com.*" }, { "pattern": "(.*\\.)?dev.azure\\.com" }, { "pattern": "(.*\\.)?github.com/company-org/.*" }, { "pattern": "https://github.com/orgs/company-org/.*" }, { "pattern": "(.*\\.)?.azurewebsites.net.*" }, { "pattern": "10.0.4.(?:[4-9]|10)*." }, { "pattern": "^(http|https)://localhost" }, { "pattern": "^(http|https)://redis.io" }, { "pattern": "^(http|https)://www.linkedin.com" }, { "pattern": "^(http|https)://help.octopus.com" } ], "retryOn429": true, "aliveStatusCodes": [200, 206] }
When there are broken links, Action output will look something like this.
With good URL ignore patterns in the configuration, the number of broken links should be minimal. I was quickly able to catch and correct at least a dozen invalid links after configuring the tool.
When all configured links are checked successfully:
It’s also helpful to add a workflow status badge to the repo’s README so the link check status is more visible than drilling into Actions.
The package is Treeware which I think is cool. They ask you buy the world a tree to thank them for their work.