From 9762f1d6dee7da503d62cea41bbd49f1412280e7 Mon Sep 17 00:00:00 2001 From: Shulhan Date: Thu, 15 Jan 2026 21:57:43 +0700 Subject: journal/2026: new journal "Gitignore package for Go" This journal explain how to implement gitignore pattern in Go. --- _content/journal/2026/go_gitignore/index.adoc | 240 ++++++++++++++++++++++++++ _content/journal/2026/index.adoc | 8 + _content/journal/index.adoc | 2 + 3 files changed, 250 insertions(+) create mode 100644 _content/journal/2026/go_gitignore/index.adoc create mode 100644 _content/journal/2026/index.adoc diff --git a/_content/journal/2026/go_gitignore/index.adoc b/_content/journal/2026/go_gitignore/index.adoc new file mode 100644 index 0000000..950abe4 --- /dev/null +++ b/_content/journal/2026/go_gitignore/index.adoc @@ -0,0 +1,240 @@ +// SPDX-License-Identifier: CC-BY-SA-4.0 +// SPDX-FileCopyrightText: 2026 M. Shulhan +// REUSE-IgnoreStart += Gitignore package for Go +:toc: +:sectanchors: +:lib_git_ref: https://git.sr.ht/~shulhan/pakakeh.go/tree/806359d5462fa8effde5b130da2071ed43d0da56/item/lib/git + +== Background + +I have several projects that do not use the SPDX license identifiers yet, +and I want to add them, and if possible convert existing copyright and +license headers. +My initial thought was: "is there a tool to help convert the +license to comply with SPDX?" + +I looked it up. + +There is one tool that closely matches with my requirements, +https://reuse.software/[reuse], +which has options to set copyright, license identifier, and year. + +---- +$ reuse annotate \ + --recursive \ + --copyright "author " \ + --year $YEAR \ + --license "license-name" \ + . +---- + +Example of the result is like below, + +---- +@@ -1,7 +1,8 @@ ++// SPDX-FileCopyrightText: 2018 M. Shulhan + // Copyright 2018 Shulhan . All rights reserved. +-// Use of this source code is governed by a BSD-style +-// license that can be found in the LICENSE file. ++// ++// SPDX-License-Identifier: BSD-3-Clause +---- + +It does not remove the old copyright header, I think it is by design. +I can run a `sed` on those files, to remove the line that start with +"// Copyright". +However, there are other problems. +If multiple files have different copyright year, I need to run another `sed` +command to correct the years. +And, what if the file does not have copyright year? +We need to figure it out from git history the year its created, + +---- +$ git log --follow --format=%ad --date=format:%Y $FILE | tail -1 +---- + +Some big projects, like +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/[Linux kernel], +only set the SPDX license without changing the copyright. +We can do that. +Or, we can write another tool that help convert the license headers. + +Since this is a long holidays, let's take the hard way, writing a tool to +convert the old headers to SPDX format. +This should be simple right? + +For each file in directory: + +(1) If there is a line prefixed with "// SPDX", skip it continue to the next + file. + +(2) If there is a line prefixed with "// Copyright", capture the year, + author, and email using regex, and replace it with + "// SPDX-FileCopyrightText: ..." + +(3) If there is a line contains "^//.*BSD-style" replace it with + "// SPDX-License-Identifier: BSD-3-Clause" and remove the line + that start with "// license ..." + +(4) If no "// Copyright" get the year using the above "git log" command and + insert the new "// SPDX-FileCopyrightText: ..." using predefined value. + +Turns out, there is another problem. + +A file can be +https://reuse.software/faq/#exclude-file[excluded from REUSE compliance] +if its ignored by git, using the ".gitignore" file. +And that is why we write parser and checker for Gitignore in Go. + + +== Specification + +We use the +https://git-scm.com/docs/gitignore[gitignore(5)^] +manual as specification for the implementation. + +In short, the rules are as follow: + +* Each line is a pattern, that will be matched with file name or path. +* Empty line is ignored. +* Line started with '#' is a comment, unless it is escaped with backslash + '\\'. +* Space before and after line are ignored, unless escaped with backslash + '\\'. +* Character '/' is directory separator. +* Special character '?' in the pattern means match one character except '/'. +* Special character '*' in the pattern means match zero or more character + except '/'. +* A pattern that end with '/' only match with directory with the same + name. + +When reading the above rules, my first thought is that this is similar to +https://pkg.go.dev/path/filepath#Match[filepath.Match^]. + +I was wrong. + +According to the example given in manual, a pattern "foo/" matches with +"foo" or "a/foo"; but, the result for `filepath.Match` is different, +---- +fmt.Println(filepath.Match("foo/", "foo")) +fmt.Println(filepath.Match("foo/", "a/foo")) +// Output: +// false +// false +---- + +Even if we remove the trailing slash in pattern "foo", the output still +not as expected, + +---- +fmt.Println(filepath.Match("foo", "foo")) +fmt.Println(filepath.Match("foo", "a/foo")) +// Output: +// true +// false +---- + +Continuing the rules, there are other special characters that do not +inline with the [filepath.Match]. + +* Special character '!' in the beginning of pattern means negation. + A file or directory that is excluded by previous pattern, is included + again if match with it. + +* A pattern "\*\*/foo" means match any file or directory named "foo" with + zero or more directory before it. +* A pattern "foo/\*\*" means match any file or directory inside directory + "foo" but not directory named "foo" itself. +* A pattern "foo/\*\*/bar" means match file or directory named "bar" inside + directory "foo", with zero or more directory in between. + + +== Implementation + +Based on the above specification, seems like a simple [filepath.Match] or +[patch.Match] is not sufficient to handle the pattern. + +We need to convert those patterns into a regex that complies with the above +rules: + +* If the pattern end with '/', mark it as directory, and remove the + trailing '/'. + +* Trim the "\*\*/" at the beginning of pattern since it means anything + before. + Pattern "\*\*/foo" or "\*\*/\*\*/foo" is equal to "foo". + +* Ignore the pattern if its end with empty string or only '\*'. + +* Now, we need to detect if the pattern contains directory separator '/'. + Lets find the index and store it as `$SEP_IDX` for later. + +* Escape regex meta-characters '.', '+', '|', '(', and ')' with + backslash '\\'. + +* Replace single character '\*' with regex "[^/]\*" (accept zero or more + characters except "/"). + +* Replace single character '?' with regex "[^/]" (accept one character + except "/"). + +* Replace string "/\*\*/" with regex "(/.\*)?/" (accept zero or more + directories in between). + +* Replace string "/\*\*" with regex "/(.\*)" (accept everything inside a + directory) + +* Replace string "\*\*" with regex "[^/]\*" (second pass for '\*') + +* Back to $SEP_IDX, +** If no directory separator found, prepend the pattern with + regex "(/.\*)?/" (accept zero or more directories before). +** if directory separator is in the beginning or middle of pattern, prepend + the pattern with regex "^/?" (do not accept any directory before) + +* If the pattern is a directory (end with '/') as we mark before, append + back the '/' with '$'; otherwise append regex "/?$" (accept file or + directory). + +For example, here is the list of pattern and its conversion to regex, + +* foo or \*\*/foo => \^(.\*/|/)?foo/?$ +* foo\* => \^(.\*/|/)?foo[\^/]\*/?$ +* foo? => \^(.\*/|/)?foo[\^/]/?$ +* foo/ or \*\*/foo/ => \^(.\*/|/)?foo/$ +* foo/\*\* => \^(.\*/|/)?foo/(.*)/?$ +* /foo => \^/?foo/?$ +* /foo/ => \^/?foo/$ +* foo/bar => \^(.\*/|/)?foo/bar/?$ +* foo/bar/ => \^(.\*/|/)?foo/bar/$ +* /foo/bar => \^/?foo/bar/?$ +* foo/\*\*/bar => \^/?foo(/.*)?/bar/?$ + +The result of the implementation can be viewed here: +https://git.sr.ht/~shulhan/pakakeh.go/tree/main/item/lib/git/[lib/git^]. + + +The APIs are quite simple. +First, load the ".gitignore" from directory using +{lib_git_ref}/gitignore.go#L37[`LoadGitignore()`^], +and then check if path is excluded using +{lib_git_ref}/git.go#L246[`IsIgnored()`^]. + +---- +func LoadGitignore(dir string) (ign *Gitignore) +---- + +LoadGitignore load the gitignore file inside directory `dir`. Any invalid +pattern will be ignored. + +---- +func (ign *Gitignore) IsIgnored(path string) bool +---- + +IsIgnored return true if the `path` is ignored by this Gitignore content. +The `path` is relative to Gitignore directory. + +There is also a type +{lib_git_ref}/ignore_pattern.go[`IgnorePattern`^] +that one can import and use for other implementation, for example handling +`path` value in REUSE.toml annotations table. + +// REUSE-IgnoreEnd diff --git a/_content/journal/2026/index.adoc b/_content/journal/2026/index.adoc new file mode 100644 index 0000000..3f0c5cd --- /dev/null +++ b/_content/journal/2026/index.adoc @@ -0,0 +1,8 @@ +// SPDX-License-Identifier: CC-BY-SA-4.0 +// SPDX-FileCopyrightText: 2026 M. Shulhan + +=== 2026 + +link:/journal/2026/go_gitignore/[Gitignore package for Go]. +My thoughts when implementing gitignore parser and checker for Go +programming language. diff --git a/_content/journal/index.adoc b/_content/journal/index.adoc index ec6df32..18c361d 100644 --- a/_content/journal/index.adoc +++ b/_content/journal/index.adoc @@ -3,6 +3,8 @@ :toc: +include::./2026/index.adoc[] + include::./2025/index.adoc[] include::./2024/index.adoc[] -- cgit v1.3