summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--_content/journal/2026/go_gitignore/index.adoc240
-rw-r--r--_content/journal/2026/index.adoc8
-rw-r--r--_content/journal/index.adoc2
3 files changed, 250 insertions, 0 deletions
diff --git a/_content/journal/2026/go_gitignore/index.adoc b/_content/journal/2026/go_gitignore/index.adoc
new file mode 100644
index 0000000..950abe4
--- /dev/null
+++ b/_content/journal/2026/go_gitignore/index.adoc
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: CC-BY-SA-4.0
+// SPDX-FileCopyrightText: 2026 M. Shulhan <ms@kilabit.info>
+// REUSE-IgnoreStart
+= Gitignore package for Go
+:toc:
+:sectanchors:
+:lib_git_ref: https://git.sr.ht/~shulhan/pakakeh.go/tree/806359d5462fa8effde5b130da2071ed43d0da56/item/lib/git
+
+== Background
+
+I have several projects that do not use the SPDX license identifiers yet,
+and I want to add them, and if possible convert existing copyright and
+license headers.
+My initial thought was: "is there a tool to help convert the
+license to comply with SPDX?"
+
+I looked it up.
+
+There is one tool that closely matches with my requirements,
+https://reuse.software/[reuse],
+which has options to set copyright, license identifier, and year.
+
+----
+$ reuse annotate \
+ --recursive \
+ --copyright "author <email>" \
+ --year $YEAR \
+ --license "license-name" \
+ .
+----
+
+Example of the result is like below,
+
+----
+@@ -1,7 +1,8 @@
++// SPDX-FileCopyrightText: 2018 M. Shulhan <ms@kilabit.info>
+ // Copyright 2018 Shulhan <ms@kilabit.info>. All rights reserved.
+-// Use of this source code is governed by a BSD-style
+-// license that can be found in the LICENSE file.
++//
++// SPDX-License-Identifier: BSD-3-Clause
+----
+
+It does not remove the old copyright header, I think it is by design.
+I can run a `sed` on those files, to remove the line that start with
+"// Copyright".
+However, there are other problems.
+If multiple files have different copyright year, I need to run another `sed`
+command to correct the years.
+And, what if the file does not have copyright year?
+We need to figure it out from git history the year its created,
+
+----
+$ git log --follow --format=%ad --date=format:%Y $FILE | tail -1
+----
+
+Some big projects, like
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/[Linux kernel],
+only set the SPDX license without changing the copyright.
+We can do that.
+Or, we can write another tool that help convert the license headers.
+
+Since this is a long holidays, let's take the hard way, writing a tool to
+convert the old headers to SPDX format.
+This should be simple right?
+
+For each file in directory:
+
+(1) If there is a line prefixed with "// SPDX", skip it continue to the next
+ file. +
+(2) If there is a line prefixed with "// Copyright", capture the year,
+ author, and email using regex, and replace it with
+ "// SPDX-FileCopyrightText: ..." +
+(3) If there is a line contains "^//.*BSD-style" replace it with
+ "// SPDX-License-Identifier: BSD-3-Clause" and remove the line
+ that start with "// license ..." +
+(4) If no "// Copyright" get the year using the above "git log" command and
+ insert the new "// SPDX-FileCopyrightText: ..." using predefined value.
+
+Turns out, there is another problem.
+
+A file can be
+https://reuse.software/faq/#exclude-file[excluded from REUSE compliance]
+if its ignored by git, using the ".gitignore" file.
+And that is why we write parser and checker for Gitignore in Go.
+
+
+== Specification
+
+We use the
+https://git-scm.com/docs/gitignore[gitignore(5)^]
+manual as specification for the implementation.
+
+In short, the rules are as follow:
+
+* Each line is a pattern, that will be matched with file name or path.
+* Empty line is ignored.
+* Line started with '#' is a comment, unless it is escaped with backslash
+ '\\'.
+* Space before and after line are ignored, unless escaped with backslash
+ '\\'.
+* Character '/' is directory separator.
+* Special character '?' in the pattern means match one character except '/'.
+* Special character '*' in the pattern means match zero or more character
+ except '/'.
+* A pattern that end with '/' only match with directory with the same
+ name.
+
+When reading the above rules, my first thought is that this is similar to
+https://pkg.go.dev/path/filepath#Match[filepath.Match^].
+
+I was wrong.
+
+According to the example given in manual, a pattern "foo/" matches with
+"foo" or "a/foo"; but, the result for `filepath.Match` is different,
+----
+fmt.Println(filepath.Match("foo/", "foo"))
+fmt.Println(filepath.Match("foo/", "a/foo"))
+// Output:
+// false <nil>
+// false <nil>
+----
+
+Even if we remove the trailing slash in pattern "foo", the output still
+not as expected,
+
+----
+fmt.Println(filepath.Match("foo", "foo"))
+fmt.Println(filepath.Match("foo", "a/foo"))
+// Output:
+// true <nil>
+// false <nil>
+----
+
+Continuing the rules, there are other special characters that do not
+inline with the [filepath.Match].
+
+* Special character '!' in the beginning of pattern means negation.
+ A file or directory that is excluded by previous pattern, is included
+ again if match with it.
+
+* A pattern "\*\*/foo" means match any file or directory named "foo" with
+ zero or more directory before it.
+* A pattern "foo/\*\*" means match any file or directory inside directory
+ "foo" but not directory named "foo" itself.
+* A pattern "foo/\*\*/bar" means match file or directory named "bar" inside
+ directory "foo", with zero or more directory in between.
+
+
+== Implementation
+
+Based on the above specification, seems like a simple [filepath.Match] or
+[patch.Match] is not sufficient to handle the pattern.
+
+We need to convert those patterns into a regex that complies with the above
+rules:
+
+* If the pattern end with '/', mark it as directory, and remove the
+ trailing '/'.
+
+* Trim the "\*\*/" at the beginning of pattern since it means anything
+ before.
+ Pattern "\*\*/foo" or "\*\*/\*\*/foo" is equal to "foo".
+
+* Ignore the pattern if its end with empty string or only '\*'.
+
+* Now, we need to detect if the pattern contains directory separator '/'.
+ Lets find the index and store it as `$SEP_IDX` for later.
+
+* Escape regex meta-characters '.', '+', '|', '(', and ')' with
+ backslash '\\'.
+
+* Replace single character '\*' with regex "[^/]\*" (accept zero or more
+ characters except "/").
+
+* Replace single character '?' with regex "[^/]" (accept one character
+ except "/").
+
+* Replace string "/\*\*/" with regex "(/.\*)?/" (accept zero or more
+ directories in between).
+
+* Replace string "/\*\*" with regex "/(.\*)" (accept everything inside a
+ directory)
+
+* Replace string "\*\*" with regex "[^/]\*" (second pass for '\*')
+
+* Back to $SEP_IDX,
+** If no directory separator found, prepend the pattern with
+ regex "(/.\*)?/" (accept zero or more directories before).
+** if directory separator is in the beginning or middle of pattern, prepend
+ the pattern with regex "^/?" (do not accept any directory before)
+
+* If the pattern is a directory (end with '/') as we mark before, append
+ back the '/' with '$'; otherwise append regex "/?$" (accept file or
+ directory).
+
+For example, here is the list of pattern and its conversion to regex,
+
+* foo or \*\*/foo => \^(.\*/|/)?foo/?$
+* foo\* => \^(.\*/|/)?foo[\^/]\*/?$
+* foo? => \^(.\*/|/)?foo[\^/]/?$
+* foo/ or \*\*/foo/ => \^(.\*/|/)?foo/$
+* foo/\*\* => \^(.\*/|/)?foo/(.*)/?$
+* /foo => \^/?foo/?$
+* /foo/ => \^/?foo/$
+* foo/bar => \^(.\*/|/)?foo/bar/?$
+* foo/bar/ => \^(.\*/|/)?foo/bar/$
+* /foo/bar => \^/?foo/bar/?$
+* foo/\*\*/bar => \^/?foo(/.*)?/bar/?$
+
+The result of the implementation can be viewed here:
+https://git.sr.ht/~shulhan/pakakeh.go/tree/main/item/lib/git/[lib/git^].
+
+
+The APIs are quite simple.
+First, load the ".gitignore" from directory using
+{lib_git_ref}/gitignore.go#L37[`LoadGitignore()`^],
+and then check if path is excluded using
+{lib_git_ref}/git.go#L246[`IsIgnored()`^].
+
+----
+func LoadGitignore(dir string) (ign *Gitignore)
+----
+
+LoadGitignore load the gitignore file inside directory `dir`. Any invalid
+pattern will be ignored.
+
+----
+func (ign *Gitignore) IsIgnored(path string) bool
+----
+
+IsIgnored return true if the `path` is ignored by this Gitignore content.
+The `path` is relative to Gitignore directory.
+
+There is also a type
+{lib_git_ref}/ignore_pattern.go[`IgnorePattern`^]
+that one can import and use for other implementation, for example handling
+`path` value in REUSE.toml annotations table.
+
+// REUSE-IgnoreEnd
diff --git a/_content/journal/2026/index.adoc b/_content/journal/2026/index.adoc
new file mode 100644
index 0000000..3f0c5cd
--- /dev/null
+++ b/_content/journal/2026/index.adoc
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: CC-BY-SA-4.0
+// SPDX-FileCopyrightText: 2026 M. Shulhan <ms@kilabit.info>
+
+=== 2026
+
+link:/journal/2026/go_gitignore/[Gitignore package for Go].
+My thoughts when implementing gitignore parser and checker for Go
+programming language.
diff --git a/_content/journal/index.adoc b/_content/journal/index.adoc
index ec6df32..18c361d 100644
--- a/_content/journal/index.adoc
+++ b/_content/journal/index.adoc
@@ -3,6 +3,8 @@
:toc:
+include::./2026/index.adoc[]
+
include::./2025/index.adoc[]
include::./2024/index.adoc[]