| Age | Commit message (Collapse) | Author |
|
**🌼 brokenlinks: refactoring the logic, simplify the code**
Previously, we made the scan logic to run in multiple goroutine with
one channel to push and consume the result and another channel to push
and pop link to be processed.
The logic is a very complicated code, making it hard to read and debug.
These changes refactoring it to use single goroutine that push and pop
link from/to a slices, as queue.
Another refactoring is in where we store the link.
Previously, we have [jarink.Link], [brokenlinks.Broken], and
[brokenlinks.linkQueue] to store the metadata for a link.
These release unified them into struct [jarink.Link].
**🌱 brokenlinks: print the progress to stderr**
Each time the scan start, new queue add, fetching start, print the
message to stderr.
This remove the verbose options for better user experience.
**🌼 brokenlinks: improve fetch logging and decrease timeout to 10s**
When fetching, print log after the fetch completed.
If success, print the URL along with HTTP status code.
If fail, print the URL along with the error.
The timeout now reduce to 10 seconds to prevent long delay when working
with broken website.
**🌼 brokenlinks: mark the link in queue as seen with status code 0**
This is to fix double URL being pushed to queue.
Given the following queue and its parent,
----
/page2.html => /index.html
/brokenPage => /index.html
/brokenPage => /page2.html
----
Before scanning the second "/brokenPage" on parent page "/page2.html",
check if its seen first to get the status code before we run the scan.
This allow jarink report "/brokenPage" as broken link for both pages,
not just in "/index.html".
**🌼 brokenlinks: skip parsing non-HTML page**
If the response Content-type return other than "text/html", skip parsing
the content and return immediately.
We also skip processing "mailto\:" URL.
**🌼 brokenlinks: make link that return HTML always end with slash**
If parent URL like "/page" return the body as HTML page, the URL should
be end with slash to make the relative links inside it works when joined
with the parent URL.
**🌱 brokenlinks: store the anchor or image source in link**
In the struct `Link`, we add field `Value` that store the `href` from A
element or `src` from IMG element.
This allow us to debug any error during scan, especially joining path
and link.
**🌼 brokenlinks: fix possible panic in markAsBroken**
If the Link does not have `parentUrl`, set the parent URL using the link
URL itself.
This only happened if the target URL that we will scan return an error.
|
|
Rewording some paragraphs, formatting on code, and add INSTALL
section.
|
|
The build task set the Version information based on the latest
tag and number of commits.
|
|
If the Link does not have parentUrl, set the parent URL using the link
URL itself.
This only happened if the target URL that we will scan return an error.
|
|
In the struct Link, we add field Value that store the href from A element
or src from IMG element.
This allow us to debug any error during scan, especially joining path
and link.
|
|
If parent URL like "/page" return the body as HTML page, the URL should
be end with slash to make the relative links inside it works when joined
with the parent URL.
|
|
|
|
|
|
This is to see the behaviour of [Node.Descendants] when traversing
the element recursively.
|
|
Given the following queue and its parent,
/page2.html => /index.html
/brokenPage => /index.html
/brokenPage => /page2.html
Before scanning the second "/brokenPage" on parent page "/page2.html",
check if its seen first to get the status code before we run the scan.
This allow jarink report "/brokenPage" as broken link for both pages,
not just in "/index.html".
|
|
If the request redirected, use the "Location" value in the response
header as the parent URL instead of from the original link in queue.
|
|
If the parent URL end with .html or .htm, join the directory of parent
instead of the current path with the relative path.
|
|
If the response Content-type return other than "text/html", skip parsing
the content and return immediately.
|
|
This is to fix double URL being pushed to queue.
|
|
The TestScan_slow takes around ~11 seconds due to test include
[time.Sleep].
|
|
When fetching, print log after the fetch completed.
If success, print the URL along with HTTP status code.
If fail, print the URL along with the error.
The timeout now reduce to 10 seconds to prevent long delay when working
with broken website.
|
|
Each time the scan start, new queue add, fetching start, print the
message to stderr.
This remove the verbose options for better user experience.
|
|
Previously, have [jarink.Link], [brokenlinks.Broken], and
[brokenlinks.linkQueue] to store the metadata for a link.
These changes unified them into struct [jarink.Link].
|
|
Previously, we made the scan logic to run in multiple goroutine with
one channel to push and consume the result and another channel to push
and pop link to be processed.
The logic is a very complicated code, making it hard to read and debug.
These changes refactoring it to use single goroutine that push and pop
link from/to a slices, as queue.
|
|
This is so the README can be rendered in pkg.go.dev and in git.sr.ht.
While at it, group documentation files under _doc/ directory.
|
|
**🌼 brokenlinks: fix infinite loop on unknown host**
On link with invalid domain, it should break and return the error
immediately.
|
|
On link with invalid domain, it should break and return the error
immediately.
|
|
**🌱 brokenlinks: add option to ignore list HTTP status code**.
When link known to have an issues, one can ignore the status code during
scanning broken links using "-ignore-status" option.
**🌱 brokenlinks: add option "insecure"**.
The "-insecure" option does not report an error on server with invalid
certificates.
**🌱 brokenlinks: implement caching for external URLs**.
Any successful fetch on external URLs will be recorded into jarink
cache file, located in user's cache directory.
For example, in Linux it would be `$HOME/.cache/jarink/cache.json`.
This help improve the future rescanning on the same or different target
URL, minimizing network requests.
**🌼 brokenlinks: reduce the number of goroutine on scan**.
Previously, each scan run on one goroutine and the result is
pushed using one goroutine.
This makes one scan of link consume two goroutine.
This changes the scan function to return the result and push it
in the same goroutine.
|
|
The version command print the version of the program.
|
|
|
|
Previously, each scan run on one goroutine and the result is
pushed using pushResult also in one goroutine.
This makes one link consume two goroutines.
This changes the scan function to return the result and push it
in the same goroutine.
|
|
Any succesful fetch on external URLs, will be recorded into jarink
cache file, located in user's home cache directory.
For example, in Linux it would be `$HOME/.cache/jarink/cache.json`.
This help improve the future rescanning on the same or different target
URL, minimizing network requests.
|
|
The test run a server that contains three six pages that contains
various [time.Sleep] duration before returning the response.
This allow us to see how the main scan loop works, waiting
for resultq and listWaitStatus.
|
|
There are two test cases, one for invalid status code like "abc",
and one for unknown status code like "50".
|
|
|
|
|
|
Before the Options passed to worker, it should be valid, including the
URL to be scanned.
|
|
This is for git.sr.ht to be able to render the README.
|
|
The insecure option will allow and not report as error on server with
invalid certificates.
|
|
When link known to have an issues, one can ignore the status
code during scanning broken links using "-ignore-status" option.
|
|
The first release of jarink provides the command "brokenlinks",
to scan for broken links.
The output of this command is list of page with its broken links
in JSON format.
This command accept the following options,
`-verbose`::
Print the page that being scanned to standard error.
`-past-result=<path to JSON file>`::
Scan only the pages reported by result from past scan based
on the content in JSON file.
This minimize the time to re-scan the pages once we have fixed the URLs.
|
|
|
|
|
|
When two or more struct has the same prefix that means it is time to
move it to group it.
Also, we will group one command to one package in the future.
|
|
Naming it page_links does not make sense if the result is from brokenlinks
command.
|
|
Using HTTP HEAD on certain page may return
* 404, not found, for example on
https://support.google.com/accounts/answer/1066447
* 405, method not allowed, for example on
https://aur.archlinux.org/packages/rescached-git
For 405 response code we can check and retry with GET, but for 404 its
impossible to check if the URL is really exist or not, since 404 means
page not found.
|
|
When the call to HTTP HEAD or GET return an error and the error is
*net.DNSError with Timeout, retry the call until no error or Timeout
again for 5 times.
|
|
Previously, we only encode the BrokenlinksResult.PageLinks.
The struct may changes in the future, so its better to encode the whole
struct now rather than changing the output later.
|
|
The brokenlinks command now have option "-past-result" that accept
path to JSON file from the past result.
If its set, the program will only scan the pages with broken links
inside that report.
|
|
|
|
|
|
Previously, if we pass the URL with path to brokenlinks, for example
"web.tld/path" it will scan all of the pages in the website "web.tld".
Now, it only scan the "/path" and its sub paths.
|
|
|
|
The worker use log with date and time, while the main program is not.
|
|
The README contains the content from the usage function in the
"cmd/jarink".
|