From 797faa817881ea63271d5c6794b80ccd644cc76c Mon Sep 17 00:00:00 2001 From: Shulhan Date: Fri, 2 Jan 2026 17:21:26 +0700 Subject: doc: add index and reformat some document using asciidoc This is for publication of doc under https://kilabit.info/project/vos . --- doc/.gitignore | 1 + doc/dev/GOAL | 263 ----------------------------------------------------- doc/dev/GOAL.adoc | 258 ++++++++++++++++++++++++++++++++++++++++++++++++++++ doc/dev/NOTES | 122 ------------------------- doc/dev/NOTES.adoc | 135 +++++++++++++++++++++++++++ doc/index.adoc | 26 ++++++ 6 files changed, 420 insertions(+), 385 deletions(-) create mode 100644 doc/.gitignore delete mode 100644 doc/dev/GOAL create mode 100644 doc/dev/GOAL.adoc delete mode 100644 doc/dev/NOTES create mode 100644 doc/dev/NOTES.adoc create mode 100644 doc/index.adoc diff --git a/doc/.gitignore b/doc/.gitignore new file mode 100644 index 0000000..2d19fc7 --- /dev/null +++ b/doc/.gitignore @@ -0,0 +1 @@ +*.html diff --git a/doc/dev/GOAL b/doc/dev/GOAL deleted file mode 100644 index 7578e06..0000000 --- a/doc/dev/GOAL +++ /dev/null @@ -1,263 +0,0 @@ -Vos Goals ----------- - Taken from CoSort Technical Specifications. - - -Legend: -- : unimplemented -+ : implemented -= : on going/half done -? : is it worth/why/what is that mean - - -Ease of Use ------------ - -- Processes record layouts and SQL­like field definitions from central data - dictionaries. - -- Converts and processes native COBOL copybook, Oracle SQL*Loader control - file, CSV, and W3C extended log format (ELF) file layouts. - -- SortCL data definition files are a supported MIMB metadata format. - -- Mix of on­line help, pre­runtime application validation, and runtime - error messages. - -- Leverages centralized application and file layout definitions (metadata - repositories). - -= Reports problems to standard error when invoked from a program, or - to an error log. - -- Runs silently or with verbose messaging without user intervention. - -- Allows user control over the amount of informational output produced. - -- Generates a query­ready XML audit log for data forensics and privacy - compliance. - -= Describes commands and options through man pages and on­line documentation. - - it's half done because the program is always moving to a new features. - it's not wise to mark this as 'done'. - -- Easy­to­use interfaces and seamless third­party sort replacements preclude - the need for training classes - - -Resource Control ----------------- - -+ Sets and allows user modification of the maximum and minimum number of - concurrent sort threads for sorting on multi­CPU and multi­core systems. - - using PROCESS_MAX variable. - -+ Uses a specified directory, a combination of directories, for temporary work - files. - - using PROC_TMP_DIR variable. - -+ Limits the amount of main and virtual memory used during sort operations. - - using PROCESS_MAX_ROW variable. - - Since input file size is unpredictable and a human is still need to - run the program, the amount of program memory still cannot decide by - human. What if it's set to 1 kilobytes ?. - -+ Sets the size of the memory blocks used as physical I/O buffers. - - using FILE_BUFFER_SIZE variable. - - -Input and Output  ----------------- - -= Processes any number of files, of any size, and any number of records, - fixed or variable length to 65,535 bytes passed from an input procedure, - from stdin, a named pipe, a table in memory, or from an application program. - - - TODO: from stdin - - TODO: from a named pipe. - - TODO: from a table in memory. - - TODO: from an application program. - -? Supports the use of environment variables. - - for what ? - -= Supports wildcards in the specification of input and output files, as well - as absolute path names and aliases. - - - TODO: supports wildcards in the specification of input files. - -+ Accepts and outputs fixed­ or variable­length records with delimited field. - -? Generates one or more output files, and/or summary information, including - formatted and dashboard­ready reports. - -- Returns sorted, merged, or joined records one (or more) at a time to an output - procedure, to stdout (or named pipe), a table in memory, one or more new or - existing files, or to a program. - -- Outputs optional sequence numbers with each record, at any starting value, for - indexed loads and/or reports. - - -Record Selection and Grouping ------------------------------ - -= Includes or omits input or output records using field­to­field or field­constant - comparisons. - - TODO: field-to-field comparisons - -- Compares on any number of data fields, using standard and alternate collating - sequences. - -+ Sorts and/or reformats groups of selected records. - - using SORT and CREATE statement. - -+ Matches two or more sorted or unsorted files on inner and outer join criteria using - SQL­based condition syntax. - - using JOIN with '+' or '-' statement. - -- Skips a specified number of records, bytes, or a file header or footer. - -- Processes a specified number of records or bytes, including a saved header. - -- Eliminates or saves records with duplicate keys. - - -Sort Key Processing -------------------- - -+ Allows any number of key fields to be specified in ascending or - descending order. - - using SORT x by x.f1 ASC; or - using SORT x by x.f1 DESC; - -+ Supports any number of fields from 0 to 65,535 bytes in length. - - almost unlimited, the limit is your memory. - -+ Orders fixed position fields, or floating fields with one or more - delimiters. - -- Supports numeric keys, including all C, FORTRAN, and COBOL data types. - -- Supports single­ and multi­byte character keys, including ASCII, EBCDIC, - ASCII in EBCDIC sequence, American, European, ISO and Japanese timestamps, - and natural (locale­dependent) values, as well as Unicode and double­byte - characters such as Big5, EUC­TW, UTF32, and S­JIS. - -- Allows left or right alignment and case shifting of character keys. - -- Accepts user compare procedures for multi­byte, encrypted and other - special data. - -- Performs record sequence checking. - -+ Maintains input record order (stability) on duplicate keys. - -- Controls treatment of null fields when specifying floating - (character separated) keys. - -- Collates and converts between many of the following data types (formats): - --- - - -Record Reformatting -------------------- - -+ Inserts, removes, resizes, and reorders fields within records; defines new - fields. - -- Converts data in fields from one format to another either using internal - conversion. - -- Maps common fields from differently formatted input files to a uniform sort - record. - -= Joins any fields from several files into an output record, usually based on a - condition. - - using JOIN statement. current support only in joining two input files. - -- Changes record layouts from one file type to another, including: Line - Sequential, Record Sequential, Variable Sequential, Blocked, Microsoft Comma - Separated Values (CSV), ACUCOBOL Vision, MF I­SAM, MFVL, Unisys VBF, VSAM - (within UniKik MBM), Extended Log Format (W3C), LDIF, and XML. - -- Maps processed records to many differently formatted output files, including - HTML. - -- Writes multiple record formats to the same file for complex report - requirements. - -- Performs mathematical expressions and functions on field data (including - aggregate data) to generate new output fields. - -- Calculates the difference in days, hours, minutes and seconds betweeen - timestamps. - - -Field Reformatting/Validation ------------------------------ - -- Aligns desired field contents to either the left or right of the target - field, where any leading or trailing fill characters from the source are - moved to the opposite side of the string. - -- Processes values from multi­dimensional, tab­delimited lookup files. - -- Creates and processes sub­strings of original field contents, where you can - specify a positive or negative offset and a number of bytes to be contained - in the sub­string. - -- Finds a user­specified text string in a given field, and replaces all - occurrences of it with a different user­specified text string in the target - field. - -- Supports Perl Compatible Regular Expressions (PCRE), including pattern - matching. - -- Uses C­style “iscompare” functions to validate contents at the field level - (for example, to determine if all field characters are printable), which can - also be used for record­filtering via selection statements. - -- Protects sensitive field data with field­level de­identification and AES­256 - encryption routines, along with anonymization, pseudonymization, filtering - and other column-level data masking and obfuscation techniques. - -- Supports custom, user­written field­level transformation libraries, and - documents an example of a field­level data cleansing routine from - Melissa Data (AddressObject). - - -Record Summarization --------------------- - -- Consolidates records with equal keys into unique records, while totaling, - averaging, or counting values in specified fields, including derived - (cross­calculated) fields. - -- Produces maximum, minimum, average, sum, and count fields. - -- Displays running summary value(s) up to a break (accumulating aggregates). - -- Nreaks on compound conditions. - -- Allows multiple levels of summary fields in the same report. - -- Re­maps summary fields into a new format, allowing relational tables. - -- Ranks data through a running count with descending numeric values. - -- Writes detail and summary records to the same output file for structured - reports. diff --git a/doc/dev/GOAL.adoc b/doc/dev/GOAL.adoc new file mode 100644 index 0000000..74370fb --- /dev/null +++ b/doc/dev/GOAL.adoc @@ -0,0 +1,258 @@ += Vos Goals + +Taken from CoSort Technical Specifications. + +Legend: + +* - : unimplemented +* + : implemented +* = : on going/half done +* ? : is it worth/why/what is that mean + + +== Ease of Use + +(-) Processes record layouts and SQL­like field definitions from central +data dictionaries. + +(-) Converts and processes native COBOL copybook, Oracle SQL*Loader control +file, CSV, and W3C extended log format (ELF) file layouts. + +(-) SortCL data definition files are a supported MIMB metadata format. + +(-) Mix of on­line help, pre­runtime application validation, and runtime +error messages. + +(-) Leverages centralized application and file layout definitions +(metadata repositories). + +(=) Reports problems to standard error when invoked from a program, or +to an error log. + +(-) Runs silently or with verbose messaging without user intervention. + +(-) Allows user control over the amount of informational output produced. + +(-) Generates a query­ready XML audit log for data forensics and privacy +compliance. + +(=) Describes commands and options through man pages and on­line +documentation. + +it's half done because the program is always moving to a new features. +it's not wise to mark this as 'done'. + +(-) Easy­to­use interfaces and seamless third­party sort replacements +preclude the need for training classes + + +== Resource Control + +(+) Sets and allows user modification of the maximum and minimum number of +concurrent sort threads for sorting on multi­CPU and multi­core systems. + +Using PROCESS_MAX variable. + +(+) Uses a specified directory, a combination of directories, for temporary +work files. + +Using PROC_TMP_DIR variable. + +(+) Limits the amount of main and virtual memory used during sort +operations. + +Using PROCESS_MAX_ROW variable. + +Since input file size is unpredictable and a human is still need to +run the program, the amount of program memory still cannot decide by +human. What if it's set to 1 kilobytes ?. + +(+) Sets the size of the memory blocks used as physical I/O buffers. + +Using FILE_BUFFER_SIZE variable. + + +== Input and Output + +(=) Processes any number of files, of any size, and any number of records, +fixed or variable length to 65,535 bytes passed from an input procedure, +from stdin, a named pipe, a table in memory, or from an application program. + +- TODO: from stdin +- TODO: from a named pipe. +- TODO: from a table in memory. +- TODO: from an application program. + +(?) Supports the use of environment variables. + +(=) Supports wildcard in the specification of input and output files, as +well as absolute path names and aliases. + +- TODO: supports wildcard in the specification of input files. + +(+) Accepts and outputs fixed­ or variable­length records with delimited +field. + +(?) Generates one or more output files, and/or summary information, +including formatted and dashboard­ready reports. + +(-) Returns sorted, merged, or joined records one (or more) at a time to an +output procedure, to stdout (or named pipe), a table in memory, one or more +new or existing files, or to a program. + +(-) Outputs optional sequence numbers with each record, at any starting +value, for indexed loads and/or reports. + + +== Record Selection and Grouping + +(=) Includes or omits input or output records using field­to­field or field +constant comparisons. + +TODO: field-to-field comparisons + +(-) Compares on any number of data fields, using standard and alternate +collating sequences. + +(+) Sorts and/or reformats groups of selected records. + +Using SORT and CREATE statement. + +(+) Matches two or more sorted or unsorted files on inner and outer join +criteria using SQL­based condition syntax. + +Using JOIN with '+' or '-' statement. + +(-) Skips a specified number of records, bytes, or a file header or footer. + +(-) Processes a specified number of records or bytes, including a saved +header. + +(-) Eliminates or saves records with duplicate keys. + + +== Sort Key Processing + +(+) Allows any number of key fields to be specified in ascending or +descending order. + + using SORT x by x.f1 ASC; or + using SORT x by x.f1 DESC; + +(+) Supports any number of fields from 0 to 65,535 bytes in length. + +Almost unlimited, the limit is your memory. + +(+) Orders fixed position fields, or floating fields with one or more +delimiters. + +(-) Supports numeric keys, including all C, FORTRAN, and COBOL data types. + +(-) Supports single­ and multi­byte character keys, including ASCII, EBCDIC, +ASCII in EBCDIC sequence, American, European, ISO and Japanese timestamps, +and natural (locale­dependent) values, as well as Unicode and double­byte +characters such as Big5, EUC­TW, UTF32, and S­JIS. + +(-) Allows left or right alignment and case shifting of character keys. + +(-) Accepts user compare procedures for multi­byte, encrypted and other +special data. + +(-) Performs record sequence checking. + +(+) Maintains input record order (stability) on duplicate keys. + +(-) Controls treatment of null fields when specifying floating +(character separated) keys. + +(-) Collates and converts between many of the following data types +(formats). + + +== Record Reformatting + +(+) Inserts, removes, resizes, and reorders fields within records; defines +new fields. + +(-) Converts data in fields from one format to another either using internal +conversion. + +(-) Maps common fields from differently formatted input files to a uniform +sort record. + +(=) Joins any fields from several files into an output record, usually based +on a condition. + +Using JOIN statement. current support only in joining two input files. + +(-) Changes record layouts from one file type to another, including: Line +Sequential, Record Sequential, Variable Sequential, Blocked, Microsoft Comma +Separated Values (CSV), ACUCOBOL Vision, MF I­SAM, MFVL, Unisys VBF, VSAM +(within UniKik MBM), Extended Log Format (W3C), LDIF, and XML. + +(-) Maps processed records to many differently formatted output files, +including HTML. + +(-) Writes multiple record formats to the same file for complex report +requirements. + +(-) Performs mathematical expressions and functions on field data (including +aggregate data) to generate new output fields. + +(-) Calculates the difference in days, hours, minutes and seconds between +timestamps. + + +== Field Reformatting/Validation + +(-) Aligns desired field contents to either the left or right of the target +field, where any leading or trailing fill characters from the source are +moved to the opposite side of the string. + +(-) Processes values from multi­dimensional, tab­delimited lookup files. + +(-) Creates and processes sub­strings of original field contents, where you +can specify a positive or negative offset and a number of bytes to be +contained in the sub­string. + +(-) Finds a user­specified text string in a given field, and replaces all +occurrences of it with a different user­specified text string in the target +field. + +(-) Supports Perl Compatible Regular Expressions (PCRE), including pattern +matching. + +(-) Uses C­style “iscompare” functions to validate contents at the field +level (for example, to determine if all field characters are printable), +which can also be used for record­filtering via selection statements. + +(-) Protects sensitive field data with field­level de­identification and +AES­256 encryption routines, along with anonymization, pseudonymization, +filtering and other column-level data masking and obfuscation techniques. + +(-) Supports custom, user­written field­level transformation libraries, and +documents an example of a field­level data cleansing routine from +Melissa Data (AddressObject). + + +== Record Summarization + +(-) Consolidates records with equal keys into unique records, while +totaling, averaging, or counting values in specified fields, including +derived (cross­calculated) fields. + +(-) Produces maximum, minimum, average, sum, and count fields. + +(-) Displays running summary value(s) up to a break (accumulating +aggregates). + +(-) Breaks on compound conditions. + +(-) Allows multiple levels of summary fields in the same report. + +(-) Re­maps summary fields into a new format, allowing relational tables. + +(-) Ranks data through a running count with descending numeric values. + +(-) Writes detail and summary records to the same output file for structured +reports. diff --git a/doc/dev/NOTES b/doc/dev/NOTES deleted file mode 100644 index 92bf86c..0000000 --- a/doc/dev/NOTES +++ /dev/null @@ -1,122 +0,0 @@ - sometimes i forgot why i write code like this. - -- S.T.M.L - -- follow linux coding style - -- priority of source code (4S) : - + stable - + simple - + small - + secure (this option does not need for this program) - -- keep as small as possible: - + remove unneeded space - + remove unneeded variable - -- write comment/documentation as clear as possible - -- learn to use: - + if (1 == var) - -- learn to avoid: - + (i < strlen(str)) - on loop statement because strlen() need temporary variable. - try, - l = strlen(str); - while (i < l) { ... } - -- use function in libc as much as possible; if not, wrap it! - - - -001 - I/O Relation between Statement ------------------------------------------------------------------------------ -LOAD is an input statement. - -SORT, CREATE, JOIN is an output statement, but it can be an input. -i.e: - - 1 - load abc ( ... ) as x; - 2 - sort x by a, b; - 3 - create ghi ( x.field, ... ) as out_x; - -file output created by sort statement in line 2 will be an input by create -statement in line 3. - - -002 - Why we need '2nd-loser' ------------------------------------------------------------------------------ - -to minimize comparison and insert in merge tree. - - - -003 - Why we need 'level' on tree node ------------------------------------------------------------------------------ - -list of input file to merge is A, B, C contain sorted data : - - A : 10, 11, 12, 13 (1st file) - B : 1, 12, 100, 101 (2nd file) - C : 2, 13, 200, 201 (3rd file) - -if we use tree insert algorithm: - - if (root < node) - insert to left - else - insert to right - -after several step we will get: - -B-12 - \ - C-13 - / -A-12 - -which result in not-a-stable sort, - - B-1 C-2 A-10 A-11 B-12 A-12 ... - -they should be, - - B-1 C-2 A-10 A-11 A-12 B-12 ... - -Even if we choose different algorithm in insert: - - if (root <= node) - insert to left - else - insert to right - -there is also input data that will violate this, i.e: - - A : 2, 13, 200, 201 (1st file) - B : 1, 12, 100, 101 (2nd file) - C : 10, 11, 12, 13 (3rd file) - - -004 - recursives call + thread + free on SunOS 5.10 ------------------------------------------------------------------------------ - -i did not investigate much, but doing a recursive call + thread + free cause -SIGSEGV on SunOS 5.10 system, but not in GNU/Linux system. This odd's found -whee testing on Solaris and by using dbx the SIGSEGV "sometimes" catched in -str_destroy, - - if (str->buf) - free(str->buf); <= dbx catch here - -and "sometimes" below that (but not in vos function/stack). - -i.e: - list_destroy(**ptr) - { - if (! (*ptr)) - return; - list_destroy((*ptr)->next); - free((*ptr)); - } - -and no, it's not about double free. diff --git a/doc/dev/NOTES.adoc b/doc/dev/NOTES.adoc new file mode 100644 index 0000000..72d1306 --- /dev/null +++ b/doc/dev/NOTES.adoc @@ -0,0 +1,135 @@ + sometimes i forgot why i write code like this. + -- S.T.M.L + +- follow linux coding style + +- priority of source code (4S) : + + stable + + simple + + small + + secure (this option does not need for this program) + +- keep as small as possible: + + remove unneeded space + + remove unneeded variable + +- write comment/documentation as clear as possible + +- learn to use: + + if (1 == var) + +- learn to avoid: + + (i < strlen(str)) + on loop statement because strlen() need temporary variable. + try, + l = strlen(str); + while (i < l) { ... } + +- use function in libc as much as possible; if not, wrap it! + + + +== 001 - I/O Relation between Statement + +LOAD is an input statement. + +SORT, CREATE, JOIN is an output statement, but it can be an input. +i.e: + +---- +1 - load abc ( ... ) as x; +2 - sort x by a, b; +3 - create ghi ( x.field, ... ) as out_x; +---- + +file output created by sort statement in line 2 will be an input by create +statement in line 3. + + +== 002 - Why we need '2nd-loser' + +to minimize comparison and insert in merge tree. + + + +== 003 - Why we need 'level' on tree node + +list of input file to merge is A, B, C contain sorted data : + +---- +A : 10, 11, 12, 13 (1st file) +B : 1, 12, 100, 101 (2nd file) +C : 2, 13, 200, 201 (3rd file) +---- + +if we use tree insert algorithm: + +---- +if (root < node) + insert to left +else + insert to right +---- + +after several step we will get: + +---- +B-12 + \ + C-13 + / +A-12 +---- + +which result in not-a-stable sort, + + B-1 C-2 A-10 A-11 B-12 A-12 ... + +they should be, + + B-1 C-2 A-10 A-11 A-12 B-12 ... + +Even if we choose different algorithm in insert: + +---- +if (root <= node) + insert to left +else + insert to right +---- + +there is also input data that will violate this, i.e: + +---- +A : 2, 13, 200, 201 (1st file) +B : 1, 12, 100, 101 (2nd file) +C : 10, 11, 12, 13 (3rd file) +---- + + +== 004 - recursives call + thread + free on SunOS 5.10 + +i did not investigate much, but doing a recursive call + thread + free cause +SIGSEGV on SunOS 5.10 system, but not in GNU/Linux system. This odd's found +whee testing on Solaris and by using dbx the SIGSEGV "sometimes" catched in +str_destroy, + +---- +if (str->buf) + free(str->buf); <= dbx catch here +---- + +and "sometimes" below that (but not in vos function/stack). + +i.e: +---- +list_destroy(**ptr) +{ + if (! (*ptr)) + return; + list_destroy((*ptr)->next); + free((*ptr)); +} +---- + +and no, it's not about double free. diff --git a/doc/index.adoc b/doc/index.adoc new file mode 100644 index 0000000..1cc3be0 --- /dev/null +++ b/doc/index.adoc @@ -0,0 +1,26 @@ += vos + +Vos is a program to process formatted data, i.e. CSV data. +Vos is designed to process a large input file, a file where their size is +larger than the size of memory, and can be tuned to adapt with your machine +environment. + +link:user/vos_user_manual.html[Vos User Manual] - User manual for vos +command line. + + +== Development + +link:dev/GOAL.html[GOAL] - List the goal of this project. + +link:dev/NOTES.html[NOTES] - Miscellaneous notes when developing the +project. + +link:dev/vos-sketch.odg[Vos sketch diagram]. + +Performance logs, + +- link:dev/vos.test.create.log[vos.test.create.log]. +- link:dev/vos.test.create.mem.log[vos.test.create.mem.log]. +- link:dev/vos.test.join.log[vos.test.join.log]. +- link:dev/vos.test.join.log[vos.test.join.log]. -- cgit v1.3