diff options
| author | Shulhan <ms@kilabit.info> | 2026-01-02 17:21:26 +0700 |
|---|---|---|
| committer | Shulhan <ms@kilabit.info> | 2026-01-02 17:21:26 +0700 |
| commit | 797faa817881ea63271d5c6794b80ccd644cc76c (patch) | |
| tree | ce012b4e704d870c07e2fc4f50f6b099ffa82431 | |
| parent | a5817d2410f65c3a055e4c1ec212270aed50186d (diff) | |
| download | vos-797faa817881ea63271d5c6794b80ccd644cc76c.tar.xz | |
doc: add index and reformat some document using asciidoc
This is for publication of doc under https://kilabit.info/project/vos .
| -rw-r--r-- | doc/.gitignore | 1 | ||||
| -rw-r--r-- | doc/dev/GOAL | 263 | ||||
| -rw-r--r-- | doc/dev/GOAL.adoc | 258 | ||||
| -rw-r--r-- | doc/dev/NOTES.adoc (renamed from doc/dev/NOTES) | 85 | ||||
| -rw-r--r-- | doc/index.adoc | 26 |
5 files changed, 334 insertions, 299 deletions
diff --git a/doc/.gitignore b/doc/.gitignore new file mode 100644 index 0000000..2d19fc7 --- /dev/null +++ b/doc/.gitignore @@ -0,0 +1 @@ +*.html diff --git a/doc/dev/GOAL b/doc/dev/GOAL deleted file mode 100644 index 7578e06..0000000 --- a/doc/dev/GOAL +++ /dev/null @@ -1,263 +0,0 @@ -Vos Goals ----------- - Taken from CoSort Technical Specifications. - - -Legend: -- : unimplemented -+ : implemented -= : on going/half done -? : is it worth/why/what is that mean - - -Ease of Use ------------ - -- Processes record layouts and SQLlike field definitions from central data - dictionaries. - -- Converts and processes native COBOL copybook, Oracle SQL*Loader control - file, CSV, and W3C extended log format (ELF) file layouts. - -- SortCL data definition files are a supported MIMB metadata format. - -- Mix of online help, preruntime application validation, and runtime - error messages. - -- Leverages centralized application and file layout definitions (metadata - repositories). - -= Reports problems to standard error when invoked from a program, or - to an error log. - -- Runs silently or with verbose messaging without user intervention. - -- Allows user control over the amount of informational output produced. - -- Generates a queryready XML audit log for data forensics and privacy - compliance. - -= Describes commands and options through man pages and online documentation. - - it's half done because the program is always moving to a new features. - it's not wise to mark this as 'done'. - -- Easytouse interfaces and seamless thirdparty sort replacements preclude - the need for training classes - - -Resource Control ----------------- - -+ Sets and allows user modification of the maximum and minimum number of - concurrent sort threads for sorting on multiCPU and multicore systems. - - using PROCESS_MAX variable. - -+ Uses a specified directory, a combination of directories, for temporary work - files. - - using PROC_TMP_DIR variable. - -+ Limits the amount of main and virtual memory used during sort operations. - - using PROCESS_MAX_ROW variable. - - Since input file size is unpredictable and a human is still need to - run the program, the amount of program memory still cannot decide by - human. What if it's set to 1 kilobytes ?. - -+ Sets the size of the memory blocks used as physical I/O buffers. - - using FILE_BUFFER_SIZE variable. - - -Input and Output ----------------- - -= Processes any number of files, of any size, and any number of records, - fixed or variable length to 65,535 bytes passed from an input procedure, - from stdin, a named pipe, a table in memory, or from an application program. - - - TODO: from stdin - - TODO: from a named pipe. - - TODO: from a table in memory. - - TODO: from an application program. - -? Supports the use of environment variables. - - for what ? - -= Supports wildcards in the specification of input and output files, as well - as absolute path names and aliases. - - - TODO: supports wildcards in the specification of input files. - -+ Accepts and outputs fixed or variablelength records with delimited field. - -? Generates one or more output files, and/or summary information, including - formatted and dashboardready reports. - -- Returns sorted, merged, or joined records one (or more) at a time to an output - procedure, to stdout (or named pipe), a table in memory, one or more new or - existing files, or to a program. - -- Outputs optional sequence numbers with each record, at any starting value, for - indexed loads and/or reports. - - -Record Selection and Grouping ------------------------------ - -= Includes or omits input or output records using fieldtofield or fieldconstant - comparisons. - - TODO: field-to-field comparisons - -- Compares on any number of data fields, using standard and alternate collating - sequences. - -+ Sorts and/or reformats groups of selected records. - - using SORT and CREATE statement. - -+ Matches two or more sorted or unsorted files on inner and outer join criteria using - SQLbased condition syntax. - - using JOIN with '+' or '-' statement. - -- Skips a specified number of records, bytes, or a file header or footer. - -- Processes a specified number of records or bytes, including a saved header. - -- Eliminates or saves records with duplicate keys. - - -Sort Key Processing -------------------- - -+ Allows any number of key fields to be specified in ascending or - descending order. - - using SORT x by x.f1 ASC; or - using SORT x by x.f1 DESC; - -+ Supports any number of fields from 0 to 65,535 bytes in length. - - almost unlimited, the limit is your memory. - -+ Orders fixed position fields, or floating fields with one or more - delimiters. - -- Supports numeric keys, including all C, FORTRAN, and COBOL data types. - -- Supports single and multibyte character keys, including ASCII, EBCDIC, - ASCII in EBCDIC sequence, American, European, ISO and Japanese timestamps, - and natural (localedependent) values, as well as Unicode and doublebyte - characters such as Big5, EUCTW, UTF32, and SJIS. - -- Allows left or right alignment and case shifting of character keys. - -- Accepts user compare procedures for multibyte, encrypted and other - special data. - -- Performs record sequence checking. - -+ Maintains input record order (stability) on duplicate keys. - -- Controls treatment of null fields when specifying floating - (character separated) keys. - -- Collates and converts between many of the following data types (formats): - --- - - -Record Reformatting -------------------- - -+ Inserts, removes, resizes, and reorders fields within records; defines new - fields. - -- Converts data in fields from one format to another either using internal - conversion. - -- Maps common fields from differently formatted input files to a uniform sort - record. - -= Joins any fields from several files into an output record, usually based on a - condition. - - using JOIN statement. current support only in joining two input files. - -- Changes record layouts from one file type to another, including: Line - Sequential, Record Sequential, Variable Sequential, Blocked, Microsoft Comma - Separated Values (CSV), ACUCOBOL Vision, MF ISAM, MFVL, Unisys VBF, VSAM - (within UniKik MBM), Extended Log Format (W3C), LDIF, and XML. - -- Maps processed records to many differently formatted output files, including - HTML. - -- Writes multiple record formats to the same file for complex report - requirements. - -- Performs mathematical expressions and functions on field data (including - aggregate data) to generate new output fields. - -- Calculates the difference in days, hours, minutes and seconds betweeen - timestamps. - - -Field Reformatting/Validation ------------------------------ - -- Aligns desired field contents to either the left or right of the target - field, where any leading or trailing fill characters from the source are - moved to the opposite side of the string. - -- Processes values from multidimensional, tabdelimited lookup files. - -- Creates and processes substrings of original field contents, where you can - specify a positive or negative offset and a number of bytes to be contained - in the substring. - -- Finds a userspecified text string in a given field, and replaces all - occurrences of it with a different userspecified text string in the target - field. - -- Supports Perl Compatible Regular Expressions (PCRE), including pattern - matching. - -- Uses Cstyle “iscompare” functions to validate contents at the field level - (for example, to determine if all field characters are printable), which can - also be used for recordfiltering via selection statements. - -- Protects sensitive field data with fieldlevel deidentification and AES256 - encryption routines, along with anonymization, pseudonymization, filtering - and other column-level data masking and obfuscation techniques. - -- Supports custom, userwritten fieldlevel transformation libraries, and - documents an example of a fieldlevel data cleansing routine from - Melissa Data (AddressObject). - - -Record Summarization --------------------- - -- Consolidates records with equal keys into unique records, while totaling, - averaging, or counting values in specified fields, including derived - (crosscalculated) fields. - -- Produces maximum, minimum, average, sum, and count fields. - -- Displays running summary value(s) up to a break (accumulating aggregates). - -- Nreaks on compound conditions. - -- Allows multiple levels of summary fields in the same report. - -- Remaps summary fields into a new format, allowing relational tables. - -- Ranks data through a running count with descending numeric values. - -- Writes detail and summary records to the same output file for structured - reports. diff --git a/doc/dev/GOAL.adoc b/doc/dev/GOAL.adoc new file mode 100644 index 0000000..74370fb --- /dev/null +++ b/doc/dev/GOAL.adoc @@ -0,0 +1,258 @@ += Vos Goals + +Taken from CoSort Technical Specifications. + +Legend: + +* - : unimplemented +* + : implemented +* = : on going/half done +* ? : is it worth/why/what is that mean + + +== Ease of Use + +(-) Processes record layouts and SQLlike field definitions from central +data dictionaries. + +(-) Converts and processes native COBOL copybook, Oracle SQL*Loader control +file, CSV, and W3C extended log format (ELF) file layouts. + +(-) SortCL data definition files are a supported MIMB metadata format. + +(-) Mix of online help, preruntime application validation, and runtime +error messages. + +(-) Leverages centralized application and file layout definitions +(metadata repositories). + +(=) Reports problems to standard error when invoked from a program, or +to an error log. + +(-) Runs silently or with verbose messaging without user intervention. + +(-) Allows user control over the amount of informational output produced. + +(-) Generates a queryready XML audit log for data forensics and privacy +compliance. + +(=) Describes commands and options through man pages and online +documentation. + +it's half done because the program is always moving to a new features. +it's not wise to mark this as 'done'. + +(-) Easytouse interfaces and seamless thirdparty sort replacements +preclude the need for training classes + + +== Resource Control + +(+) Sets and allows user modification of the maximum and minimum number of +concurrent sort threads for sorting on multiCPU and multicore systems. + +Using PROCESS_MAX variable. + +(+) Uses a specified directory, a combination of directories, for temporary +work files. + +Using PROC_TMP_DIR variable. + +(+) Limits the amount of main and virtual memory used during sort +operations. + +Using PROCESS_MAX_ROW variable. + +Since input file size is unpredictable and a human is still need to +run the program, the amount of program memory still cannot decide by +human. What if it's set to 1 kilobytes ?. + +(+) Sets the size of the memory blocks used as physical I/O buffers. + +Using FILE_BUFFER_SIZE variable. + + +== Input and Output + +(=) Processes any number of files, of any size, and any number of records, +fixed or variable length to 65,535 bytes passed from an input procedure, +from stdin, a named pipe, a table in memory, or from an application program. + +- TODO: from stdin +- TODO: from a named pipe. +- TODO: from a table in memory. +- TODO: from an application program. + +(?) Supports the use of environment variables. + +(=) Supports wildcard in the specification of input and output files, as +well as absolute path names and aliases. + +- TODO: supports wildcard in the specification of input files. + +(+) Accepts and outputs fixed or variablelength records with delimited +field. + +(?) Generates one or more output files, and/or summary information, +including formatted and dashboardready reports. + +(-) Returns sorted, merged, or joined records one (or more) at a time to an +output procedure, to stdout (or named pipe), a table in memory, one or more +new or existing files, or to a program. + +(-) Outputs optional sequence numbers with each record, at any starting +value, for indexed loads and/or reports. + + +== Record Selection and Grouping + +(=) Includes or omits input or output records using fieldtofield or field +constant comparisons. + +TODO: field-to-field comparisons + +(-) Compares on any number of data fields, using standard and alternate +collating sequences. + +(+) Sorts and/or reformats groups of selected records. + +Using SORT and CREATE statement. + +(+) Matches two or more sorted or unsorted files on inner and outer join +criteria using SQLbased condition syntax. + +Using JOIN with '+' or '-' statement. + +(-) Skips a specified number of records, bytes, or a file header or footer. + +(-) Processes a specified number of records or bytes, including a saved +header. + +(-) Eliminates or saves records with duplicate keys. + + +== Sort Key Processing + +(+) Allows any number of key fields to be specified in ascending or +descending order. + + using SORT x by x.f1 ASC; or + using SORT x by x.f1 DESC; + +(+) Supports any number of fields from 0 to 65,535 bytes in length. + +Almost unlimited, the limit is your memory. + +(+) Orders fixed position fields, or floating fields with one or more +delimiters. + +(-) Supports numeric keys, including all C, FORTRAN, and COBOL data types. + +(-) Supports single and multibyte character keys, including ASCII, EBCDIC, +ASCII in EBCDIC sequence, American, European, ISO and Japanese timestamps, +and natural (localedependent) values, as well as Unicode and doublebyte +characters such as Big5, EUCTW, UTF32, and SJIS. + +(-) Allows left or right alignment and case shifting of character keys. + +(-) Accepts user compare procedures for multibyte, encrypted and other +special data. + +(-) Performs record sequence checking. + +(+) Maintains input record order (stability) on duplicate keys. + +(-) Controls treatment of null fields when specifying floating +(character separated) keys. + +(-) Collates and converts between many of the following data types +(formats). + + +== Record Reformatting + +(+) Inserts, removes, resizes, and reorders fields within records; defines +new fields. + +(-) Converts data in fields from one format to another either using internal +conversion. + +(-) Maps common fields from differently formatted input files to a uniform +sort record. + +(=) Joins any fields from several files into an output record, usually based +on a condition. + +Using JOIN statement. current support only in joining two input files. + +(-) Changes record layouts from one file type to another, including: Line +Sequential, Record Sequential, Variable Sequential, Blocked, Microsoft Comma +Separated Values (CSV), ACUCOBOL Vision, MF ISAM, MFVL, Unisys VBF, VSAM +(within UniKik MBM), Extended Log Format (W3C), LDIF, and XML. + +(-) Maps processed records to many differently formatted output files, +including HTML. + +(-) Writes multiple record formats to the same file for complex report +requirements. + +(-) Performs mathematical expressions and functions on field data (including +aggregate data) to generate new output fields. + +(-) Calculates the difference in days, hours, minutes and seconds between +timestamps. + + +== Field Reformatting/Validation + +(-) Aligns desired field contents to either the left or right of the target +field, where any leading or trailing fill characters from the source are +moved to the opposite side of the string. + +(-) Processes values from multidimensional, tabdelimited lookup files. + +(-) Creates and processes substrings of original field contents, where you +can specify a positive or negative offset and a number of bytes to be +contained in the substring. + +(-) Finds a userspecified text string in a given field, and replaces all +occurrences of it with a different userspecified text string in the target +field. + +(-) Supports Perl Compatible Regular Expressions (PCRE), including pattern +matching. + +(-) Uses Cstyle “iscompare” functions to validate contents at the field +level (for example, to determine if all field characters are printable), +which can also be used for recordfiltering via selection statements. + +(-) Protects sensitive field data with fieldlevel deidentification and +AES256 encryption routines, along with anonymization, pseudonymization, +filtering and other column-level data masking and obfuscation techniques. + +(-) Supports custom, userwritten fieldlevel transformation libraries, and +documents an example of a fieldlevel data cleansing routine from +Melissa Data (AddressObject). + + +== Record Summarization + +(-) Consolidates records with equal keys into unique records, while +totaling, averaging, or counting values in specified fields, including +derived (crosscalculated) fields. + +(-) Produces maximum, minimum, average, sum, and count fields. + +(-) Displays running summary value(s) up to a break (accumulating +aggregates). + +(-) Breaks on compound conditions. + +(-) Allows multiple levels of summary fields in the same report. + +(-) Remaps summary fields into a new format, allowing relational tables. + +(-) Ranks data through a running count with descending numeric values. + +(-) Writes detail and summary records to the same output file for structured +reports. diff --git a/doc/dev/NOTES b/doc/dev/NOTES.adoc index 92bf86c..72d1306 100644 --- a/doc/dev/NOTES +++ b/doc/dev/NOTES.adoc @@ -1,5 +1,5 @@ - sometimes i forgot why i write code like this. - -- S.T.M.L + sometimes i forgot why i write code like this. + -- S.T.M.L - follow linux coding style @@ -29,51 +29,57 @@ -001 - I/O Relation between Statement ------------------------------------------------------------------------------ +== 001 - I/O Relation between Statement + LOAD is an input statement. SORT, CREATE, JOIN is an output statement, but it can be an input. i.e: - 1 - load abc ( ... ) as x; - 2 - sort x by a, b; - 3 - create ghi ( x.field, ... ) as out_x; +---- +1 - load abc ( ... ) as x; +2 - sort x by a, b; +3 - create ghi ( x.field, ... ) as out_x; +---- file output created by sort statement in line 2 will be an input by create statement in line 3. -002 - Why we need '2nd-loser' ------------------------------------------------------------------------------ +== 002 - Why we need '2nd-loser' to minimize comparison and insert in merge tree. -003 - Why we need 'level' on tree node ------------------------------------------------------------------------------ +== 003 - Why we need 'level' on tree node list of input file to merge is A, B, C contain sorted data : - A : 10, 11, 12, 13 (1st file) - B : 1, 12, 100, 101 (2nd file) - C : 2, 13, 200, 201 (3rd file) +---- +A : 10, 11, 12, 13 (1st file) +B : 1, 12, 100, 101 (2nd file) +C : 2, 13, 200, 201 (3rd file) +---- if we use tree insert algorithm: - if (root < node) - insert to left - else - insert to right +---- +if (root < node) + insert to left +else + insert to right +---- after several step we will get: +---- B-12 \ C-13 / A-12 +---- which result in not-a-stable sort, @@ -85,38 +91,45 @@ they should be, Even if we choose different algorithm in insert: - if (root <= node) - insert to left - else - insert to right +---- +if (root <= node) + insert to left +else + insert to right +---- there is also input data that will violate this, i.e: - A : 2, 13, 200, 201 (1st file) - B : 1, 12, 100, 101 (2nd file) - C : 10, 11, 12, 13 (3rd file) +---- +A : 2, 13, 200, 201 (1st file) +B : 1, 12, 100, 101 (2nd file) +C : 10, 11, 12, 13 (3rd file) +---- -004 - recursives call + thread + free on SunOS 5.10 ------------------------------------------------------------------------------ +== 004 - recursives call + thread + free on SunOS 5.10 i did not investigate much, but doing a recursive call + thread + free cause SIGSEGV on SunOS 5.10 system, but not in GNU/Linux system. This odd's found whee testing on Solaris and by using dbx the SIGSEGV "sometimes" catched in str_destroy, - if (str->buf) - free(str->buf); <= dbx catch here +---- +if (str->buf) + free(str->buf); <= dbx catch here +---- and "sometimes" below that (but not in vos function/stack). i.e: - list_destroy(**ptr) - { - if (! (*ptr)) - return; - list_destroy((*ptr)->next); - free((*ptr)); - } +---- +list_destroy(**ptr) +{ + if (! (*ptr)) + return; + list_destroy((*ptr)->next); + free((*ptr)); +} +---- and no, it's not about double free. diff --git a/doc/index.adoc b/doc/index.adoc new file mode 100644 index 0000000..1cc3be0 --- /dev/null +++ b/doc/index.adoc @@ -0,0 +1,26 @@ += vos + +Vos is a program to process formatted data, i.e. CSV data. +Vos is designed to process a large input file, a file where their size is +larger than the size of memory, and can be tuned to adapt with your machine +environment. + +link:user/vos_user_manual.html[Vos User Manual] - User manual for vos +command line. + + +== Development + +link:dev/GOAL.html[GOAL] - List the goal of this project. + +link:dev/NOTES.html[NOTES] - Miscellaneous notes when developing the +project. + +link:dev/vos-sketch.odg[Vos sketch diagram]. + +Performance logs, + +- link:dev/vos.test.create.log[vos.test.create.log]. +- link:dev/vos.test.create.mem.log[vos.test.create.mem.log]. +- link:dev/vos.test.join.log[vos.test.join.log]. +- link:dev/vos.test.join.log[vos.test.join.log]. |
