aboutsummaryrefslogtreecommitdiff
path: root/doc/dev
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev')
-rw-r--r--doc/dev/GOAL263
-rw-r--r--doc/dev/NOTES122
-rw-r--r--doc/dev/TODO10
-rw-r--r--doc/dev/slog206
-rw-r--r--doc/dev/test1
-rw-r--r--doc/dev/vos-sketch.odgbin0 -> 29491 bytes
-rw-r--r--doc/dev/vos.test.create.log113
-rw-r--r--doc/dev/vos.test.create.mem.log50
-rw-r--r--doc/dev/vos.test.join.log39
-rw-r--r--doc/dev/vos.test.sort-00.log65
10 files changed, 869 insertions, 0 deletions
diff --git a/doc/dev/GOAL b/doc/dev/GOAL
new file mode 100644
index 0000000..7578e06
--- /dev/null
+++ b/doc/dev/GOAL
@@ -0,0 +1,263 @@
+Vos Goals
+----------
+ Taken from CoSort Technical Specifications.
+
+
+Legend:
+- : unimplemented
++ : implemented
+= : on going/half done
+? : is it worth/why/what is that mean
+
+
+Ease of Use
+-----------
+
+- Processes record layouts and SQL­like field definitions from central data
+ dictionaries.
+
+- Converts and processes native COBOL copybook, Oracle SQL*Loader control
+ file, CSV, and W3C extended log format (ELF) file layouts.
+
+- SortCL data definition files are a supported MIMB metadata format.
+
+- Mix of on­line help, pre­runtime application validation, and runtime
+ error messages.
+
+- Leverages centralized application and file layout definitions (metadata
+ repositories).
+
+= Reports problems to standard error when invoked from a program, or
+ to an error log.
+
+- Runs silently or with verbose messaging without user intervention.
+
+- Allows user control over the amount of informational output produced.
+
+- Generates a query­ready XML audit log for data forensics and privacy
+ compliance.
+
+= Describes commands and options through man pages and on­line documentation.
+
+ it's half done because the program is always moving to a new features.
+ it's not wise to mark this as 'done'.
+
+- Easy­to­use interfaces and seamless third­party sort replacements preclude
+ the need for training classes
+
+
+Resource Control
+----------------
+
++ Sets and allows user modification of the maximum and minimum number of
+ concurrent sort threads for sorting on multi­CPU and multi­core systems.
+
+ using PROCESS_MAX variable.
+
++ Uses a specified directory, a combination of directories, for temporary work
+ files.
+
+ using PROC_TMP_DIR variable.
+
++ Limits the amount of main and virtual memory used during sort operations.
+
+ using PROCESS_MAX_ROW variable.
+
+ Since input file size is unpredictable and a human is still need to
+ run the program, the amount of program memory still cannot decide by
+ human. What if it's set to 1 kilobytes ?.
+
++ Sets the size of the memory blocks used as physical I/O buffers.
+
+ using FILE_BUFFER_SIZE variable.
+
+
+Input and Output 
+----------------
+
+= Processes any number of files, of any size, and any number of records,
+ fixed or variable length to 65,535 bytes passed from an input procedure,
+ from stdin, a named pipe, a table in memory, or from an application program.
+
+ - TODO: from stdin
+ - TODO: from a named pipe.
+ - TODO: from a table in memory.
+ - TODO: from an application program.
+
+? Supports the use of environment variables.
+
+ for what ?
+
+= Supports wildcards in the specification of input and output files, as well
+ as absolute path names and aliases.
+
+ - TODO: supports wildcards in the specification of input files.
+
++ Accepts and outputs fixed­ or variable­length records with delimited field.
+
+? Generates one or more output files, and/or summary information, including
+ formatted and dashboard­ready reports.
+
+- Returns sorted, merged, or joined records one (or more) at a time to an output
+ procedure, to stdout (or named pipe), a table in memory, one or more new or
+ existing files, or to a program.
+
+- Outputs optional sequence numbers with each record, at any starting value, for
+ indexed loads and/or reports.
+
+
+Record Selection and Grouping
+-----------------------------
+
+= Includes or omits input or output records using field­to­field or field­constant
+ comparisons.
+
+ TODO: field-to-field comparisons
+
+- Compares on any number of data fields, using standard and alternate collating
+ sequences.
+
++ Sorts and/or reformats groups of selected records.
+
+ using SORT and CREATE statement.
+
++ Matches two or more sorted or unsorted files on inner and outer join criteria using
+ SQL­based condition syntax.
+
+ using JOIN with '+' or '-' statement.
+
+- Skips a specified number of records, bytes, or a file header or footer.
+
+- Processes a specified number of records or bytes, including a saved header.
+
+- Eliminates or saves records with duplicate keys.
+
+
+Sort Key Processing
+-------------------
+
++ Allows any number of key fields to be specified in ascending or
+ descending order.
+
+ using SORT x by x.f1 ASC; or
+ using SORT x by x.f1 DESC;
+
++ Supports any number of fields from 0 to 65,535 bytes in length.
+
+ almost unlimited, the limit is your memory.
+
++ Orders fixed position fields, or floating fields with one or more
+ delimiters.
+
+- Supports numeric keys, including all C, FORTRAN, and COBOL data types.
+
+- Supports single­ and multi­byte character keys, including ASCII, EBCDIC,
+ ASCII in EBCDIC sequence, American, European, ISO and Japanese timestamps,
+ and natural (locale­dependent) values, as well as Unicode and double­byte
+ characters such as Big5, EUC­TW, UTF32, and S­JIS.
+
+- Allows left or right alignment and case shifting of character keys.
+
+- Accepts user compare procedures for multi­byte, encrypted and other
+ special data.
+
+- Performs record sequence checking.
+
++ Maintains input record order (stability) on duplicate keys.
+
+- Controls treatment of null fields when specifying floating
+ (character separated) keys.
+
+- Collates and converts between many of the following data types (formats):
+ ---
+
+
+Record Reformatting
+-------------------
+
++ Inserts, removes, resizes, and reorders fields within records; defines new
+ fields.
+
+- Converts data in fields from one format to another either using internal
+ conversion.
+
+- Maps common fields from differently formatted input files to a uniform sort
+ record.
+
+= Joins any fields from several files into an output record, usually based on a
+ condition.
+
+ using JOIN statement. current support only in joining two input files.
+
+- Changes record layouts from one file type to another, including: Line
+ Sequential, Record Sequential, Variable Sequential, Blocked, Microsoft Comma
+ Separated Values (CSV), ACUCOBOL Vision, MF I­SAM, MFVL, Unisys VBF, VSAM
+ (within UniKik MBM), Extended Log Format (W3C), LDIF, and XML.
+
+- Maps processed records to many differently formatted output files, including
+ HTML.
+
+- Writes multiple record formats to the same file for complex report
+ requirements.
+
+- Performs mathematical expressions and functions on field data (including
+ aggregate data) to generate new output fields.
+
+- Calculates the difference in days, hours, minutes and seconds betweeen
+ timestamps.
+
+
+Field Reformatting/Validation
+-----------------------------
+
+- Aligns desired field contents to either the left or right of the target
+ field, where any leading or trailing fill characters from the source are
+ moved to the opposite side of the string.
+
+- Processes values from multi­dimensional, tab­delimited lookup files.
+
+- Creates and processes sub­strings of original field contents, where you can
+ specify a positive or negative offset and a number of bytes to be contained
+ in the sub­string.
+
+- Finds a user­specified text string in a given field, and replaces all
+ occurrences of it with a different user­specified text string in the target
+ field.
+
+- Supports Perl Compatible Regular Expressions (PCRE), including pattern
+ matching.
+
+- Uses C­style “iscompare” functions to validate contents at the field level
+ (for example, to determine if all field characters are printable), which can
+ also be used for record­filtering via selection statements.
+
+- Protects sensitive field data with field­level de­identification and AES­256
+ encryption routines, along with anonymization, pseudonymization, filtering
+ and other column-level data masking and obfuscation techniques.
+
+- Supports custom, user­written field­level transformation libraries, and
+ documents an example of a field­level data cleansing routine from
+ Melissa Data (AddressObject).
+
+
+Record Summarization
+--------------------
+
+- Consolidates records with equal keys into unique records, while totaling,
+ averaging, or counting values in specified fields, including derived
+ (cross­calculated) fields.
+
+- Produces maximum, minimum, average, sum, and count fields.
+
+- Displays running summary value(s) up to a break (accumulating aggregates).
+
+- Nreaks on compound conditions.
+
+- Allows multiple levels of summary fields in the same report.
+
+- Re­maps summary fields into a new format, allowing relational tables.
+
+- Ranks data through a running count with descending numeric values.
+
+- Writes detail and summary records to the same output file for structured
+ reports.
diff --git a/doc/dev/NOTES b/doc/dev/NOTES
new file mode 100644
index 0000000..92bf86c
--- /dev/null
+++ b/doc/dev/NOTES
@@ -0,0 +1,122 @@
+ sometimes i forgot why i write code like this.
+ -- S.T.M.L
+
+- follow linux coding style
+
+- priority of source code (4S) :
+ + stable
+ + simple
+ + small
+ + secure (this option does not need for this program)
+
+- keep as small as possible:
+ + remove unneeded space
+ + remove unneeded variable
+
+- write comment/documentation as clear as possible
+
+- learn to use:
+ + if (1 == var)
+
+- learn to avoid:
+ + (i < strlen(str))
+ on loop statement because strlen() need temporary variable.
+ try,
+ l = strlen(str);
+ while (i < l) { ... }
+
+- use function in libc as much as possible; if not, wrap it!
+
+
+
+001 - I/O Relation between Statement
+-----------------------------------------------------------------------------
+LOAD is an input statement.
+
+SORT, CREATE, JOIN is an output statement, but it can be an input.
+i.e:
+
+ 1 - load abc ( ... ) as x;
+ 2 - sort x by a, b;
+ 3 - create ghi ( x.field, ... ) as out_x;
+
+file output created by sort statement in line 2 will be an input by create
+statement in line 3.
+
+
+002 - Why we need '2nd-loser'
+-----------------------------------------------------------------------------
+
+to minimize comparison and insert in merge tree.
+
+
+
+003 - Why we need 'level' on tree node
+-----------------------------------------------------------------------------
+
+list of input file to merge is A, B, C contain sorted data :
+
+ A : 10, 11, 12, 13 (1st file)
+ B : 1, 12, 100, 101 (2nd file)
+ C : 2, 13, 200, 201 (3rd file)
+
+if we use tree insert algorithm:
+
+ if (root < node)
+ insert to left
+ else
+ insert to right
+
+after several step we will get:
+
+B-12
+ \
+ C-13
+ /
+A-12
+
+which result in not-a-stable sort,
+
+ B-1 C-2 A-10 A-11 B-12 A-12 ...
+
+they should be,
+
+ B-1 C-2 A-10 A-11 A-12 B-12 ...
+
+Even if we choose different algorithm in insert:
+
+ if (root <= node)
+ insert to left
+ else
+ insert to right
+
+there is also input data that will violate this, i.e:
+
+ A : 2, 13, 200, 201 (1st file)
+ B : 1, 12, 100, 101 (2nd file)
+ C : 10, 11, 12, 13 (3rd file)
+
+
+004 - recursives call + thread + free on SunOS 5.10
+-----------------------------------------------------------------------------
+
+i did not investigate much, but doing a recursive call + thread + free cause
+SIGSEGV on SunOS 5.10 system, but not in GNU/Linux system. This odd's found
+whee testing on Solaris and by using dbx the SIGSEGV "sometimes" catched in
+str_destroy,
+
+ if (str->buf)
+ free(str->buf); <= dbx catch here
+
+and "sometimes" below that (but not in vos function/stack).
+
+i.e:
+ list_destroy(**ptr)
+ {
+ if (! (*ptr))
+ return;
+ list_destroy((*ptr)->next);
+ free((*ptr));
+ }
+
+and no, it's not about double free.
diff --git a/doc/dev/TODO b/doc/dev/TODO
new file mode 100644
index 0000000..3dc6001
--- /dev/null
+++ b/doc/dev/TODO
@@ -0,0 +1,10 @@
+>>
+- add set variable
+ set process_compare_case_sensitive; (default)
+ set process_compare_case_notsensitive;
+
+ set process_tmp_dir "/path/to/tmp/dir";
+ set process_tmp_dir "/another/tmp/dir";
+<< DONE
+
+- Produces maximum, minimum, average, sum, and count fields.
diff --git a/doc/dev/slog b/doc/dev/slog
new file mode 100644
index 0000000..72e2185
--- /dev/null
+++ b/doc/dev/slog
@@ -0,0 +1,206 @@
+ i have a odd habit: checking code every time
+ i get bored, which result an error some time.
+ this file prevent me to over checking it.
+ -- May Benot
+
+--- format ---
++ function_name
+@check : XXXX XXXX
+@last-check : year.month.day (last check)
+@auditor : thisman@thatserver.com (last auditor)
+@desc : fix algorithm
+--- tamrof ---
+
+
+vos_String
+-----------------------------------------------------------------------------
+
++ str_create
+@check : X
+@last-check : 2008.12.17
+@auditor : ms@kilabit.info
+
++ str_append_c
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ str_append
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+@desc : fix len increment
+
++ str_detach
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ str_rtrim
+@check : XX
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+@desc : removed
+
++ str_prune
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ str_destroy
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ str_raw_copy
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ str_raw_randomize
+@check : XX
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+@desc : 'x' should not be replaced
+
++ str_raw_hash
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
+
+vos_File
+-----------------------------------------------------------------------------
+
++ file_open
+@check : X
+@last-check : 2008.12.19
+@auditor : ms@kilabit.info
+
++ file_read
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_write
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_fetch_until
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_skip_until
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_skip_space
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_destroy
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_raw_get_size
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
++ file_raw_is_exist
+@check : X
+@last-check : 2009.01.25
+@auditor : ms@kilabit.info
+
+
+vos_LL
+-----------------------------------------------------------------------------
+
++ ll_add
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ ll_link
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ ll_print
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ ll_destroy
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
+
+vos_Field
+-----------------------------------------------------------------------------
+
++ field_soft_copy
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ field_add
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ field_print
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ _field_destroy
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
+
+vos_Record
+-----------------------------------------------------------------------------
+
++ record_new
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ _record_cmp
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ record_add_field
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ record_add_row
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ record_prune
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ record_destroy
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
++ record_print
+@check : X
+@last-check : 2009.01.26
+@auditor : ms@kilabit.info
+
diff --git a/doc/dev/test b/doc/dev/test
new file mode 100644
index 0000000..ff6cde3
--- /dev/null
+++ b/doc/dev/test
@@ -0,0 +1 @@
++ Accepts and outputs fixed­ or variable­length records with delimited field.
diff --git a/doc/dev/vos-sketch.odg b/doc/dev/vos-sketch.odg
new file mode 100644
index 0000000..45b86a6
--- /dev/null
+++ b/doc/dev/vos-sketch.odg
Binary files differ
diff --git a/doc/dev/vos.test.create.log b/doc/dev/vos.test.create.log
new file mode 100644
index 0000000..027c53f
--- /dev/null
+++ b/doc/dev/vos.test.create.log
@@ -0,0 +1,113 @@
+2009.10.12
+
+Comparing vos create process time by setting process max row and buffer size
+==============================================================================
+
+ is it disk or algorithm ?
+
+- file input size : 257,985,910 byte (~ 250 MB)
+- format of input field:
+
+ '\'':field01:'\''::';',
+
+- format of output field:
+
+ '':field01:''::'|',
+
+- number of field in input & output : 11 field
+- process max : 2 (this option does not effect process actually)
+
+
+system copy time
+==============================================================================
+
+ real 0m2.906s
+ user 0m0.010s
+ sys 0m0.747s
+
+
+vos load+create
+==============================================================================
+
+test 000
+--------
+o process max row : 100,000
+o file buffer size : 8192
+
+ real 0m30.243s
+ user 0m55.680s
+ sys 0m1.567s
+
+
+test 001
+--------
+o process max row : 100,000
+o file buffer size : 1,024,000
+
+ real 0m30.296s
+ user 0m55.536s
+ sys 0m1.790s
+
+
+test 002
+--------
+
+o process max row : 200,000
+o file buffer size : 1,024,000
+
+ real 0m30.115s
+ user 0m55.956s
+ sys 0m1.500s
+
+
+test 003
+--------
+
+o process max row : 100,000
+o file buffer size : 51,200,000
+
+ real 0m29.924s
+ user 0m55.443s
+ sys 0m1.563s
+
+
+test 004
+--------
+
+o process max row : 500,000
+o file buffer size : 51,200,000
+
+ real 0m32.795s
+ user 0m57.013s
+ sys 0m1.697s
+
+
+(source change)
+before:
+- int record_read_filtered(struct Record **R, struct File *F,
+ struct Field *fld);
+after:
+- int record_read_filtered(struct Record **R, struct File *F,
+ struct Field *fld, struct String *str);
+
+
+test 005
+--------
+
+o process max row : 100,000
+o file buffer size : 8,192
+
+ real 0m29.783s
+ user 0m54.253s
+ sys 0m1.867s
+
+
+test 006
+--------
+
+o process max row : 100,000
+o file buffer size : 51,200,000
+
+ real 0m30.364s
+ user 0m56.000s
+ sys 0m1.570s
diff --git a/doc/dev/vos.test.create.mem.log b/doc/dev/vos.test.create.mem.log
new file mode 100644
index 0000000..f5db2f6
--- /dev/null
+++ b/doc/dev/vos.test.create.mem.log
@@ -0,0 +1,50 @@
+
+ How much vos load+create use memory
+------------------------------------------------------------------------------
+
+o input file size : 51,521,908
+o input rows : 501,000
+o input fields : 11
+o output fields : 11
+o process max row : 100,000
+o process max : 2
+
+
+2009.01.14 - test 000
+------------------------------------------------------------------------------
+
+0 file buffer size : 51,200,000
+o bytes allocated : 296,833,772
+o allocs : 11,000,473
+o running time (w/o memcheck) :
+
+ real 0m6.187s
+ user 0m10.789s
+ sys 0m0.363s
+
+
+2009.01.15 - test 001
+------------------------------------------------------------------------------
+
+o with new vos_process_create algorithm
+0 file buffer size : 51,200,000
+o bytes allocated : 166,820,858 (~ 3 * input file size)
+o allocs : 3,500,662
+o running time (w/o memcheck) :
+
+ real 0m4.565s
+ user 0m8.026s
+ sys 0m0.327s
+
+
+2009.01.16 - test 002
+------------------------------------------------------------------------------
+o file buffer size : 8192 (default)
+o bytes allocated : 64,437,110 (~ 1.2 * input file size :)
+o allocs : 3,500,652
+o running time (w/o memcheck) :
+
+ real 0m4.361s
+ user 0m7.763s
+ sys 0m0.283s
+
diff --git a/doc/dev/vos.test.join.log b/doc/dev/vos.test.join.log
new file mode 100644
index 0000000..d3557f3
--- /dev/null
+++ b/doc/dev/vos.test.join.log
@@ -0,0 +1,39 @@
+ How fast vos_join is and how much memory does it's used
+------------------------------------------------------------------------------
+
+o input file size 1 (already sorted) : 40,499,908
+o input file size 2 (already sorted) : 40,499,908
+
+o input rows : 501,000
+o input fields : 11
+o output fields : 22
+
+o process max row : 100,000
+o process max : 2
+o file buffer size : 8192 bytes
+
+
+2009.01.18 - test 000
+------------------------------------------------------------------------------
+
+o allocs : 24,048,866
+o bytes allocated : 417,724,740 (~ 5 * inputs file size)
+o running time (w/o memcheck) :
+
+ real 0m9.118s
+ user 0m8.483s
+ sys 0m0.237s
+
+
+2009.01.18 - test 001
+------------------------------------------------------------------------------
+
+o with new vos_join algorithm
+o allocs : 542
+o bytes allocated : 42,134 (~ 0.2 * inputs file size)
+o running time (w/o memcheck) :
+
+ real 0m5.336s
+ user 0m4.833s
+ sys 0m0.333s
+
diff --git a/doc/dev/vos.test.sort-00.log b/doc/dev/vos.test.sort-00.log
new file mode 100644
index 0000000..d297e74
--- /dev/null
+++ b/doc/dev/vos.test.sort-00.log
@@ -0,0 +1,65 @@
+ How much vos load+sort use memory
+------------------------------------------------------------------------------
+
+o input file size : 51,521,908
+o input rows : 501,000
+o input fields : 11
+o output fields : 11
+o sorted fields : field03
+o process max row : 100,000
+o process max : 2
+
+
+2009.01.16 - test 000
+------------------------------------------------------------------------------
+
+o file buffer size : 8192 (default)
+o allocs : 24,048,740
+o bytes allocated : 417,820,691 (~ 8 * input file size)
+o running time (w/o memcheck) :
+
+ real 0m12.341s
+ user 0m15.849s
+ sys 0m0.627s
+
+
+2009.01.16 - test 001
+------------------------------------------------------------------------------
+
+o file buffer size : 51,200,000
+o allocs : 24,048,751
+o bytes allocated : 1,185,697,974 (~ 23 * input file size)
+o running time (w/o memcheck) :
+
+ real 0m12.341s
+ user 0m15.849s
+ sys 0m0.627s
+
+
+
+2009.01.16 - test 002
+------------------------------------------------------------------------------
+
+o with new sort_process algorithm
+o file buffer size : 8192
+o allocs : 18,624,738
+o bytes allocated : 332,184,755 (~ 6 * input file size)
+o running time (w/o memcheck) :
+
+ real 0m10.314s
+ user 0m13.059s
+ sys 0m0.583s
+
+
+2009.01.17 - test 003
+------------------------------------------------------------------------------
+
+o with new sort_process & vos_sort_merge algorithm
+o file buffer size : 8192
+o allocs : 6,600,924
+o bytes allocated : 123,352,391 (~ 2 * input file size)
+o running time (w/o memcheck) :
+
+ real 0m6.936s
+ user 0m9.379s
+ sys 0m0.560s