Monday, May 4, 2020

decomet - super fast command line to remove comments, Unicode whitespace and dedup multiple lines from source/text files


decomet is a cosmic fast Windows command line tool that will minify source code. By default this will remove all comment lines starting with // or contained  with /**/. 

It features;
  1. remove all blank/empty lines. Blank is defined as whitespace* 
  2. remove all Unicode control characters, except tab, form feed and carriage return.
  3. remove empty duplicate lines reduce them to 1 line for readability of code.
  4. remove indent whitespace*
  5. minify and normalize whitespace* to a single space
  6. prefix with line number. Specifically line number, tab and then line  
  7. recurses sub-directories
  8. funnel to a single output directory

*ISO 30112 defines POSIX whitespace characters for function iswspace() for locale 'en_US.UTF8' as Unicode characters U+0009..U+000D, U+0020, U+1680, U+180E, U+2000..U+2006, U+2008..U+200A, U+2028, U+2029, U+205F, and U+3000
  
It super fast and written in C/C++ mixture.
It reads and writes UTF-8 source code. 
Files accept Unicode names. 
Built-in human readable elapsed time.

The code base for this project is http://code.google.com/p/cpp-decomment/ but has been greatly improved to handle Unicode spaces, control characters, UTF-8 files and UTF-8 filename. Moreover, the state machine has been optimized and improved to work.
Improved code to make sure all the switches actually work.

Download decomet.zip.
Demo version is outputs 10 lines of a single file and open this page on each run.  
Use metadataconsult@gmail.com for license request.

As with all my software - 100% no malware or spyware. I am trying to sell this and that would be a bad idea.

decomet -h 2> help.txt - to pipe to a file 'help.txt' 


Usage: decomet -[bcehimnprsv] [-d<DIR>] file1.c file2.js ...
 Decomment source files, optionally remove whitespace, control characters and duplicate empty lines

DEMO Edition - limited to 10 lines and 1 file!
               Get a license version from metadataconsult@gmail.com

  -b         remove all whitespace* blank/empty lines
  -c         preprocess & remove control characters in ASCII and UNicode range
             U+0001..U+0008, U+000E..U+001F and U+007F..U+009F, respectively.
             NOTE: U+001A 'SUB' Substitute character will terminate reading a text file unexpectedly.
  -e         Removes duplicate Unicode whitespace* entire lines aka 'empty lines', leaving 1 line.
             *ISO 30112 defines POSIX whitespace characters for function iswspace() for locale 'en_US.UTF8' as Unicode characters
             U+0009..U+000D, U+0020, U+1680, U+180E, U+2000..U+2006, U+2008..U+200A, U+2028, U+2029, U+205F, and U+3000
  -h         display help message
  -i         remove indent whitespace*
  -m         minify && normalize whitespace* to a single space
  -n         prefix with line number
  -p         preview files matching wildcard for recursive search
  -r         recursive search sub-dirs under the input-file's folder - file wildcard needed
  -s         output to stdout, instead of output-files (infile1.c.dec)
  -v         switch off verbose - default on

  -d<DIR>    output funnel directory, no space after -d

  file[*?].c input-files, file wildcard [?*] allowed. The output-file is 'filexxx.c.dec'

Features:

 Fast, written in mainly C, C++ for Unicode support
 Read and writes UTF-8 text files
 Implements a state machine for parsing to remove comments, enforce min. spaces, etc.
 Implements a stack for file/folder traversal

Limitations:

 Each line length is a max of 100,000 characters wide
 Does not handle long file paths (>260)

Notes:

 org src code - http://code.google.com/p/cpp-decomment/
 improved to handle Unicode, UTF-8 files && remove duplicate lines, Unicode whitespace
 fixed stack imp (org. failed if single double quote found with -m switch)
 improved to assure all switches work correctly, etc.

decomet demo version 2.0.2.0
copyright 2020 metadataconsulting.ca
https://metadataconsulting.blogspot.com/2020/05/decomet-super-fast-command-line-to-remove-comments-Unicode-whitespace-and-dedup-multiple-lines-from-source-text-files.html
Get a license version from metadataconsult@gmail.com

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Speed test on 100 lines of C++ file.

1. input2.cpp
Input   100 lines.
Output  100 lines.
Removed 0 lines.
Elapsed 3ms.

Speed test on 1 Gig text file.

I:\WORK-CODE\Visual Studio Projects\decomment\Debug>decomet -e 1gb.txt
Input   42949674 lines.
Output  42949670 lines.
Removed 4 lines.
Elapsed 6min 29s 170ms.

No comments:

Post a Comment