Chapter 7: Unix Power Tools

7.1 Intro

In working though this seciton its good to have some files of dummy data. To have some interesting dummy files to work on you can use these services.

Mockaroo

Dummy Data Me

7.2 GREP

grep stands for: global regular expression print. It is a powerful search program from the command line.

grep string file.txt

It searches for the string you give it, inside the file you give it, and returns to the screen the lines that contain a match for the string you give it.

The search is case sensitive.

grep -i apples fruit.txt
options
-i Sets the search to case insensitive
-w Matches on whole words
-v Returns lines that don't match (inverse match)
-n Shows line numbers for the returned results
-c Shows the number of positive matches

7.2.1 Input multiple files

To input multiple files give grep a path to a folder full of files

grep -R apple Users/smerth/some_interesting_files
options
-R is used to direct the search recursively into the files in folders
-n applies line numbers to the match results
-h suppresses the file name and just gives the matched terms
-l just returns the file name and not the matched term
-L returns whole files that don't contain a match

7.2.2 Wildcards

Use wildcards to refine the folders and files that are searched

ls *.txt

lists only the .txt files

so

grep apple Users/smerth/unix_files/*.txt

will look for the string apple in only the text files

7.2.3 Pipe output from other sources into grep

You can pipe results from a command into grep

ps aux | grep terminal

ps returns active processes, grep searches these for lines containing the string terminal

or

ps aux | grep Applications

or

history | grep docker

This will search through your terminal history and return any lines (commands) you have run containing the string docker.

7.2.4 Pipe output into grep and grep results into another command

history | grep drupal | less

which pipes the grep results (lines containing the drupal command from the drupal CLI,) into less to give a paginated result list

7.2.5 Coloring matched text

grep --color lorem lorem_ipsum.txt

highlights the search results in a different color

grep --color=auto lorem lorem_ipsum.txt

Note that --color=auto will only color results if returning them to terminal window. You don't want markup in results you pipe to another process.

In .bashrc you can set and export a shell variables to set the default colors you want to use as well as other options you want to set as default for grep.

export GREP_OPTIONS="--color=auto"

color=auto means color is always on...

for case insensitive search always on use:

export GREP_OPTIONS="-i"

7.2.6 Grep examples

7.2.6.1 Ex. 1 - Look in all files for a string and export lines containing that string to a new file

grep -n "YOUR SEARCH STRING" * > output-file

-n (include line numbers)

* (look in all files)

7.2.6.2 Ex. 2 - To look in a specific file

grep -n "YOUR SEARCH STRING" input-file > output-file

7.2.6.3 Ex. 3- Find all lines containing a string, in the input-file, except the lines that contain another string and pipe the result set to the output-file

grep -n "test" input-file | grep -v "modern" > output-file

-v (inverse switch, returns lines that don't match the condition)

7.2.6.4 Ex. 4 - Grep from a set of input files

Given a set of input files: demo-1, demo-2, demo-3, demo-4, demo-5

grep "this" demo_*

7.3 REGEX

regex is short for regular expression. Regex is a sequence of characters that define a search pattern.

You can use regular expressions with grep.

7.3.0.1 Using regular expressions with grep

Its a good idea to put quotes around the regex because some "regex symbols" and "special symbols in unix" are the same.

When using character sets [:alpha:] with grep you need to put them into brackets [[:alpha:]] so grep knows you are trying to match these character classes and not the character set '':alpha:"

There are basic and extended regular expressions. When using a text editor make sure you know which one works. In terminal you can set an option for using grep to turn the extended regex set on by default:

export GREP_OPTIONS="-E"

7.3.0.2 example: grep with regex

grep "foo.*bar" demo_file

matches lines beginning with "foo." and ending with "bar" with anything in between.

7.3.0.3 Resources

N.B. Regex is a whole world unto itself , consult another document for more detail.

Regex One

Regexr - Online Sandbox

Regular expressions in Javascript

7.4 TR

Translating characters. tr finds and replaces a string.

The first argument to tr is the search string, the second argument is the replacment string

echo "a,b,c" | tr ',' '-'

Here it looks like tr searches the input string for , and replaces each one it finds with a - but actually... it maps the position of the search character to the replacement character

tr '123456' 'EBGDAE'

For example, all instances of 1 are replaced by E, all instances of 2 are replaced by B, etc...

Run this example to see it at work.

echo "The first argument is the search string, for the second is the replacment string" | tr 'ag' '12345'

(If the replacement set is smaller than the search set the last character of the replacement set is repeated until the necessary spaces are accounted for. (needs an example))

7.4.1 TR Examples

7.4.1.1 Ex. 1 - take a file as input

tr 'A-Z' 'a-z' < people.txt

takes the file as input and swaps uppercase alphabet for lowercase alphabet

7.4.1.2 Ex. 2 - take a file as input and output to another file

tr ',' '\t' < people.csv > people.tsv

replace comma with a tab in a csv delimited datafile, then output to a tab delimited file.

7.4.1.3 Ex. 3 - delete characters

-d (delete characters in a listed set)

echo 'abc1233deee567f' | tr -d [:digit:]

returns: abcdeeef

7.4.1.4 Ex. 4 - squeeze characters

-s (squeeze will delete repeats in a listed set)

echo 'abc1233deee567f' | tr -s [:digit:]

returns: abc123de567f

7.4.1.5 Ex. 5 - return the complimentary set

-c (use complimentary set)

7.4.1.6 Ex. 6 - delete characters not in a listed set

echo 'abc1233deee567f' | tr -dc [:digit:]

returns: 1233567

7.4.1.7 Ex. 7 - squeeze characters not in a listed set

echo 'abc1233deee567f' | tr -sc [:digit:]

returns: abc1233de567f

7.4.1.8 Ex. 8 - specify an argument for each option

echo 'abc1233deee567f' | tr -sd [:digit:] [:alpha:]

this means: squeeze the digits and delete the alpha cahracters

returns: 123567

7.4.1.9 Ex. 9 - remove all non-printable characters from a file

tr -dc [:print:]  < file1 > file2

7.4.1.10 Ex. 10 - remove all surplus carriage return and end of file characters (cleaning windows documents)

tr -d '\015\032' < windows.file > unix.file

7.4.1.11 Ex. 11 - remove double spaces from file

tr -s '  ' < file1 > file2

7.5 SED

Stream editor takes a stream of input and edits it.

The sed syntax always follows this pattern

sed 's/a/b/'

s: substitution a: search string b: replacement string

echo 'upstream' | sed 's/up/down/'

returns: downstream

by default it only changes the first occurance of the search string

7.5.1 sed: global replacement

adding a 'g' for global will change the action to substituting all occurances

echo 'upstream and upward' | sed 's/up/down/g'

returns: downstream and downward

the delimiter can be anything you choose:

sed 's/up/down/'

or

sed 's:up:down:'

or

sed 's|up|down|'

You can switch to another delimiter when the search string you are using contains the default delimiter.

When feeding a file into sed, each line in the file is treated as a stream and so each line will be acted upon.

sed 's/pear/mango/' fruit.txt

Notice that sed will take a file as input without using the '<' cahracter (although you can use it)

You can string multiple sed commands using the -e option

sed -e 's/pear/mango/' -e 's/apple/mango/' fruit.txt

7.5.2 sed: regex

Regex works with sed in the same way that it works with grep. As with grep you need to watch out for extended character sets (or set sed to use the extended regex set in .bashrc)

In the following examples we will set sed to use extended regex set using

-E

7.5.3 sed: regex back-references

Back-references are part of regular expressions and sed makes good use of them

echo 'daytime' | sed -E 's/(...)time/\1light/'
returns: daylight

You can define up to 9 back-references in the search string. That is, up to 9 sets of parenthesis containing a search string. These will be referenced by the numbers \1 through \9 in order to replace any occurance of the search string with the appropriate replacement string.

echo 'Dan Stevens' | sed -E 's/([A-Za-z]+) ([A-Za-z]+)/\2, \1/'
Returns: Stevens, Dan

You could run that over an entire file.

7.6 CUT

Cutting select text portions. Cut allows you to cut one of three options:

options
-c characters
-b bytes
-f fields

7.6.0.1 Non de-limited files

cut -c 1-10 < mock_lorem_ipsum.txt
Cuts a column of data in the file defined by position of characters 1-10 on each line

you can cut multiple columns

cut -c 1-10,34-37,47- < mock_lorem_ipsum.txt
NB

When working with non-delimited files you define the columns you want to cut by indicating a range of characters (5-19)

7.6.0.2 De-limited files

But if the file is delimited all you need to do is indicate the column number and use the -f fields option

cut -f 2,3 < mock_tab_data.txt
cuts out just columns 2 and 6

you can specify the delimiter with the -d option

cut -f 2,6 -d "," < presidents.csv

MacOS Note!

When using Terminal on MacOS there is a trick to entering the tab character.

Hold down ctl + v and then quicky press tab. Takes a little practice...

7.7 DIFF

Comparing files

diff mock_lorem_start.txt mock_lorem_edited.txt
In the outout of this command

> indicates an insertion.

< indicates a deletion.

options
-i Case insensitive
-b Ignor changes to blank character
-w Ignor all whitespace
-B Ignor blank lines
-r recursively compare directories
-s show identical files

7.7.1 diff: Alternative formats

options
-c Copied context
-u Unified context
-y side-by-side
-q Only whether files differ

7.7.2 Using a text editor to view diffs

Do a search on your mac for FileMerge. It comes with Apple's Developer Tools.

It's not pretty but gets the job done.

7.8 XARGS

Passing an argument list to commands

xargs is short for "execute as arguments".

xargs parses an input stream into items and then it loops through each item in that list and passes it to a command.

Here's an example

wc lorem_ipsum.txt

Word counts returns:

3077   37859  256409 mock_lorem_ipsum.txt

3077 - newline count

37859 - word count

256409 - byte count

echo 'lorem_ipsum.txt' | wc

Word count returns

       1       1      21

Because the string 'mock_lorem_ipsum.txt' is being passed to wc instead of the file.

To pass the file to wc as an argument use xargs.

echo 'lorem_ipsum.txt' | xargs wc

Returns the word count stats for the file (ie: the file is passed into the command wc as an argument)

you can use the -t argument to see what it does

echo 'lorem_ipsum.txt' | xargs -t wc

-t outputs the commands as they are run

Now try to pass multiple arguments into wc

echo 'lorem_1.txt lorem_2.txt' | xargs -t wc

You can see that first one file was passed as an argument into wc then the second file was passed into wc as an argument.

The results for each run of the command are listed one after another. Rather conviently wc also returns totals.

If you want it to loop through x number of arguments you can use the -n option

echo 'lorem_1.txt lorem_2.txt' | xargs -t -n1 wc

Here -n is set to 1. So, wc runs with argument 1 then loops and runs again with argument2, etc...

To see this clearly run:

echo 1 2 3 4 | xargs -t -n2

echo is run first with arguments 1 and 2 echo is then run again with arguments 3 and 4

Here's another example:

cat mock_companies.csv | xargs -I {} echo "Buy stock in: {}"

-I specifies a placeholder for the looped output from cat. so the result is something like

Buy stock in: Trilia Buy stock in: Cogilith Buy stock in: Centidel etc...

When passing filenames as arguments there is an issue with names containing a space. xargs may see each word in the name as a new argument. use the -O option

ls ~/Library/ | grep 'A.*' | xargs -0 -n1

7.8.1 xargs Examples

7.8.1.1 Example 1 - Take a list of company names and make a directory called "Companies" on the desktop containing a sub-directory for each company in the list

cat mock_companies.csv | sort | uniq | xargs -I {} mkdir -p ~/Desktop/Companies/{}