NAME programming - a point of introduction to programming, in unix

PAGE DATE 19980122-19981020

ABOUT THIS FILE
This file is intended to provide a beginning point for learning computer programming. Most tutorials for a specific programming language will assume you know what's presented here. The concepts here also crop up in the the more powerful user commands, such as find. Examples assume a unix environment. Explanations are intended to be practical, as opposed to theoretically or historically precise. Basic end-user skills are assumed. Topics introduced include Boolean logic, flow control, conditionals, and binary numbers. These concepts are fundamental to programming, and they are not as complicated as they sound. The summary includes a brief survey of programming tools for unix. This file is crawling with personal opinion, some of which is quite non-standard. Don't flame me, write a better version.

The programming language used for examples is dcon, which can be obtained as C sourcecode ( heh, as if you knew what the heck THAT was :o) at e.g. ftp://sunsite.unc.edu/pub/Linux/system/serial/dcon0.97.tgz and installs easily in Linux. You may even be able to comprehend the examples from just reading them, but playing with them is better.

The examples in this file should be viewed in a simple viewer that does not interpret backslash escapes. If you don't see a backslash and an n there ---> \n the examples will be messed up. more or an editor like pico or vi will be OK, as will Lynx. The *roff processor called by the typical ( non-cLIeNUX ) man command may screw it up.

CALCULATING
To pick a practical starting point for a discussion of programming, consider a simple electronic calculator. This will give us a reference point for explaining what a computer really is. This will also allow us to introduce some of the things you should know about "2 + 2" when you're programming that you don't need to know about when you're calculating. A lot of terms will be introduced or used in a programming sense in the next few paragraphs. If you find this boring, hang in there. Getting a computer to do what YOU want it to is very satisfying. Also, knowing the basics of programming makes many things clearer from the "end-user" point of view, particularly in unix, but less so, I hope, in cLIeNUX.

A calculator can perform the type of data processing we call arithmatic. Using the buttons on the calculator you can input a number or two and instruct it to perform a calculation on the numbers you have input, and it will display a result. You can then instruct the calculator to perform some other calculation. Even some of the very simplest calculators will allow you to use the answer from a calculation in the next calculation, since the number in the display is also stored in memory. In this way the previous calculation is preserved in the current state of the calculator.

On a Hewlett-Packard 10b business calculator, the button sequence... 41 / 7 = will produce a display of 5.857143 .( Actually, the HP handheld has a dash-and-2-dots divide symbol.) The same calculation in the xcalc calculator emulator for the X Window System gives an answer of 5.8571429 . Not only do these two highly regarded tools disagree, but neither is perfectly correct. Oh dear! Since 41 and 7 are integers, there is a very simple way to give a precise and correct answer to the calculation 41 / 7 . In integer arithmatic 41 / 7 equals "5 remainder 6". The remainder is also known as the modulo. The integer representation is strictly correct and precise, but very inconvenient, since the answer is two distinct numbers. Also, "5 remainder 6" may not really express what we want to know about 41 / 7. A rational number, such as 5.857143 is a better answer to "Approximately how big is a one-seventh slice of 41?", whereas integers better answer the question "How many whole sevens are there in 41?".

Our two example calculators are geared for the rational numbers. When you input a 12 to the HP or to xcalc they store it in an internal representation for rational numbers something like +12.00000. On the other hand, the unix bc calculator utility prefers integers. If you type bc and then tell bc 41 / 7 bc will answer with 5 . The modulo can be obtained separately with the modulo operator, e.g. 41 % 7 .

The internal representation of rational numbers is slightly different in the HP calculator and in xcalc. bc's integers are very different, and integer operations are actually very different from operations on rationals. There is an internal commonality between all these calculating tools, however. Basic electronic digital circuits don't have zero through nine. The basic circuits don't have decimal points or exponents. Digital circuitry works on ones and zeros. Any data a computer can hold, from numbers to strings of text characters to pictures to phone calls to whatever, must at some point be converted to ones and zeros. A single minimum piece of digital data, which may be either a zero or a one, is called a bit. If a bit is being used to represent part of a number then it is one binary digit.

COMPUTING
All our example numbers were calculated using no other digits besides one and zero. Humans have 20 digits, in two sets of ten. Everyday numbers are base 10, also known as decimal. Ten was chosen instead of 20 because of the difficulty of counting on your toes. With just one and zero we can have a base 2 number system, known as binary. The conversion from the decimal numbers you input, to the binary form it was calculated in, and back to the form your answer was displayed as, was hidden from you. In the context of a calculator that's the way you want it, you want to think about the kind of numbers you're working with. Unlike a calculator, a computer might be asked to manipulate any kind of data.

Numbers of any size can be represented by bits in binary, just as numbers of any size can be represented by 0 through 9 in decimal. A group of 8 bits is called a byte. There are 256 possible settings of 8 bits, from 00000000 to 11111111, which is 0 to 255 decimal.

The points I hope I've made by now are; all digital computation is performed on ones and zeros, any data can be represented by ones and zeros, arithmatic is just one of many types of data processing, and even the simplest arithmatic can be performed a variety of ways; or rather, everyday arithmatic calculation usually involves some assumptions about the types of numbers in question.

A computer can perform any kind of data processing; it is not limited to arithmatic. Computing vaguely resembles a subset of thinking. However, thinking is rooted in the motives, emotions and experiences of the individual, things computers do not have. Computers can do well at strictly confined tasks, such as playing chess, but the most powerful computers currently in existence can only emulate the behavior of a microscopic quantity of biological braincells. Thinking is also analog and massively parallel; as different from a typical computer as can be. Most computers only do one computation at a time, although very quickly, while a brain does many many things at once.

There are several things a computer needs to be able to do that a calculator doesn't. A computer must be able to perform operations on individual bits. It must be able to perform certain operations conditionally, i.e. it must have some form of "IF". In other words, it must be able to "decide" whether or not to do something based on some condition. Also a computer must be able to do things repetitively. These features of computers; general data manipulations, conditionals, and repetition; are the difference between calculations and computations, or formulas and algorithms. The distinction between computers and calculators dates from the days when both "calculators" and "computers" were typically human beings, perhaps aided with a sliderule or mechanical adding machine. It's not a matter of power or speed, it's a matter of versatility. There are many types of data manipulations that simply can't be done without conditional execution of steps, for instance.

I ran abruptly into this difference myself once with a "programmable calculator" I used to have. It could store and sequentially execute a program of 500 or so instructions, instructions in this case being calculator button strokes. It therefor had the repetition capability of a computer, but it didn't have conditional execution; It had no "if". The task I was unable to get it to do for me without conditionals was to convert an azimuth to a bearing. Azimuths and bearings are two ways to express a direction. An azimuth is an angle between 0 and 360, usually measured from north. A bearing is an angle between 0 and 90 in a quadrant. The azimuth 100 is the bearing S80E, i.e. "south 80 east" in the southeast quadrant. As far as I can tell, that conversion requires a "decision" about what quadrant we find our azimuth in that requires a conditional, and so a calculator simply can't do it.

Here's a reminder of some terms I've snuck into the discussion already that were used in a computing sense....



	data 		input  	   	display       	memory     
	state     	integer    	modulo		bit     
	binary     	byte      	conditionals    emulator     
	algorithm	internal representation		instruction
	store 		program

PROGRAMMING
dcon is a very simple programming language for unix. It's the simplest, easiest to learn programming language I've seen that still is fairly useful. It allows you to edit instructions to the computer into a text file, and then you can execute the file, thus running the program you have written. dcon has just enough functionality that we can write a simple program that will demonstrate conditionals, repetition, and non-arithmetic data operations. dcon also has simple user input and output facilities, which are not strictly a requirement for something to be a computer, but are a practical necessity, particularly for a tutorial. Please install dcon if you haven't yet.

Put the following two lines in a text file as the first two lines....

	#!/bin/dcon
	print "Hello World! \n" 
	

Then chmod the file to executable and run it. If dcon is in /bin you should get a cheery response from your new executable. Don't go any further until you do. Move dcon or change the first line to find it if you have to. Don't ever get any more complicated than you have to when programming. Otherwise you will write un-debuggable code, which is very frustrating.

OK, it says hello? Every time? Great. You've written a unix command. It's a useless unix command, and a very common useless unix command, but you now have a toe-hold on programming. Your program runs. Now change the text string so you can say you've written something besides "Hello World".

The first line, #!/bin/dcon, tells unix to run the given interpreter, dcon, on the rest of the file. dcon is the type of programming language known as an interpreter. So are Perl, Lisp, BASIC, and the various unix shells, among others. The interpreting of the first line this way is a facility of the operating system, not the language, but after that the language is interpreting the file. The second line is an instruction to print a string of charachters to the user's terminal. Charachters are given to the dcon print instruction within double quotes. The \n is the unix standard print command escape for "print a newline", or linefeed. Charachter terminal IO (input/output) still uses the terminology of paper-display teletypes. The second line is very similar to the C or Perl instructions for the same action. You can remove the \n to see the difference without it.

Now add a few more print lines to your program, like....


	#!/bin/dcon 
	print "Hello World!\n"
	print "I'm so proud"
	print " to be here!\n\n\n\n"
	print "Yeeeee HAW!\n"
...and run it. You can see several things from this. We've gotten clever with \n. We use ! too much, like the yahoos we are. And most important, our program is executed line by line from top to bottom. This is the flow of control of the program. Computers do things one at a time, not all at once, and top to bottom is the default order. The crucial repetition and conditional features of a computer are based on changing this order. We change the order of execution with flow control instructions.

dcon has a tiny set of flow control instructions. It has labels, goto, gosub and return. We'll use a label and a goto to write an idiotic endless loop, which will demonstrate only the repetition capability of programming. Change your proggy to this.....

	#!/bin/dcon
	:chorus
	print "habba hooba ya!      \n   "
	print "Ain't THIS a party!?!?! \n "
	goto chorus
Make sure there's a \n in the print statements inside the loop, i.e. between :chorus and goto chorus. When you run it now you'll have to do ^c to kill it. Do so. Ah, the power of repetition. The little virtual cybercretin you just created will chant "habba hooba ya" forever if you let it.

Perhaps the use of a conditional would cause something a bit more interesting to happen. Let's also input something, for the conditional to check.

	#!/bin/dcon

	:top
	print "Enter some text\n"
	input $a
	let $b=$b+$a
	print $b"\n"
	if len($b)>100 goto theend
	goto top
	:theend
	print $b"\n"
	print "OK, 100 charachter limit reached. I quit.\n"
This little program has one goto making a loop to :top, and one goto that breaks us out of this loop if we've accumulated more than 100 charachters in our string. The program decides how many times we have to loop back to :top to get a string of 100 or more charachters. In fancier languages this would be done with a do--until construct. dcon can be commented with #, rem or // . Here's the same thing with comments.
	#!/bin/dcon
	
	:top				# this is a label, a branch target
	print "Enter some text\n"	# tell the user what to input
	input $a			# get user input as string var. $a
	let $b=$b+$a			# + is "append" with strings. $b grows
	print $b"\n"			# print our current accumulated string
	if len($b)>100 goto theend	# break out if our string is >100 bytes
	goto top			# unconditional branch, making our loop
	:theend				# branch target for escaping the loop
	print $b"\n"			# print $b the last time 
	print "OK, 100 charachter limit reached. I quit.\n"  
OK, we've got repetition and conditionals going on. This is almost useful too. It does something that's not utterly predictable. Another thing about this little program is that it now has the same general outline as most interactive utilities, which usually consist of an endless loop at the "top" of thier flow control, some action based on some input, and an escape condition.

Recall that to really be computing we have to be able to do arbitrary data manipulation. That means we have to be able to flip any bit we want. Let's see what kind of bit-twiddling we can do in dcon. This one just churns some integers.

	#!/bin/dcon
	:top			# top of endless loop
	let a=a+1		# increment a
	print a"     "		# print current increment value
	print a&16776961"\n"	# print a  ANDed with 16776961
	goto top		# loop
 
The & operator is the dcon bitwise Boolean Logic AND operation. You may not have known you wanted to learn this. You do. It's a very fancy name but what it does absolutely could not be any simpler. Computer logic is Boolean logic, named after a guy named Boole who invented it. AND is one of a few crucial Boolean functions, along with NOT, OR and XOR.

Let's say we have two bits named a and b that we input to AND. AND will give an answer of 1 if and only if a and b are both 1. If either or both of a and b are 0, then a AND b = 0. With 2 input bits there are only 4 possible cases of AND, so we'll show them all in this little table...

	AND
	inputs...       output
	a       b	aANDb
	0	0	0
	0	1	0
	1	0	0
	1	1	1
When we use dcon's & operator on two numbers the two numbers are ANDed on a bitwise basis, i.e. each significance-aligned pair of bits in the two arguments is ANDed and the answer is thus the bitwise AND of the two inputs. The number 16776961 in our example is 111111111111111100000001 in binary. That's our bitmask. When we & our bitmask with some other number, the 1 bits in the other argument will become 0's in the result at each bit position where the mask is zero. This makes our example program do a funny sort of counting. Check it out. This illustrates that dcon gives us bit-resolution control of integers. You want to be familiar with the few boolean functions, because they're simple, and because they're what computers do well. They therefor pop up over and over again. Here are some "truth tables". Modify the example to use other functions and masks too. For example, note how XOR can be used to "toggle" bits ( if the mask bit is 1 ). OR is 1 if a OR b is 1....

	OR
	a	b	aORb
	0	0	0	
	0	1	1
	1	0	1
	1	1	1
XOR is the exclusive-OR function, A XOR B is 1 if A is not equal to B. The dcon XOR operator is ^ .
	a 	b 	aXORb
	0	0	0
	0	1	1
	1	0	1
	1	1	0
NOT is a unary function; it takes one argument. dcon doesn't seem to have it as a bitwise operator, but it bears mention anyway.
	a 	NOTa
	0	1
	1 	0
A function like addition or AND that takes two arguments is called a binary function. This is not the same as a binary number, so I stalled on introducing that use of the term. Everything computers do is made out of these few simple operators.

dcon is an extremely simple programming language. It was developed to write modem scripts. The author of dcon tends to refer to it as a utility. All programming languages are instruction sets for some virtual computer. dcon was kept simple by implementing a very typical virtual machine, i.e. it's similar to a generic microprocessor. It's kindof like interpreted assembly language. dcon's flow control instructions are very similar to the native machine language of a microprocessor. It's very generic, and so is good for introducing fundamentals. Other languages are better for complex tasks or for performance-critical tasks.

I hope you now are familiar with the programming connotations of these terms.......


	programming language		interpreter
	flow control			endless loop	
	Boolean Logic			argument	
	bitmask				unary function
Some terms I should also cover in an intro........

	subroutine	hexadecimal	octal	
	debuggability		
The number 16776961 is 111111111111111100000001 in binary. Binary doesn't convert very sensibly to base 10, as you can see. Often data is represented in base 8 or 16, octal and hexadecimal, because these number bases map to consistent chunks of bits in a binary number. dcon doesn't do hexadecimal or binary input or display of integers, so I used dc to figure what decimal number my desired bitmap is. 16776961 is FFFF01 in hexadecimal. The hexadecimal digits are 0 1 2 3 4 5 6 7 8 9 a b c d e f . It takes exactly 4 bits to represent 16 things. That's why FFFF01 is a better representation of our bitmap than the decimal version. Each hex digit maps to four adjacent bits, and in a simple case like this you can perhaps do the conversion in your head. f is 1111, 0 is 0000, 1 is 0001. So, as I hope you see, hexadecimal is actually simpler and clearer when dealing with arbitrary binary numbers, such as memory addresses, than decimal.

Octal is base 8, and the octal digits 0 through 7 always map to aligned sets of 3 bits, which is why the chmod command can accept octal digits for owner/group/other perms.

Here are the decimal numbers 0 through 16 in binary, octal, decimal, and hexadecimal. See the ascii manpage for how bytes are typically used to represent charachters.

	binary        octal          DECIMAL       hexadecimal
	    0           0               0               0
	    1           1               1               1
	   10           2               2               2
	   11           3               3               3
	  100           4               4               4
	  101           5               5               5
	  110           6               6               6
	  111           7               7               7
	 1000          10               8               8 
	 1001          11               9               9
	 1010          12              10               a
	 1011          13              11               b
	 1100          14              12               c
	 1101          15              13               d
	 1110          16              14               e  
	 1111	       17              15		f
	10000          20	       16              10

					( good idea, Garen :o)  
Subroutines are modularized sections of a program. Subroutines are created in dcon with the :(label), gosub and return instructions. Subroutines save resources if a single routine gets called from several places in a program. Again, the dcon mechanism ( for subroutines) doesn't add much to what microprocessors do to support subroutines. The thing that makes a subroutine modular is the return instruction. The return instruction sends program flow back to where the subroutine was called from, i.e. after a return the next instruction is the one just after the calling gosub.( "The stack" you may have heard about elsewhere in unix is the data structure maintained by the cpu chip for keeping track of which "gosub" the CPU will be "return"ing to.) Hopefully looking at some code will help clarify. The power of subroutines depends on thier use in other code, so this example is unfortunately a bit lengthy....
	#!/bin/dcon 
 	gosub subrou 
	gosub subrou 
	gosub subrou 
	:top let i=i+1 		# i is an increment counter 
	gosub subrou 
	if i/3*3=i gosub by3 	# is i a multiple of 3?  
	if i=30 exit 0 		# 30 repetitions should do 
	goto top # loop
	exit 1 			# if flow hits this something is broke

	:subrou			# prints on every iteration
	print a"   Call me?\n"
	print "if there's more text here each call saves more typing\n"
	return

	:by3			# this will print if a is a multiple of 3
	print "           AND me?\n"
	return
DEBUGGABILITY
The goto instruction gets a bad rap. There is the idea rampant in the computer world that producing manageable, debuggable programs is largely a matter of avoiding using goto. Most other programming languages provide other flow control abstractions such as for-next, do-while et. al. These other constructs are actually made out of simple branches, the machine equivalent of gotos. You can't NOT use goto, you can just pretend you aren't. Abuse of goto will definitely cost you, however. I'll never forget the 100-line BASIC program I wrote that was one solid wad of "spaghetti code". It's a very interesting thing to write something you yourself cannot fix. On the other hand, there are dozens of gotos in the main() function of my H3sm program, but none of them cause flow to cross the areas of code they terminate. This is a case of using goto like return, and doesn't cause incomprehensible tangles.

Simplicity always wins in dubuggabity. No code, no bugs. The simplest way to get something done is the best. If code is necessary, then the key to keeping things from getting out of hand is to keep things in comprehendable-sized modules. Forth programmers call this "factoring". Subroutines are a big help with this. The design of Forth makes subroutines as inate to the language as top-to-bottom flow in other languages. In any non-trivial language though, computers allow you to bust problems into manageable chunks. If at any time you have to keep track of more than two or three branch targets ( goto labels) you probably need to simplify. Subroutines can be debugged ( tested for correctness) individually, particularly if they don't mess with data in ways that are too general. Simplifying and modularizing code is not easy, though. It is in fact an art.

Another huge aid to debuggability is intelligent commenting of your code. Proper commenting starts with the selection of names for variables, subroutines and other labels. Useful commenting is an art. It requires some thinking. First of all, comment your code so someone who knows the language, but doesn't know anything about your particular program, can figure out what's going on. Years, or days, after you write your program, that person could be you. For a particular routine or function, the code itself tells *how* something is done; it is wise to include a comment describing *what* is being done. Comments shouldn't look like code, and label names shouldn't look like data types. If either of those is the case then your labels aren't very descriptive. That is, "integer" is a pretty stupid name for a variable. A particular variable name should say something that is specific to that variable. Again, it's an art, since your code talks to a machine, but names and comments talk to humans.

OTHER MEANS OF PROGRAMMING
The actual instructions a computer chip accepts and understands are far removed from the ones in a dcon script. Scripts are characters in a text file. A CPU chip gets it's actual instructions from RAM, and they must be in the exact form the CPU understands. Actual machine code is an unintelligible stream of digital data. The dcon script is interpreted by the dcon executable every time it is run, and thereby converted to actual instructions to the hardware. This is convenient, but an interpreted program runs CPU-inefficiently, since the the interpreter spends the great bulk of it's CPU time interpreting the given interpreter instructions, rather than executing the machine instructions they ultimately result in. This may be very convenient to a user, however, particularly for box-specific tasks that don't run all the time.

An assembler is a tool for programming a CPU in it's native instruction set. An assembly language is a translation of the native instructions of a CPU into a just slightly less cryptic form, e.g. JSR for the CPU's integer code for "jump to subroutine", the machine equivalent of gosub. There's usually a one-to-one mapping of machine instructions to assembly language names (mnemonics) for them. An assembler takes a text file of assembly language instuctions and assembles it into an executable binary file suitable for direct execution by the CPU. In unix the linker comes into play at this point. Binary executable code, also known as object code, can be archived and linked in a library to be used by other programs, or made into a stand-alone command. This is what ELF is about.

With mnemonics like JSR for machine instructions, and other rudimentry conveniences like named variables and being able to input numbers in hexadecimal instead of as ones and zeros, the process of writing assembly language is still tedious. Worse, assembly language is CPU-specific. Different CPUs have different instructions, and thus different assembly languages.

Compilers are like assemblers, but with greater translation capability. Compilers do not restrict themselves to the native instructions of the CPU. FORTRAN, COBOL, C, C++, Ada, and Pascal are compilers. A compiler takes a text file and makes an executable file, but through a more complex process than assembly. Perhaps more importantly, a compiler implements instuctions for a virtual machine that can be compiled on various real machines with the appropriate assembler "back end".

A particular programming language, such as C, is an implementation of a particular virtual computer. The virtual computer implemented by C has a variety of flow control constructs, a variety of data manipulation features, and syntactic and other features that can be built from machine instructions, but are not native to a particular CPU. A single instruction to the C virtual machine may result in hundreds of machine instructions when compiled. Languages like C are called high-level languages compared to assembler for this reason.

Having a high-level language produce hundreds of instructions for you based on simple directives from you is nice. More important, however, is the fact that the abstraction of the compiler allows code to be written in C and compiled on a variety of CPUs, sometimes unchanged. For example, The dcon interpreter is distributed as a single C sourcecode file. The author of dcon wrote dcon in 1989 on some unix, and was able to compile it in Linux in 1996 with no major modifications. If he had written it in assembler that would not have been possible. unix is written mostly in C for this reason, and unix's portability is key to it's incredible durability. This is also why C is often called a systems programming language. The C virtual machine is easily implemented on most CPUs, so C can be compiled to code that performs well enough for the core of a system. Dennis Ritchie estimates that the performance cost of C in the first UNIX written in C was 40% versus the same UNIX in assembler. This has proven to be a cost worth bearing.

There are other families of programming language as well. Some implement virtual machines totally unlike a typical CPU, but very few completely hide you from the concepts shown above with dcon. For example, macro languages like m4 implement repetition via recursion, which is a whole nother ball of wax. My personal favorite programming language is Forth, which is what I call a compiling interpreter, which implements a virual machine with a data stack in addition to the return stack of typical CPUs, but is not as widely useful in unix as C is, since there's all that existing C code, including the Linux kernel.

If I can make one point, let it be this; all these computer wonders are built from tiny concepts that, by definition, could not be any simpler. A familiarity with these fundamentals is therefor the most fruitful path to understanding, being in control of, and getting the most from your computer.
SEE ALSO
dcon.doc and other dcon distribution files
The usenet comp.lang. hierarchy
the comp.unix.programming FAQ
the man page for your language of curiosity.
Sections 2 and 3 of the standard unix manpages are all C.
search the web for e.g. "c tutorial" etc.
all those hundreds of megs of GNU/Linux sourcecode

RIGHTS

Copyright 1999 Richard Allen Hohensee
This file is released for redistribution only as part of an entire intact cLIeNUX Core.

I kinda wrote this for sillyhead.