Sild; Reading substrings
Jun 08, 2016In the last post, I made a simple little linked list.
I can't expect to make every single cell in a list by hand, of course. I need a way to turn some kind of input into a set of cells that are all semantically linked together. I need a function that reads a string and outputs a linked list! Here's a good first go, annotated:
C * read(char *s) {
if (*s == '\0') {
return NULL;
} else {
return makecell(*s, read(s + 1));
}
}
This function returns a pointer to the first cell in the new linked
list. As we saw before, from there we can access further elements by traversing
the next members in each cell.
This function accepts a string. In C, strings are not a primitive type,
they're instead represented by a pointer address that points to the location of
the first character in the string. C strings are NULL terminated, which means
that it goes from the "starting" point, to the next null byte. The null byte
here is represented by '\0', but it could just as well be NULL or 0 or
0.0 or 0x0. They are all equal to the same thing, which is absolute 0,
which is the null byte. Functions that operate on strings implement loops that
go until they run into a null byte, and this one will do that, too.
Notice the *s's. This is called 'dereferencing'. When we 'dereference' the
pointer we've been given, it returns the actual content of that memory address.
Let's say we're passing in a string like so:
read("a");
Now, it looks like that string constant is one byte long, and it is... sort of. The contents of that string are only one byte long, but it takes up two contiguous bytes in memory, and looks something like this:
['a']['\0']
This is the difference between single and double quotes in C. A double quote
represents a string literal, and a single quote represents a single char.
I'm going to change the struct that represents a cell now, so that the
valmember holds acharinstead of anint. This will make it easier to think about parsing strings. From now it will look like this:typedef struct C { char val; struct C * next; } C;
So, the function above would look at "a", and find that the first char in the
string is 'a', and create a cell using makecell() whose val is 'a', and
whose next cell is...
read(s + 1)
Why are we adding 1 to a ... string?
We're not adding 1 to a string, we're adding 1 to the address that we were
originally given. We're saying, in effect, "Tell me what is right after the
first character in memory." This call to read() sees whatever the first call
saw, but starting on the second char. In this case, that's the NULL byte!
On the second call, we're returning the NULL pointer. So the final structure
of this linked list looks like this, it is currently just one cell long:
{ 'a', 0x0 }
How about read("abcde")
{ 'a', 0x7fe6514034c0 },
{ 'b', 0x7fe6514034a0 },
{ 'c', 0x7fe651403480 },
{ 'd', 0x7fe651403460 },
{ 'e', 0x0 }
This is a little more complicated! Each character has been loaded into a cell
whose val is the char at the location it was given, and whose next member
contains the memory address of the return value of the successive call to
read(). Those numbers are pointers that represent memory addresses, and are
the return value of malloc() via makecell().
You might notice that the linked list structure we're using has something in common with the general string structure in C itself... they are both
NULLterminated! In fact, I've written aread()function that turns a C string (which takes up contiguous space in memory) to a linked list ofchars whose elements could be stored in arbitrary places.
Now we're cooking with gas!
Let's see...
read("a b c d e");
This will make a linked list that looks like this:
{ 'a', 0x7fe6514034c0 },
{ ' ', 0x7fe6514034a0 },
{ 'b', 0x7fe651403480 },
{ ' ', 0x7fe651403460 },
{ 'c', 0x7fe651403440 },
{ ' ', 0x7fe651403420 },
{ 'd', 0x7fe651403400 },
{ ' ', 0x7fe6514033fe },
{ 'e', 0x0 }
This is doing what I told it to do. A space is indeed a char ' '. But I
don't really care about space characters. If I add an if else clause to my
function, I can simply ignore the input if it's a space char.
C * read(char *s) {
if (*s == '\0') {
return NULL;
} else if (*s == ' ') {
return read(s + 1);
} else {
return makecell(*s, read(s + 1));
}
}
I'm going to change this from a series of if statements to a switch. Because
in some ways it's cleaner. The code below is functionally equivalent to the
code above, and while we're at it, let's also ignore \n newline chars in
the input:
C * read(char *s) {
switch(*s) {
case '\0':
return NULL;
case ' ':
case '\n':
return read(s + 1);
default:
return makecell(*s, read(s + 1));
}
}
It's easy to think of switch statements as a simple refactor of if and else
if and else clauses when dealing with the various possible states of a
single variable being compared to constant values, and they do often play
that role, but there are some subtle differences and weird little gotchas.
Look at this:
int main() {
int x = 1;
switch(x) {
case 1:
printf("%i", x);
case 2:
printf("%i", x);
case 3:
printf("%i", x);
default:
printf("%i", x);
}
return 0;
}
At a glance, you would expect this code to print out x once, no matter what
it is. But it does not!
If x is 1, then this will print 1111.
2 will output 222.
3 will print 33,
and any other value will print x one time.
Under the hood, a switch statement produces jump instructions that act like
simple goto statements. If x is 2 at the switch, then it will jump to the
point in the code where this case is stated. After that, it will execute all
the remaining lines in the switch unless you break it off, either explicitly
with a break statement, like this:
int main() {
int x = 1;
switch(x) {
case 1:
printf("%i", x);
break;
case 2:
printf("%i", x);
break;
case 3:
printf("%i", x);
break;
default:
printf("%i", x);
break;
}
return 0;
}
Or implicitly, with a return statement, as I've done above with the read function.
The print_list() function I've been using so far has been useful, but I'd like
more information about the cells that a list is composed of. Here is a function
called debug_list(), which has the same basic structure as print_list, but
prints out all of the information for each cell.
void debug_list(C *c) {
printf("Address: %p, Value: %c, Next: %p\n", c, c->val, c->next)
if (c->next) {
debug_list(c->next);
}
}
So something like debug_list(read("abcd")) would output something like
Address: 0x7fb9cbc033f0, Value: a, Next: 0x7fb9cbc033e0
Address: 0x7fb9cbc033e0, Value: b, Next: 0x7fb9cbc033d0
Address: 0x7fb9cbc033d0, Value: c, Next: 0x7fb9cbc033c0
Address: 0x7fb9cbc033c0, Value: d, Next: 0x0
Notice that, as you would expect, each cell's next member is the same as the
next cell's Address, with the exception of the last cell, whose next member
is, again, the NULL pointer.
It's all well and good to be able to read in individual chars as values for each cell, but it's not very practical in the long run to only have single char long labels for everything. I need to be able to read in arbitrarily long strings, instead.
First, I'll have to adjust the val member once again, to be a C string (a pointer to a char) instead of a char by itself.
typedef struct C {
char * val;
struct C * next;
} C;
The final new read function will look like this (let's work backwards):
C * read(char *s) {
switch(*s) {
case '\0':
return NULL;
case ' ': case '\n':
return read(s + 1);
default:
return makecell(read_substring(s), read(s + count_substring_length(s) + 1));
}
}
As you can see, I need two new functions. read_substring() and
count_substring_length().
char *read_substring(char *s) {
int len = count_substring_length(s);
char *out = malloc(len + 1);
for (int i = 0; i < len; i++) {
out[i] = s[i];
}
out[len] = '\0';
return out;
};
This function is manually copying the substring into a new, malloc()ed
location, and returning the pointer to that location. I need to know how much
space to allocate, though, and that's where count_substring_length() comes in.
int count_substring_length(char *s) {
int i = 0;
while (s[i] != ' ' && s[i] != '\0' && s[i] != '\n')
i++;
return i;
}
This function is very simple, it just increments a counter int i for every
character in a string that is not a space, a newline character, or a NULL
byte. What you get back is a number that is equal to the amount of bytes needed
to store the substring in question, including the one slot for the '\0' which
will terminate the new string.
I'll use that count_substring_length() in a couple of places. Once in the
read_substring() function itself, for the aforementioned allocation, and
another time in the succession call to read() at the tail of the switch
statement, when I need to know how much to increment to pointer so that I'm
starting at the end of the substring that I just processed. This will make
more sense with an example. Let's make a list out of "red balloons".
read("red balloons") will look at the first char it was given. It is the
default case in the switch, so it will
return makecell(read_substring(s), read(s + count_substring_length(s) + 1));
The first read_substring will first pass the string into
count_substring_length to figure out the allocation.
count_substring_length will increment until it hits a terminal char, in this
case a space. "red" is the substring, which is 3 bytes long.
Back in read_substring(), malloc() will allocate space for those 4 bytes,
plus 1 more for the terminating '\0' in the copied string, and then, one by
one, copy each char of the substring into the newly malloc()'ed space, then
tagging the end of it with a '\0', then returning that pointer, which is
loaded into the new cell's val member in the call to makecell(). Having
read in the substring, the read function needs to know how far ahead to skip in
the input string "red balloons" to get past the substring it's processed, so
it counts ahead again, in the line
return makecell(read_substring(s), read(s + count_substring_length(s) + 1));
and adds one more to get past the terminating space. The next call to read()
will then see "balloons" by itself, and then read in that substring, and
finally we'll end up with a list of just two elements, instead of an element
for each char as we had before.
Address: 0x7ffb21c033f0, Value: red, Next: 0x7ffb21c033e0
Address: 0x7ffb21c033e0, Value: balloons, Next: 0x0
Now, we have a way to turn some input (a random string) into a linked list structure that contains a bunch of cells that contain substrings of that original string. We also have a couple of functions that can operate on the resulting linked list. Those functions, as they develop, will serve as the prototypes for all functions that operate on these lists.