Sild - read more good

June 13, 2016

There is a complexity problem with my program!

I want to iterate over each character as few times as possible in order to read them in. How many times am I iterating over them now?

To find out, I’ll add an inner_reads global var and initialize it to 0. As a reminder, you can see this code in its working form in the repo, right about here.

int inner_reads = 0;

I’ll add an increment to that variable each place where a char is being accessed: Once in:

char *read_substring(char *s) {
    int len = count_substring_length(s);
    char *out = malloc(len);
    for (int i = 0; i < len; i++) {
        if (s[i] == 'i')
            inner_reads++;

        out[i] = s[i];
    }
    out[len] = '\0';
    return out;
};

once in:

int count_substring_length(char *s) {
    int i = 0;
    while (s[i] != ' ' && s[i] != '\0' && s[i]!= ')') {

        if (s[i] == 'i')
            inner_reads++;

        i++;
    }
    return i;
}

and once in

int count_list_length(char *s) {
    int depth = 1;
    int i = 1;
    while (depth > 0) {

        if (s[i] == 'i')
            inner_reads++;

        if (s[i] == '(') {
            depth += 1;
        } else if (s[i] == ')'){
            depth -= 1;
        }
        i++;
    }
    return i;
}

Notice I’m just incrementing on the character i, so that I can see how many times I access just that one char, like this:

int main() {
    C *a_list = read("i");
    debug_list(a_list);
    printf("%i", inner_reads);
    return 0;
}

This prints out:

LABEL- Address: 0x7f99684033d0, Value: i Next: 0x104197028
NIL- Address: 0x104197028
-------------------------------------------------------
3

Ok, so that’s not so so bad… I look at that char 3 separate times. Once in read_substring() when it calls count_substring_length() to figure out how much memory to malloc for the output, once when it actually copies over the substring to the memory that has been malloc‘d, and one final time in the main read function when it calls count_substring_length() to know how far to jump ahead in the input string. That, I can live with. But what about count_list_length()?

int main() {
    C *a_list = read("(((((((((((i)))))))))))");
    debug_list(a_list);
    printf("%i", inner_reads);
    return 0;
}

yields:

LIST- Address: 0x7fde4b403540, List_Value: 0x7fde4b403520 Next: 0x100186028
|   LIST- Address: 0x7fde4b403520, List_Value: 0x7fde4b403500 Next: 0x100186028
|   |   LIST- Address: 0x7fde4b403500, List_Value: 0x7fde4b4034e0 Next: 0x100186028
|   |   |   LIST- Address: 0x7fde4b4034e0, List_Value: 0x7fde4b4034c0 Next: 0x100186028
|   |   |   |   LIST- Address: 0x7fde4b4034c0, List_Value: 0x7fde4b403490 Next: 0x100186028
|   |   |   |   |   LIST- Address: 0x7fde4b403490, List_Value: 0x7fde4b403470 Next: 0x100186028
|   |   |   |   |   |   LIST- Address: 0x7fde4b403470, List_Value: 0x7fde4b403450 Next: 0x100186028
|   |   |   |   |   |   |   LIST- Address: 0x7fde4b403450, List_Value: 0x7fde4b403430 Next: 0x100186028
|   |   |   |   |   |   |   |   LIST- Address: 0x7fde4b403430, List_Value: 0x7fde4b403410 Next: 0x100186028
|   |   |   |   |   |   |   |   |   LIST- Address: 0x7fde4b403410, List_Value: 0x7fde4b4033f0 Next: 0x100186028
|   |   |   |   |   |   |   |   |   |   LIST- Address: 0x7fde4b4033f0, List_Value: 0x7fde4b4033d0 Next: 0x100186028
|   |   |   |   |   |   |   |   |   |   |   LABEL- Address: 0x7fde4b4033d0, Value: i Next: 0x100186028
|   |   |   |   |   |   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   NIL- Address: 0x100186028
|   |   |   |   -------------------------------------------------------
|   |   |   |   NIL- Address: 0x100186028
|   |   |   -------------------------------------------------------
|   |   |   NIL- Address: 0x100186028
|   |   -------------------------------------------------------
|   |   NIL- Address: 0x100186028
|   -------------------------------------------------------
|   NIL- Address: 0x100186028
-------------------------------------------------------
NIL- Address: 0x100186028
-------------------------------------------------------
14

The way I’ve got this written now, when a string is nested inside other lists, it gets examined 3 times for the actual reading of the string and once more for each list that it is nested inside. This is really crappy, and it means that for a string like

"(((((((((((i)))))))))))"

The i char in the middle is being accessed fourteen times. This isn’t including looking at all the parenthethes chars, either!

I can do better than this. At the very least, I want to be assured that, no matter what the nesting structure, each char will only ever be examined the same number of times!


Well, it turns out that I can get rid of the count_list_length() function completely by changing the way I approach the read() function itself!

Right now, read() operates on a pointer to the head of a string. When you pass this pointer into the function, it creates a local copy of that value and binds it to the argument name in the function signature. Which only exists until the end of the function. So, for example:

int derp(int local_i) {
    local_i++;
    return local_i;
}

int i = 10;
printf("%i", i); // 10
printf("%i", derp(i)); // 11
printf("%i", i); // 10

Further, if I call derp() inside of itself, I am still not affecting anything outside of the current function call.

int i = 10;

printf("%i", i); // 10

int derp(int local_i) {
    local_i++;
    if (local_i < 100) // have to have a base case or it will go on forever!
        derp(local_i);
    return local_i;
}

printf("%i", derp(i)); // 11
printf("%i", i); // 10

Avoiding complex global state is almost always a Good Thing. This seems like the expected behavior! We want the stuff that happens inside derp() to stay there, and not touch the global state, normally. But in the case of the read() function, because it is so recursive, it is much cleaner to have it operate on a pointer to a pointer instead of the pointer itself. This way, instead of copying the pointer each time the function is called, it copies a pointer to it, and you can operate directly on the pointer value itself, incrementing it regardless of how many stack frames you are inside!

So,

int derp(int *local_i) {
    (*local_i)++;
    return *local_i;
}

int i = 10;
printf("%i\n", i);        // 10
printf("%i\n", derp(&i)); // 11
printf("%i\n", i);        // 11

Now, the increment of the pointer inside the function affects the actual value itself, not the local copy.

How does this look in the read function? Well it looks like this:

C * read(char **s) { // **s is a 'pointer to a pointer', in this case a pointer to a string
    switch(**s) {    // dereferencing the pointer twice yields the actual char it points to (two levels of indirection)
        case '\0': case ')':
            (*s)++;              // increment the pointer at the end of a list
            return &nil;
        case ' ': case '\n':
            (*s)++;              // increment the pointer after ignoring whitespace
            return read(s);
        case '(':
            (*s)++;              // increment the pointer after starting a list

            // this is the magic part! the first call to read increments the pointer as it goes
            // so that when what looks like the same pointer is passed into the second call, it has
            // already passed by the entirety of the list!
            return makecell(LIST, (union V){.list = read(s)}, read(s));
            //                                         ^          ^
            //                                 so the 1st call & 2nd call are starting in different places!
        default: {
            // this part works like it did before, but we're guaranteed to never
            // read a char more than the 3 times necessary to make a copy and then jump over it.
            char *label_val = read_substring(*s);
            (*s) += count_substring_length(*s);
            return makecell(LABEL, (union V){.label = label_val}, read(s));
        }
    }
}

NOW WATCH THIS

int main() {
    char *a_string = "(((((((((((i)))))))))))";
    C *a_list = read(&a_string);
    debug_list(a_list);
    printf("%i", inner_reads);
    return 0;
}

with this new version of read, returns:

LIST- Address: 0x7f8c0a403540, List_Value: 0x7f8c0a403520 Next: 0x10d94a028
|   LIST- Address: 0x7f8c0a403520, List_Value: 0x7f8c0a403500 Next: 0x10d94a028
|   |   LIST- Address: 0x7f8c0a403500, List_Value: 0x7f8c0a4034e0 Next: 0x10d94a028
|   |   |   LIST- Address: 0x7f8c0a4034e0, List_Value: 0x7f8c0a4034c0 Next: 0x10d94a028
|   |   |   |   LIST- Address: 0x7f8c0a4034c0, List_Value: 0x7f8c0a403490 Next: 0x10d94a028
|   |   |   |   |   LIST- Address: 0x7f8c0a403490, List_Value: 0x7f8c0a403470 Next: 0x10d94a028
|   |   |   |   |   |   LIST- Address: 0x7f8c0a403470, List_Value: 0x7f8c0a403450 Next: 0x10d94a028
|   |   |   |   |   |   |   LIST- Address: 0x7f8c0a403450, List_Value: 0x7f8c0a403430 Next: 0x10d94a028
|   |   |   |   |   |   |   |   LIST- Address: 0x7f8c0a403430, List_Value: 0x7f8c0a403410 Next: 0x10d94a028
|   |   |   |   |   |   |   |   |   LIST- Address: 0x7f8c0a403410, List_Value: 0x7f8c0a4033f0 Next: 0x10d94a028
|   |   |   |   |   |   |   |   |   |   LIST- Address: 0x7f8c0a4033f0, List_Value: 0x7f8c0a4033d0 Next: 0x10d94a028
|   |   |   |   |   |   |   |   |   |   |   LABEL- Address: 0x7f8c0a4033d0, Value: i Next: 0x10d94a028
|   |   |   |   |   |   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   NIL- Address: 0x10d94a028
|   |   |   |   -------------------------------------------------------
|   |   |   |   NIL- Address: 0x10d94a028
|   |   |   -------------------------------------------------------
|   |   |   NIL- Address: 0x10d94a028
|   |   -------------------------------------------------------
|   |   NIL- Address: 0x10d94a028
|   -------------------------------------------------------
|   NIL- Address: 0x10d94a028
-------------------------------------------------------
NIL- Address: 0x10d94a028
-------------------------------------------------------
3

Only those three times!

What about:

"((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((i))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))";
|   |   |   |   |   -------------------------------------------------------
|   |   |   |   |   NIL- Address: 0x10c22f028
|   |   |   |   -------------------------------------------------------
|   |   |   |   NIL- Address: 0x10c22f028
|   |   |   -------------------------------------------------------
|   |   |   NIL- Address: 0x10c22f028
|   |   -------------------------------------------------------
|   |   NIL- Address: 0x10c22f028
|   -------------------------------------------------------
|   NIL- Address: 0x10c22f028
-------------------------------------------------------
NIL- Address: 0x10c22f028
-------------------------------------------------------
3 // <----- Boom.

With this tweak, I’ve removed a potentially major computation bottleneck and guaranteed that the reader function will operate in linear time at a max of O(3n), and I’ve completely eliminated the count_substring_length() function from the source. At this point, the source code is still only 116 lines long, and fully 35 of those lines are devoted to the debugging code that lets me see what I am doing!


I feel pretty good about having this in constant time, but I can do just one more better. I’m currently calling count_substring_length() twice- once in the read_substring() to malloc the right amount, and again in read() to jump ahead the right amount after reading a substring in. Why not just save that value somewhere so I only have to count once?

int current_substring_length = 0;
char *read_substring(char *s) {
    current_substring_length = count_substring_length(s);
    char *out = malloc(current_substring_length);
    for (int i = 0; i < current_substring_length; i++) {
        out[i] = s[i];
    }
    out[current_substring_length] = '\0';
    return out;
};

And then in the read function:

C * read(char **s) {
    switch(**s) {
        case '\0': case ')':
            (*s)++;
            return &nil;
        case ' ': case '\n':
            (*s)++;
            return read(s);
        case '(':
            (*s)++;
            return makecell(LIST, (union V){.list = read(s)}, read(s));
        default: {
             char *label_val = read_substring(*s);
             (*s) += current_substring_length;
             //           HERE ^
             return makecell(LABEL, (union V){.label = label_val}, read(s));
        }
    }
}

Now I’ve saved a whole iteration and only need to look at each char a maximum of two times! This can really add up. I don’t think I can go any lower than that- the initial lookahead is necessary to malloc appropriately, and I’d have to reallocate any temporary buffer that I might use to try and get around that, which would likely take more time to computer (and be more error prone!) than just looking ahead for the number on each substring.

But wait there’s even more!

I am operating on the very same pointer that I have a pointer to in the read() function… why can’t I use the same trick to get rid the global var that holds the substring length? Spoiler alert I totally can.

char *read_substring(char **s) {
    int current_substring_length = count_substring_length(*s);
    char *out = malloc(current_substring_length);
    for (int i = 0; i < current_substring_length; i++) {
        out[i] = **s;
        (*s)++;
    }
    out[current_substring_length] = '\0';
    return out;
};

I still have to pass a copy of the pointer into count_substring_length(), and it’s good to assign the current_substring_length to a local var, so I only have to count the length once for both the malloc and the for loop, but now when I increment the pointer with (*s)++, I am affecting the global state of that pointer. This means that I can remove the addition of current_substring_length in the main read() function.

C * read(char **s) {
    switch(**s) {
        case '\0': case ')':
            (*s)++;
            return &nil;
        case ' ': case '\n':
            (*s)++;
            return read(s);
        case '(':
            (*s)++;
            return makecell(LIST, (V){.list = read(s)}, read(s));
        default: {
            return makecell(LABEL, (V){read_substring(s)}, read(s));
        }
    }
}

Now every call to read() and read_substring() increments the global pointer as much as it needs to as soon as it can. I have a lot less global state to think about and don’t have to worry about passing that information through all these recursive calls!