Multithreaded wc

A common utility in Linux and Unix systems is the wc program which is used to count the number of lines, words and characters in the files provided as input.

In this blog we will create a multithreaded version of the wc command where each thread will be responsible for finding the count of the lines, words and characters in one file, and then update the common shared variables for the total lines, words and characters across files.

Note: The purpose of this blog is to use threads in a bare bones implementation of wc.

File Handling

To interact with a file (read and write), we need a FILE pointer (which is different and more portable and easier to use than a file descriptor, also FILE struct has a field for file descriptor).

The fopen() function has the signature:

FILE *fopen(const char *restrict path, const char *restrict mode);

Where path is the file path and mode are one of {r, w, a, r+, w+, a+} with postfix b if dealing with binary files.

You might ask what after I get the pointer? You can start reading or writing to the file, but there are some subtle differences based on the use case. For reading from files commonly used are fgets (for text files or formatted data) and fread (for binary or unformatted data), fgets terminates reading once a newline (unlike fread) or EOF or specified characters are read.

The fgets() prototype is:

char *fgets(char *restrict s, int n, FILE *restrict stream);

file handling (creating file pointers and closing the resources)

This snippet involves input of the file names and storing the FILE* for each and displaying an error if not found. The struct file_info stores the FILE* and the file name for a file. The file name can be found from the FILE* indirectly using by finding the file descriptor, but that is not target of this blog.

After performing all the logic in the function get_word_count, the files are closed to release resources.

Multithreading

Now for the core logic, since this is a multithreaded wc with each thread responsible for one file, we run a thread for each of the valid files stored in the files array of file_info type.

We will be using the POSIX thread library in C for thread handling. This provides APIs for thread creation, terminating, detaching and synchronization using mutex.

The pthread_create signature is:

int pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, typeof(void *(void *)) *start_routine, void *restrict arg);

The start_routine is executed by the thread once it is created. The return type of this function is expected to be of type (void*). The attr of the thread (which we won't discuss in this blog) can be set to default by passing NULL. The arg allow to pass different arguments to different threads, for each thread we pass a different file_info type having the FILE* and the file_name, allowing for file reading of the specific file.

get_word_count function

In the code above, we create count number of threads and terminate them before the function ends as the tid has local scope, to prevent race condition when the threads update variables total_lines, total_words, total_chars, The mutex mutex is used to ensure that a single thread is capable of updating the counts, this is demonstrated in the below snippet (In case you get an error for compute_word_count it is mostly due to no return statement, in that case return NULL; at the end should work).

compute_word_count function

The logic for counting the words is that whenever the first character of a word is encountered, increment words, the state variable records if the current character is inside a word or a series of whitespaces.

This source code can be found at: wc.c

Steps to run:

Create a file wc.c and write the code
Compile (g++ -g wc.c)
Create file1 and file2 and add some text
Run (usage ./a.out file1 file2)

You will get an output of the count of lines, words and characters for each valid file and total count in the last row.

Search This Blog

madvise