# Project 1: Philphix (due 2/8)

## Goals

This project is designed to serve as an introduction to the C language. Although there is conceptually a lot to learn to complete this project, the actual amount of code you will need to write is short. For reference, the “official” solution only adds about 120 lines including comments.

To complete this project, you will have to:

• Use the C file input/output library (you can find more information and common examples here
• Do memory allocation (to understand GNU allocation in C (here)
• Create general allocation function (here)
• Manipulate strings (for basic but incomplete information on strings in C, visit here and coerce strings to void pointers and vice versa.

Tip: Start early! You likely will have to learn and experiment a lot with C to complete this project, so make sure you budget enough time.

Hint: Function headers can be confusing in C (the following functions exist in conjunction: printf, fprintf, sprintf, snprintf, vprintf, vfprintf, vsprintf, and vsnprintf); the man pages are extremely helpful in understanding them. Their basic usage is just man [FUNC_NAME] in any Linux/MacOS terminal (this functionality does not exist in Windows command line/Powershell unless you install some other things) — this gives you the function usage, arguments, return values, and error codes, amongst other information. If you don’t feel comfortable using terminal to access this information yet, the Linux Die pages are a good way to get the information in a more traditionally accessible way. (If you’re curious about the 8 functions listed earlier, you can find their information here!)

## Background

philphix is a very simple and silly replacement tool that accepts a single command line argument, the name of a replacement set to use. This replacement set consists of pairs of “words” (separated by an arbitrary number of tabs and spaces), each pair on its own line. The first word only consists of alphanumeric characters, but the second word can include any non-whitespace printable character. The first word is the target word the input word is being matched against; the second word is what is to replace the input word if at all due to priority.

NOTE: Replacement set when referenced in the future indicates the first word of each pair in the set given; the replacement word is the second in the pair.

For each word (sequence of letters and numbers unbroken by any other character) in the input, philphix processes it to see if it should be replaced. Here is the order of match priority:

1. The exact word is in the replacement set
2. The word with all but the first character converted to lower case is in the replacement set
3. The word converted completely to lower case is in the replacement set

If there is no match the word is printed to standard output unchanged; if there is a match, the highest priority replacement for the word is printed.

It accepts a single command line argument: the name of a file containing a replacement set to use. For example, to use philphix, all you would need to type into the terminal is:

$./philphix tests/sanity/replace  Of course, you can replace tests/sanity/replace with whichever file you wish to use for your replacement set. You should have no memory leaks which are a function of the input. Additionally, philphix must be reasonably efficient. It takes the staff solution less than a second to run for each test. As a measure of reasonably efficient when you’re planning out your solution, a good place to start is to ask yourself how many function calls am I making and why each line is useful. Additionally, is my hash function efficient? If you cannot immediately answer that question, it’s likely it might not be necessarily and only serves to clutter up your code. (Refer back to CS61B for asymptotic analysis review.) ## Getting Started Visit https://galloc.cs61c.org/assignments/proj1. Log in, connect your GitHub account, and start the Project 1 assignment. A GitHub repo will be created for you; this will be your personal repo for any project 1 work you do. Once you have done that, clone and enter into your repo: $ git clone https://github.com/61c-student/sp21-proj1-<Your User Name>.git proj1-philphix
\$ cd proj1-philphix


This copies over a Makefile for the project and several partially implemented source files.

## Overview

After you have a copy of the starter code, you have the following files:

• src/hashtable.c - Code for a generic hashtable. You will need to implement 2 functions in hashtable.c: insertData(HashTable *table, void *key, void *data), and *findData(HashTable *table, void *key).
• src/hashtable.h - Header file for hashtable.c. For more information on header files and how they are used, check out this link. You will need to fill in the definition of your struct HashTable in here.
• src/philphix.c - Contains various functions for our spellchecker. You will need to implement 4 functions in philphix.c: stringHash(void *s), stringEquals(void *s1, void *s2), readDictionary(char *filename), and processInput().
• src/philphix.h - Header file defining functions in philphix.c. You may not modify philphix.h. If you wish to declare additional helper functions for philphix.c, you will have to add them to philphix.c and add a function prototype at the beginning of the file.
• Makefile - Makefile for compiling philphix, as well as for running tests. For more information on Makefiles and how they are used, check out this link.
• diff.sh - A bash script to use for additional testing. This will only work on Linux/Mac machines†. First run chmod +x diff.sh and then run ./diff.sh to see how to use the script. Use this script for additional testing once your code works on the given testing, replacement, and reference files.
• tests/sanity/replace - Sample set of replacement rules.
• tests/sanity/test - Sample input file.
• tests/sanity/reference - Sample output file.

Spend some time looking over these files and figuring out what is in each file and what they are used for.

TL;DR: hashtable.c and hashtable.h define a simple generic chained hash table. hashtable.h acts as an interface through which other files interact with our hash table implementation. Likewise, philphix.h declares the functions that are defined in philphix.c. philphix.c contains the main function, which serves as the program’s entry point.

† Windows users, there are a few different options to run diff.sh — native Windows cmd line and Powershell don’t support running bash scripts. The easiest option (if you already set it up) is using WSL (Windows Subsystem for Linux; setup instructions here) as it essentially encases your Linux distribution into user-mode and runs it on Windows. Alternatively, you could always ssh into the Hive. If you’re up for a bit more adventure, you can install either Cygwin or MinGW MSYS; the latter is less heavy-duty than the former but both will do what you need it to do.

For those who want a review of hashing, these Hug slides are a good reference.

While completing this project will require getting familiar with many parts of the C language (C file input/output library, do memory allocation, manipulate strings, and coerce strings to void pointers and vice versa just to name a few), your final solution should not be all that long.

To finish this project, you will have to:

1. Fill in the definition of struct HashTable in hashtable.h

Additionally, you will have to correctly implement the following functions:

1. void insertData(HashTable *table, void *key, void *data) in hashtable.c
2. void *findData(HashTable *table, void *key) in hashtable.c
3. unsigned int stringHash(void *s) in philphix.c
4. int stringEquals(void *s1, void *s2) in philphix.c
5. void readDictionary(char *dictName) in philphix.c
6. void processInput() in philphix.c

We suggest working on the subparts in the order listed above. Also, note that you may NOT import additional header files (*.h) This means all your changes must be contained to ONLY hashtable.c, hashtable.h, and philphix.c.

Remember to regularly check the piazza thread and pull from the starter repo to ensure that you get any updates/bug fixes.

One significant aspect that you should focus on is testing. You can create some python scripts to produce larger sets of input. Other good tests for robustness include using large binaries on standard input (with the output redirected to /dev/null).

NOTE: /dev/null is the *nix equivalent of the void; anything directed to that part of the disk is effectively discarded, never to be seen again. You can find more detailed information about it (and how to use it) here and here.

One particular aspect of C is nondeterminism: something that may run perfectly fine on one computer doesn’t run the same way on another due to memory bugs or other issues. We will be grading exclusively on the Gradescope which resembles the hive cluster, so you must make sure your code works correctly on those systems. Although we will attempt to help you run your code on your own computer rather than the hive, please understand that running on your own computer may result in problems we are not equipped to debug.

NOTE: in additional to nondeterminism, running memory tests (and some versions of tests) in the VSCode terminal can result in strange errors that see like memory leaks and be rooted in issues with your code, especially if you attempt to valgrind the same code in the hive afterwards; we’re not entirely sure why this happens but if you run into issues like this and suspect this might be the issue (it’s not common), please let us know ASAP so you don’t waste too many hours on an issue that’s not your fault. (VSCode has had additional strange behaviour in the past; if you run into something at any point during this course, please let us know via Piazza so we’re aware of it.)

Included in the repo is a sample replacement set, input, and output. Your output should EXACTLY match ours, since we will be using automated scripts to grade your program. You can type make test in your project 1 directory to compile and test your program against the sample replacement set/input and the edge case. You can also safely output all sorts of debugging information to stderr, as this will be ignored by our scripts and by the test routine provided in the Makefile. We have also included a second edge case test to see if your program will not break if we pass in your program binary as an input! We recommend that you create additional tests like this to see if you pass the different edge cases. If you just want to run the basic test, you can just run make testbasic. If you just want to run the edge case we gave, you can run make testedge.

NOTE: we mention binary files a few times in the spec; you can normally recognise them as they’re named *.bin though they do not need to end in that. Binaries are non-text files and are typically used to story binary data. If you open them, they’ll be unreadable to the human eye, AKA you won’t be able to necessarily see if the inputs are valid or not. An example of a binary would be a program’s executable (aka binary executable).

Furthermore, you can initially assume that both the replacement set and the input won’t contain words longer than 60 characters. This gets you the majority of the credit. However, for full credit, you must ensure that your program fully works even if you get words which are arbitrarily longer than 60 characters.

You should assume that the replacement is well formatted (i.e., alphanumeric word separated by some number of spaces or tabs followed by any non-whitespace, printable character delimited by new lines though a valid dictionary may not end in a new line!) if it exists. You CANNOT assume anything about what comes in through standard input except for the length of “words” for a majority of credit.

Tip: Consider running your program under valgrind to detect any memory errors. For example, use valgrind -q ./philphix tests/sanity/replace to have valgrind print out any invalid memory accesses it detects during the run. In C, it’s easy to write a program that appears to work correctly for simple tests, but which fails or crashes with larger inputs. Valgrind catches many of these hidden bugs that might otherwise appear only in the autograder.

### Thinking through programming and writing new tests (and edge cases). Tips by Stephan Kaminsky

The first thing you should always do when you are creating a new program is create a design doc or specification. For projects in this class, we have already given you a specification so you do not need to worry about creating your own. Instead, we recommend that you create a summary of what your design task actually is. Writing a summary (or a design doc in the first place) is a good way for you to understand what your objectives are without a lot of the fluff. For this project, summarizing this spec into the expected inputs and outputs should make it very clear on what you will need to do (and be careful of!). This is very helpful when you may design abstraction layers are you can see the possible inputs and outputs of parts of your program, how they fit together, and how they will produce the overall expected output from any given input.

In addition, thinking through functions on “paper” can be helpful when referencing later as later project tasks can get rather large. This also can be helpful later when going to Office Hours because the TA on duty will be able to clearly see your delineated thoughts and it’ll take less time to understand your design and code. It can also be useful if you started the project early, put it aside for a bit, and then come back to it in refreshing what you were thinking; just remember to keep it updated if you make any major changes to your implementation.

#### Writing Tests!

Once you have created your summary and outlined the abstraction you have, your next step should be writing test cases which will test to see if you have fufilled your abstractions. For example, once you have sumarizied the spec, you should see that this spec does not require that the input nor the dictionary end in a new line. What this means is you should write a test to ensure that your program operates correctly even in that case. You should exhaustively do this though try not to repeat your tests, each test should be targeting one specific thing. If you end up targeting multiple things, it can make debugging even more difficult as you will now have to figure out which thing is causing the test to fail!

Once you have created a strong test suite, you should start implementing your abstractions (or functions) in your code. After that, you should start tieing those abstraction together by working through the main routine/function of your program (while paying close attention that you have fufilled what is required). While you are working on writing your program, you may think of a few new possible edge cases, we recommend that you add those tests to your suit as soon as you think of them as they can be valueable in confirming correctness.

After you believe you have completed your program, you can now run it against your test suite and see how your code is doing! You should note that you may find some tests fail and some pass. It is important to realize that if a test you wrote fails, you should not just assume your program is incorrect. Instead, you should go through your test case (the input and expected result) and confirm that those match the specification which was given to you (by referencing the summary and possibly the spec or design doc). If you find a bug in the test, fix it and try running it again. If the test is found to be correct (or still errors after it is fixed), this means your program breaks its specification in some way. While this may seem devastating as you now have to debug your program, fret not as there are multiple steps you should take when wading through your program with a debugger.

Some of you may immediately jump to your favorite debugging technique: launching gdb (I hope that would be your first goto but understand it can be scary to use), eyeballing your code, adding print statements, etc. While it is typically good to immediately start debugging, you should first look at HOW your program failed as that will give you clues into what went wrong. The C language really doesn’t have many error messages, it may segfault or it may complete. Because of this, it is important for you to understand why one error may come up and what you should do in those cases. In C, if you get a segfault, that means you should be looking at any read/write operation you do to memory as one of those is reading or modifying values in an unintended way (I will further go through this in the next paragraph). If it does not segfault, it does always mean that you do not have erroneous memory access but it does mean that something in your program is not working as inspected. Because it did not crash, it is important for you to check what IS your output. For example, if you are expecting to write to a file, does it actually create the file?, Is the file even there?, Is there a path to the file?, Does the contents of the file actually make sense?. If the contents are incorrect, how did it fail?. For example, in this project, if you do not see anything, are you sure you correctly wrote charaters to the file? Another example for this project is if you just receive the same output as you input with no changes to your dictionary. This could mean multiple things such as your dictionary is not adding entries, your string equals function is incorrect, you are not writing the value of a matching dictionary entry, etc. There can be many things which can be the problem, but it is very important that you iterate through the possible issues so you can start analyzing each issue to figure out if that is what is causing the failure.

Now that you have iterated a list of possible issues with your code, you should start debugging your issue. I strongly recommend that you use a debugger such as gdb though understand that some issues may be able to be solved with print debugging. I recommend gdb as it will allow you to test multiple things at once without having to constantly change, build, and launch your program to modify your print statements (in addition to print statements actually being able to change the state of your code!). The first thing you should look at (with gdb or print statements) is the first possible issue in your code. While this may sound obvious to some, it can be something which is easily overlooked. It is important to do this as you may have castcading errors! Once you have done that, step to some portion of your code. Once there, you should think through what the expected state of your code should be (aka for those print debuggers, this is where you print a whole bunch of variables). If you write out what you expect the state of your program to be at some point, this can help you spot errors which allow you to further isolate your issue. When I say you should write the expected state, I do not mean that you must know every address and every single variable your program has, instead it should be what do you expect the state of your function variables to be and maybe some global variables that the function interacts with. Once you have stepped to that point and written down what you expect the state to be, you should check each variable (ie print them out) to see if it matches your expectations. If it does not, you should look closer at what modifies it and see if there is some error in that (this step may force you to reiterate a list of possible issues for you to start looking at). If you see everything is working as inspected, you will need to go to the next possible issue you iterated through and do the same thing until you figure out your issue.

This process can be long but it is imporant that you stick to it. With practice and programming experience, you will begin to see common issues which will allow you to more quickly hone in on the issue. It is especially important that you understand this as larger projects can be too much for you to just walk line by line in your head to analyize (it can be VERY easy to make a mistake that way as you might thing the line does something different than it actually does). After you have fixed the issue, you will continue to do this process until you have passed all of your tests. Once you have passed all of your tests, it is always a good idea to rereview the specification of your program to ensure you are not missing any test or have any tests which may not align with the design. Sometimes you may feel you are getting stuck on the same thing over and over again. It can be very useful for you to take a break if you notice you are just running in circles as you should give your mind a break by doing something else. A refreshed mind may catch things which you were unable to catch as errors may have blended with the expected behavior in haste, tiredness, or a daze.

#### Learning by example

We have given you a basic sanity test in this project in addition to an edge case. Edge cases are commonly known as tests which take valid inputs to your program but may cause the logic of your program to fail if you are not careful. We have started the second step for you which is Writing Tests!. Before you continue with looking at the tests, I recommend you do the Understanding your task portion of this guide. Once you have created a summary, you should now start to create tests! We have already given you two tests, a sanity test and an edge case. A sanity test is generally a test which will check for very basic correctness but does not test the bounds of your program. Edge cases normally target the bounds of your program and are important to write so you can be sure your code works correctly. In our case, the sanity test gives you some examples of different inputs and how we expect for you to transform them in your program. When you look at the summary you have created of your task, you may notice that the spec we gave states that you cannot expect anything about the input to the program. This means you should be able to correctly handle any possible character thrown as input. Because of this, it means that your program should be able to take in ANYTHING (aka including the binary itself) as an input and operate correctly. Once you realize that this is a test you should create, you would create some binary which you input to your program to ensure it operates correctly. An easy way (which we gave you) to test this is have a replacement set which does not change anything then see if you pass the philphix binary to itself and see if the output is the exact same program. Once again, your program is just bits which can be interpreted in any way so it is a valid input! You should continue to look through the summary you have made of what the expected behavior of your program is. For example, your summary will show you that you cannot assume that there exists a new line at the end of the input or dictionary. This means YOU will need to create a test to ensure that your program functions correctly in those cases. Once you have written your test cases (for reference, the autograder has about 10 different types of tests), you should start writing your solution. Once you have finished your solution, you should run it through the tests you have created (please note that is it always a very good idea to test as you go though it is not always possible or may require you to write multiple testing frameworks). Then you should continue to debug and test until you have passed all of your tests. Once you have done that, it is always a good idea to step away from this so you can give yourself some time to think through possible other senarios which you have. Once you finish that, you are ready to ship your code! :)

## Submitting

Please submit using Gradescope to Project 1, using the GitHub submission option to ensure that your files are in the right place.

REMEMBER: the grading will be done almost entirely by automated scripts. We will be only using your hashtable.c, hashtable.h, and philphix.c files when grading! Your output must exactly match the specified format, making correctness the primary goal of this project. Upon submission, the autograder will give you the result of a basic sanity test but will not give you your complete grade. Rather, you are responsible for developing any tests you need to make sure that you meet the requirements of the project. We deliberately did not include a more comprehensive test. We want you to practice writing your own tests!

You are to do this work individually. It is OK to assist your classmates with their project, but don’t copy code, and don’t go searching for previous solutions!

NOTE: Please avoid working too closely with other classmates; as aforementioned, sharing general ideas and tips is okay but debugging by looking at other students’ code (even if you’re finished) is strongly warned against. This is not because we want to discourage helping each other but as it has been shown to lead to both pre-meditated and accidental academic dishonesty cases in many semesters regardless of if you intentionally memorised their code or not. Please do not put yourself and your classmates’ in that situation as it causes everyone involved a lot of additional stress at the end of the semester when it could’ve been easily avoided by simply not viewing each other’s code, even when just helping. That being said, if you’re not sure if something is okay to do or not, just ask on Piazza, either in a post or on the respective project thread as it’ll help everyone to clearly delineate what’s allowed and encouraged versus what is not.

## FAQ

### The autograder failed to execute correctly?!?!

This is typically due to you having a lot of print statements which will cause the autograder to hang or run out of memory when it is executing. If you run into this, please remove/comment out your print statements and resubmit. If you continue to have this issue after you have confirmed that you have done that AND you have confirmed that you pass the sanity test locally, please make a post on piazza with a link to your submission. If you do not include a link, we will not be able to easily find your submission!

### If a malloc call fails, what should I exit with?

You can simply exit with error code 1.