Files and Data Structures

There are many applications where reading data from and writing data to a text file goes hand-in-hand with using data structures in various ways. Files are often used to store meaningful data (information) that a program needs to complete it’s task effectively. Typically, the data from these files needs to be loaded into some type of data structures so that the program has a structured was of accessing, iterating, and drawing answers from the data. This is why this lesson exists: to show you how these two things go hand-in-hand. The ideas from the previous lesson flow very nicely into this discussion.

Recall the Thesaurus example from last time. In that, a dictionary data structure with strings as keys and sets of string as values was proposed as a way to track the words that are similar to other words. You also should have seen two examples of functions that use this structure. One was for reporting similar words, and the other for adding a new word to the thesaurus. However, there were several aspects of this design that we did not consider very carefully: Where does the thesaurus / similar word information come from? Will we just have a huge thesaurus dictionary hard-coded across thousands of lines of code in our .py file? Also, what happens when we add a new word? If the program stops and is restarted, how can we save such a value? All of these questions can be answer (and the problems solved) by incorporating file I/O into the program for loading and saving thesaurus information.

Loading from a File

First, the problem of where the data comes from will be addressed. It is possible to literally write out the entire contents of the thesaurus in the python code. To do this you’d have to put this somewhere near the top of your code file:

thesaurus = {
  'strong' : {'mighty', 'tough', 'robust'},
  'small' : {'tiny', 'little'},
  'fast' : {'speedy', 'quick'},
  'calm' : {'mellow', 'chill', 'relaxed', 'peaceful'},
  /* Continue for THOUSANDS more words */
}

This would take up many lines of code, and would not be very good stylistically or structurally. A good principle when designing software is to keep the code and the data separate. The code should contain the features and functionality that make the program operate. The data contains the raw information the program needs to execute. In this case, the code would be all the logic for finding words, printing them out, and adding new words. The data is the collection of base words and similar words. In order to separate these concerns, it will be best to keep all of the data in a separate text file, and then have the code load this in when it needs it. The developer of this program could create a file called thesaurus.txt with the following format:

strong : mighty tough robust
small : tiny little
fast : speedy quick
calm : mellow chill relaxed peaceful
...

Each base word gets its own line of the file, followed be a space-colon-space. After this, the remaining similar words are provided, separated by a space for each. Now that the data has been pushed into a separate file, the thesaurus data structure can be declared like so in the code:

thesaurus = { /* EMPTY :) */ }

We will need some additional code whose job it will be to populate this dictionary with all the necessary information, and with the same structure we had before. This function would need to be invoked each time the program starts up. This code can be implemented using the following pair of functions:

def add_line(thes, line):
    sp = line.split(' : ')
    for word in sp[1].split(' '):
        add_word(thes, sp[0], word)

def load_thesaurus(thes):
    f = open('thesaurus.txt', 'r')
    lines = f.readlines()
    for line in lines:
        add_line(thes, line.strip('\n'))

The first function is responsible for processing individual lines from the input file, and the second opens up the file, iterates through each line with a loop, and calls the former for each line. A program that want’s to load up the thesaurus information and use it may be structured like so, assuming these functions are implemented elsewhere:

def main():
    thesaurus = {}
    load_thesaurus(thesaurus)
    # From here, use it! It should be populated.
    # ...

main()

Saving to a File

The other half of the problem is saving the thesaurus. The saving feature is necessary in this case because we want the thesaurus software to allow users to add their own custom similarities. Due to this, the thesaurus information is not static, but can be added to as time goes on by the users. On any given run of the program, a user could decide to add a new similarity. If we don’t save this into a file, the additions will be lost each time the program is terminated. To solve this problem, we will re-write the entire thesaurus file each time the program exits. The old thesaurus.txt will get over-written each time the program stops. This can be done with a single, simple function:

def save_thesaurus(thes):
    f = open('thesaurus.txt', 'w')
    for k, v in thes.items():
        l = list(v)
        f.write(k + ' : ' + ' '.join(l) + '\n')
    f.close()

As long as this get’s called before the program stops, it will ensure that the thesaurus file stays up-to-date with any changes in-between program runs.

Conclusion

Files that store our programs data are critical for the construction of useful program. Many programs store and retrieve data from the computer hard-drive while executing. In this lesson, we used a simple example of reading from and writing to a simple text file. There are other ways of storing and retrieving a program’s data, such as by using a DBMS. However, that technique is beyond the scope of this lesson.



PyFlo Home Complete Bookmark Next