Part 1: Parsing Goals and outcomes This assignment is intended to help the student be able to: • Combine loops and other structures to process text input o Branching statements o Container types o Membership operators o String methods • Structure a program to make future modifications easier
This assignment is also intended to form the basis for a series of assignments over the next weeks. To that end, it is intended to facilitate discussions about symbol processing, structured data, and the fundamental principles of object-oriented design and programming. Background For many applications, an input stream needs to be broken up into meaningful pieces. Depending on the applications, these pieces might be called words or tokens. Application areas where this is important include systems software programming, order processing, natural language processing (AI), network security, codebreaking, national security communications monitoring, and many more. The task Your assignment is to build a program that can take a string as input and produce a “frequency list” of all of the string (see the definition of a word below.) For this assignment's purposes, the input strings can be assumed not to contain escape characters (\n, \t, …) and to be readable with a single input() statement. When your program ends, it prints the list of words. In the output, each line contains a single word and the number of times that word occurred in the input. For readability, the number should be the first thing on the line, and the word should be second. For example, here are two runs of the program, showing the user’s input and the output: Enter a text line: This is a very long line of text with many words in it, most of them only once. 1 this 1 is 1 a 1 very 1 long 1 line 2 of 1 text 1 with 1 many 1 words 1 in 1 it 1 most 1 them 1 only 1 once
Enter a line of text: This is a word, and so is this. 2 this 2 is 1 a 1 word 1 and 1 so
Specific programming requirements 1. Use good prompts for all user input 2. No “dead code” – remove all diagnostic prints, abandoned attempts, etc. 3. Use loops and data container types to reduce repeated code. No section of the program should be longer than necessary, and the whole program must be less than 150 lines long. 4. Use comments only when necessary to document the program. 5. Variable names must be mnemonic 6. Your program must normally end (not crash or get stuck in an infinite loop)
Part 2: Rewrite the parser program and extend it with a brief statistical section.
Once the program has printed the frequency table, print this report:
number of words: 24
number of unique words: 20
the average length of unique words: 5.63
the average length of words it the text: 4.84
Notice that the numbers are left-justified in fields that line up vertically. The labels are right-justified. Floating-point numbers for the averages should be printed with 2 decimal places.
And remember the definition of a "word" for this assignment - it's a sequence of English letters, hyphen(s), and apostrophe(s). A hyphen or apostrophe is considered part of a word only if it has letters on both sides. Words are separated (delimited) by any character(s), not part of a word.
'Words' do not have to be real English words, and the input is not restricted to English text.