What happened to OC? - CLOSED Carnage?!
Sign in to follow this  
Followers 0
icstars2

C - Occurrences In File

Hello all! I am in the last leg of my semester and my teachers & TA's no longer have office hours, which means there's no teacher assistance in this final project. I appreciate any help or assistance!

 

The project: Create a C program that counts and finds the word with the highest number of appearances from a given file. I understand how to get the program to read from the file, but getting it to count the occurrences of entire words in the file is what gets me. I've created a prior project (see below) that counted the occurrences of numbers, but not entire words.

 

The exact instructions:

umgw21g.png

 

The input file: https://cise.ufl.edu/class/cop3275sp15/input.txt

 

Previous program, which displayed both occurrences and the number:

#include <stdio.h>
int main(void)
{
  int digit_count [10] = {0};
  int digit;
  long n;
 
  printf("Please enter a number to begin:\t");
    scanf("%ld", &n);
   
    while (n > 0)
    {
      digit = n % 10;
      digit_count[digit]++;
      n /= 10;
    }
 
  printf ("Digits:\t    ");
  for (digit = 0; digit <= 9; digit++)
    printf("%3d", digit);

  printf("\nAppearences:");
  for (digit = 0; digit <= 9; digit++)
    printf("%3d", digit_count[digit]);
   
  return 0;
}

 

The new program doesn't have to be anything like this, though it might be useful to start from somewhere not at the ground floor.


343OrM8.jpg?1

Share this post


Link to post
Share on other sites

Tiddy-bits:

This would be so much easier in Python... Does anyone know a good way to use a hashmap in C? Probably is not what the professor is looking for, though =/

You should definitely split the input file into lines (split on \n) and words (split on spaces). Not sure if your professor wants you to count commas and periods or not. Is "Earth" the same as "Earth."?

Out of all the languages I would consider "pretty good", C is the worst at text processing. :P

Kru likes this

Share this post


Link to post
Share on other sites

Dunno C at all, but just in concept what I'd do if I were writing this in Lisp is have five variables - one to remember which word you last scanned in the file, one for the highest so far, one for the word it's currently counting, a string for the most common word, and a string to store the word it's currently counting. Then a table to keep track of the words scanned, and another table with punctuation to ignore.

 

Then a simple recursive function (recursing until the end of the file):

 

It scans until the first space, loads that as a string in the "currently scanning" variable, checks the table for punctuation to remove from the string, checks the final string against the table of already scanned words, and if it is a new word then it scans between spaces and compares those with the string until it reaches the end of the file (adding one to the "current occurrences" if they match). At the end of the file, it compares the "current occurrences" with "highest occurrences" and overwrites "most common word" with the string and "highest occurrences" with the number if "current occurrences" is higher. Then it adds the "currently scanning" string to the table of words scanned, and repeats with the next word in the file and adds one to that "bookmark" variable.

 

If it finds a word it's already scanned, it adds one to the bookmark and goes to the next word.

Edited by TCK

Umh7x1l.gif

Share this post


Link to post
Share on other sites

TCK, where you want a "table" I want a hashmap. Either way, there's unfortunately no equivalent in plain old C. You could keep a list of words and their counts, I guess...

 

Edit: Ugh, is grind-the-problem-down-until-it's-not-a-problem-anymore a single word?

Double Edit: Python FTW. I definitely "cheated" here with Python's collections module -- recreating Counter might be a small challenge. That's the part that I think really needs a hashmap.

 

from collections import Counter

with open('textfile.txt') as textfile:
    word_list = []
    for line in textfile:
        for basic_word in line.split(sep=' '):
            word = basic_word.strip(' ,.:\n')  # remove punctuation
            if word != '':  # ignore the empty string
                word_list.append(word)

    word_counts = Counter(word_list)  # now we count
    for word in word_counts:
        print('{}: {}'.format(word, word_counts[word]))
v3nture and TCK like this

Share this post


Link to post
Share on other sites

C sure is fun to write in. 

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdbool.h>

typedef struct Word {
    char *word;
    unsigned long instances;
} Word;

Word *SearchForWordWithString(Word **words, unsigned long wordsCount, const char *string) {
    for(unsigned long w=0;w<wordsCount;w++) {
        if(strcmp(words[w]->word,string) == 0) {
            return words[w];
        }
    }
    return NULL;
}

int main(int argc, const char * argv[]) {
    if(argc == 2) {
        FILE *file = fopen(argv[1],"r");
        if(file == NULL) {
            printf("%s was not openable.\n",argv[1]);
            return 0;
        }
        fseek(file,0,SEEK_END);
        size_t length = ftell(file);
        fseek(file,0,SEEK_SET);
        char *data = (char *)calloc(length+1,sizeof(*data));
        fread(data, length, sizeof(*data), file);
        fclose(file);
        
        for(size_t i=0;i<length;i++) {
            if(data[i] >= 'A' && data[i] <= 'Z') {
                data[i] = tolower(data[i]); //make lowercase
            }
            else if((data[i] < 'a' || data[i] > 'z') && data[i] != '\'') {
                data[i] = 0x0;
            }
        }
        
        
        // I may need to go to the hospital after writing this program.
        unsigned long wordsCount = 0;
        unsigned long wordsCapacity = 1000;
        Word **words = malloc(sizeof(*words) * wordsCapacity);
        
        for(size_t i=0;i<length;i++) {
            char *string = data + i;
            size_t stringLength = strlen(string);
            if(stringLength == 0) continue;
            
            Word *word = SearchForWordWithString(words, wordsCount, string);
            
            if(word) {
                word->instances++;
            }
            else {
                words[wordsCount] = (Word *)malloc(sizeof(*words));
                words[wordsCount]->word = calloc(stringLength+1,sizeof(*(*words)->word));
                memcpy(words[wordsCount]->word,string,stringLength);
                words[wordsCount]->instances = 1;
                wordsCount++;
                
                if(wordsCount == wordsCapacity) {
                    wordsCapacity *= 2;
                    words = realloc(words, sizeof(*words) * wordsCapacity);
                }
            }
            
            i += stringLength;
        }
        
        Word *highScore = words[0];
        bool multipleHighScores = false;
        
        // Is it over yet?
        for(unsigned long i=1;i<wordsCount;i++) {
            if(words[i]->instances > highScore->instances) {
                highScore = words[i];
                multipleHighScores = false;
            }
            else if(words[i]->instances == highScore->instances) {
                multipleHighScores = true;
            }
        }
        
        // High score! High score! Yay!
        if(highScore) {
            if(multipleHighScores) {
                printf("The words with the most instances are: \"%s\"",highScore->word);
                for(unsigned long i=0;i<wordsCount; i++) {
                    if(words[i] != highScore && words[i]->instances == highScore->instances) {
                        printf(", \"%s\"",words[i]->word);
                    }
                }
                printf(" at %lu instances.\n",highScore->instances);
            }
            else
                printf("The word with the most instances is \"%s\" at %lu times.\n",highScore->word,highScore->instances);
        }
        else {
            // Sorry, man 
            printf("There were no words in that file. You wasted my time :,(\n");
        }
        
        // Life is full of free statements.
        for(unsigned long i=0;i<wordsCount;i++) {
            free(words[i]->word);
            free(words[i]);
        }
        free(words);
        free(data);
        
    }
    else {
        printf("This program takes one argument - your input.txt file.\n");
    }
    return 0;
}
Free isn't required if this is a standalone program, but I put it there regardless, because I hate having a malloc that doesn't have a free.

 

The word with the most instances is "the" at 97 times. 
Edited by 002
TCK and v3nture like this

Share this post


Link to post
Share on other sites

Thank you everybody! I'll test that program out ASAP. Never used python before but it seems a lot more intuitive than C.

 

 

 // I may need to go to the hospital after writing this program.

 

Please don't die.


343OrM8.jpg?1

Share this post


Link to post
Share on other sites

Thank you everybody! I'll test that program out ASAP. Never used python before but it seems a lot more intuitive than C.

Please don't die.

You could probably do anything in C that you can do in any other language. A lot of times, though, C is not the best choice, and you're going to end up writing so much more code than you would in another language. C is here to stay, though. Edited by 002

Share this post


Link to post
Share on other sites
Sign in to follow this  
Followers 0
  • Recently Browsing   0 members

    No registered users viewing this page.