OCR-ing the First 25
Selecting an OCR Engine
It will take a little over a semester for all 200 cookbooks to be scanned, so I decided to take the first 25 books scanned so far and test out my OCR and text pre-processing steps on them. This will allow me to see whether I have correctly planned the steps I’ll need to take with the full corpus to get meaningful results. Going into the project I had experience with and access to Tesseract OCR, an open-source commend line tool for converting images to machine-readable text, so I decided I would use that to create the .txt files I’ll need for analysis. When I first decided to use Tesseract I had tested a few scanned cookbooks and noticed that Tesseract struggled to recognize the fractions in ingredient lists, but otherwise seemed to do a pretty accurate job of handling the rest of the text.
My colleague in Preservation who is managing digitization is also OCR-ing the texts as part of the digitization workflow we are testing for all digital collections. She is using ABBYY Finereader, a paid license product with a GUI. Since we’re using both products for different parts of the workflow and we have 25 completed, I now have more examples to compare to see if there is a significant difference in accuracy. With more examples to compare, I can now see a noticeable difference in how Tesseract and Finereader handle the table of contents and other non-standard text layouts as well as images and other “noise” on the page.

Example of Tesseract OCR on left and ABBYY Finereader on right
Determining Pre-Processing Steps
No matter which OCR option I choose, there will inevitably be mistakes that produce extra punctuation, numerical characters, or unrecognizable characters in the text that will need to be cleaned before or ignored by Python during analysis. I’ll be testing different options to determine how much advanced cleanup will be necessary to make sure these errors and extra characters don’t muddy the analysis.
First 25 Cookbooks List
Call Number | Title | Organization | Location | Date |
---|---|---|---|---|
TX715.2 M53 B545x | “Blessed are those who hunger … for they will be filled” | First Congregational Church Women’s Fellowship | Forest City, IA | |
TX715.2 M53 B6638x 1960 | A book of favorite recipes | First United Presbyterian Church | Fairfield, IA | 1960 |
TX715.2 M53 C6514x 1923 | A collection of choice and tried recipes | Philathea Society of the Whiting Congregational Church | Whiting, IA | 1923 |
TX715.2 M53 C6515x 1903 | A collection of choice recipes | Ladies of Des Moines to benefit Des Moines Missionary Sewing School | Des Moines, IA | 1903 |
TX715.2 M53 C6622x 1950 | Cook book | Woman’s Society of Christian Service of the Methodist Church | Blakesburg, IA | 1950 |
TX715.2 M53 F365x 1928 | The Farm Bureau cook book | Women of Delaware County Farm Bureau | Delaware County, IA | 1928 |
TX715.2 M53 F38 1960z | Favorite recipes | Altar & Rosary Society, St. Mary’s Catholic Church | Manchester, IA | 1960 |
TX715.2 M53 F753x 1992 | Friends of the Fairfield Public Library cookbook | Friends of the Fairfield Public Library | Fairfield, IA | 1992 |
TX715.2 M53 H6622x | Home town recipes of Oskaloosa | Altrusa Club of Oskaloosa | Oskaloosa, IA | |
TX715.2 M53 K586x 1976 | “Kitchen secrets” with love | Low-Rent Housing Agency of Ottumwa, Iowa | Ottumwa, IA | 1976 |
TX715.2 M53 M385x 1896 | Mason City cook book: a collection of choice recipes | Woman’s Society of the Congregational Church | Mason City, IA | 1896 |
TX715.2 M53 O67x 1975 | Oppskrifter | Palestine Lutheran Church Women | Huxley, IA | 1975 |
TX715.2 M53 O88x 1977 | Ottumwa cook book | Ottumwa, IA | 1977 | |
TX715.2 M53 O97x 1982 | Our best recipes, 1857-1982 | Winthrop Commercial Club | Winthrop, IA | 1982 |
TX715.2 M53 R432x | Recipe roundup | American Legion Auxiliary, williams-Jobe-Gibson Unit 128 | Sidney, IA | |
TX715.2 M53 R43x 1971 | Recipe favorties of Holt mothers | Des Moines Holtap Mother’s Club | Des Moines, IA | 1971 |
TX715.2 M53 S363x 1999 | Schaller Chamber cookbook: celebratin 50 years of pop corn days | Schaller Chamber of Commerce | Schaller, IA | 1999 |
TX715.2 M53 S713x 1910 | St. John’s Lutherna Church cookbook | Lutheran Ladies Aid Society | Ogden, IA | 1910 |
TX715.2 M53 T472x 1924 | Tested cooky recipes | McGregor Tourist Club | McGregor, IA | 1924 |
TX715.2 M53 T476x 1935 | Tested recipes | Orphans’ Aid Society of the St. Paul’s Evangelical Lutheran Church | Waverly, IA | 1935 |
TX715.2 M53 U55x 1997 | Unitarian Universalist Fellowship of Ames 50th anniversary cookbook | Unitarian Universalist Fellowship of Ames | Ames, IA | 1997 |
TX715.2 M53 W355x 1970 | Walnut centennial cook book | Walnut Centennial Cookbook Committee | Walnut, IA | 1970 |
TX715.2 M53 W54x 1936 | When do we eat? | Division One of Willing Workers, Frankville Cmomunity Church | Winneshiek County, IA | 1936 |
TX715.2 M53 W555x 1972 | Wild rose recipes | 21st National Square Dance Convention | Des Moines, IA | 1972 |
TX723.5 C9 F38x 1963 | Favorite recipes of Z.C.B.J. Drill Team | Z.C.B.J. Drill Team | Cedar Rapids, IA | 1963 |