Background and History of the Project


As part of the Princeton-IBM Project Pegasus in the 1980's, an electronic archive of 2300 Cairo-Geniza transcriptions was created. IBM Corporation donated six PC-XT's, a 3812 Laser Printer and printer software to the project. The Department of Near Eastern Studies funded the keyboarding. The total volume of texts transcribed is close to 10 MB.

The texts were input with a version of Kedit for DOS that had been configured to write Hebrew alphabet letters from right to left. Dr. Michael Sperberg-McQueen, then computer consultant to the humanities at Princeton, adapted Mansfield Software's Kedit for DOS for Hebrew and Arabic letters. This product was called Kedit/Semitic and served as a limited right-to-left wordprocessor for MS-DOS vintage 1985. It was demonstrated at IBM higher education conferences.

In Kedit/Semitic the cursor starts at the right side of the screen and moves one space to the left after a Hebrew or Arabic letter is typed; similarly, pressing the space-bar moves the cursor one space to the left. Pressing the "Enter" key returns the cursor to the right side of the screen. While this system would be awkward to use as a wordprocessor for Hebrew since it lacks a function to wrap Hebrew text correctly when the typed text reaches the left margin, this system is quite adequate for the line-by-line transcriptions that do not require continuous text wrap. The system also has a left-to-right mode in which it is a full function wordprocessor still widely used today for it power and simplicity.

Hebrew and Arabic letters are displayed on the screen and printed using the Duke Language Toolkit.

The texts come from a collection of roughly 15,000 Geniza texts (out of the total 200,000+) which deal with the daily life of the Jewish community in Cairo and those of other places in the Mediterranean, mainly in the 11th to 13th centuries. These so-called "Geniza documents," range in size from a few words to long letters of 80-100 lines.

This collection had been the focus of the scholarship of S. D. Goitein (1900-1985) at the Hebrew University in Jerusalem and from 1957 at the University of Pennsylvania and after his retirement in 1971 at the Princeton Institute for Advanced Study. The multi-volume study, A Mediterranean Society: the Jewish communities of the Arab world as portrayed in the Documents of the Cairo Geniza 6 vols. (Berkeley: University of California Press, 1967-1993) is the standard work on what has become known as the "documentary Geniza," to distinguish the sub-set of these 15,000 fragments from others that contain religious or other literary material such as Bibles, rabbinic texts, liturgy, poetry, mysticism, religious philosophy, magical texts, and even fragments from books of Islamic literature.

The term "document" and hence "documentary Geniza" has a technical function in Geniza studies that is obscured by the common use of the term document to mean any electronic text file. In Geniza studies "document" means any text, a self-contained leaf or a leaf from a notebook, that originates in daily, routine activity (letters, legal documents, business accounts, marriage contracts, orce documents and lists of all kinds) rather than being pages from a copied manuscript on a literary subject.

In 1985, under the supervision of the project director, Near Eastern Studies Professor Mark R. Cohen, keyboarders began computerizing "documents" in Judaeo-Arabic (Arabic written in Hebrew characters) and Hebrew from the Cairo Geniza. The goal was to create a free-text data-base of these texts that could be searched electronically for information on the economic and social history of Jews and Muslims in the medieval Mediterranean, as well as on the history of the Arabic language.

Over ten years, working at varying paces, keyboarders covered (1) all the major book-length publications of Geniza documents (many poor editions being corrected by comparison with photocopies of the manuscripts kept in the department's "S. D. Goitein Laboratory for Geniza Research"); (2) most of the documents published by S. D. Goitein in article form, incorporating corrections made by him in his personal offprints; (3) documents deciphered or re-edited by scholars M. Cohen and A. L. Udovitch of the Near Eastern Studies Department; (4) many documents "edited," that is, typed by Professor Goitein, but not published.

The total of about 2300 individual texts that has thus far been completed represents a "provisional corpus." Necessarily it encompasses mostly longer documents, of the type normally published in the course of research. Thus in terms of actual bytes of data the total comprises rather more than 15 percent of the target figure of 15,000 fragments (roughly estimated to amount to between 40 and 50 Megabytes when completed). The remainder consists of the most challenging texts, for they are the ones requiring original decipherment by skilled students of the language of the Geniza. However, it should be noted that the provisional corpus alone will immeasurably facilitate this task by providing decipherers quick access to words and phrases in the "Geniza vocabulary." Furthermore, the provisional corpus provides a solid, if far from exhaustive, basis on which significant research can already be done.

We would be remiss if we did not mention the transcribers without whose dedication and spirit this archive would not have been possible. Each document bears the initials of the transcriber and the date the document was last updated.

Although funding for active transcription work in the project from Pegasus ended in 1990, the archive has been maintained and augmented through the scholarship of Prof. Cohen, Prof. Udovitch and their students.

To make the archive keyword searchable was one of the top priorities of Prof. Cohen and his colleagues. Believing that no one can anticipate all possible items of information of interest to scholars, now and in the future, the project designers felt from the beginning that it was essential to use an "all-words-as-keywords" approach. The main problem during the past decade has been to find an effective "search engine" to retrieve information written in right-to-left font.

Prof. Cohen has built a searchable prototype using a small subset of the documents dealing with poverty and assistance to the poor using Nota Bene's Hebrew function and indexing option on a Toshiba Laptop. While this system works well with a small set of files, it would be impractical to index the entire corpus using Nota Bene. This system has been demonstrated at international Geniza conferences by Prof. Cohen and is used by him in a seminar on the subject of poverty and charity. It serves as an important tool in his work with one aspect of the communal life of medieval Mediterranean Jewish society.


In 1994, the new computer consultant to the humanities at Princeton University, Dr. Peter Batke, proposed a solution to the "search engine" dilemma. Given the history of the project in the DOS environment and the ongoing work of Prof. Cohen with the NB indexing option (and the problem of the machine hanging with large files), it made sense to look for a DOS based tool of roughly the same vintage that is adept at handling the Hebrew right-to-left and also can handle a corpus over 10MB.

WordCruncher 4.5 has the advantage that it indexes large corpora in relatively small memory regions. Wordcruncher separates the function of "indexing" from the function of "retrieval." "Indexing" is only done once for a given corpus and can take a considerable amount of time depending on the size of the corpus and the speed of the machine. An indexing job of all the Geniza transcriptions may take 2 hours to run on an 1985 vintage IBM-PC XT; the same job would run in 8 minutes on a modern Pentium. "Retrieval" is based on the files created during the "indexing" phase and is quite instantaneous and independent of the speed of the machine or the size of a file. Thus, the "retrieval" performance of an IBM-XT is as fast as it would be on a modern Pentium. This is due to the fact that retrieval uses only "index-searching" which requires no CPU intensive processing.

Another advantage of WordCruncher is that the index files can be distributed via ftp or on a CD-ROM independently of the retrieval program. The retrieval program will work on any PC, no matter how old, provided it has enough diskspace to hold the files.

WordCruncher was designed to be a sophisticated index, search and retrieval program for MS-DOS based computers before the era of large RAM memory regions and the Windows interface. It was developed at Brigham Young University for large textual corpora like The Collected Works of Shakespeare and the Bible and the Book of Mormon. WordCruncher was also designed to handle files in Hebrew characters and to interface with the Duke Language Toolkit.

Furthermore, and very importantly, WordCruncher has an interactive feature (a "word-wheel") which displays all the words in the database alphabetically. Since the spelling of the Judaeo-Arabic Geniza documents is very inconsistent, this feature makes it possible for a user to search for and retrieve terms or passages that might otherwise be missed. At the same time, the interactive feature reveals keywords that the user might not expect to find in the corpus, thus considerably increasing the usefulness of the database.

The revival on the project in 1994 was greatly aided by a grant of equipment and student labor from the Department of Near Eastern Studies. The work plan for the revival of the project drew heavily on experience Dr. Batke had gained working with text archives he had built at Duke University and at Johns Hopkins. His work as one of the designers of the Duke Language Toolkit made it easy for him to grasp the quite involved procedures for entering and editing the texts with right-to-left display and Hebrew screen and printer fonts. This is a quite esoteric area of DOS computing.

Starting in the Fall of 1994, Dr. Batke designed a workplan to 1. demonstrate the technical feasibility of right-to-left indexing with WordCruncher; 2. to consolidate the individual transcription files into large files encompassing entire collections; 3. to design a flexible markup that would retrieve the crucial shelf-information of a document and would be flexible enough not to preclude other markup-schemes.

A series of consultations between Prof. Cohen and Dr. Batke yielded a working prototype of some 200 "documents" in the provisional corpus that are housed in the Bodleian Library, Oxford.

On the basis of this working prototype, the Department of Near Eastern Studies allocated $6000 to purchase two new 486 PC's and to hire a post-doc with expertise in the Geniza to assemble the collections.

In June 1995, Dr. Hassan Khalilieh was hired to start work on the corpus. His tasks included marking up the self-information, standardizing the "document descriptions" and separating the Hebrew text from the English descriptions and formatting information. In addition, Dr. Khalileh added a general category of the "document" based on its content.

At the present moment (June, 1996) all 2300 texts are being coded for indexing in WordCruncher. Additional work of proofreading will then be done using the word-wheel. Soon thereafter, a CD-ROM will be prepared, including the provisional corpus plus the search-and-retrieval software. After trials at Princeton, the package will be made available to scholars and libraries. As the corpus is enlarged in subsequent years (subject to adequate funding), the CD-ROM will be updated periodically.

Looking down the road, another desideratum involves creating digitized images of the actual documents so users can compare transcriptions with originals on their monitors, and even recommend corrections to be incorporated into the database. A parallel goal entails making retrieval of text and of digitized images accessible on the World Wide Web.