Computer comes to the help of lexicography

We decide to make a thesaurus

Arvind first came to know of and use Roget’s work in 1952 and wished Hindi had such a wonderful tool. He hoped that in the new spirit of dictionary making in India, a Hindi thesaurus would soon be made too. Two decades later, Arvind was in Bombay (now Mumbai), editing a Hindi fortnightly magazine, Madhuri, for the Times of India group. There was still no Hindi thesaurus on the horizon. On the evening of Christmas Day 1973, it occurred to him that he would have to make it. The next morning, we discussed the idea during our walk and decided to go ahead with the work.

We were well aware that the colossal job would require our full-time dedication. Arvind would have to leave his lucrative job and, in the absence of any financial support, we would have to live simply off our savings.

We spent some months in collecting reference material. On 19 April 1976, we started work on a part-time basis, in our off hours. Arvind would write words on specially designed cards and Kusum would later create indexes for them on a set of smaller cards. In 1978, Arvind left Bombay and we moved to Delhi. The final plunge into the ocean of Hindi vocabulary had been taken.

Arvind had imagined that we would be able to complete the work in two years (it eventually took twenty!). He had reasoned that we could follow the pattern of Roget’s Thesaurus. We assigned numbers to all the concepts and put the numbered cards in the Rogetian sequence. All that remained to be done was to fill the cards with appropriate Hindi words. Alas, it was not that simple.

To check the model, Arvind went through the first few pages of a Hindi dictionary. A large number of necessary concepts were missing in Roget’s and there was no way to add more categories between the already assigned sequential numbers.

Roget’s work is based on the so-called scientific classification. Language, however, is anything but scientific. While the study of words is a science, people coin words in various unscientific ways, mostly associative, but sometimes just whimsical. Associations vary from people to people and time to time and have societal contexts. The scientific system is also handicapped by difficulties that the layman may have in making a straight association of concepts. For example, in modern Rogetian editions, wheat is listed with grasses. Among its associations are bamboo, banana. No relationship has been pointed out with cereal or food. Another example is that of steel. The user thinks of steel in the context of iron. But in Roget, it is counted among alloys with no reference to iron.

When Roget’s system failed us, we considered emulating Amar Singh. However, he was out of sync with new realities. Wars or arms no longer conjure up images of warriors from the kshatriya caste. Nor would one associate lion with a kshatriya or cow with a vaishya. The shudras are no longer menials or servants. In Amar Singh’s time, music was a heavenly activity, but a musician a menial. Thus, he put music in the first canto Heavens, and musician under Shudras in the second; this would not work in the contemporary context.

It was now plain to us that we had no model; that we were on our own. There were no pointers to what order, sequence, pattern or structure we would give to our word groups. We decided to evolve our own system as we progressed. There were at least five false starts. It was fourteen years before we came upon a viable structure.

The job of adding words was divided between the two of us. Arvind took care of categories like activities, ideas, abstract nouns, verbs, adjectives, adverbs, idioms and exclamations. Kusum was assigned words relating to things, animals, trees, herbs and mythological names. She had to face unforeseen difficulties. Hindi has many words for a tree/animal and a word may stand for many trees/animals. Her problem was how to find a way to distinguish and insert a word in the right place. Fortunately for her, Sir Monier-Williams’ excellent Sanskrit–English dictionary gives the New Latin technical names of such things. Kusum started making an index of New Latin technical terms, to check and re-check if her entries were right.

Computer and the Shabda Lexicographer

By 1990-91, we had a roomful of 60,000 hand-written cards with over 2,50,000 words. The cards were arranged subject-wise in specially designed wooden trays in which we were able to stack two or three rows of about 150 cards. The trays and rows were arranged in conceptual groups and subgroups. To change the sequence, we would inter-shift trays, or subgroups within a tray. The task of handling the data spread all over was getting out of hand. There was also much overlapping of categories and repetition of words.

We also had to think of the means to resolve the logistics of handling the data while publishing. The numerous cards would first have to go to typists who, we feared, would mix up their sequence or lose some cards. There could be typographical errors, or corrected type sheets could get mixed up. Typesetters at the printing press would add their own quota of errors. Even with careful proof-reading, it seemed unlikely that we would have an error-free work.

The formidable task of creating indexes also stared us in the face; once the thesaurus part of the book was typeset, a veritable army would be required to index it and, worse, indexers might supply their own share of unforgivable blunders. Without an index, a thematic thesaurus would have no meaning. Even fifteen years after starting it, the work was nowhere near completion.

At this time, our son, Dr Sumeet Kumar, a double gold-medallist mbbs, ms, from the Seth G.S. Medical College, Mumbai, was working as a resident surgeon at Dr Ram Manohar Lohia Hospital, New Delhi. There, viewing the first personal computers that were beginning to be used in India and the computerization of data at the hospital, he saw their great potential.

He suggested that we computerize our data. We initially turned down the idea, then submitted. However, having over the years supported our work from our savings, we had no money for a computer nor programmers. Sumeet took up an assignment as surgeon for the National Iranian Oil Company for one and a half years, with the explicit goal of returning to India as soon as he had saved enough money to computerize our work. He was back in Delhi in 1992. After some research, we purchased our first i386 computer in May 1993.

In Iran, Sumeet also educated himself about computers and computer applications. He had determined that our work required a database programme, not just a word processor.

The importance of a database for a thesaurus or dictionary cannot be overstated. It facilitates the handling and management of data in various ways. One can add as many new categories or concepts as one likes, include extra columns, rows and fields, enter any number of synonyms, and shift groups to change/modify the sequence. Once a data is in place, duplications show up and can be removed; records or expressions can be examined, edited, changed. And, more importantly, indexing is automatic.

To be of any use, databases need complex programming. We soon learned that there were no programmes available for making thesauruses. We would need to get our own software package developed and customized. But computer programmers do not come cheap. Further, we discovered, no one from the several software companies we approached had any previous experience to meet our specific requirements. The task of developing a custom-built solution would take time and cost an astronomical amount.

Sumeet found he had a natural and hitherto undiscovered talent for programming and took on the daunting task. He selected FoxPro 2.0 as the most appropriate platform for our database. Over the next six months, he wrote the initial application for converting our manually written cards. He kept upgrading the programme, adding new modules to satisfy our ever-increasing demands, enabling us to view and examine the growing data, edit it, and reorganize it. His programme allowed us to earmark individual records for selection to feature in various types of mono-, bi- and multilingual thesauruses and dictionaries. He has now evolved a foolproof, almost automatic system of converting DOS data into fully formatted Adobe PageMaker and Microsoft Word documents with multilingual indexes, ready for taking camera-ready printouts.

Our labour of love first bore fruit after twenty years in the shape of Samantar Kosh Hindi Thesaurus—the first ever in Hindi. It contains 1,60,850 expressions grouped in 1,100 categories and 23,759 sub-categories. National Book Trust, India, published it in 1996 as part of the golden jubilee celebrations of Independence. We were thrilled to present its first copy on 13 December 1996 to the then President of India, Dr Shankar Dayal Sharma.

We often wonder what would have happened if we had not taken the computer route. We may still have been writing cards!

Soon after that we busied ourselves in making an extended bilingual vesion of it, which was published by the Penguin India as our massive work The Penguin English-Hindi/Hindi-English Thesaurus and Dictionary द पेंगुइन इंग्लिश-हिंदी/हिंदी-इंग्लिश थिसारस ऐंड डिक्शनरी. And now is being marketed by the authors through Arvind Linguistics,, E-28, 1^st Floor, Kalindi Colony, New Delhi 110065.

Further on we have been able to come out with a still more extended version of it on the internet under the title Arvind Lexicon.

We decide to make a thesaurus

Computer and the Shabda Lexicographer

Comments