LINGUIST List 13.1428

Tue May 21 2002

Review: Software: Concordance 3.0

Editor for this issue: Terence Langendoen <terrylinguistlist.org>

What follows is another discussion note contributed to our Book Discussion Forum. We expect these discussions to be informal and interactive; and the author of the book discussed is cordially invited to join in.

If you are interested in leading a book discussion, look for books announced on LINGUIST as "available for discussion." (This means that the publisher has sent us a review copy.) Then contact Simin Karimi at siminlinguistlist.org or Terry Langendoen at terrylinguistlist.org.

Directory

Pernilla Danielsson, Concordance 3.0 (software review)

Message 1: Concordance 3.0 (software review)

Date: Tue, 21 May 2002 13:15:29 +0100
From: Pernilla Danielsson <pernillaclg.bham.ac.uk>
Subject: Concordance 3.0 (software review)

Concordance version 3.0, software package available at http://www.rjcw.freeserve.co.uk/.

Pernilla Danielsson: Centre for Corpus Linguistics, English Department, University of Birmingham

INTRODUCTION AND GENERAL INFORMATION It seems unavoidable that any new concordance software will lack that one essential feature the researcher needs, whilst offering a number of less useful features instead. That said, every review of concordance software could begin with a list of unavailable, but necessary features and continue to list other less useful, but present, features found in the new version. The only problem is that these lists would vary from researcher to researcher.

However, instead of a compilation of advantages and disadvantages, this review will focus on the positive features offered by this concordance software, such as the word list link to the concordances and the new web publishing as a saving option. Some concerns are given as to the time and space issues and the presentation of the collocation within this software.

The software reviewed here is Concordance 3.0 a Windows-based tool, available for MS Windows-95, 98, NT and XP. Concordance 3.0, implemented by Rob Watts, was first released in January 1999 and has since been fairly regularly updated. A test version of the tool can be downloaded from the web site http://www.rjcw.freeserve.co.uk/, and to purchase a single license (also available on-line) costs GBP 55 or USD 89 for the first copy.

BACKGROUND Naming a tool "Concordance" will unavoidably carry expectations based on connotations around the words concordance and concordancer. It is not a name to choose if you want to avoid confrontations with similar software in the field. The word 'concordance' is defined as follows in Cobuild's English Dictionary for Advanced Learners: "a concordance is a list of the words in a text or group of texts, with information about where in the text each word occurs and how often it occurs. The sentences each word occurs in are often given" (Cobuild 2001). Hence, a concordancer can be interpreted as software that produces such a concordance. Concordancing is one of the oldest ways of browsing through a (computerised) text or corpus. Early in text computing the KWIC (KeyWord In Context) model was established and it is still the standard way of presenting concordance information. Concordance 3.0 can produce these traditional KWIC concordances but also includes the option of displaying full sentences.

When concordances first appeared on the linguistics scene they were criticised for a number of things. For example, there was their lack of lemmatization, not being able to distinguish between homographs and the lack of possibilities to choose context. The development in the field of NLP has now made it possible to add some of this information to the text, thus enabling the concordancers to have lemmatizing and part-of- speech disambiguation among its features. Whether or not linguistic annotation adds or removes information from the text is not a consideration for this review (see John Sinclair's talk at 6th TELRI Seminar for further discussions on this matter), but what is notable, is that while lemmatisation is an option (you as the linguist may manually group words together to form a lemma), part-of speech tagging is not considered in the featured software.

USING THE NEW CONCORDANCE 3.0 Perhaps it is my own expectation of a piece of software named Concordance that makes this tool such a great mystery to me. The first thing the software does is to index your text or corpus on word level and produce word lists; this leaves you waiting. The word list turns out to be vital for your searches, which makes this tool differ substantially from most other available concordance tools. Instead of typing in a search word and receiving concordance lines as most concordancers prompt you to, this tool uses the word list, displayed on the left-hand window, as a direct link into the text. Once you get used to the idea that clicking on the word in the word list performs your search, you may indeed find yourself seeing this as a very attractive feature. Perhaps not so much for what you get now, but for what it could provide in a later version. What if every new search you do would also give you an immediate word list for the smaller set? A linguist's trained eye will probably pick up more regularities when confronted with these lists than any statistical calculation can provide. As such, this interface has many exciting possibilities for future development.

However, while the word list, as a search link, may have attractive possibilities, it also has its faults. Initially, when beginning a new session, the index procedure leaves the user waiting. This review was originally intended to include tests of three separate corpora; one corpus consisting of five hundred thousand words, another of 2 million words and a third of 20 million words. Starting with the small 500 000 word corpus, which in fact only consists of 4 novels, the system sends out a warning that this is a very large file. Considering that it is currently no challenge to store a 100 million word corpus on your PC, I chose to proceed and decided to get a cup of coffee while waiting, leaving the system running. I needed to make a walk of about 40 meters, including the time it takes to fill up the cup and upon my return discovered that it was still indexing. To be more precise, the system used 22.85 seconds to analyse the file, another 147.22 seconds to sort the file and needed 170.10 seconds to finally load the file; all in all this added up to more than five and half minutes. For those of us who are used to instant access to the 430 million words of the Bank of English, it is a stressful wait. However, it must also be acknowledged that those of us more comfortable using very large corpora are probably better off using UNIX based products, such as CWB (Christ 1994), QUE (Mason 1996) or Lookup (Clear 1987). Still, if we compare Concordance 3.0 with some of its MSWindows-based competitors, such as WordSmith, the latter only takes a few seconds to load the same input on the same machine. Of course, having indexed the file once, you can save the indexing and use this as a starting point next time you want to search the same data; Don't hold your breath though, loading a saved file will also include some waiting. You may choose the option "Display while load", but again this will make loading even more time consuming. Apart from the time spent on this, this saved index file will take up space on the hard disk. Despite earlier warnings given by the software, there are no apparent upper limits for what the software can handle. In the end, it will be the size of your hard disk and your patience that will decide how big a corpus you can work with.

Once texts are imported into the concordancer, the next item on the agenda is to try out the query options. I have previously mentioned that if you are not familiar with the software it may, at first glance, seem rather confusing. However, once you know the search routine you might find yourself looking at this as one of the new concordance 3.0's strengths. The concordancer is fitted with a simple interface. Compared to several of its competitors, it presents you with a rather clean window split into the word list and the relevant concordance lines. Although interface seems to come down to personal preference, it is at least easier when placed in a teaching environment as there are not too many disturbing buttons that may encourage students to get it wrong.

Researchers who do not work with English will be pleased to hear that this tool does acknowledge other languages. You may choose an alphabet from an extensive list (not only Latin alphabet, but also for example Greek and Hebrew) and if the tool discovers a character in your text not covered by the chosen alphabet it offers to add them. After the additions, you may go in and sort the alphabet. I found this very useful for the Swedish characters "�", "�" and "�". As they are sorted in the given order at the end of our alphabet (although not in the same order as in the other Scandinavian countries) and it is a relief when I can manually control the sorting. Also, when trying the Chinese part of The Birmingham Centre for Corpus Linguistics new Chinese-English translation database (on a machine running Chinese Windows NT), it performed well.

The software includes all the normal features, such as statistics of the text you are working on, which includes information about the text size, the number of tokens, the number of types and a type/token ratio. Moving onto statistics between the words, the collocation features found under the menu Context, we find information about collocation based on word positions around the keyword. This way of presenting collocates seems to have gained popularity and is also found in the WordSmith and LookUp software. It is not clear why the linguist is forced to look at the collocations in boxes per position. Although this might be good sometimes, it is very annoying in many cases. On a list of wanted but not available features, other possible display options of the collocation information is positioned high up.

Even with all these useful features one of the most interesting parts of this tool is not found in its linguistics features, but in the way that it explores web-opportunities. In a simple save-operation (separate from the normal save as text or pdf), this tool offers you a complete web version of your research; a four parted frame-window, including a headword list window, a window for the concordance, a window for the original text and a window for different sections, combined with a smaller window to go between the sections. Exploring the exciting possibilities hypertext has to offer, this format enables you to jump easily in and out of the concordance into a specific place in the original text, even without a web server that will post it for the whole world to see. The innovative use of html code for saving workspaces can provide you with a useful tool to present to students. Also, this has great possibilities for the future. Why not include a small program that lets students manipulate these concordance lines themselves; sorting; sub-querying etc?: Add a short window for the student to write down her findings, a submit button and we have a very useful e-learning facility.

CONCLUDING REMARKS The new Concordancer 3.0 is a useful tool for corpus linguists, language teachers and lexicographers; the many users that the concordancer has around the world already prove this. For those of you who have not yet tried this tool, the publishing of the result as web sites, or the direct link between word lists and concordances should be enough motivation to at least download a test version from the internet (http://www.rjcw.freeserve.co.uk/). Ultimately, if you need a tool that can handle millions and millions of words of language data you might find this tool a bit on the slow side. However, this software may still be of use in your teaching.

REFERENCES Christ, O. (1994) A Modular and Flexible Architecture for an Integral Corpus Query System. In Proceedings of Complex'94. 3rd Conference on Computational Lexicography and Text Research, Budapest, Hungary, July 7-10, 1994, pp. 23-32.

Clear, J. (1987) Computing. In Sinclair, J. M. S, Looking Up: An Account of the COBUILD Project. Glasgow: Collins ELT. ISBN 0-00-370256-1.

Mason, O., (1996) Corpus Access Software: The CUE System. In Text Technology: The Journal of Computer Text Processing. Vol 6 No. 4. Winter 1996, pp. 257-266. ISSN 1053-900X. Wright State University-Lake Campus.

Scott, M., (1999) Wordsmith Tools version 3, Oxford: Oxford University Press. ISBN 0-19-459289-8.

ABOUT THE REVIEWER Pernilla Danielsson holds a PhD (Gothenburg 2001) in Computational Linguistics. She is a senior researcher and deputy director at the Centre for Corpus Linguistics, University of Birmingham.