LINGUISTIC RESOURCES ON THE INTERNET
by Roger Harris
E-mail: rwsh@nationalfinder.com
General linguistic resources, and machine translation resources in particular, may be found in many parts of the Internet. Similar networks, such as Compuserve, also have linguistic sections (Compuserve: GO FLEFO or GO MACCIMSUP, which uses translation technology — Eng/Fre/Ger — from Intergraph Corp.), but these are not directly available on Internet.
Access to much of Internet's data may be obtained with a simple computer, modem and communications program. My own 'antique' system includes an Amstrad 1640 connected to a 1200 baud modem. Where linguistic resources include high-quality screen graphics, colour, and sound, then obviously you will need a more advanced computer system to take advantage of these features.
You may gain access to Internet either via a terminal connected to a large academic or corporate computer system or via a stand-alone computer connected to the telephone system. Whatever the case, the distant host computer to which you will be connected will have a directory structure similar in some ways to PC-DOS as used in IBM-PC's and their clones. If you are accustomed only to Windows or Apple Macintosh icon screens then you might find this a problem. However, navigation is very simple and only a few commands will be needed.
Your computer will be connected to your local host computer and will exchange data using a protocol such as Kermit, XMODEM or YMODEM. The distant host computer, for example, in Brazil, will automatically set up a communications link with your local host. A little knowledge of the Unix operating system might be useful but is not essential. Other than that, instead of accessing, say, drive C: or drive F: on your computer, you will be accessing a distant hard disk drive which is identified by name instead of by letter. It's also a bit slower. That's all.
Various computer commands are shown below enclosed in single quotation marks, for example, 'dir'. The quotation marks are not to be used when typing in a command. The control key is represented by ^, as in '^C' for control C. The Carriage Return or Enter key is represented by .
Help in the form of screens of commands and explanations is available almost everywhere by typing 'h' or 'help' or '?' at the prompt and pressing ; use data logging (see below) to build up your own file of help pages. Type 'cd' pathname to select a directory path. Type 'cd ..' to go up one directory level. Type 'cd' to go to the highest directory level. Type 'dir' to display the current directory which will probably look something like this:
MULBERRY.SRV.CS.CMU.EDU:/usr0/anon
drwxr-xr-x 2 root system 512 Aug 24 1994 misc
-rw-r--r-- 1 root system 1158 Nov 23 1993 READ_ME
drwxr-xr-x 2 root system 512 Feb 17 00:33 project
drwxr-xr-x 6 3973 0 2048 Aug 24 1994 sys
drwxr-xr-x 2 root system 1536 Feb 21 15:43 user
In the line above the body of the directory, the capitalised words refer to the full name of the computer being accessed (CMU = Carnegie Mellon University), while /usr0/anon is the directory path.
In the body of the listing, in column one, if the first character is a 'd' (as in, for example, drwxr-xr-x), then the line refers to a directory which may in turn contain other directories and/or files. If the first character is a '-', then the line refers to a file. Letters after the first character refer to file security and access.
In the second line, READ_ME is a text file which can be downloaded into your host computer by typing 'get' READ_ME . The number 1158 is the number of bytes in that file and Nov 23 1993 is the date when it was last updated. Type 'cd' project to go to the project directory.
Usually, one can type in the full directory path. If that is rejected then type in the directory path names individually until you reach the desired directory. To examine the contents of a directory, type dir , to check in case the next directory's name is not the same as that in your reference source.
The distant computer could take up to a minute or more to respond depending upon data traffic levels. Repeated, impatient pressing of a key will significantly slow down the response; each keystroke is stored and will be executed once the distant computer is free to deal with your instructions. It may have to execute the dozens or even hundreds of keystrokes which you might have made. This could take minutes or more and the distant system might appear to have gone berserk. Sometimes it is best to hang up, type '^H' , and log on again.
Be prepared to use your ingenuity in finding files and do not be surprised if the directory structure which you encounter does not exactly match the description obtained from a reference book or from this article. If you get lost or need help, then type 'h' or 'help' or '?' and press . Internet is an evolving entity subject to continual change, revision, addition, and deletion.
Data logging
Any text which appears on your screen may be sent to a printer by pressing the key while holding down the key. The communications software will automatically store in a named file any text which is received from a distant source. This process may be called data logging. The software will store the text in a file whose suffix or extension name might be .LOG, or you can specify a filename. The data logging function must be invoked before the data appears on your computer screen.
A source of unique albeit cryptic filenames is based upon the date and time when you are about to store some data. For example, 7.33 pm on 16th February 1995 may be coded as 5=year (1995), 2=month, 16=day, 19=hour, 33=minutes, to give the filename of 52161933. Use A, B, and C to represent October, November, and December. Add the filename extension '.199' if you want to identify the decade. Such filenames will appear in date/time order in a sorted display of filenames.
Archie, Gopher and Veronica
These are programs which will perform keyword searches in order to locate files and programs; archie operates on ftp sites and veronica on gopher sites. Archie will only locate a source; veronica will locate a source and retrieve a file. A search allows Boolean parameters such as 'machine AND translation', machine OR translation', and 'translation NOT machine'.
An archie search for 'linguist' will return addresses such as:
coombs.anu.edu.au /coombspapers/coombsarchives/linguistics/
csli.stanford.edu /linguistics/
ftp.sunet.se /pub/mac/umich/misc/linguistics/
ftp.univie.ac.at /systems/dos/simtel/linguist/
julian.uwo.ca /doc/FAQ/greek-faq/linguistics
knot.queensu.ca /pub/tcrunchers/Misc/linguist.list
Use the data logging function of your communications software to store the data as it is displayed.
USENET newsgroups
There are reputed to be some 7,000 to 8,000 newsgroups on Internet. Your Internet access provider may be connected to less than half that number. Some newsgroups are either empty or dormant. A newsgroup is composed of groups of comments called threads which in turn are composed of articles put there by posters (writers). You can post a reply directly to an article on the thread or to the poster. You can download articles or whole threads. A file in your computer may be uploaded and then posted to an existing newsgroup thread or to one which you can initiate. The following newsgroups are of linguistic interest:
alt.etext electronic texts
comp.ai.nat-lang Natural language processing by computer
comp.ai.nlang-know-rep Natural language and knowledge representation
comp.software.international Finding, using, and
writing non-English applications
comp.speech Research and applications in speech
recognition and production
comp.text.sgml Structured documents markup languages
sci.lang Natural languages, communication
sci.lang.translation Problems and concerns of translators
FAQs (Frequently Asked Questions) are extensive, detailed documents providing copious information about the subjects covered by a newsgroup. They are available at:
ftp: rtfm.mit.edu /pub/usenet/newsgroup's name
where 'newsgroup's name' is, for example 'sci.lang'.
Linguistic resource sites
In the following section, file locations are shown as follows:
protocol: computer_name /directory/directory/file
Computer network addresses are often shown as four numbers in addition to the more usual letters, for example:
ftp.cmu.edu [128.2.206.173].
Either format should work. All directory names are case-sensitive. You first log on using the computer name and then select the directory. The following is a list of useful linguistic resource sites:
Alex Catalogue of Electronic Texts
http://www.lib.ncsu.edu stacks/alex-index.html
gopher://rsl.ox.ac.uk 70/11/lib-corn/hunter
gopher://gopher.lib.nmsu.edu /11/library/stacks/Alex
Association for Computational Linguistics (ACL)
http://www.cs.columbia.edu /~acl
Send e-mail to
listserv@cs.columbia.edu
with the following in the body of the message:
index acl-1
Brown University linguistics page
http://www.cog.brown.edu /pointers/linguistics.html
Colibri newsletter (language, linguistics, etc.)
http://colibri.let.ruu.nl
comp.ai.nat-lang Usenet newsgroup
Dragomir Radev is the editor of comp.ai.nat-lang FAQ and the source of several items in this list.
e-mail: radev@cs.columbia.edu
http://www.cs.columbia.edu /~radev/home.html
Consortium for Lexical Research
e-mail: lexical@nmsu.edu
ftp: crl.nmsu.edu /CLR/catalog
Corpora, dictionaries, wordlists etc.
e-mail: ingrid.maier@slaviska.uu.se (Russian corpus)
e-mail: ldc@unagi.cis.upenn.edu (CELEX, LDC)
ftp: black.ox.ac.uk /wordlists/ (word lists)
ftp: ftp.cmu.edu /project/fgdata/dict/ (dictionaries)
ftp: ftp.cs.vu.nu /dictionaries (word lists)
ftp: ftp.funet.fi /pub/doc/dictionaries/ (word lists)
ftp: ftp.uu.net /doc/dictionaries/DEC-collection/ (dictionaries)
ftp: ftp.white.toronto.edu /pub/words/sodict.gz (Shorter Oxford)
ftp: gatekeeper.dec.com /pub/misc/stolfi-wordlists (word lists)
ftp: wocket.vantage.gte.com /pub/standard_dictionary (word lists)
http://olymp.fer.uni-lj.si /dictionary/a2s.html (Eng.-Slovene)
http://philae.sas.upenn.edu /French/french.html
http://solar.rtd.utk.edu /friends/cyrillic/cyrillic.html
http://www.fmi.uni-passau.de /htbin/lt/lte (Eng.-Ger. dictionary)
Echo Eurodicautom
Translates words between Dan/Dut/Fre/Ger/Ita/Por/Spa:
http://www.uni-frankfurt.de /~felix/eurodicautom.html
EITS (Experimental Internet Translation Service)
Launched in 1994, and offered, apparently, translations between many of the world's known languages including Pig Latin and Ubby Dubby. You can get the full hilarious details by e-mailing to
jens@panix.com
with the subject line as
request-file eitsfaq.txt.
ELSNET (European Language and Speech Network)
e-mail: elsnet-list@cogsci.ed.ac.uk
http://www.cogsci.ed.ac.uk /elsnet/home.html
International Standards Organisation (ISO), ISO Online
http://www.iso.ch /welcome.html (English version)
http://www.iso.ch /welcomef.html (French version)
ISO-8859-1 FAQ (International Standards Organisation)
ftp: ftp.vlsivie.tuwien.ac.at /pub/8bit/FAQ-ISO-8859-1
Institute for Natural Language Processing at the University of Stuttgart
http://www.ims.uni-stuttgart.de/IMS.html
Lingsoft Corp Inc's demonstrations of their linguistic software
http://www.lingsoft.fi /cgi-pub/engcg (English parser)
http://www.lingsoft.fi /cgi-pub/engtwol (English morphology)
http://www.lingsoft.fi /cgi-pub/finhyp9 (Finnish hyphenation)
http://www.lingsoft.fi /cgi-pub/finstems (Finnish stems)
http://www.lingsoft.fi /cgi-pub/fintwol (Finnish morphology)
http://www.lingsoft.fi /cgi-pub/gertwol (German morphology)
http://www.lingsoft.fi /cgi-pub/swetwol (Swedish morphology)
LINGUIST list
http://www.ling.rochester.edu /linguist/contents.html
Send e-mail to
listserv@tamvm1.tamu.edu
with the following in the body of the message:
'subscribe linguist forename surname'
You will receive frequent bulletins on various linguistic subjects; a message: 'unsubscribe linguist forename surname' will cancel.
Linguistic tools
clarity.princeton.edu /pub/
linc.cis.upenn.edu /pub/xtag/
speech.cse.ogi.edu /pub/tools/
Linguistics and MT document archive and e-print server
http://xxx.lanl.gov /cmp-lg/
Multilingual PC Directory
This book is a copious source of information about linguistics software, wordprocessors, fonts, suppliers' addresses, Internet resources, etc. (ISBN: 1-873091-03-5). There is also an electronic version in Windows Help File format, available for downloading from Compuserve's Foreign Language Forum (GO CIS:FLEFO) in the file MPCDIR.ZIP, and also from the site: http://knowledge.co.uk/xxx/, due on-line in May 1995. Contact: Knowledge Computing, 9 Ashdown Drive, Boreham Wood, Herts. WD6 4LZ, Tel: +44 (0)181-953 7722, Fax: +44(0)181-905 1879, E-Mail: 72240.3447@compuserve.com.
Natural language software list
ftp: ftphost.uni-koblenz.de /outgoing/software_list.ps.z
Natural Language Software Registry, Saarbrücken
http://cl-www.dfki.uni-sb.de /cl/registry/draft.html
For a descriptive document and questionnaire:
ftp: crlftp.nmsu.edu /pub/non-lexical/NL_Software_registry
ftp: dri.cornell.edu /pub/Natural_Language_Software_Registry or /pub/NLSR
NL-KR Digest (as published on Usenet newgroup comp.ai.nlang-know-rep)
For subscriptions, send an e-mail request to:
nl-kr-request@ai.sunnyside.com
For submissions, questions etc., send:
nl-kr@ai.sunnyside.com
Back issues are available from the following:
ftp: ai.sunnyside.com /pub/nl-kr/Vnn/Nnn
(where Vnn = volume number, Nnn = issue number)
gopher: ai.sunnyside.com (Port 70) /pub/nl-kr
http://ai.sunnyside.com /pub/nl-kr
Software localisation
ftp:etext.archie.umich.edu/pub/Economics/FutureTalk/media-localising.txt.gz
http://gopher.gmu.edu /bcox/Economics/SoftwareLicensingPaper.html
Translators' Home Companion
http://www.rahul.net /lai/companion.html
Unicode Consortium FAQ
Full listings of ISO 639 and ISO CD 11639, and the use of ISO/IEC 6420 control functions to encode language:
http://www.stonehand.com /unicode.html
http://www.stonehand.com /unicode/standard/principles.html#x12
(deals with language tagging)
e-mail: unicode-inc@hq.metaphor.com
University of Virginia electronic text centre
http://www.lib.virginia.edu /etext.ETC.html
Usenet's 1000 most commonly used words and usage statistics
ftp: ftp.spies.com /Library/Article/Language/top1000.use
This site also contains further files of linguistic interest.
The above information has been collected from Usenet newsgroups, Internet searches, an FAQ edited by Dragomir Radev, and the Internet Golden Directory, 2nd edition. My thanks go to them all.
|