MTR menu
overview
previous page    next page
British Computer Society's coat of arms British Computer Society
Natural Language Translation
Specialist Group

URL: http://www.bcs-mt.org.uk/
WEB PAGE 9
Machine Translation Review
No. 1, April 1995   ISSN: 1358-8346
http://www.bcs-mt.org.uk/mtreview/1/9.htm

 
LINGUISTIC RESOURCES ON THE INTERNET

by Roger Harris
E-mail: rwsh@nationalfinder.com

General linguistic resources, and machine translation resources in particular, may be found in many parts of the Internet. Similar networks, such as Compuserve, also have linguistic sections (Compuserve: GO FLEFO or GO MACCIMSUP, which uses translation technology — Eng/Fre/Ger — from Intergraph Corp.), but these are not directly available on Internet.
 
Access to much of Internet's data may be obtained with a simple computer, modem and communications program. My own 'antique' system includes an Amstrad 1640 connected to a 1200 baud modem. Where linguistic resources include high-quality screen graphics, colour, and sound, then obviously you will need a more advanced computer system to take advantage of these features.
 
You may gain access to Internet either via a terminal connected to a large academic or corporate computer system or via a stand-alone computer connected to the telephone system. Whatever the case, the distant host computer to which you will be connected will have a directory structure similar in some ways to PC-DOS as used in IBM-PC's and their clones. If you are accustomed only to Windows or Apple Macintosh icon screens then you might find this a problem. However, navigation is very simple and only a few commands will be needed.
 
Your computer will be connected to your local host computer and will exchange data using a protocol such as Kermit, XMODEM or YMODEM. The distant host computer, for example, in Brazil, will automatically set up a communications link with your local host. A little knowledge of the Unix operating system might be useful but is not essential. Other than that, instead of accessing, say, drive C: or drive F: on your computer, you will be accessing a distant hard disk drive which is identified by name instead of by letter. It's also a bit slower. That's all.
 
Various computer commands are shown below enclosed in single quotation marks, for example, 'dir'. The quotation marks are not to be used when typing in a command. The control key is represented by ^, as in '^C' for control C. The Carriage Return or Enter key is represented by .
 
Help in the form of screens of commands and explanations is available almost everywhere by typing 'h' or 'help' or '?' at the prompt and pressing ; use data logging (see below) to build up your own file of help pages. Type 'cd' pathname to select a directory path. Type 'cd ..' to go up one directory level. Type 'cd' to go to the highest directory level. Type 'dir' to display the current directory which will probably look something like this:

MULBERRY.SRV.CS.CMU.EDU:/usr0/anon         
drwxr-xr-x 2 root system  512  Aug 24  1994  misc     
-rw-r--r-- 1 root system 1158  Nov 23  1993  READ_ME  
drwxr-xr-x 2 root system  512  Feb 17  00:33 project  
drwxr-xr-x 6 3973 0      2048  Aug 24  1994  sys      
drwxr-xr-x 2 root system 1536  Feb 21  15:43 user

In the line above the body of the directory, the capitalised words refer to the full name of the computer being accessed (CMU = Carnegie Mellon University), while /usr0/anon is the directory path. In the body of the listing, in column one, if the first character is a 'd' (as in, for example, drwxr-xr-x), then the line refers to a directory which may in turn contain other directories and/or files. If the first character is a '-', then the line refers to a file. Letters after the first character refer to file security and access.
 
In the second line, READ_ME is a text file which can be downloaded into your host computer by typing 'get' READ_ME . The number 1158 is the number of bytes in that file and Nov 23 1993 is the date when it was last updated. Type 'cd' project to go to the project directory.
 
Usually, one can type in the full directory path. If that is rejected then type in the directory path names individually until you reach the desired directory. To examine the contents of a directory, type dir , to check in case the next directory's name is not the same as that in your reference source.
 
The distant computer could take up to a minute or more to respond depending upon data traffic levels. Repeated, impatient pressing of a key will significantly slow down the response; each keystroke is stored and will be executed once the distant computer is free to deal with your instructions. It may have to execute the dozens or even hundreds of keystrokes which you might have made. This could take minutes or more and the distant system might appear to have gone berserk. Sometimes it is best to hang up, type '^H' , and log on again.
 
Be prepared to use your ingenuity in finding files and do not be surprised if the directory structure which you encounter does not exactly match the description obtained from a reference book or from this article. If you get lost or need help, then type 'h' or 'help' or '?' and press . Internet is an evolving entity subject to continual change, revision, addition, and deletion.
 
 
Data logging
 
Any text which appears on your screen may be sent to a printer by pressing the key while holding down the key. The communications software will automatically store in a named file any text which is received from a distant source. This process may be called data logging. The software will store the text in a file whose suffix or extension name might be .LOG, or you can specify a filename. The data logging function must be invoked before the data appears on your computer screen.
 
A source of unique albeit cryptic filenames is based upon the date and time when you are about to store some data. For example, 7.33 pm on 16th February 1995 may be coded as 5=year (1995), 2=month, 16=day, 19=hour, 33=minutes, to give the filename of 52161933. Use A, B, and C to represent October, November, and December. Add the filename extension '.199' if you want to identify the decade. Such filenames will appear in date/time order in a sorted display of filenames.
 
 
Archie, Gopher and Veronica
 
These are programs which will perform keyword searches in order to locate files and programs; archie operates on ftp sites and veronica on gopher sites. Archie will only locate a source; veronica will locate a source and retrieve a file. A search allows Boolean parameters such as 'machine AND translation', machine OR translation', and 'translation NOT machine'.
 
An archie search for 'linguist' will return addresses such as:

	coombs.anu.edu.au /coombspapers/coombsarchives/linguistics/
	csli.stanford.edu /linguistics/
	ftp.sunet.se /pub/mac/umich/misc/linguistics/
	ftp.univie.ac.at /systems/dos/simtel/linguist/
	julian.uwo.ca /doc/FAQ/greek-faq/linguistics
	knot.queensu.ca /pub/tcrunchers/Misc/linguist.list

Use the data logging function of your communications software to store the data as it is displayed.
 
 
USENET newsgroups
 
There are reputed to be some 7,000 to 8,000 newsgroups on Internet. Your Internet access provider may be connected to less than half that number. Some newsgroups are either empty or dormant. A newsgroup is composed of groups of comments called threads which in turn are composed of articles put there by posters (writers). You can post a reply directly to an article on the thread or to the poster. You can download articles or whole threads. A file in your computer may be uploaded and then posted to an existing newsgroup thread or to one which you can initiate. The following newsgroups are of linguistic interest:
alt.etext           electronic texts
comp.ai.nat-lang    Natural language processing by computer
comp.ai.nlang-know-rep   Natural language and knowledge representation
comp.software.international   Finding, using, and 
                              writing non-English  applications
comp.speech         Research and applications in speech 
                    recognition and production
comp.text.sgml      Structured documents markup languages
sci.lang            Natural languages, communication
sci.lang.translation   Problems and concerns of translators

FAQs (Frequently Asked Questions) are extensive, detailed documents providing copious information about the subjects covered by a newsgroup. They are available at:
	ftp: rtfm.mit.edu /pub/usenet/newsgroup's name
where 'newsgroup's name' is, for example 'sci.lang'.
 
 
Linguistic resource sites
 
In the following section, file locations are shown as follows:
	protocol: computer_name /directory/directory/file 

Computer network addresses are often shown as four numbers in addition to the more usual letters, for example:
	 ftp.cmu.edu [128.2.206.173].

Either format should work. All directory names are case-sensitive. You first log on using the computer name and then select the directory. The following is a list of useful linguistic resource sites:
 
 
Alex Catalogue of Electronic Texts
  http://www.lib.ncsu.edu stacks/alex-index.html
  gopher://rsl.ox.ac.uk 70/11/lib-corn/hunter
  gopher://gopher.lib.nmsu.edu /11/library/stacks/Alex

 
 
Association for Computational Linguistics (ACL)
	http://www.cs.columbia.edu /~acl
Send e-mail to
	listserv@cs.columbia.edu
with the following in the body of the message:
	index acl-1

 
Brown University linguistics page http://www.cog.brown.edu /pointers/linguistics.html
 
Colibri newsletter (language, linguistics, etc.) http://colibri.let.ruu.nl
 
comp.ai.nat-lang Usenet newsgroup Dragomir Radev is the editor of comp.ai.nat-lang FAQ and the source of several items in this list. e-mail: radev@cs.columbia.edu http://www.cs.columbia.edu /~radev/home.html
 
Consortium for Lexical Research e-mail: lexical@nmsu.edu ftp: crl.nmsu.edu /CLR/catalog
 
Corpora, dictionaries, wordlists etc.
e-mail: ingrid.maier@slaviska.uu.se	   (Russian corpus)
e-mail: ldc@unagi.cis.upenn.edu	       (CELEX, LDC)
ftp: black.ox.ac.uk /wordlists/	       (word lists) 
ftp: ftp.cmu.edu /project/fgdata/dict/ (dictionaries)
ftp: ftp.cs.vu.nu /dictionaries	       (word lists)
ftp: ftp.funet.fi /pub/doc/dictionaries/ (word lists)
ftp: ftp.uu.net /doc/dictionaries/DEC-collection/ (dictionaries)
ftp: ftp.white.toronto.edu /pub/words/sodict.gz (Shorter Oxford)
ftp: gatekeeper.dec.com /pub/misc/stolfi-wordlists (word lists)
ftp: wocket.vantage.gte.com /pub/standard_dictionary (word lists)
http://olymp.fer.uni-lj.si /dictionary/a2s.html (Eng.-Slovene)
http://philae.sas.upenn.edu /French/french.html
http://solar.rtd.utk.edu /friends/cyrillic/cyrillic.html 
http://www.fmi.uni-passau.de /htbin/lt/lte  (Eng.-Ger. dictionary)

 
Echo Eurodicautom
Translates words between Dan/Dut/Fre/Ger/Ita/Por/Spa:
http://www.uni-frankfurt.de /~felix/eurodicautom.html

 
EITS (Experimental Internet Translation Service)
Launched in 1994, and offered, apparently, translations between many of the world's known languages including Pig Latin and Ubby Dubby. You can get the full hilarious details by e-mailing to
	jens@panix.com
with the subject line as
	request-file eitsfaq.txt.

 
ELSNET (European Language and Speech Network)
	e-mail: elsnet-list@cogsci.ed.ac.uk
	http://www.cogsci.ed.ac.uk /elsnet/home.html

 
International Standards Organisation (ISO), ISO Online
	http://www.iso.ch /welcome.html    (English version)
	http://www.iso.ch /welcomef.html   (French version)

 
ISO-8859-1 FAQ (International Standards Organisation)
	ftp: ftp.vlsivie.tuwien.ac.at /pub/8bit/FAQ-ISO-8859-1

 
Institute for Natural Language Processing at the University of Stuttgart
	http://www.ims.uni-stuttgart.de/IMS.html

 
Lingsoft Corp Inc's demonstrations of their linguistic software
http://www.lingsoft.fi /cgi-pub/engcg		(English parser)
http://www.lingsoft.fi /cgi-pub/engtwol	(English morphology)
http://www.lingsoft.fi /cgi-pub/finhyp9	(Finnish hyphenation)
http://www.lingsoft.fi /cgi-pub/finstems	(Finnish stems)
http://www.lingsoft.fi /cgi-pub/fintwol	(Finnish morphology)
http://www.lingsoft.fi /cgi-pub/gertwol 	(German morphology)
http://www.lingsoft.fi /cgi-pub/swetwol	(Swedish morphology)

 
LINGUIST list
	http://www.ling.rochester.edu /linguist/contents.html

 
Send e-mail to
	listserv@tamvm1.tamu.edu
with the following in the body of the message:
	'subscribe linguist forename surname'

 
You will receive frequent bulletins on various linguistic subjects; a message: 'unsubscribe linguist forename surname' will cancel.
 
Linguistic tools
clarity.princeton.edu /pub/
linc.cis.upenn.edu /pub/xtag/                             
speech.cse.ogi.edu /pub/tools/

Linguistics and MT document archive and e-print server
	http://xxx.lanl.gov /cmp-lg/

Multilingual PC Directory
This book is a copious source of information about linguistics software, wordprocessors, fonts, suppliers' addresses, Internet resources, etc. (ISBN: 1-873091-03-5). There is also an electronic version in Windows Help File format, available for downloading from Compuserve's Foreign Language Forum (GO CIS:FLEFO) in the file MPCDIR.ZIP, and also from the site: http://knowledge.co.uk/xxx/, due on-line in May 1995. Contact: Knowledge Computing, 9 Ashdown Drive, Boreham Wood, Herts. WD6 4LZ, Tel: +44 (0)181-953 7722, Fax: +44(0)181-905 1879, E-Mail: 72240.3447@compuserve.com.
 
Natural language software list
	ftp: ftphost.uni-koblenz.de  /outgoing/software_list.ps.z

Natural Language Software Registry, Saarbrücken
	http://cl-www.dfki.uni-sb.de /cl/registry/draft.html

For a descriptive document and questionnaire:
ftp: crlftp.nmsu.edu /pub/non-lexical/NL_Software_registry
ftp: dri.cornell.edu /pub/Natural_Language_Software_Registry or /pub/NLSR

 
NL-KR Digest (as published on Usenet newgroup comp.ai.nlang-know-rep)
For subscriptions, send an e-mail request to:
	 nl-kr-request@ai.sunnyside.com

 
For submissions, questions etc., send:
nl-kr@ai.sunnyside.com

 
Back issues are available from the following:
	ftp: ai.sunnyside.com /pub/nl-kr/Vnn/Nnn 
		(where Vnn = volume number, Nnn = issue number)
	gopher: ai.sunnyside.com (Port 70) /pub/nl-kr
	http://ai.sunnyside.com /pub/nl-kr

 
Software localisation
ftp:etext.archie.umich.edu/pub/Economics/FutureTalk/media-localising.txt.gz
http://gopher.gmu.edu /bcox/Economics/SoftwareLicensingPaper.html

 
Translators' Home Companion
	http://www.rahul.net /lai/companion.html

Unicode Consortium FAQ
Full listings of ISO 639 and ISO CD 11639, and the use of ISO/IEC 6420 control functions to encode language:
http://www.stonehand.com /unicode.html
http://www.stonehand.com /unicode/standard/principles.html#x12
	(deals with language tagging)
e-mail: unicode-inc@hq.metaphor.com

University of Virginia electronic text centre
	http://www.lib.virginia.edu /etext.ETC.html

Usenet's 1000 most commonly used words and usage statistics
	ftp: ftp.spies.com /Library/Article/Language/top1000.use

This site also contains further files of linguistic interest.
The above information has been collected from Usenet newsgroups, Internet searches, an FAQ edited by Dragomir Radev, and the Internet Golden Directory, 2nd edition. My thanks go to them all.