ELFJ Corpus

What is the ELFJ Corpus?

The ELFJ (English as a Lingua Franca in Japan) Corpus consists of 28,956 words of written transcripts and 6 hours and 39 minutes of audio recordings generated from a collection of 19 naturally-occurring conversations. The conversations were conducted online via Zoom, between 18 Japanese English language learners and 18 interlocutors of various nationalities. The data was collected by five researchers from a university in Japan in 2019 as part of a research project supported by JSPS KAKENHI Grant Number JP18K00753.(本研究は JSPS 科研費 JP18K00753 の助成を受けたものです). The project was approved by the ethical committee of the researchers’ institution prior to its commencement.

Who is the ELFJ Corpus intended for?

The corpus was created for researchers who have an academic interest in studying various aspects of naturally occurring spoken communication in an ELF context, particularly for those whose research focus is on low-proficiency English learners in Japan.

Who are the participants in the ELFJ Corpus?

Eighteen Japanese English speakers (JES), university English learners in Japan, and eighteen non-Japanese English speakers (NJES), foreign interlocutors consisting of eight nationalities, participated in the research project. Participation was voluntary, and all the participants gave written consent for their conversation(s) to be used in an audio format for the corpus prior to the recordings. Four of the participants (i.e., NJES3/NJES4, NJES7/NJES8, JES8/JES15, JES17/JES19) each participated in the project on two separate occasions.

How were the conversations contained in the ELFJ Corpus conducted?

All the conversations were recorded in a Japanese university setting. Each conversation consisted of two participants (i.e., JES and NJES). The conversations were conducted online via Zoom. None of the paired participants had met (or had any contact) prior to the recording of their conversations. All the JES participants were at the university physically for the recording of the conversations. All but one of the NJES participants were outside of Japan at the time of the recordings. The pairing of participants was based solely on their availability.

What is the nature of the conversations contained in the ELFJ Corpus?

Before each of the recordings, the participants were asked to have a conversation on any topics they wished for approximately 20 minutes. Only JES participants received a (optional) list of eight topical questions (e.g., What time is it there now?) and six general themes/topics (e.g., Culture/Customs) to help prompt conversation if they wished. This list, however, was rarely referred to by JES.

Why doesn’t the written transcript always match the audio track word-for-word?

The conversations were manually transcribed by a professional transcription service provider in Japan. The level of transcription provided (素起こし) did not guarantee a strict verbatim textual representation of the audio data. Therefore, the final version of written text that is presented in the corpus may differ from its accompanying audio track to some extent.

Why is the timing of the captions sometimes not precise?

Automatic caption alignment software was used to transform the written transcripts into captions that were timed to the audio track. However, due to the nature of the conversations (non-native interlocutors; overlapping utterances; long pauses; omitted information) the timing is not always precise.

Why are some words/sections missing from the written and audio texts?

Some individual words have been omitted from the written and audio texts by the researchers to ensure the privacy of the participants. Additionally, the beginning section (i.e, greetings/introductions) and/or other extended portions of some conversations have been omitted for the same purpose.

Did the researchers have any involvement during the recording of conversations?

At least one of the researchers was always present during the recording of each conversation for supervision purposes. In only a few instances, JESs sought the aid of a researcher, either non-verbally (e.g., through eye contact) or verbally, when they encountered a problem (e.g., understanding their foreign interlocutor, unstable Internet connection, etc.). However, such occurrences were rare. When they did occur, the researcher(s) made every effort to keep their involvement to a minimum.

Why does the ngram count sometimes differ from the number of ngrams shown in the search results?

The ngram count is an accurate count of the number of occurrences of a specific ngram contained in the corpus transcripts. However, where an ngram spans multiple captions, we are not able to show the precise location of that ngram in the search results.

Why are annotations not provided in the transcripts?

As with any project, the creators of the ELFJ Corpus were bound by their own unique set of budgetary and logistical limitations. Moreover, subscribing to a single annotation convention may not ensure the needs of every potential user will be met. While we expect the needs of users will vary widely, the project team recognizes that the absence of annotations could be a prohibitive factor for some users. Nevertheless, because each transcript is accompanied by full-length audio, it is hoped that users, for whom annotated text is a necessity (e.g., for academic presentations or publications), will add their own annotations to excerpts from the transcripts based on a convention that is most appropriate for their individual needs.

What does the annotation "(xxxx)" in the transcripts mean?

This annotation represents words that were unknown or unrecognizable to the original transcriber. The number of x's contained in this annotation (i.e., "xxxx") is standardized in every case and, therefore, is neither an indication nor an approximation of the number of syllables contained in the words it represents.

How will my sign-in data be stored and used?

The Google email address which users must sign in with to access the corpus is not registered or stored in the database. Only the Google account ID number and user name is registered. Users’ "country of residence", "affiliation" (optional), and "position" (optional) are also registered and may be used in analytical data. Users' "country of residence" information may be used by the project team for statistical purposes in academic presentations and publications.

How to cite ELFJ Corpus 1.0 Online

Recommended Citation

ELFJ Corpus. 2023. The English as a Lingua Franca in Japan Corpus (version ELFJ Corpus 1.0 Online). Principal investigator ELFJ Corpus 1.0: Blagoja Dimoski; Co-investigators: Satomi Kuroshima, Yuri Jody Yujobo, Tricia Okada, Rasami Chaikul. https://elfj-corpus.com (date of last access).

Short Citation

ELFJ Corpus. 2023. The English as a Lingua Franca in Japan Corpus (version ELFJ Corpus 1.0 Online). https://elfj-corpus.com (date of last access).


The researchers are sincerely grateful to all the participants for their valuable contributions to the project, and to Paul Raine for the time he invested, the technical expertise he applied, and his unwavering commitment to the development of the ELFJ Corpus online platform.