Published in Proceedings of the 4th Workshop on Argument Mining (ArgMining2017), 2017
We propose a method for the annotation of Japanese civil judgment documents, with the purpose of creating flexible summaries of these. The first step, described in the current paper, concerns content selection, i.e., the question of which material should be extracted initially for the summary. In particular, we utilize the hierarchical argu-ment structure of the judgment documents. Our main contributions are a) the design of an annotation scheme that stresses the connection between legal issues (called issue topics) and argument structure, b) an adaptation of rhetorical status to suit the Japanese legal system and c) the definition of a linked argument structure based on le-gal sub-arguments. In this paper, we report agreement between two annotators on sev-eral aspects of the overall task.
Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2017. Annotation of argument structure in Japanese legal documents. In Proceedings of the 4th Workshop on Argument Mining (ArgMining2017). pages 22-31. http://www.aclweb.org/anthology/W17-5103
Published in Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE 2017), 2017
We propose an annotation scheme for the summarization of Japanese judgment documents. This paper reports the details of the development of our annotation scheme for this task. We also conduct a human study where we compare the annotation of independent annotators. The end goal of our work is summarization, and our categories and the link system is a consequence of this. We propose three types of generic summaries which are focused on specific legal issues relevant to a given legal case.
Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2017. Designing an annotation scheme for summarizing Japanese judgment documents. In Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE 2017). pages 275-280. http://ieeexplore.ieee.org/document/8119471/
Published in The 41st annual Language Testing Research Colloquium, 2019
Recommended citation: Sawaki, Y., Ishii, Y., & Yamada, H. (2019). Japanese university students’ paraphrasing strategies in L2 summary writing. The 41st annual Language Testing Research Colloquium.
Recommended citation: Sawaki, Y., Ishii, Y., & Yamada, H. (2019). Japanese university students' paraphrasing strategies in L2 summary writing. The 41st annual Language Testing Research Colloquium.
Published in The 6th Competition on Legal Information Extraction/Entailment (COLIEE-2019), 2019
Deep learning based approaches achieved significant advances in various Natural Language Processing (NLP) tasks. However, such approaches have not yet been evaluated in the legal domain compared to other domains such as news articles and colloquial texts. Since creating annotated data in the legal domain is expensive, applying deep learning models to the domain has been challenging. A fine-tuning approach can alleviate the situation; it allows a model trained with a large out-domain data set to be retrained on a smaller in-domain data set. A fine-tunable language model “BERT” was proposed and achieved state-of-the-art in various NLP tasks. In this paper, we explored the fine-tuning based approach in legal textual entailment task using COLIEE task 2 data set. The experimental results show that fine-tuning approach improves the performance, achieving F 1 = 0.50 with COLIEE task 2 dry run data.
Recommended citation: Hiroaki Yamada and Takenobu Tokunaga. 2019. A performance study on fine-tuned large language models in the Legal Case Entailment task. In Proceedings of the 6th Competition on Legal Information Extraction/Entailment (COLIEE-2019). https://www.cl.c.titech.ac.jp/tokunaga/_media/publication/yamada_2019aa.pdf
Published in Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2019, 2019
This paper discusses the computer-assisted content evaluation of summaries. We propose a method to make a correspondence between the segments of the source text and its summary. As a unit of the segment, we adopt “Idea Unit (IU)” which is proposed in Applied Linguistics. Introducing IUs enables us to make a correspondence even for the sentences that contain multiple ideas. The IU correspondence is made based on the similarity between vector representations of IU. An evaluation experiment with two source texts and 20 summaries showed that the proposed method is more robust against rephrased expressions than the conventional ROUGEbased baselines. Also, the proposed method outperformed the baselines in recall. We implemented the proposed method in a GUI tool “Segment Matcher” that aids teachers to establish a link between corresponding IUs across the summary and source text.
Recommended citation: Marcello Gecchele, Hiroaki Yamada, Takenobu Tokunaga and Yasuyo Sawaki. 2019. Supporting content evaluation of student summaries by Idea Unit embedding. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA@ACL2019). https://www.aclweb.org/anthology/W19-4436
Published in Legal Knowledge and Information Systems - JURIX 2019: The Thirty-second Annual Conference,, 2019
We address the legal text understanding task, and in particular we treat Japanese judgment documents in civil law. Rhetorical status classification (RSC) is the task of classifying sentences according to the rhetorical functions they fulfil; it is an important preprocessing step for our overall goal of legal summarisation. We present several improvements over our previous RSC classifier, which was based on CRF. The first is a BiLSTM-CRF based model which improves performance significantly over previous baselines. The BiLSTM-CRF architecture is able to additionally take the context in terms of neighbouring sentences into account. The second improvement is the inclusion of section heading information, which resulted in the overall best classifier. Explicit structure in the text, such as headings, is an information source which is likely to be important to legal professionals during the reading phase; this makes the automatic exploitation of such information attractive. We also considerably extended the size of our annotated corpus of judgment documents.
Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2019. Neural network based Rhetorical status classification for Japanese judgement documents. In The proceedings of the 32nd International Conference on Legal Knowledge and Information Systems (JURIX 2019). pages 133–142. https://doi.org/10.3233/FAIA190314
Published in The 2022 International Conference on Language Resources and Evaluation (LREC2022),, 2022
This paper describes a comprehensive annotation study on Japanese judgment documents in civil cases. We aim to build an annotated corpus designed for Legal Judgment Prediction (LJP), especially for torts. Our annotation scheme contains annotations of whether tort is accepted by judges as well as its corresponding rationales for explainability purpose. Our annotation scheme extracts decisions and rationales at character-level. Moreover, the scheme can capture the explicit causal relation between judge’s decisions and their corresponding rationales, allowing multiple decisions in a document. To obtain high-quality annotation, we developed an annotation scheme with legal experts, and confirmed its reliability by agreement studies with Krippendorff’s alpha metric. The result of the annotation study suggests the proposed annotation scheme can produce a dataset of Japanese LJP at reasonable reliability.
Recommended citation: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Keisuke Takeshita, and Mihoko Sumida. 2022. Annotation Study of Japanese Judgments on Tort for Legal Judgment Prediction with Rationales. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 779–790, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.83
Published in The 2022 International Conference on Language Resources and Evaluation (LREC2022),, 2022
In this paper, we approach summary evaluation from an applied linguistics (AL) point of view. We provide computational tools to AL researchers to simplify the process of Idea Unit (IU) segmentation. The IU is a segmentation unit that can identify chunks of information. These chunks can be compared across documents to measure the content overlap between a summary and its source text. We propose a full revision of the annotation guidelines to allow machine implementation. The new guideline also improves the inter-annotator agreement, rising from 0.547 to 0.785 (Cohen’s Kappa). We release L2WS 2021, a IU gold standard corpus composed of 40 manually annotated student summaries. We propose IUExtract; i.e. the first automatic segmentation algorithm based on the IU. The algorithm was tested over the L2WS 2021 corpus. Our results are promising, achieving a precision of 0.789 and a recall of 0.844. We tested an existing approach to IU alignment via word embeddings with the state of the art model SBERT. The recorded precision for the top 1 aligned pair of IUs was 0.375. We deemed this result insufficient for effective automatic alignment. We propose “SAT”, an online tool to facilitate the collection of alignment gold standards for future training.
Recommended citation: Marcello Gecchele, Hiroaki Yamada, Takenobu Tokunaga, Yasuyo Sawaki, and Mika Ishizuka. 2022. Automating Idea Unit Segmentation and Alignment for Assessing Reading Comprehension via Summary Protocol Analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4663–4673, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.498
Published in Findings of the The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2022),, 2022
This paper investigates the pretrained language model (PLM) specialised in the Japanese legal domain. We create PLMs using different pretraining strategies and investigate their performance across multiple domains. Our findings are (i) the PLM built with general domain data can be improved by further pretraining with domain-specific data, (ii) domain-specific PLMs can learn domain-specific and general word meanings simultaneously and can distinguish them, (iii) domain-specific PLMs work better on its target domain; still, the PLMs retain the information learnt in the original PLM even after being further pretrained with domain-specific data, (iv) the PLMs sequentially pretrained with corpora of different domains show high performance for the later learnt domains.
Recommended citation: Keisuke Miyazaki, Hiroaki Yamada and Takenobu Tokunaga. 2022. Cross-domain Analysis on Japanese Legal Pretrained Language Models. In Findings of the The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 274–281, Online. https://aclanthology.org/2022.findings-aacl.26
Published in Legal Knowledge and Information Systems - JURIX 2023: The Thirty-sixth Annual Conference,, 2023
With the increasing demand for summarizing Japanese judgment documents, the automatic generation of high-quality summaries by large language models (LLMs) is expected. We propose a method to select exemplars using the nearest neighbor search for the one-shot learning method. The experiments showed our method outperforms baseline methods.
Recommended citation: Akito Shimbo, Yuta Sugawara, Hiroaki Yamada, Takenobu. 2023. Nearest Neighbor Search for Summarization of Japanese Judgment Documents. Legal Knowledge and Information Systems - JURIX 2023: The Thirty-sixth Annual Conference, pages 225–340, Maastricht, The Netherlands. https://doi.org/10.3233/FAIA230984
Published in The 16th International Conference on Computer Supported Education,, 2024
This paper introduces our ongoing research project that aims to generate multiple-choice questions for the Japanese National Nursing Examination using large language models (LLMs). We report the progress and prospects of our project. A preliminary experiment assessing the LLMs’ potential for question generation in the nursing domain led us to focus on distractor generation, which is a difficult part of the entire questiongeneration process. Therefore, our problem is generating distractors given a question stem and key (correct choice). We prepare a question dataset from the past National Nursing Examination for the training and evaluation of LLMs. The generated distractors are evaluated with compared to the reference distractors in the test set. We propose reference-based evaluation metrics for distractor generation by extending recall and precision, which is popular in information retrieval. However, as the reference is not the only acceptable answer, we also conduct human evaluatio n. We evaluate four LLMs: GPT-4 with few-shot learning, ChatGPT with few-shot learning, ChatGPT with fine-tuning and JSLM with fine-tuning. Our future plan includes improving the LLMs’ performance by integrating question writing guidelines into the prompts to LLMs and conducting a large-scale administration of automatically generated questions.
Recommended citation: Yusei Kido, Hiroaki Yamada, Takenobu Tokunaga, Rika Kimura, Yuriko Miura, Yumi Sakyo and Naoko Hayashi. 2024. Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models. In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), pages 821-829. https://doi.org/10.5220/0012729200003693
Published in The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024),, 2024
Interpretation methods provide saliency scores indicating the importance of input words for neural summarization models. Prior work has analyzed models by comparing them to human behavior, often using eye-gaze as a proxy for human attention in reading tasks such as classification. This paper presents a framework to analyze the model behavior in summarization by comparing it to human summarization behavior using eye-gaze data. We examine two research questions: RQ1) whether model saliency conforms to human gaze during summarization and RQ2) how model saliency and human gaze affect summarization performance. For RQ1, we measure conformity by calculating the correlation between model saliency and human fixation counts. For RQ2, we conduct ablation experiments removing words/sentences considered important by models or humans. Experiments on two datasets with human eye-gaze during summarization partially confirm that model saliency aligns with human gaze (RQ1). However, ablation experiments show that removing highly-attended words/sentences from the human gaze does not significantly degrade performance compared with the removal by the model saliency (RQ2).
Recommended citation: Fariz Ikhwantri, Hiroaki Yamada, and Takenobu Tokunaga. 2024. Analyzing Interpretability of Summarization Model with Eye-gaze Information. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 939–950, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lrec-main.84
We present an annotation scheme describing the argument structure of judgement documents, a central construct in Japanese law. To support the final goal of this work, namely summarisation aimed at the legal professions, we have designed blueprint models of summaries of various granularities, and our annotation model in turn is fitted around the information needed for the summaries. In this paper we report results of a manual annotation study, showing that the annotation is stable. The annotated corpus we created contains 89 documents (37,673 sentences; 2,528,604 characters). We also designed and implemented the first two stages of an algorithm for the automatic extraction of argument structure, and present evaluation results.
Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2019. Building a Corpus of Legal Argumentation in Japanese Judgement Documents: Towards Structure-Based Summarisation. Artificial Intelligence and Law. Springer Netherlands, 27(2):141–170. https://doi.org/10.1007/s10506-019-09242-3
This paper provides the first broad overview of the relation between different interpretation methods and human eye-movement behaviour across different tasks and architectures. The interpretation methods of neural networks provide the information the machine considers important, while the human eye-gaze has been believed to be a proxy of the human cognitive process. Thus, comparing them explains machine behaviour in terms of human behaviour, leading to improvement in machine performance through minimising their difference. We consider three types of natural language processing (NLP) tasks: sentiment analysis, relation classification and question answering, and four interpretation methods based on: simple gradient, integrated gradient, input-perturbation and attention, and three architectures: LSTM, CNN and Transformer. We leverage two corpora annotated with eye-gaze information: the Zuco dataset and the MQA-RC dataset. This research sets up two research questions. First, we investigate whether the saliency (importance) of input-words conform with those from human eye-gaze features. To this end, we compute a saliency distance (SD) between input words (by an interpretation method) and an eye-gaze feature. SD is defined as the KL-divergence between the saliency distribution over input words and an eye-gaze feature. We found that the SD scores vary depending on the combinations of tasks, interpretation methods and architectures. Second, we investigate whether the models with good saliency conformity to human eye-gaze behaviour have better prediction performances. To this end, we propose a novel evaluation device called “SD-performance curve” (SDPC) which represents the cumulative model performance against the SD scores. SDPC enables us to analyse the underlying phenomena that were overlooked using only the macroscopic metrics, such as average SD scores and rank correlations, that are typically used in the past studies. We observe that the impact of good saliency conformity between humans and machines on task performance varies among the combinations of tasks, interpretation methods and architectures. Our findings should be considered when introducing eye-gaze information for model training to improve the model performance.
Recommended citation: Fariz Ikhwantri, Jan Wira Gotama Putra, Hiroaki Yamada, Takenobu Tokunaga. 2023. Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour. Information Processing & Management. Elsevier Ltd, 60(2023) 103195. https://doi.org/10.1016/j.ipm.2022.103195
This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court’s accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3477 Japanese Civil Code judgments by 41 legal experts, resulting in 7978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.
Recommended citation: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Akira Tokutsu, Keisuke Takeshita, and Mihoko Sumida. 2024. Japanese tort-case dataset for rationale-supported legal judgment prediction. Artificial Intelligence and Law. https://doi.org/10.1007/s10506-024-09402-0
The present paper provides an overview of an online module for formative assessment of summary writing skills for second language (L2) introductory academic writing instruction in Japan and presents initial empirical results on how Japanese undergraduate students’ summary writing performance changed with a series of automated summary content feedback delivered in the module. A key feature of this module was the provision of fine-grained feedback delivered as scaffolding during revisions in terms of two key aspects of summary content: main idea representation and paraphrasing. Participants were 64 Japanese undergraduate engineering majors in introductory academic writing courses at a private university in Tokyo. The students completed two summary writing tasks provided through the online module. Results of a multivariate analysis of variance showed significant improvement of the content analytic score on revision on the initial summary task, and that this improved performance level was retained on a transfer task. The language use analytic score also improved significantly on the transfer task. Detailed analyses of learner-produced summaries based on descriptive statistics further suggested that the learners made substantively meaningful changes concerning main idea coverage and verbatim copying of the source text while still meeting the length requirement, although the results differed somewhat across the source texts assigned. Despite some study limitations, these results provide initial support for immediate content feedback provision for the development of basic summary writing skills.
Recommended citation: Yasuyo Sawaki, Yutaka Ishii, Hiroaki Yamada, Takenobu Tokunaga. Developing and validating an online module for formative assessment of summary writing with automated content feedback for EFL academic writing instruction, Language Testing in Asia 14, 50 (2024). https://doi.org/10.1186/s40468-024-00325-w
We are accumulating electronic files of English composition by university freshmen. For the past ten years or so, about 50 to 90 students enrolled in three English classes taught by the last author are submitting 15 essays per year, each in three different versions. The first is what the students come up with in half hour or so in class after engaging what we call "oral response practice," in which a group of three students in turn read a question card aloud, respond to the question and video record the interaction. Sets of 10 question cards around one topic are prepared by the teacher and distributed to the groups with a video camera. After class, students will spend some time to "complete" their compositions and submit them during the next class and review other students' essays within groups of six. The students are asked to revise the essays and submit the final version during the following class. In addition to the students' peer review comments, it would be desirable if the students can get feedback by statistical analysis of parsing and other processing of their own essays but the files submitted has to go through some pre-processing for the parser and analyzers to work properly. In this presentation, we report on your preliminary study and experimentation.
The goal of foreign language educations and / or learning is attainment of proficiency in the target language, and learners should not only acquire knowledge of vocabulary, expressions and grammar but also achieve automatization of mental processing of that language. To what extent are the English language education and learning in Japan achieving these goals?
Legal judgment prediction (LJP) is the task of predicting the outcome of a court case based on input facts. Predicting legal judgment makes it possible to help not only legal professionals, but also the general public who are not legal specialists. An LJP system allows everyone to predict and foresee the outcome of litigation when involved in legal disputes. This article provides a simple introduction to artificial intelligence and natural language processing research in the field of LJP, reviews the recent advances in legal judgment prediction and related topics, and discusses the challenges and possible directions to develop a smarter and more trustful LJP system.
ソーシャルメディアでの感情分析や感情的かつ共感的な対話システムの構築を目的として対話における各発話の感情認識(EmotionRecognition in Conversations: ERC) が注目を集めている.ERCでは,発話の内容だけでなく,発話間の関係が話者の感情に大きな影響を与えることが知られている.従来手法の多くは,発話間の関係を抽出し,高い認識性能を達成した.このような手法は,単体で高い認識性能を示すことが多いが,性質の異なるモデルを組み合わせることでさらなる性能向上が期待できる.本研究は,単体で高い性能を発揮するモデルが出力する感情ラベルの確率分布と,性質の異なる別のモデルを用いて検索した近傍事例から作成した確率分布とを組み合わせる手法を提案する.評価実験において,提案手法はERCにおける3つのベンチマークデータセットのうち,2つのデータセットでベースモデル単体の認識率を上回る性能を達成した.また並べ替え検定において,提案手法はベースモデル単体に対して統計的に有意な結果を示した.
This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court's accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3,477 Japanese Civil Code judgments by 41 legal experts, resulting in 7,978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.
This paper reports MONETECH's participation in FinArg-1's Argument Unit Identification in Earnings Conference Call subtask. Our experiments are based on the BERT and FinBERT models with additional experimentation on Large Language Model-based data augmentation, data filtering, and the model's layer freezing. Our best-performing submission, which is based on data filtering and the model's layer freezing, scores 75.54\% in micro F1 evaluation. Results from additional runs also show that the model's layer freezing and data filtering could further improve model performance beyond our best submission.
Recommended citation: Supawich Jiarakul, Hiroaki Yamada and Takenobu Tokunaga. MONETECH at the NTCIR-17 FinArg-1 Task: Layer Freezing, Data Augmentation, and Data Filtering for Argument Unit Identification. The 17th NTCIR Conference Evaluation of Information Access Technologies, 2023/12/12. https://doi.org/10.20736/0002001314
直喩表現(例:ひまわりのような笑顔)に対して,人のような自然な解釈(例:明るい笑顔)の候補を生成するモデルを作成することは,自然言語処理の分野において注目を集めている課題のひとつである.本研究では,事前学習済みマスク言語モデルBERTを用いて直喩表現に対する解釈を生成する.また,形容詞の補完に適したマスク言語モデル(Masked Language Model,MLM) の拡張手法と形容詞-名詞の修飾関係に着目した学習フレームワークを提案する.提案手法の適用によって,直喩解釈のスコアを表すRecall@5は0.296を示し,他比較対象を上回った.
In this paper, we explore the application of Generative Pre-trained Transformers (GPTs) in cross-lingual legal Question-Answering (QA) systems using the COLIEE Task 4 dataset. In the COLIEE Task 4, given a statement and a set of related legal articles that serve as context, the objective is to determine whether the statement is legally valid, i.e., if it can be inferred from the provided contextual articles or not, which is also known as an entailment task. By benchmarking four different combinations of English and Japanese prompts and data, we provide valuable insights into GPTs’ performance in multilingual legal QA scenarios, contributing to the development of more efficient and accurate cross-lingual QA solutions in the legal domain.