Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

Posts

Website open!

less than 1 minute read

Published: February 06, 2018

Now my own website is open!

publications

Annotation of argument structure in Japanese legal documents

Published in Proceedings of the 4th Workshop on Argument Mining (ArgMining2017), 2017

We propose a method for the annotation of Japanese civil judgment documents, with the purpose of creating flexible summaries of these. The first step, described in the current paper, concerns content selection, i.e., the question of which material should be extracted initially for the summary. In particular, we utilize the hierarchical argu-ment structure of the judgment documents. Our main contributions are a) the design of an annotation scheme that stresses the connection between legal issues (called issue topics) and argument structure, b) an adaptation of rhetorical status to suit the Japanese legal system and c) the definition of a linked argument structure based on le-gal sub-arguments. In this paper, we report agreement between two annotators on sev-eral aspects of the overall task.

Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2017. Annotation of argument structure in Japanese legal documents. In Proceedings of the 4th Workshop on Argument Mining (ArgMining2017). pages 22-31. http://www.aclweb.org/anthology/W17-5103

Designing an annotation scheme for summarizing Japanese judgment documents

Published in Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE 2017), 2017

We propose an annotation scheme for the summarization of Japanese judgment documents. This paper reports the details of the development of our annotation scheme for this task. We also conduct a human study where we compare the annotation of independent annotators. The end goal of our work is summarization, and our categories and the link system is a consequence of this. We propose three types of generic summaries which are focused on specific legal issues relevant to a given legal case.

Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2017. Designing an annotation scheme for summarizing Japanese judgment documents. In Proceedings of the 9th International Conference on Knowledge and Systems Engineering (KSE 2017). pages 275-280. http://ieeexplore.ieee.org/document/8119471/

Japanese university students’ paraphrasing strategies in L2 summary writing

Published in The 41st annual Language Testing Research Colloquium, 2019

Recommended citation:
Sawaki, Y., Ishii, Y., & Yamada, H. (2019). Japanese university students’ paraphrasing strategies in L2 summary writing. The 41st annual Language Testing Research Colloquium.

Recommended citation: Sawaki, Y., Ishii, Y., & Yamada, H. (2019). Japanese university students' paraphrasing strategies in L2 summary writing. The 41st annual Language Testing Research Colloquium.

A performance study on fine-tuned large language models in the Legal Case Entailment task

Published in The 6th Competition on Legal Information Extraction/Entailment (COLIEE-2019), 2019

Deep learning based approaches achieved significant advances in various Natural Language Processing (NLP) tasks. However, such approaches have not yet been evaluated in the legal domain compared to other domains such as news articles and colloquial texts. Since creating annotated data in the legal domain is expensive, applying deep learning models to the domain has been challenging. A fine-tuning approach can alleviate the situation; it allows a model trained with a large out-domain data set to be retrained on a smaller in-domain data set. A fine-tunable language model “BERT” was proposed and achieved state-of-the-art in various NLP tasks. In this paper, we explored the fine-tuning based approach in legal textual entailment task using COLIEE task 2 data set. The experimental results show that fine-tuning approach improves the performance, achieving F 1 = 0.50 with COLIEE task 2 dry run data.

Recommended citation: Hiroaki Yamada and Takenobu Tokunaga. 2019. A performance study on fine-tuned large language models in the Legal Case Entailment task. In Proceedings of the 6th Competition on Legal Information Extraction/Entailment (COLIEE-2019). https://www.cl.c.titech.ac.jp/tokunaga/_media/publication/yamada_2019aa.pdf

Supporting content evaluation of student summaries by Idea Unit embedding

Published in Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@ACL 2019, 2019

This paper discusses the computer-assisted content evaluation of summaries. We propose a method to make a correspondence between the segments of the source text and its summary. As a unit of the segment, we adopt “Idea Unit (IU)” which is proposed in Applied Linguistics. Introducing IUs enables us to make a correspondence even for the sentences that contain multiple ideas. The IU correspondence is made based on the similarity between vector representations of IU. An evaluation experiment with two source texts and 20 summaries showed that the proposed method is more robust against rephrased expressions than the conventional ROUGEbased baselines. Also, the proposed method outperformed the baselines in recall. We implemented the proposed method in a GUI tool “Segment Matcher” that aids teachers to establish a link between corresponding IUs across the summary and source text.

Recommended citation: Marcello Gecchele, Hiroaki Yamada, Takenobu Tokunaga and Yasuyo Sawaki. 2019. Supporting content evaluation of student summaries by Idea Unit embedding. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA@ACL2019). https://www.aclweb.org/anthology/W19-4436

Neural network based Rhetorical status classification for Japanese judgement documents

Published in Legal Knowledge and Information Systems - JURIX 2019: The Thirty-second Annual Conference,, 2019

We address the legal text understanding task, and in particular we treat Japanese judgment documents in civil law. Rhetorical status classification (RSC) is the task of classifying sentences according to the rhetorical functions they fulfil; it is an important preprocessing step for our overall goal of legal summarisation. We present several improvements over our previous RSC classifier, which was based on CRF. The first is a BiLSTM-CRF based model which improves performance significantly over previous baselines. The BiLSTM-CRF architecture is able to additionally take the context in terms of neighbouring sentences into account. The second improvement is the inclusion of section heading information, which resulted in the overall best classifier. Explicit structure in the text, such as headings, is an information source which is likely to be important to legal professionals during the reading phase; this makes the automatic exploitation of such information attractive. We also considerably extended the size of our annotated corpus of judgment documents.

Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2019. Neural network based Rhetorical status classification for Japanese judgement documents. In The proceedings of the 32nd International Conference on Legal Knowledge and Information Systems (JURIX 2019). pages 133–142. https://doi.org/10.3233/FAIA190314

Annotation Study of Japanese Judgments on Tort for Legal Judgment Prediction with Rationales

Published in The 2022 International Conference on Language Resources and Evaluation (LREC2022),, 2022

This paper describes a comprehensive annotation study on Japanese judgment documents in civil cases. We aim to build an annotated corpus designed for Legal Judgment Prediction (LJP), especially for torts. Our annotation scheme contains annotations of whether tort is accepted by judges as well as its corresponding rationales for explainability purpose. Our annotation scheme extracts decisions and rationales at character-level. Moreover, the scheme can capture the explicit causal relation between judge’s decisions and their corresponding rationales, allowing multiple decisions in a document. To obtain high-quality annotation, we developed an annotation scheme with legal experts, and confirmed its reliability by agreement studies with Krippendorff’s alpha metric. The result of the annotation study suggests the proposed annotation scheme can produce a dataset of Japanese LJP at reasonable reliability.

Recommended citation: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Keisuke Takeshita, and Mihoko Sumida. 2022. Annotation Study of Japanese Judgments on Tort for Legal Judgment Prediction with Rationales. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 779–790, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.83

Automating Idea Unit Segmentation and Alignment for Assessing Reading Comprehension via Summary Protocol Analysis

Published in The 2022 International Conference on Language Resources and Evaluation (LREC2022),, 2022

In this paper, we approach summary evaluation from an applied linguistics (AL) point of view. We provide computational tools to AL researchers to simplify the process of Idea Unit (IU) segmentation. The IU is a segmentation unit that can identify chunks of information. These chunks can be compared across documents to measure the content overlap between a summary and its source text. We propose a full revision of the annotation guidelines to allow machine implementation. The new guideline also improves the inter-annotator agreement, rising from 0.547 to 0.785 (Cohen’s Kappa). We release L2WS 2021, a IU gold standard corpus composed of 40 manually annotated student summaries. We propose IUExtract; i.e. the first automatic segmentation algorithm based on the IU. The algorithm was tested over the L2WS 2021 corpus. Our results are promising, achieving a precision of 0.789 and a recall of 0.844. We tested an existing approach to IU alignment via word embeddings with the state of the art model SBERT. The recorded precision for the top 1 aligned pair of IUs was 0.375. We deemed this result insufficient for effective automatic alignment. We propose “SAT”, an online tool to facilitate the collection of alignment gold standards for future training.

Recommended citation: Marcello Gecchele, Hiroaki Yamada, Takenobu Tokunaga, Yasuyo Sawaki, and Mika Ishizuka. 2022. Automating Idea Unit Segmentation and Alignment for Assessing Reading Comprehension via Summary Protocol Analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4663–4673, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.498

Cross-domain Analysis on Japanese Legal Pretrained Language Models

Published in Findings of the The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP 2022), 2022

This paper investigates the pretrained language model (PLM) specialised in the Japanese legal domain. We create PLMs using different pretraining strategies and investigate their performance across multiple domains. Our findings are (i) the PLM built with general domain data can be improved by further pretraining with domain-specific data, (ii) domain-specific PLMs can learn domain-specific and general word meanings simultaneously and can distinguish them, (iii) domain-specific PLMs work better on its target domain; still, the PLMs retain the information learnt in the original PLM even after being further pretrained with domain-specific data, (iv) the PLMs sequentially pretrained with corpora of different domains show high performance for the later learnt domains.

Recommended citation: Keisuke Miyazaki, Hiroaki Yamada and Takenobu Tokunaga. 2022. Cross-domain Analysis on Japanese Legal Pretrained Language Models. In Findings of the The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 274–281, Online. https://aclanthology.org/2022.findings-aacl.26

Nearest Neighbor Search for Summarization of Japanese Judgment Documents

Published in Legal Knowledge and Information Systems - JURIX 2023: The Thirty-sixth Annual Conference, 2023

With the increasing demand for summarizing Japanese judgment documents, the automatic generation of high-quality summaries by large language models (LLMs) is expected. We propose a method to select exemplars using the nearest neighbor search for the one-shot learning method. The experiments showed our method outperforms baseline methods.

Recommended citation: Akito Shimbo, Yuta Sugawara, Hiroaki Yamada, Takenobu. 2023. Nearest Neighbor Search for Summarization of Japanese Judgment Documents. Legal Knowledge and Information Systems - JURIX 2023: The Thirty-sixth Annual Conference, pages 225–340, Maastricht, The Netherlands. https://doi.org/10.3233/FAIA230984

Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models

Published in The 16th International Conference on Computer Supported Education, 2024

This paper introduces our ongoing research project that aims to generate multiple-choice questions for the Japanese National Nursing Examination using large language models (LLMs). We report the progress and prospects of our project. A preliminary experiment assessing the LLMs’ potential for question generation in the nursing domain led us to focus on distractor generation, which is a difficult part of the entire questiongeneration process. Therefore, our problem is generating distractors given a question stem and key (correct choice). We prepare a question dataset from the past National Nursing Examination for the training and evaluation of LLMs. The generated distractors are evaluated with compared to the reference distractors in the test set. We propose reference-based evaluation metrics for distractor generation by extending recall and precision, which is popular in information retrieval. However, as the reference is not the only acceptable answer, we also conduct human evaluatio n. We evaluate four LLMs: GPT-4 with few-shot learning, ChatGPT with few-shot learning, ChatGPT with fine-tuning and JSLM with fine-tuning. Our future plan includes improving the LLMs’ performance by integrating question writing guidelines into the prompts to LLMs and conducting a large-scale administration of automatically generated questions.

Recommended citation: Yusei Kido, Hiroaki Yamada, Takenobu Tokunaga, Rika Kimura, Yuriko Miura, Yumi Sakyo and Naoko Hayashi. 2024. Automatic Question Generation for the Japanese National Nursing Examination Using Large Language Models. In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), pages 821-829. https://doi.org/10.5220/0012729200003693

Analyzing Interpretability of Summarization Model with Eye-gaze Information

Published in The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024

Interpretation methods provide saliency scores indicating the importance of input words for neural summarization models. Prior work has analyzed models by comparing them to human behavior, often using eye-gaze as a proxy for human attention in reading tasks such as classification. This paper presents a framework to analyze the model behavior in summarization by comparing it to human summarization behavior using eye-gaze data. We examine two research questions: RQ1) whether model saliency conforms to human gaze during summarization and RQ2) how model saliency and human gaze affect summarization performance. For RQ1, we measure conformity by calculating the correlation between model saliency and human fixation counts. For RQ2, we conduct ablation experiments removing words/sentences considered important by models or humans. Experiments on two datasets with human eye-gaze during summarization partially confirm that model saliency aligns with human gaze (RQ1). However, ablation experiments show that removing highly-attended words/sentences from the human gaze does not significantly degrade performance compared with the removal by the model saliency (RQ2).

Recommended citation: Fariz Ikhwantri, Hiroaki Yamada, and Takenobu Tokunaga. 2024. Analyzing Interpretability of Summarization Model with Eye-gaze Information. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 939–950, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lrec-main.84

Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition

Published in The 13th International Global WordNet Conference (GWC2025) , 2025

WordNet provides a carefully constructed repository of semantic relations, created by specialists. But there is another source of information on semantic relations, the intuition of language users. We present the first systematic study of the degree to which these two sources are aligned. Investigating the cases of misalignment could make proper use of WordNet and facilitate its improvement. Our analysis which uses templates to elicit responses from human participants, reveals a general misalignment of semantic relation knowledge between WordNet and human intuition. Further analyses find a systematic pattern of mismatch among synonymy and taxonomic relations~(hypernymy and hyponymy), together with the fact that WordNet path length does not serve as a reliable indicator of human intuition regarding hypernymy or hyponymy relations.

Recommended citation: Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga. 2025. Misalignment of Semantic Relation Knowledge between WordNet and Human Intuition. In the Proceedings of The 13th International Global WordNet Conference (GWC2025). https://github.com/unipv-larl/GWC2025/releases/download/papers/GWC2025_paper_5.pdf

Evaluation of LLM-Generated Distractors of Multiple-Choice Questions for the Japanese National Nursing Examination

Published in The 17th International Conference on Computer Supported Education (CSEDU 2025), 2025

This paper reports the evaluation results in the usefulness of distractors generated by large language models (LLMs) in creating multiple-choice questions for the Japanese National Nursing Examination. Our research questions are: “(RQ1) Do question writers adopt LLM-generated distractor candidates in question writing?” and “(RQ2) Does providing LLM-generated distractor candidates reduce the time for writing questions?”. We selected ten questions from the proprietary mockup examinations of the National Nursing Examination administered by a prep school, considering the analysis of the last ten-year questions of the National Nursing Examination. Distractors are generated by seven different LLMs, given a stem and a key for each question of the above ten, and they are compiled into the distractor candidate sets. Given a stem and a key for each question, 15 domain experts completed questions by filling in three distractors. Eight experts are provided with the LLM-generated distractor candidates; the other seven are not. The results of comparing the two groups provided us with affirmative answers to both RQs. The current evaluation remains subjective from the viewpoint of the question writers; it is necessary to evaluate whether questions generated with the assistance of LLM work in a real examination setting. Our future plan includes administering a large-scale mockup examination using both human-made and LLM-assisted questions and analysing the differences in the responses to both types of questions.

Recommended citation: Yusei Kido, Hiroaki Yamada, Takenobu Tokunaga, Rika Kimura, Yuriko Miura, Yumi Sakyo, Naoko Hayashi. 2025. Evaluation of LLM-Generated Distractors of Multiple-Choice Questions for the Japanese National Nursing Examination. In the Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025), pages 754-764. https://doi.org/10.5220/0013460300003932

Aligning Sizes of Intermediate Layers by LoRA Adapter for Knowledge Distillation

Published in The Sixth Workshop on Insights from Negative Results in NLP, 2025

Intermediate Layer Distillation (ILD) is a variant of Knowledge Distillation (KD), a method for compressing neural networks.ILD requires mapping to align the intermediate layer sizes of the teacher and student models to compute the loss function in training, while this mapping is not used during inference.This inconsistency may reduce the effectiveness of learning in intermediate layers.In this study, we propose LoRAILD, which uses LoRA adapters to eliminate the inconsistency.However, our experimental results show that LoRAILD does not outperform existing methods.Furthermore, contrary to previous studies, we observe that conventional ILD does not outperform vanilla KD.Our analysis of the distilled models’ intermediate layers suggests that ILD does not improve language models’ performance.

Recommended citation: Takeshi Suzuki, Hiroaki Yamada, and Takenobu Tokunaga. 2025. Aligning Sizes of Intermediate Layers by LoRA Adapter for Knowledge Distillation. In The Sixth Workshop on Insights from Negative Results in NLP, pages 100-105. https://aclanthology.org/2025.insights-1.10/

Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning

Published in The Sixth Workshop on Insights from Negative Results in NLP, 2025

Evaluating an LLM’s robustness against numerical perturbation is a good way to know if the LLM actually performs reasoning or just replicates patterns learned. We propose a novel method to augment math word problems (MWPs), producing numerical variations at a large scale utilizing templates. We also propose an automated error classification framework for scalable error analysis, distinguishing calculation errors from reasoning errors. Our experiments using the methods show LLMs are weak against numerical variations, suggesting they are not fully capable of generating valid reasoning steps, often failing in arithmetic operations.

Recommended citation: Yuli Yang, Hiroaki Yamada, and Takenobu Tokunaga. 2025. Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning. In The Sixth Workshop on Insights from Negative Results in NLP, pages 171-180. https://aclanthology.org/2025.insights-1.16/

An Overview of the COLIEE 2025 Competition: Legal Case Law and Statute Law Information Retrieval and Entailment

Published in The 20th International Conference on Artificial Intelligence and Law (ICAIL2025) , 2025

We summarize the 12th Competition on Legal Information Extraction and Entailment. In this edition, the competition included four tasks on case law and statute law, plus a new pilot task on Tort law. The case law component includes an information retrieval task (Task 1), and the confirmation of an entailment relation between an existing case and an unseen case (Task 2). The statute law component includes an information retrieval task (Task 3), and an entailment/question-answering task based on retrieved civil code statutes (Task 4). The new pilot task is tort prediction (TP) and its rationale extraction (RE). As in the previous 11 competitions, participation was open to any group using any approach. This year, ten different teams participated in the case law competition tasks, with most participating in more than one task. Eight teams submitted a total of 21 runs for Task 1, and six teams submitted a total of 18 runs for Task 2. For the statute law tasks, eight teams submitted a total of 22 runs for Task 3, and ten teams submitted a total of 29 runs for Task 4. For the pilot task, four teams submitted a total of 10 runs. In this paper, we summarize the variety of approaches used, present our official evaluation, and describe our analysis of the submitted results.

Recommended citation: Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Calum Kwan, Ken Satoh, Hiroaki Yamada, Yoshioka Masaharu. 2025. An Overview of the COLIEE 2025 Competition: Legal Case Law and Statute Law Information Retrieval and Entailment.In the Proceedings of the 20th International Conference on Artificial Intelligence and Law (ICAIL2025). https://dl.acm.org/conference/icail/proceedings

On the Distinctive Co-occurrence Characteristics of Antonymy

Published in The 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025) , 2025

Antonymy has long received particular attention in lexical semantics. Previous studies have shown that antonym pairs frequently co-occur in text, across genres and parts of speech, more often than would be expected by chance. However, whether this co-occurrence pattern is distinctive of antonymy remains unclear, due to a lack of comparison with other semantic relations. This work fills the gap by comparing antonymy with three other relations across parts of speech using robust co-occurrence metrics. We find that antonymy is distinctive in three respects: antonym pairs co-occur with high strength, in a preferred linear order, and within short spans. All results are available online.

Recommended citation: Zhihan Cao , Hiroaki Yamada , Takenobu Tokunaga. 2025. On the Distinctive Co-occurrence Characteristics of Antonymy. In the Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). https://aclanthology.org/2025.starsem-1.10/

publications_jn

Building a corpus of legal argumentation in Japanese judgement documents: towards structure-based summarisation

Published: February 15, 2019

We present an annotation scheme describing the argument structure of judgement documents, a central construct in Japanese law. To support the final goal of this work, namely summarisation aimed at the legal professions, we have designed blueprint models of summaries of various granularities, and our annotation model in turn is fitted around the information needed for the summaries. In this paper we report results of a manual annotation study, showing that the annotation is stable. The annotated corpus we created contains 89 documents (37,673 sentences; 2,528,604 characters). We also designed and implemented the first two stages of an algorithm for the automatic extraction of argument structure, and present evaluation results.

Recommended citation: Hiroaki Yamada, Simone Teufel and Takenobu Tokunaga. 2019. Building a Corpus of Legal Argumentation in Japanese Judgement Documents: Towards Structure-Based Summarisation. Artificial Intelligence and Law. Springer Netherlands, 27(2):141–170. https://doi.org/10.1007/s10506-019-09242-3

Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour

Published: March 01, 2023

This paper provides the first broad overview of the relation between different interpretation methods and human eye-movement behaviour across different tasks and architectures. The interpretation methods of neural networks provide the information the machine considers important, while the human eye-gaze has been believed to be a proxy of the human cognitive process. Thus, comparing them explains machine behaviour in terms of human behaviour, leading to improvement in machine performance through minimising their difference. We consider three types of natural language processing (NLP) tasks: sentiment analysis, relation classification and question answering, and four interpretation methods based on: simple gradient, integrated gradient, input-perturbation and attention, and three architectures: LSTM, CNN and Transformer. We leverage two corpora annotated with eye-gaze information: the Zuco dataset and the MQA-RC dataset. This research sets up two research questions. First, we investigate whether the saliency (importance) of input-words conform with those from human eye-gaze features. To this end, we compute a saliency distance (SD) between input words (by an interpretation method) and an eye-gaze feature. SD is defined as the KL-divergence between the saliency distribution over input words and an eye-gaze feature. We found that the SD scores vary depending on the combinations of tasks, interpretation methods and architectures. Second, we investigate whether the models with good saliency conformity to human eye-gaze behaviour have better prediction performances. To this end, we propose a novel evaluation device called “SD-performance curve” (SDPC) which represents the cumulative model performance against the SD scores. SDPC enables us to analyse the underlying phenomena that were overlooked using only the macroscopic metrics, such as average SD scores and rank correlations, that are typically used in the past studies. We observe that the impact of good saliency conformity between humans and machines on task performance varies among the combinations of tasks, interpretation methods and architectures. Our findings should be considered when introducing eye-gaze information for model training to improve the model performance.

Recommended citation: Fariz Ikhwantri, Jan Wira Gotama Putra, Hiroaki Yamada, Takenobu Tokunaga. 2023. Looking deep in the eyes: Investigating interpretation methods for neural models on reading tasks using human eye-movement behaviour. Information Processing & Management. Elsevier Ltd, 60(2023) 103195. https://doi.org/10.1016/j.ipm.2022.103195

Japanese tort-case dataset for rationale-supported legal judgment prediction

Published: May 11, 2024

This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court’s accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3477 Japanese Civil Code judgments by 41 legal experts, resulting in 7978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.

Recommended citation: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Akira Tokutsu, Keisuke Takeshita, and Mihoko Sumida. 2024. Japanese tort-case dataset for rationale-supported legal judgment prediction. Artificial Intelligence and Law. https://doi.org/10.1007/s10506-024-09402-0

近傍事例を用いた対話における感情認識

Published: June 15, 2024

ソーシャルメディアでの感情分析や感情的かつ共感的な対話システムの構築を目的として，対話における発話の感情認識 ERC: Emotion Recognition in Conversations が注目を集めている．ERC では，似た内容を示す発話でも一連の発話の内容（文脈）に応じて異なる感情を示すことが知られている．文脈を把握する代表的な手法として，一連の発話を連結し識別モデルに入力する手法がある．この従来手法は，識別対象の発話とその先行文脈（対話）を入力し，識別モデル単体で対象の発話の感情ラベルを予測する特徴を持つ．本研究は，モデル外部のデータベースを活用して従来の識別モデルを補強する方法を提案する．具体的には，識別対象の発話と，意味的に近い発話を訓練セットから検索し，検索した発話（近傍事例）に付与された感情ラベルを基に確率分布を作成して，従来の識別モデルの確率分布と重み付き線形和によって組み合わせる．さらに本手法は，定数による重み付き線形和だけでなく，識別対象の発話ごとに動的に重み係数を変更する方法を提案する．評価実験において，ERC における 3 つのベンチマークデータで，動的に重み係数を変更する提案手法が，従来手法を上回る最高水準の認識性能を示した．

Recommended citation: 石渡太智, 後藤淳, 山田寛章, 徳永健伸. 近傍事例を用いた対話における感情認識, 自然言語処理, 2024, 31 巻, 2 号, p. 504-533, 2024/06/15, Online ISSN 2185-8314, Print ISSN 1340-7619. https://doi.org/10.5715/jnlp.31.504

Developing and validating an online module for formative assessment of summary writing with automated content feedback for EFL academic writing instruction

Published: October 28, 2024

The present paper provides an overview of an online module for formative assessment of summary writing skills for second language (L2) introductory academic writing instruction in Japan and presents initial empirical results on how Japanese undergraduate students’ summary writing performance changed with a series of automated summary content feedback delivered in the module. A key feature of this module was the provision of fine-grained feedback delivered as scaffolding during revisions in terms of two key aspects of summary content: main idea representation and paraphrasing. Participants were 64 Japanese undergraduate engineering majors in introductory academic writing courses at a private university in Tokyo. The students completed two summary writing tasks provided through the online module. Results of a multivariate analysis of variance showed significant improvement of the content analytic score on revision on the initial summary task, and that this improved performance level was retained on a transfer task. The language use analytic score also improved significantly on the transfer task. Detailed analyses of learner-produced summaries based on descriptive statistics further suggested that the learners made substantively meaningful changes concerning main idea coverage and verbatim copying of the source text while still meeting the length requirement, although the results differed somewhat across the source texts assigned. Despite some study limitations, these results provide initial support for immediate content feedback provision for the development of basic summary writing skills.

Recommended citation: Yasuyo Sawaki, Yutaka Ishii, Hiroaki Yamada, Takenobu Tokunaga. Developing and validating an online module for formative assessment of summary writing with automated content feedback for EFL academic writing instruction, Language Testing in Asia 14, 50 (2024). https://doi.org/10.1186/s40468-024-00325-w

Examining the consistency of instructor versus large language model ratings on summary content: Toward checklist-based feedback provision with second language writers

Published: July 20, 2025

This study examined the consistency between instructor ratings of learner-generated summaries and those estimated by a large language model (LLM) on summary content checklist items designed for undergraduate second language (L2) writing instruction in Japan. The effects of the LLM prompt design on the consistency between the two were also explored by comparing six types of prompts obtained by altering the amount of information included in the prompt and the direction concerning the order in which different parts of the LLM output (a checklist-based rating and its rationale) are generated. Ninety-seven summaries written by Japanese undergraduate students were analyzed by employing three checklist items on the use of topic sentences included in the source text. Satisfactory agreement between instructor and LLM ratings for low-stakes use was obtained for certain checklist item-by-prompt type combinations. When discrepancies between the two were observed, LLM ratings tended to be harsher than instructor ratings in general. Furthermore, the amount of information included in the LLM prompt affected the instructor-LLM rating agreement more than the order of generating a rating and its rationale in the output. The results offered initial empirical support for employing LLM-generated formative feedback on summary content in L2 writing classrooms.

Recommended citation: Yasuyo Sawaki, Yutaka Ishii, Hiroaki Yamada, Takenobu Tokunaga. Examining the consistency of instructor versus large language model ratings on summary content: Toward checklist-based feedback provision with second language writers, Language Testing (2025). https://doi.org/10.1177/02655322251349217

A comprehensive evaluation of semantic relation knowledge of pretrained language models and humans

Published: July 24, 2025

Recently, much work has concerned itself with the enigma of what exactly pretrained language models (PLMs) learn about different aspects of language, and how they learn it. One stream of this type of research investigates the knowledge that PLMs have about semantic relations. However, many aspects of semantic relations were left unexplored. Generally, only one relation has been considered, namely hypernymy. Furthermore, previous work did not measure humans’ performance on the same task as that performed by the PLMs. This means that at this point in time, there is only an incomplete view of the extent of these models’ semantic relation knowledge. To address this gap, we introduce a comprehensive evaluation framework covering five relations beyond hypernymy, namely hyponymy, holonymy, meronymy, antonymy, and synonymy. We use five metrics (two newly introduced here) for recently untreated aspects of semantic relation knowledge, namely soundness, completeness, symmetry, prototypicality, and distinguishability. Using these, we can fairly compare humans and models on the same task. Our extensive experiments involve six PLMs, four masked and two causal language models. The results reveal a significant knowledge gap between humans and models for all semantic relations. In general, causal language models, despite their wide use, do not always perform significantly better than masked language models. Antonymy is the outlier relation where all models perform reasonably well.

Recommended citation: Zhihan Cao, Hiroaki Yamada, Simone Teufel, Takenobu Tokunaga. A comprehensive evaluation of semantic relation knowledge of pretrained language models and humans, Language Resources and Evaluation (2025). https://doi.org/10.1007/s10579-025-09858-9

publications_nr

Toward Automatic Extraction of Characteristic Features of English Composition by Japanese College Students : Preliminary Experiments and Issues to be Addressed

Published: June 21, 2014

We are accumulating electronic files of English composition by university freshmen. For the past ten years or so, about 50 to 90 students enrolled in three English classes taught by the last author are submitting 15 essays per year, each in three different versions. The first is what the students come up with in half hour or so in class after engaging what we call "oral response practice," in which a group of three students in turn read a question card aloud, respond to the question and video record the interaction. Sets of 10 question cards around one topic are prepared by the teacher and distributed to the groups with a video camera. After class, students will spend some time to "complete" their compositions and submit them during the next class and review other students' essays within groups of six. The students are asked to revise the essays and submit the final version during the following class. In addition to the students' peer review comments, it would be desirable if the students can get feedback by statistical analysis of parsing and other processing of their own essays but the files submitted has to go through some pre-processing for the parser and analyzers to work properly. In this presentation, we report on your preliminary study and experimentation.

Recommended citation: 山田寛章, 石井雄隆, 原田康也. 日本人大学生の英語作文からの特徴量の自動抽出に向けて : 予備実験と今後の課題 (思考と言語). 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 114(100). pages 55-60, Jun 2014. ISSN 0913-5685. https://ci.nii.ac.jp/naid/110009925596

判決書自動要約のための修辞役割分類

Published: March 14, 2018

我々は，日本国の判決書に対する情報アクセスの容易化・効率化を目指し，検索の手掛かりとなる判決の要約を機械的に生成することを目指している．判決書に内包される共通の議論構造を適切に利用できれば，極めて高い品質の自動要約を実現することが可能となる．本研究の目的はそのための議論構造抽出である。本稿では、独自に作成した議論構造注釈付き判決書コーパスを利用し、形態素bigram等に加えて、モダリティ表現、機能表現、法律の名称、手掛かり句等の素性を導入し、SVMを用いた機械学習によって議論構造の基本となる修辞役割の自動分類を行い、基本的な素性のみでも判決書中の議論的テキストの弁別が可能であるという知見を得た。

Recommended citation: 山田寛章, Simone Teufel, 徳永健伸. 判決書自動要約のための修辞役割分類. 言語処理学会第24回年次大会発表論文集, pp. 785-788, 2018年3月. http://anlp.jp/proceedings/annual_meeting/2018/pdf_dir/P7-8.pdf

Autonomous Mutual Learning through Interaction – Difficulties in Automatization of Language Processing for Japanese EFL Learners

Published: March 18, 2019

The goal of foreign language educations and / or learning is attainment of proficiency in the target language, and learners should not only acquire knowledge of vocabulary, expressions and grammar but also achieve automatization of mental processing of that language. To what extent are the English language education and learning in Japan achieving these goals?

Recommended citation: 原田康也, 森下美和, 鈴木正紀, 横森大輔, 遠藤智子, 前坊香菜子, 鍋井理沙, 桒原奈な子, 山田寛章, 河村まゆみ. 自律的相互学習の記録と分析からインタラクションの楽しさへ～外国語としての英語自動処理の難しさを超えて～ (思考と言語). 電子情報通信学会技術研究報告 = IEICE technical report : 信学技報, 118(516). pages 17-22, Mar 2019. ISSN 2432-6380.

見出し情報を考慮した階層型RNNによる日本語判決書のための修辞役割分類

Published: March 17, 2020

本稿では，日本国の判決書に対する修辞役割分類の自動化及びその性能改善について議論する．これまでの日本の判決書における修辞役割分類の研究では，Conditional Random Field (CRF) を用いた分類器を構築しF=0.63(マクロ平均値) の性能を達成していたものの，BACKGROUND(F=0.32) 及びCONCLUSION(F=0.39) の重要な役割について相対的に分類性能が低くなっており，改善の余地があった．本稿では，文間文脈を考慮可能な階層型 RNN をベースとするモデルを用いることで，日本国判決書における修辞役割分類の性能が従来の CRF による分類器に比べて向上することを示す．また，判決書中に出現する見出し情報を扱う専用のネットワークを階層型RNNに追加することで修辞役割類の性能が向上することを示す．

Recommended citation: 山田寛章, Simone Teufel, 徳永健伸. 見出し情報を考慮した階層型RNNによる日本語判決書のための修辞役割分類. 言語処理学会第26回年次大会発表論文集, pp. 37-40, 2020年3月. https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/P1-10.pdf

日本語法律BERT を用いた判決書からの重要箇所抽出

Published: March 16, 2022

本研究では判決書からの重要箇所抽出タスクにおいて，法律分野の文書のみで事前学習を行ったBERT，日本語Wikipediaで事前学習されたBERTから追加の事前学習を行なったBERTを用い，その性能を汎用日本語BERTと比較検証した．実験より，法律分野に特化したBERTモデルを用いることで，汎用日本語BERTを超える性能があることを確認した．

Recommended citation: 菅原祐太, 宮崎桂輔, 山田寛章, 徳永健伸. 日本語法律BERTを用いた判決書からの重要箇所抽出. 言語処理学会第28回年次大会発表論文集, pp. 838-841, 2022年3月. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT1-10.pdf

日本語法律分野文書に特化したBERTの構築

Published: March 17, 2022

本論文では日本語の法律分野に特化したBERTモデルを提案する．民事事件判決書コーパスを用い，BERT を一から事前学習するモデルと，既存の汎用日本語BERT に追加事前学習するモデルを作成した．実験より，民事事件判決書を用いたMaskedLanguage Model，Next Sentence Prediction タスクについては既存の汎用日本語BERT に追加事前学習する手法が最も良い正解率を示すことがわかった．

Recommended citation: 宮崎桂輔, 菅原祐太, 山田寛章, 徳永健伸. 日本語法律分野文書に特化したBERT の構築. 言語処理学会第28回年次大会発表論文集, pp. 1546-1551, 2022年3月. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT3-7.pdf

法と人工知能の接点――判決予測システムの研究動向――

Published: May 31, 2022

Legal judgment prediction (LJP) is the task of predicting the outcome of a court case based on input facts. Predicting legal judgment makes it possible to help not only legal professionals, but also the general public who are not legal specialists. An LJP system allows everyone to predict and foresee the outcome of litigation when involved in legal disputes. This article provides a simple introduction to artificial intelligence and natural language processing research in the field of LJP, reviews the recent advances in legal judgment prediction and related topics, and discusses the challenges and possible directions to develop a smarter and more trustful LJP system.

Recommended citation: 山田寛章. 法と人工知能の接点. 情報法制研究, 11 巻 p. 27-33, 2022年5月. https://www.jstage.jst.go.jp/article/alis/11/0/11_27/_article/-char/ja

近傍事例を用いた対話における感情認識

Published: March 04, 2023

ソーシャルメディアでの感情分析や感情的かつ共感的な対話システムの構築を目的として対話における各発話の感情認識(EmotionRecognition in Conversations: ERC) が注目を集めている．ERCでは，発話の内容だけでなく，発話間の関係が話者の感情に大きな影響を与えることが知られている．従来手法の多くは，発話間の関係を抽出し，高い認識性能を達成した．このような手法は，単体で高い認識性能を示すことが多いが，性質の異なるモデルを組み合わせることでさらなる性能向上が期待できる．本研究は，単体で高い性能を発揮するモデルが出力する感情ラベルの確率分布と，性質の異なる別のモデルを用いて検索した近傍事例から作成した確率分布とを組み合わせる手法を提案する．評価実験において，提案手法はERCにおける3つのベンチマークデータセットのうち，2つのデータセットでベースモデル単体の認識率を上回る性能を達成した．また並べ替え検定において，提案手法はベースモデル単体に対して統計的に有意な結果を示した．

Recommended citation: 石渡太智, 美野秀弥, 後藤淳, 山田寛章, 徳永健伸. 近傍事例を用いた対話における感情認識. 言語処理学会第29回年次大会発表論文集, pp. 567-571, 2023年3月. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/A3-1.pdf

敵対的学習を用いた知識蒸留への中間層蒸留と対照学習の導入

Published: March 04, 2023

知識蒸留(KD)とは，大規模なニューラルネットワークを圧縮する手法の一つである．言語モデル向けKDの中で最高性能の手法は，敵対的学習に中間層出力と対照学習を導入したCILDAと呼ばれる手法である．CILDAの学習は最大化ステップと最小化ステップに分かれているが，中間層出力と対照学習は最大化ステップでのみ活用されている．本研究では，最小化ステップに中間層蒸留と対照学習を導入し，性能を向上させることを目指した．しかし，既存手法に対して有意な差は確認できなかったため，原因分析のためにCILDA単体の再現実験を行ったところ，先行研究の主張とは異なり，GLUEにおける複数のタスクでCILDAがそれ以前の手法の性能を上回らないという結果を得た．

Recommended citation: 鈴木偉士, 山田寛章, 徳永健伸. 敵対的学習を用いた知識蒸留への中間層蒸留と対照学習の導入. 言語処理学会第29回年次大会発表論文集, pp. 783-788, 2023年3月. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/Q3-2.pdf

参照例を使わないキャッチコピーの自動評価

Published: March 04, 2023

広告文の一種であるキャッチコピーの人手によるオフライン評価は高コストである．キャッチコピーの自動生成研究の迅速化・効率化のためには自動評価器が必要となる．自動評価器の構築のために必要なデータセットが現存しないため，日本語としては初となる23,641 件のキャッチコピーとその評価値から成るデータセットを構築した．このデータセットを利用してBERTと対照学習を用いた参照例を必要としない評価機を構築し，評価実験を行った結果，テストデータの評価値に対する相関係数が平均で0.28 を超えた．対照学習を用いない学習との比較も行い，対照学習の有用性を確認した．

Recommended citation: 新保彰人, 山田寛章, 徳永健伸. 参照例を使わないキャッチコピーの自動評価. 言語処理学会第29回年次大会発表論文集, pp. 1557-1562, 2023年3月. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/A7-2.pdf

低資源な法ドメイン含意タスクにおけるデータ拡張

Published: March 04, 2023

法ドメインではアノテーションが高コストのため学習データが不足する問題がある．本稿では，COLIEETASK4を用いて，ラベル付き学習データのルールベースによる拡張と，言語モデル事前学習の際の擬似的な学習データ拡張の効果検証を行う．実験の結果，提案手法である反対解釈によるデータ拡張手法が最良の性能を示した．

Recommended citation: 伊藤光一, 山田寛章, 徳永健伸. 低資源な法ドメイン含意タスクにおけるデータ拡張. 言語処理学会第29回年次大会発表論文集, pp. 990-885, 2023年3月. https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P4-2.pdf

Japanese Tort-case Dataset for Rationale-supported Legal Judgment Prediction

Published: December 01, 2023

This paper presents the first dataset for Japanese Legal Judgment Prediction (LJP), the Japanese Tort-case Dataset (JTD), which features two tasks: tort prediction and its rationale extraction. The rationale extraction task identifies the court's accepting arguments from alleged arguments by plaintiffs and defendants, which is a novel task in the field. JTD is constructed based on annotated 3,477 Japanese Civil Code judgments by 41 legal experts, resulting in 7,978 instances with 59,697 of their alleged arguments from the involved parties. Our baseline experiments show the feasibility of the proposed two tasks, and our error analysis by legal experts identifies sources of errors and suggests future directions of the LJP research.

Recommended citation: Hiroaki Yamada, Takenobu Tokunaga, Ryutaro Ohara, Akira Tokutsu, Keisuke Takeshita, Mihoko Sumida. Japanese Tort-case Dataset for Rationale-supported Legal Judgment Prediction. ArXiv Preprint, 2023/12/1. https://arxiv.org/abs/2312.00480

MONETECH at the NTCIR-17 FinArg-1 Task: Layer Freezing, Data Augmentation, and Data Filtering for Argument Unit Identification

Published: December 12, 2023

This paper reports MONETECH's participation in FinArg-1's Argument Unit Identification in Earnings Conference Call subtask. Our experiments are based on the BERT and FinBERT models with additional experimentation on Large Language Model-based data augmentation, data filtering, and the model's layer freezing. Our best-performing submission, which is based on data filtering and the model's layer freezing, scores 75.54\% in micro F1 evaluation. Results from additional runs also show that the model's layer freezing and data filtering could further improve model performance beyond our best submission.

Recommended citation: Supawich Jiarakul, Hiroaki Yamada and Takenobu Tokunaga. MONETECH at the NTCIR-17 FinArg-1 Task: Layer Freezing, Data Augmentation, and Data Filtering for Argument Unit Identification. The 17th NTCIR Conference Evaluation of Information Access Technologies, 2023/12/12. https://doi.org/10.20736/0002001314

複数文書要約を用いた事実性の検証

Published: March 03, 2024

事実性検証の対象となる主張文章の中には複数の文章を情報源として参照し，複数段階の推論を経ることで初めて正しい判定ができる主張文章が存在する．本研究ではこのような主張文章に適した事実性検証の仕組みとして，要約-判定アーキテクチャを提案する．提案手法では，1 段階目に情報源として与えられた複数の文書の中から主張文章を支持する部分を複数文書要約し，2 段階目では生成した要約を用いて主張文章の事実性判定を行う．1 段階目で情報源の文書を短く要約することにより，主張文章の判定で大規模言語モデルの推論能力を有効活用することを狙う．提案手法の性能を HoVerデータセットを用いて評価したところ，従来のアプローチを超える性能を達成した．

Recommended citation: 伊藤悠馬, 山田寛章, 徳永健伸. 複数文書要約を用いた事実性の検証. 研究報告自然言語処理（NL）, Volume 2024-NL-259, Issue 17, pp. 1-9, 2024年3月. http://id.nii.ac.jp/1001/00232766/

事前学習済みモデルを用いた日本語直喩表現の解釈

Published: March 04, 2024

直喩表現（例：ひまわりのような笑顔）に対して，人のような自然な解釈（例：明るい笑顔）の候補を生成するモデルを作成することは，自然言語処理の分野において注目を集めている課題のひとつである．本研究では，事前学習済みマスク言語モデルBERTを用いて直喩表現に対する解釈を生成する．また，形容詞の補完に適したマスク言語モデル(Masked Language Model，MLM) の拡張手法と形容詞-名詞の修飾関係に着目した学習フレームワークを提案する．提案手法の適用によって，直喩解釈のスコアを表すRecall@5は0.296を示し，他比較対象を上回った．

Recommended citation: 鈴木颯仁, 山田寛章, 徳永健伸. 事前学習済みモデルを用いた日本語直喩表現の解釈. 言語処理学会第30回年次大会発表論文集, pp. 3137-3142, 2024年3月. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/D11-6.pdf

対義関係バイアス: 事前訓練済み言語モデルと人間の意味関係間の弁別能力に関する分析

Published: March 04, 2024

意味関係の弁別は，人間にとっても，機械にとっても，容易なタスクではない．本研究では，多様な下流タスクで卓越した性能を示した事前訓練済み言語モデルが意味関係の弁別ができているか否かという問いに，混淆度という尺度を提案し，人間との比較の上でアプローチした．結果として，事前訓練済み言語モデルは，意味関係の弁別能力は，人間に比べて下回ることと同時に，非対義関係を対義関係として誤認識するバイアスが観察された．

Recommended citation: Cao Zhihan, 山田寛章, 徳永健伸. 対義関係バイアス: 事前訓練済み言語モデルと人間の意味関係間の弁別能力に関する分析. 言語処理学会第30回年次大会発表論文集, pp. 2194-2198, 2024年3月. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/D8-3.pdf

GPTs and Language Barrier: A Cross-Lingual Legal QA Examination

Published: March 04, 2024

In this paper, we explore the application of Generative Pre-trained Transformers (GPTs) in cross-lingual legal Question-Answering (QA) systems using the COLIEE Task 4 dataset. In the COLIEE Task 4, given a statement and a set of related legal articles that serve as context, the objective is to determine whether the statement is legally valid, i.e., if it can be inferred from the provided contextual articles or not, which is also known as an entailment task. By benchmarking four different combinations of English and Japanese prompts and data, we provide valuable insights into GPTs’ performance in multilingual legal QA scenarios, contributing to the development of more efficient and accurate cross-lingual QA solutions in the legal domain.

Recommended citation: Nguyen Ha Thanh, 山田寛章, 佐藤健. GPTs and Language Barrier: A Cross-Lingual Legal QA Examination. 言語処理学会第30回年次大会発表論文集, pp. 1062-1066, 2024年3月. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/E4-5.pdf

大規模言語モデルを用いた日本語判決書の自動要約

Published: March 04, 2024

日本語判決書の自動要約の需要の高まりに伴って，大規模言語モデル（LLM）によって高品質な判決書の要約文を出力することが期待されている．本研究ではOne-shot文脈内学習に用いるサンプルを近傍事例検索を用いて選ぶ手法を提案する．ベースライン手法と比較し，提案手法を用いることによって判決書要約の精度が高まることを示す．

Recommended citation: 新保彰人, 菅原祐太, 山田寛章, 徳永健伸. 大規模言語モデルを用いた日本語判決書の自動要約. 言語処理学会第30回年次大会発表論文集, pp. 1056-1061, 2024年3月. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/E4-4.pdf

日本語不法行為事件データセットの構築

Published: March 04, 2024

本研究は日本語・日本法における法的判断予測研究のためのデータセットである，日本語不法行為事件データセット(JapaneseTort-case Dataset, JTD) を提案する．JTDは不法行為判断予測タスク及びその根拠抽出タスク向けに設計されている．根拠抽出タスクは不法行為の成否判断に際して重要な根拠となった主張を，原告または被告の主張の中から抽出するタスクである．JTDには41人の法律専門家によって注釈付けされた3,477件の民事事件判決書に基づいて構築されており，7,978事例（事例に内包される原告・被告らの主張は59,697事例）が収録されている．ベースライン実験によりJTDの各タスクの実現可能性を確認し，さらに不法行為判断予測・根拠抽出の両タスクを同時に学習させることで性能が改善することを示した．

Recommended citation: 山田寛章, 徳永健伸, 小原隆太郎, 得津晶, 竹下啓介, 角田美穂子. 日本語不法行為事件データセットの構築. 言語処理学会第30回年次大会発表論文集, pp. 1045-1050, 2024年3月. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/E4-2.pdf

不法行為判断予測データセット構築におけるELSI課題

Published: June 11, 2024

我々は，近時，法律分野へのAI応用研究の一大領域となっている司法判断予測(Legal Judgement Prediction)研究のためのデータセットである，日本語不法行為事件データセット(Japanese Tort-case Dataset, JTD)を構築した．JTDには41人の法律専門家によって注釈付けされた3,477件の民事事件判決書に基づいて構築されており，7,978事例（事例に内包される原告・被告らの主張は59,697件）が収録されている．JTDは判決書という法律分野のデータを扱うことから，司法判断に関わる予測を計算機で行うことの是非，構築したデータセットに潜在するバイアスはもとより，判決データのオープンデータ化が未だ審議中というわが国特有の事情ゆえに，多様な社会的課題を検討しながら構築する必要があった．とりわけ，判決書のような多くのステークホルダーの利害に関係し得るデータを扱うデータセットを他の研究者と共有する環境は未だ十分整備されているとは言い難い．そこで，本稿では，今後の議論の深化に貢献すべく，構築時に検討した諸課題を報告する．

Recommended citation: 山田寛章, 小原隆太郎, 角田美穂子. 不法行為判断予測データセット構築におけるELSI課題, 人工知能学会全国大会論文集, Volume JSAI2024, 第38回 (2024), Online ISSN 2758-7347 2024年5月. https://doi.org/10.11517/pjsai.JSAI2024.0_3K1OS2a02

Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning

Published: March 11, 2025

Recommended citation: Yang Yuli，山田寛章，徳永健伸. Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning. 言語処理学会第31回年次大会発表論文集, pp. 851-856，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/Q2-20.pdf

自動アノテーションを導入したG-Evalによる英文要約課題評価

Published: March 11, 2025

英語学習者向け英文要約課題の自動評価のため，大規模言語モデル (LLM) を活用し，要約の内容に基づいた評価を実現する新たな手法を提案する．本研究では，Few-shot 学習，採点基準の自動展開，要約内の重要な概念や表現の自動アノテーションを組み合わせることで，要約内容に関する質の高い評価を可能にした．

Recommended citation: 藤田晃輔, 山田寛章, 徳永健伸, 石井雄隆, 澤木泰代. 自動アノテーションを導入したG-Evalによる英文要約課題評価. 言語処理学会第31回年次大会発表論文集, pp. 1056-1061，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/P3-2.pdf

言語モデルを用いた看護師国家試験問題の誤答選択肢自動生成

Published: March 11, 2025

本研究では，看護師国家試験問題における誤答選択肢の自動生成に大規模言語モデルを活用する．生成には日本語大規模言語モデルと，API を通じて利用可能なモデルをそれぞれ複数用い，過去の試験や予備校の模擬試験の問題をデータセットとして出力の制御を行う．生成した誤答選択肢を選択肢候補として看護師国家試験問題作成経験者に提示し，実際の問題作成における負担軽減や効率改善への寄与を分析した．その結果，大規模言語モデルによる誤答選択肢は有用であり，作問作業の効率改善の可能性が示された．

Recommended citation: 城戸祐世, 山田寛章, 徳永健伸, 木村理加, 三浦友理子, 佐居由美, 林直子. 言語モデルを用いた看護師国家試験問題の誤答選択肢自動生成. 言語処理学会第31回年次大会発表論文集, pp. 1074-1079，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/P3-5.pdf

LoRAを活用した言語モデルの中間層蒸留

Published: March 11, 2025

近年，言語モデルの巨大化により，計算コストが大きく増加したため，性能を保ちつつモデルのパラメータ数を削減する手法が求められている．その一つに知識蒸留があり，中間層蒸留はその一種である．モデルの中間層出力も損失関数の計算に用いる中間層蒸留は有効とされてきたが，線形写像を推論時に用いないため，学習の効果が保証されない問題があった．本研究では，中間層蒸留の線形写像を LoRA のアダプターで代替し，推論時に除かれない線形写像を実現した LoRAILD を提案し，実験を行った．その結果，中間層蒸留の効果に対する否定的な結果を得た．

Recommended citation: 鈴木偉士, 山田寛章, 徳永健伸. LoRAを活用した言語モデルの中間層蒸留. 言語処理学会第31回年次大会発表論文集, pp. 1644-1649，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/Q4-7.pdf

判決書要約文の自動評価

Published: March 12, 2025

一般の文書要約で使われている ROUGE などの自動評価指標は判決書自動要約タスクでも使われている．しかし既存の自動評価指標では判決書要約文に不可欠な要素が，要約文に含まれているかを評価することができない．本研究では，判決書要約文に特化した評価ルーブリックを策定し，それに基づいて法律の専門家による人手評価を行う．そして，その評価データを利用して判決書要約文に特化した自動評価器を構築する．構築した評価器をカッパ係数で評価し，自動評価器と正解データとの一致度が人手評価者の間の一致度を部分的に上回ることを示す．

Recommended citation: 新保彰人, 山田寛章, 徳永健伸. 判決書要約文の自動評価. 言語処理学会第31回年次大会発表論文集, pp. 1974-1979，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/P5-10.pdf

思考発話を利用した個人の発話及び性格特性再現

Published: March 13, 2025

本研究は，思考発話を付与した対話データを用いてファインチューニングを行うことで，個人の発話及び性格特性を再現する手法を提案する．具体的には，LLM を用いて既存の対話データセットに対して対象人物の思考発話を付与する．そして，そのデータを用いてモデルを訓練することで，対象人物の話し方や感情，思考を再現する．著名人・著名キャラクターの再現に焦点を当てた先行研究に比べて，本研究は多様な特性を持つ個人の発話と性格特性を再現できる可能性を示した．

Recommended citation: 石倉誠也, 山田寛章, 平岡達也, 山田広明, 徳永健伸. 思考発話を利用した個人の発話及び性格特性再現. 言語処理学会第31回年次大会発表論文集, pp. 4155-4160，2025年3月. https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/P10-20.pdf

言語で柔軟に操作可能なWhat-if社会シミュレータ

Published: March 14, 2025

社会シミュレーション研究において、一人一人の個性のような異質性を再現し、対話のような複雑な相互作用を再現することは、重要な関心事でありつづけてきたが、実装があまりに高コストであるためその詳細に踏み込まれることは少なかった。しかし、大規模言語モデルが、広汎な状況における対話をパーソナリティで条件付けて再現できることを示したことで、人々の異質性と相互作用を詳細にモデル化できる可能性が出てきた。本発表では、大規模言語モデルを社会シミュレーションに応用する既存研究を整理した上で、その課題が大規模言語モデルの振る舞いを特定の人間や社会状況を再現するよう調整する枠組みの不在にあることを指摘する。以上を踏まえ我々が取り組む、大規模言語モデルをキャリブレーションする方法について論じる。

Recommended citation: 山田広明, 山田寛章, 平岡達也. 言語で柔軟に操作可能なWhat-if社会シミュレータ. 人工知能学会第二種研究会資料, 2025 (BI-026), 2025年3月. https://doi.org/10.11517/jsaisigtwo.2025.BI-026_11

publications_thesis

Extracting argument structure from Japanese judgment documents for structure-based summarisation

Published: March 26, 2021

本論文では,日本の判決書の自動要約への応用を目的として,判決書からその議論構造を自動抽出する手法を提案した.裁判官や検事，弁護士等，法律の運用に携わる人々は，過去の関連事件の調査に膨大な時間を費やしている.裁判の記録として最も重要な判決書は数十ページに及ぶことも多く，長く複雑な文が使われるため，専門家でも分析に時間を要する.そのため，専門家に対する支援は必要不可欠であり，計算機による判決書の要約は重大な意義を持つ.判決書は裁判官が法的議論を文章として記録したものであり，その重要な特徴として，裁判官の最終的判断である判決を最上位とする階層的議論構造を持つ.階層的議論構造とは，ある議論が根拠として別の議論を支持する構造である.「争点 (Issue Topic)」と呼ばれる論点ごとの議論が判決を支持し，各争点の結論はさらに下位の階層の議論によって支持される.そこで,本論文ではこの構造を自動抽出した上で,判決書の要約へ応用するシステムの枠組みを提案した.本論文の貢献はこの枠組みを構成する,1)議論構造抽出の定式化および人間による注釈付けのための基準の策定, 2)定式化した議論構造抽出タスクに基づいた日本国判決書コーパスの構築，3)議論構造自動抽出モデルの提案, 4)議論構造の判決書自動要約への応用の4点である. 提案した議論構造抽出タスクは，修辞役割分類，議論的支持関係抽出，争点の特定，及び，争点関連付けの4つからなる.修辞役割分類は各文が文書中で果たす役割を分類するタスクで，本論文では「結論」や「法条の引用・参照」を含む計7つの分類を定義した.　議論的支持関係抽出は文同士の関係のうち，一方が根拠となりもう一方がその根拠を踏まえた主張を展開するような支持関係を特定するタスクである.　争点の特定では，ある判決書中での中心的な議題として提示されているトピックを含む文を特定する。争点関連付けでは，判決書中の各文を特定された争点に対して関連付ける.これら各タスクについて，人手による注釈付けが安定的にできることを検証するために Cohen の Kappa をはじめとする注釈付け一致度を計測した結果，各タスクの注釈付けが安定的に実施できることを確認した.　提案した各タスクの注釈付けを行い，日本語の法律分野では初となる議論構造注釈付きの判決書コーパスを構築した.コーパスは計120の民事判決書から構成され，文数にして約4.5万文，文字数にして320万文字の規模となっている.また，コーパス中の各判決書に対して専門家により作成された判決書要約が付与されている. 構築したコーパスに基づいて議論構造の各タスクの自動抽出手法を提案した.本論文の顕著な貢献として，判決書中に存在する節の見出し文と議論構造の関係に着目し，見出し文の情報を議論構造の自動抽出手法に組み込んだ点が挙げられる.修辞役割分類では，階層型再帰ニューラルネットワーク(RNN)を用いて文間文脈を考慮するモデルを元に，文が属する見出し文を専用に処理する独立した見出しエンコーダからの素性を考慮して各文の修辞役割を予測する手法を提案した.また，見出しエンコーダを用いて，見出し文からその見出しの配下にある文が担いうる修辞役割の集合を予測する副タスクを同時に学習する手法を提案した.提案したモデルはいずれも従来の階層型RNNモデルを用いた手法に対して有意に高い性能を示した.議論的支持関係の抽出タスクでは，支持関係の支持文と被支持文が特定の修辞役割を担うことから，支持関係抽出タスク単独で学習する手法に加えて，修辞役割分類を同時に学習する手法を提案し，比較実験を行った.実験結果から，修辞役割分類との同時学習は支持関係抽出タスクの性能を有意に向上させることを示した.争点抽出および関連付けタスクでは，事前学習済みモデルBERTを各タスクにfine-tuningすることで抽出・関連付けの自動化を行った. 争点抽出タスクでは，入力文に対してその文が属する見出しとその上位に連なる見出しを付加した上で学習することで，性能が有意に向上することを示した. 争点関連付けタスクでは，見出し配下の文が同一の争点に関連付けられることを利用し，争点-見出しのペアの二値分類タスクに簡約化した. 議論構造を考慮することが要約の性能向上に資することを検証するため，議論構造を用いて要約内容を誘導する機構を導入した要約器と通常の要約器の性能を比較する実験を行った.要約内容の誘導機構は，修辞役割分類と見出し情報を用いて要約器への入力を制御する前段処理と，争点の情報と議論的支持関係を用いて要約器からの出力を編集する後段処理から構成される.自動抽出した議論構造を誘導機構に用いた実験では，ROUGE-1を基準とした評価において有意な性能向上がみとめられ，コーパスに人手で付与された議論構造を誘導機構に用いた実験では，ROUGE-1, 2, L を基準とした評価において誘導機構による有意な性能の向上が認められた. 以上要するに，本論文は 4 つのサブタスクから成る議論構造抽出タスクを定式化し，そのための安定的な注釈付け基準を提供し，各サブタスクの自動化に対して見出しを活用した抽出モデルを提案した. 議論構造を利用した自動要約の枠組みは，議論構造抽出の精度の更なる向上が必要であるものの，要約性能の向上に資するものであるという結論が得られた.

Recommended citation: Hiroaki Yamada. 2021. Extracting argument structure from Japanese judgment documents for structure-based summarisation. Doctoral thesis.