{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "89761690", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'C:\\\\Users\\\\Monoid\\\\anaconda3\\\\envs\\\\nn\\\\python.exe'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sys\n", "sys.executable" ] }, { "cell_type": "markdown", "id": "2303b263", "metadata": {}, "source": [ "먼저 파이썬 환경을 살펴봅니다." ] }, { "cell_type": "markdown", "id": "d9d4f2a3", "metadata": {}, "source": [ "개인적으로 이것을 하면서 가장 어려웠던 것은 환경을 구축하는 것 이였습니다. conda install로 설치했는데 transformers는 version 4.0.0 버전이 설치되지 않고 2.1.1가 설치되었어요. [Issue : Support transformers > 2.1.1 on Windows](https://github.com/conda-forge/transformers-feedstock/issues/16) 이거랑 무슨 연관이 있는 걸까요? 어쨋든, 해결하기위해서 transformers를 먼저 깔고 pytorch를 설치했어요. python 3.7에서 실행되더라고요." ] }, { "cell_type": "code", "execution_count": 2, "id": "f3ab8b44", "metadata": {}, "outputs": [], "source": [ "from read_data import readKoreanDataAll" ] }, { "cell_type": "code", "execution_count": 3, "id": "a5deb97e", "metadata": {}, "outputs": [], "source": [ "train, dev, test = readKoreanDataAll()" ] }, { "cell_type": "markdown", "id": "b3b5cede", "metadata": {}, "source": [ "데이터를 가져옵니다. 데이터는 단순히 리스트에 담겨있습니다." ] }, { "cell_type": "code", "execution_count": 4, "id": "13165c8c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sentence(word=['특히', '김병현', '은', '4', '회', '말', '에', '무', '기력', '하', '게', '6', '실점', '하', '면서'], pos=['MAG', 'NNP', 'JX', 'SN', 'NNB', 'NNG', 'JKB', 'XPN', 'NNG', 'XSA', 'EC', 'SN', 'NNG', 'XSV', 'EC'], namedEntity=['B', 'B', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I'], detail=['O', 'B-PS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[0]" ] }, { "cell_type": "markdown", "id": "b2baf1e3", "metadata": {}, "source": [ "0번 데이터를 확인해보면 이렇게 나옵니다.\n", "`word`, `pos`, `namedEntity`, `detail`로 나누었습니다." ] }, { "cell_type": "code", "execution_count": 5, "id": "3ee47f72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "특히 김병현 은 4 회 말 에 무 기력 하 게 6 실점 하 면서\n" ] } ], "source": [ "sentence0 = \" \".join(train[0].word)\n", "print(sentence0)" ] }, { "cell_type": "markdown", "id": "d7bf4a99", "metadata": {}, "source": [ "문장 하나는 이렇게 될 것이고요. " ] }, { "cell_type": "code", "execution_count": 6, "id": "2939cd62", "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 7, "id": "da89bd0e", "metadata": {}, "outputs": [], "source": [ "longest_sentence_index = np.argmax([len(lst.word) for lst in train])" ] }, { "cell_type": "code", "execution_count": 8, "id": "3e4006b1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "245" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train[longest_sentence_index].word)" ] }, { "cell_type": "markdown", "id": "d54d0558", "metadata": {}, "source": [ "가장 긴거는 245 길이이고 이정도면 Bert에서 요구하는 512 token보다 짧기에 괜찮다." ] }, { "cell_type": "code", "execution_count": 9, "id": "f6460490", "metadata": {}, "outputs": [], "source": [ "from transformers import BertTokenizer" ] }, { "cell_type": "code", "execution_count": 10, "id": "c3506c83", "metadata": {}, "outputs": [], "source": [ "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n", "tokenizer = BertTokenizer.from_pretrained(PRETAINED_MODEL_NAME)" ] }, { "cell_type": "markdown", "id": "4f3045a1", "metadata": {}, "source": [ "토크나이저를 다음코드로 불러옵니다. 이제 사용을 해보겠습니다." ] }, { "cell_type": "code", "execution_count": 11, "id": "c7b02470", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['특히', '김', '##병', '##현', '은', '4', '회', '말', '에', '무', '기', '##력', '하', '게', '6', '실', '##점', '하', '면', '##서']\n" ] } ], "source": [ "morph_to_tokens = tokenizer.tokenize(sentence0)\n", "print(morph_to_tokens)" ] }, { "cell_type": "markdown", "id": "78ba07a0", "metadata": {}, "source": [ "잘 작동합니다." ] }, { "cell_type": "code", "execution_count": 12, "id": "0dbc6dc5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[39671,\n", " 8935,\n", " 73380,\n", " 30842,\n", " 9632,\n", " 125,\n", " 9998,\n", " 9251,\n", " 9559,\n", " 9294,\n", " 8932,\n", " 28143,\n", " 9952,\n", " 8872,\n", " 127,\n", " 9489,\n", " 34907,\n", " 9952,\n", " 9279,\n", " 12424]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = tokenizer.convert_tokens_to_ids(morph_to_tokens)\n", "inputs" ] }, { "cell_type": "markdown", "id": "6bd8ca5b", "metadata": {}, "source": [ "아이디는 이렇게도 얻을 수 있어요." ] }, { "cell_type": "code", "execution_count": 13, "id": "d7f74b0c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([1, 22])\n" ] }, { "data": { "text/plain": [ "{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n", " 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n", " 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = tokenizer(sentence0, return_tensors='pt')\n", "print(inputs['input_ids'].size())\n", "inputs" ] }, { "cell_type": "markdown", "id": "1a840377", "metadata": {}, "source": [ "아니면 이렇게 얻을 수 있어요. 차이점은 \\[CLS\\] 토큰(101) 과 \\[SEP\\] 토큰(102)이 자동으로 삽입됩니다. attention_mask도 같이 만들어 주고 tensor로 나와요." ] }, { "cell_type": "code", "execution_count": 14, "id": "e7ab65d2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([1, 22])\n" ] }, { "data": { "text/plain": [ "{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n", " 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n", " 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = tokenizer(sentence0, return_tensors='pt', padding='longest', truncation=True)\n", "print(inputs['input_ids'].size())\n", "inputs" ] }, { "cell_type": "markdown", "id": "4e06a1a7", "metadata": {}, "source": [ "padding 옵션과 truncation 옵션이 있다. truncation 옵션은 512 개가 넘어가지 않는 한 별 상관이 없다.\n", "\n", "패딩 옵션은 다음과 같이 설정할 수 있다.\n", "\n", "padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:\n", "- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).\n", "- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.\n", "- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "74505ddd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['[CLS]',\n", " '특히',\n", " '김',\n", " '##병',\n", " '##현',\n", " '은',\n", " '4',\n", " '회',\n", " '말',\n", " '에',\n", " '무',\n", " '기',\n", " '##력',\n", " '하',\n", " '게',\n", " '6',\n", " '실',\n", " '##점',\n", " '하',\n", " '면',\n", " '##서',\n", " '[SEP]']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])" ] }, { "cell_type": "markdown", "id": "0bd66d52", "metadata": {}, "source": [ "원래대로 돌릴려면 다음과 같이 하면 되요." ] }, { "cell_type": "markdown", "id": "a4a1d9f8", "metadata": {}, "source": [ "이제 BERT를 사용해보아요." ] }, { "cell_type": "code", "execution_count": 16, "id": "3dba1967", "metadata": {}, "outputs": [], "source": [ "from transformers import BertModel" ] }, { "cell_type": "code", "execution_count": 17, "id": "c762970c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']\n", "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n", "bert = BertModel.from_pretrained(PRETAINED_MODEL_NAME)" ] }, { "cell_type": "markdown", "id": "22aea316", "metadata": {}, "source": [ "Bert를 불러옵니다. bert-base-multilingual-cased를 써요." ] }, { "cell_type": "code", "execution_count": 18, "id": "2e9e3b82", "metadata": {}, "outputs": [], "source": [ "outputs = bert(**inputs)" ] }, { "cell_type": "code", "execution_count": 19, "id": "6798d3d7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "odict_keys(['last_hidden_state', 'pooler_output'])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outputs.keys()" ] }, { "cell_type": "code", "execution_count": 21, "id": "7b1f9a60", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 22, 768])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outputs['last_hidden_state'].size()" ] }, { "cell_type": "code", "execution_count": 22, "id": "14315cd6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 768])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outputs['pooler_output'].size()" ] }, { "cell_type": "markdown", "id": "ef641edf", "metadata": {}, "source": [ "표현의 차원은 768차원 입니다. last_hidden_state는 워드 갯수가 22개이니 22차원이 나와요." ] }, { "cell_type": "code", "execution_count": 23, "id": "2b173e84", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BertConfig {\n", " \"_name_or_path\": \"bert-base-multilingual-cased\",\n", " \"architectures\": [\n", " \"BertForMaskedLM\"\n", " ],\n", " \"attention_probs_dropout_prob\": 0.1,\n", " \"classifier_dropout\": null,\n", " \"directionality\": \"bidi\",\n", " \"hidden_act\": \"gelu\",\n", " \"hidden_dropout_prob\": 0.1,\n", " \"hidden_size\": 768,\n", " \"initializer_range\": 0.02,\n", " \"intermediate_size\": 3072,\n", " \"layer_norm_eps\": 1e-12,\n", " \"max_position_embeddings\": 512,\n", " \"model_type\": \"bert\",\n", " \"num_attention_heads\": 12,\n", " \"num_hidden_layers\": 12,\n", " \"pad_token_id\": 0,\n", " \"pooler_fc_size\": 768,\n", " \"pooler_num_attention_heads\": 12,\n", " \"pooler_num_fc_layers\": 3,\n", " \"pooler_size_per_head\": 128,\n", " \"pooler_type\": \"first_token_transform\",\n", " \"position_embedding_type\": \"absolute\",\n", " \"transformers_version\": \"4.16.2\",\n", " \"type_vocab_size\": 2,\n", " \"use_cache\": true,\n", " \"vocab_size\": 119547\n", "}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bert.config" ] }, { "cell_type": "markdown", "id": "f7a9944a", "metadata": {}, "source": [ "Config 한번 보고 가요." ] }, { "cell_type": "code", "execution_count": null, "id": "d48edc53", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" } }, "nbformat": 4, "nbformat_minor": 5 }