ner-study/Processing.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "89761690",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'C:\\\\Users\\\\Monoid\\\\anaconda3\\\\envs\\\\nn\\\\python.exe'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sys\n",
    "sys.executable"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2303b263",
   "metadata": {},
   "source": [
    "먼저 파이썬 환경을 살펴봅니다."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d4f2a3",
   "metadata": {},
   "source": [
    "개인적으로 이것을 하면서 가장 어려웠던 것은 환경을 구축하는 것 이였습니다. conda install로 설치했는데 transformers는 version 4.0.0 버전이 설치되지 않고 2.1.1가 설치되었어요. [Issue : Support transformers > 2.1.1 on Windows](https://github.com/conda-forge/transformers-feedstock/issues/16) 이거랑 무슨 연관이 있는 걸까요? 어쨋든, 해결하기위해서 transformers를 먼저 깔고 pytorch를 설치했어요. python 3.7에서 실행되더라고요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f3ab8b44",
   "metadata": {},
   "outputs": [],
   "source": [
    "from read_data import readKoreanDataAll"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a5deb97e",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, dev, test = readKoreanDataAll()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b5cede",
   "metadata": {},
   "source": [
    "데이터를 가져옵니다. 데이터는 단순히 리스트에 담겨있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "13165c8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sentence(word=['특히', '김병현', '은', '4', '회', '말', '에', '무', '기력', '하', '게', '6', '실점', '하', '면서'], pos=['MAG', 'NNP', 'JX', 'SN', 'NNB', 'NNG', 'JKB', 'XPN', 'NNG', 'XSA', 'EC', 'SN', 'NNG', 'XSV', 'EC'], namedEntity=['B', 'B', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I'], detail=['O', 'B-PS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2baf1e3",
   "metadata": {},
   "source": [
    "0번 데이터를 확인해보면 이렇게 나옵니다.\n",
    "`word`, `pos`, `namedEntity`, `detail`로 나누었습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3ee47f72",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "특히 김병현 은 4 회 말 에 무 기력 하 게 6 실점 하 면서\n"
     ]
    }
   ],
   "source": [
    "sentence0 = \" \".join(train[0].word)\n",
    "print(sentence0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7bf4a99",
   "metadata": {},
   "source": [
    "문장 하나는 이렇게 될 것이고요. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2939cd62",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "da89bd0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "longest_sentence_index = np.argmax([len(lst.word) for lst in train])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "3e4006b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "245"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train[longest_sentence_index].word)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d54d0558",
   "metadata": {},
   "source": [
    "가장 긴거는 245 길이이고 이정도면 Bert에서 요구하는 512 token보다 짧기에 괜찮다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f6460490",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertTokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c3506c83",
   "metadata": {},
   "outputs": [],
   "source": [
    "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
    "tokenizer = BertTokenizer.from_pretrained(PRETAINED_MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3045a1",
   "metadata": {},
   "source": [
    "토크나이저를 다음코드로 불러옵니다. 이제 사용을 해보겠습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c7b02470",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['특히', '김', '##병', '##현', '은', '4', '회', '말', '에', '무', '기', '##력', '하', '게', '6', '실', '##점', '하', '면', '##서']\n"
     ]
    }
   ],
   "source": [
    "morph_to_tokens = tokenizer.tokenize(sentence0)\n",
    "print(morph_to_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78ba07a0",
   "metadata": {},
   "source": [
    "잘 작동합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0dbc6dc5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[39671,\n",
       " 8935,\n",
       " 73380,\n",
       " 30842,\n",
       " 9632,\n",
       " 125,\n",
       " 9998,\n",
       " 9251,\n",
       " 9559,\n",
       " 9294,\n",
       " 8932,\n",
       " 28143,\n",
       " 9952,\n",
       " 8872,\n",
       " 127,\n",
       " 9489,\n",
       " 34907,\n",
       " 9952,\n",
       " 9279,\n",
       " 12424]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer.convert_tokens_to_ids(morph_to_tokens)\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bd8ca5b",
   "metadata": {},
   "source": [
    "아이디는 이렇게도 얻을 수 있어요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d7f74b0c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1, 22])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[  101, 39671,  8935, 73380, 30842,  9632,   125,  9998,  9251,  9559,\n",
       "          9294,  8932, 28143,  9952,  8872,   127,  9489, 34907,  9952,  9279,\n",
       "         12424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer(sentence0, return_tensors='pt')\n",
    "print(inputs['input_ids'].size())\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a840377",
   "metadata": {},
   "source": [
    "아니면 이렇게 얻을 수 있어요. 차이점은 \\[CLS\\] 토큰(101) 과 \\[SEP\\] 토큰(102)이 자동으로 삽입됩니다. attention_mask도 같이 만들어 주고 tensor로 나와요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e7ab65d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([1, 22])\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'input_ids': tensor([[  101, 39671,  8935, 73380, 30842,  9632,   125,  9998,  9251,  9559,\n",
       "          9294,  8932, 28143,  9952,  8872,   127,  9489, 34907,  9952,  9279,\n",
       "         12424,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputs = tokenizer(sentence0, return_tensors='pt', padding='longest', truncation=True)\n",
    "print(inputs['input_ids'].size())\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e06a1a7",
   "metadata": {},
   "source": [
    "padding 옵션과 truncation 옵션이 있다. truncation 옵션은 512 개가 넘어가지 않는 한 별 상관이 없다.\n",
    "\n",
    "패딩 옵션은 다음과 같이 설정할 수 있다.\n",
    "\n",
    "padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:\n",
    "- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).\n",
    "- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.\n",
    "- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "74505ddd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['[CLS]',\n",
       " '특히',\n",
       " '김',\n",
       " '##병',\n",
       " '##현',\n",
       " '은',\n",
       " '4',\n",
       " '회',\n",
       " '말',\n",
       " '에',\n",
       " '무',\n",
       " '기',\n",
       " '##력',\n",
       " '하',\n",
       " '게',\n",
       " '6',\n",
       " '실',\n",
       " '##점',\n",
       " '하',\n",
       " '면',\n",
       " '##서',\n",
       " '[SEP]']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd66d52",
   "metadata": {},
   "source": [
    "원래대로 돌릴려면 다음과 같이 하면 되요."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4a1d9f8",
   "metadata": {},
   "source": [
    "이제 BERT를 사용해보아요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "3dba1967",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import BertModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c762970c",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']\n",
      "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
     ]
    }
   ],
   "source": [
    "PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
    "bert = BertModel.from_pretrained(PRETAINED_MODEL_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22aea316",
   "metadata": {},
   "source": [
    "Bert를 불러옵니다. bert-base-multilingual-cased를 써요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2e9e3b82",
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = bert(**inputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "6798d3d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "odict_keys(['last_hidden_state', 'pooler_output'])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "7b1f9a60",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 22, 768])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs['last_hidden_state'].size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "14315cd6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "torch.Size([1, 768])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outputs['pooler_output'].size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef641edf",
   "metadata": {},
   "source": [
    "표현의 차원은 768차원 입니다. last_hidden_state는 워드 갯수가 22개이니 22차원이 나와요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "2b173e84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BertConfig {\n",
       "  \"_name_or_path\": \"bert-base-multilingual-cased\",\n",
       "  \"architectures\": [\n",
       "    \"BertForMaskedLM\"\n",
       "  ],\n",
       "  \"attention_probs_dropout_prob\": 0.1,\n",
       "  \"classifier_dropout\": null,\n",
       "  \"directionality\": \"bidi\",\n",
       "  \"hidden_act\": \"gelu\",\n",
       "  \"hidden_dropout_prob\": 0.1,\n",
       "  \"hidden_size\": 768,\n",
       "  \"initializer_range\": 0.02,\n",
       "  \"intermediate_size\": 3072,\n",
       "  \"layer_norm_eps\": 1e-12,\n",
       "  \"max_position_embeddings\": 512,\n",
       "  \"model_type\": \"bert\",\n",
       "  \"num_attention_heads\": 12,\n",
       "  \"num_hidden_layers\": 12,\n",
       "  \"pad_token_id\": 0,\n",
       "  \"pooler_fc_size\": 768,\n",
       "  \"pooler_num_attention_heads\": 12,\n",
       "  \"pooler_num_fc_layers\": 3,\n",
       "  \"pooler_size_per_head\": 128,\n",
       "  \"pooler_type\": \"first_token_transform\",\n",
       "  \"position_embedding_type\": \"absolute\",\n",
       "  \"transformers_version\": \"4.16.2\",\n",
       "  \"type_vocab_size\": 2,\n",
       "  \"use_cache\": true,\n",
       "  \"vocab_size\": 119547\n",
       "}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bert.config"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a9944a",
   "metadata": {},
   "source": [
    "Config 한번 보고 가요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d48edc53",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}