ner-study/Processing.ipynb

635 lines
16 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "89761690",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'C:\\\\Users\\\\Monoid\\\\anaconda3\\\\envs\\\\nn\\\\python.exe'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sys\n",
"sys.executable"
]
},
{
"cell_type": "markdown",
"id": "2303b263",
"metadata": {},
"source": [
"먼저 파이썬 환경을 살펴봅니다."
]
},
{
"cell_type": "markdown",
"id": "d9d4f2a3",
"metadata": {},
"source": [
"개인적으로 이것을 하면서 가장 어려웠던 것은 환경을 구축하는 것 이였습니다. conda install로 설치했는데 transformers는 version 4.0.0 버전이 설치되지 않고 2.1.1가 설치되었어요. [Issue : Support transformers > 2.1.1 on Windows](https://github.com/conda-forge/transformers-feedstock/issues/16) 이거랑 무슨 연관이 있는 걸까요? 어쨋든, 해결하기위해서 transformers를 먼저 깔고 pytorch를 설치했어요. python 3.7에서 실행되더라고요."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f3ab8b44",
"metadata": {},
"outputs": [],
"source": [
"from read_data import readKoreanDataAll"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a5deb97e",
"metadata": {},
"outputs": [],
"source": [
"train, dev, test = readKoreanDataAll()"
]
},
{
"cell_type": "markdown",
"id": "b3b5cede",
"metadata": {},
"source": [
"데이터를 가져옵니다. 데이터는 단순히 리스트에 담겨있습니다."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "13165c8c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Sentence(word=['특히', '김병현', '은', '4', '회', '말', '에', '무', '기력', '하', '게', '6', '실점', '하', '면서'], pos=['MAG', 'NNP', 'JX', 'SN', 'NNB', 'NNG', 'JKB', 'XPN', 'NNG', 'XSA', 'EC', 'SN', 'NNG', 'XSV', 'EC'], namedEntity=['B', 'B', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I', 'B', 'I', 'I', 'I'], detail=['O', 'B-PS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train[0]"
]
},
{
"cell_type": "markdown",
"id": "b2baf1e3",
"metadata": {},
"source": [
"0번 데이터를 확인해보면 이렇게 나옵니다.\n",
"`word`, `pos`, `namedEntity`, `detail`로 나누었습니다."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3ee47f72",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"특히 김병현 은 4 회 말 에 무 기력 하 게 6 실점 하 면서\n"
]
}
],
"source": [
"sentence0 = \" \".join(train[0].word)\n",
"print(sentence0)"
]
},
{
"cell_type": "markdown",
"id": "d7bf4a99",
"metadata": {},
"source": [
"문장 하나는 이렇게 될 것이고요. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2939cd62",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "da89bd0e",
"metadata": {},
"outputs": [],
"source": [
"longest_sentence_index = np.argmax([len(lst.word) for lst in train])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3e4006b1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"245"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(train[longest_sentence_index].word)"
]
},
{
"cell_type": "markdown",
"id": "d54d0558",
"metadata": {},
"source": [
"가장 긴거는 245 길이이고 이정도면 Bert에서 요구하는 512 token보다 짧기에 괜찮다."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f6460490",
"metadata": {},
"outputs": [],
"source": [
"from transformers import BertTokenizer"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c3506c83",
"metadata": {},
"outputs": [],
"source": [
"PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
"tokenizer = BertTokenizer.from_pretrained(PRETAINED_MODEL_NAME)"
]
},
{
"cell_type": "markdown",
"id": "4f3045a1",
"metadata": {},
"source": [
"토크나이저를 다음코드로 불러옵니다. 이제 사용을 해보겠습니다."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c7b02470",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['특히', '김', '##병', '##현', '은', '4', '회', '말', '에', '무', '기', '##력', '하', '게', '6', '실', '##점', '하', '면', '##서']\n"
]
}
],
"source": [
"morph_to_tokens = tokenizer.tokenize(sentence0)\n",
"print(morph_to_tokens)"
]
},
{
"cell_type": "markdown",
"id": "78ba07a0",
"metadata": {},
"source": [
"잘 작동합니다."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0dbc6dc5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[39671,\n",
" 8935,\n",
" 73380,\n",
" 30842,\n",
" 9632,\n",
" 125,\n",
" 9998,\n",
" 9251,\n",
" 9559,\n",
" 9294,\n",
" 8932,\n",
" 28143,\n",
" 9952,\n",
" 8872,\n",
" 127,\n",
" 9489,\n",
" 34907,\n",
" 9952,\n",
" 9279,\n",
" 12424]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inputs = tokenizer.convert_tokens_to_ids(morph_to_tokens)\n",
"inputs"
]
},
{
"cell_type": "markdown",
"id": "6bd8ca5b",
"metadata": {},
"source": [
"아이디는 이렇게도 얻을 수 있어요."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d7f74b0c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch.Size([1, 22])\n"
]
},
{
"data": {
"text/plain": [
"{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n",
" 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n",
" 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inputs = tokenizer(sentence0, return_tensors='pt')\n",
"print(inputs['input_ids'].size())\n",
"inputs"
]
},
{
"cell_type": "markdown",
"id": "1a840377",
"metadata": {},
"source": [
"아니면 이렇게 얻을 수 있어요. 차이점은 \\[CLS\\] 토큰(101) 과 \\[SEP\\] 토큰(102)이 자동으로 삽입됩니다. attention_mask도 같이 만들어 주고 tensor로 나와요."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "e7ab65d2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch.Size([1, 22])\n"
]
},
{
"data": {
"text/plain": [
"{'input_ids': tensor([[ 101, 39671, 8935, 73380, 30842, 9632, 125, 9998, 9251, 9559,\n",
" 9294, 8932, 28143, 9952, 8872, 127, 9489, 34907, 9952, 9279,\n",
" 12424, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"inputs = tokenizer(sentence0, return_tensors='pt', padding='longest', truncation=True)\n",
"print(inputs['input_ids'].size())\n",
"inputs"
]
},
{
"cell_type": "markdown",
"id": "4e06a1a7",
"metadata": {},
"source": [
"padding 옵션과 truncation 옵션이 있다. truncation 옵션은 512 개가 넘어가지 않는 한 별 상관이 없다.\n",
"\n",
"패딩 옵션은 다음과 같이 설정할 수 있다.\n",
"\n",
"padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:\n",
"- True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).\n",
"- 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.\n",
"- False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "74505ddd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['[CLS]',\n",
" '특히',\n",
" '김',\n",
" '##병',\n",
" '##현',\n",
" '은',\n",
" '4',\n",
" '회',\n",
" '말',\n",
" '에',\n",
" '무',\n",
" '기',\n",
" '##력',\n",
" '하',\n",
" '게',\n",
" '6',\n",
" '실',\n",
" '##점',\n",
" '하',\n",
" '면',\n",
" '##서',\n",
" '[SEP]']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])"
]
},
{
"cell_type": "markdown",
"id": "0bd66d52",
"metadata": {},
"source": [
"원래대로 돌릴려면 다음과 같이 하면 되요."
]
},
{
"cell_type": "markdown",
"id": "a4a1d9f8",
"metadata": {},
"source": [
"이제 BERT를 사용해보아요."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "3dba1967",
"metadata": {},
"outputs": [],
"source": [
"from transformers import BertModel"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "c762970c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']\n",
"- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
]
}
],
"source": [
"PRETAINED_MODEL_NAME = 'bert-base-multilingual-cased'\n",
"bert = BertModel.from_pretrained(PRETAINED_MODEL_NAME)"
]
},
{
"cell_type": "markdown",
"id": "22aea316",
"metadata": {},
"source": [
"Bert를 불러옵니다. bert-base-multilingual-cased를 써요."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2e9e3b82",
"metadata": {},
"outputs": [],
"source": [
"outputs = bert(**inputs)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "6798d3d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"odict_keys(['last_hidden_state', 'pooler_output'])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outputs.keys()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "7b1f9a60",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 22, 768])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outputs['last_hidden_state'].size()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "14315cd6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"torch.Size([1, 768])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"outputs['pooler_output'].size()"
]
},
{
"cell_type": "markdown",
"id": "ef641edf",
"metadata": {},
"source": [
"표현의 차원은 768차원 입니다. last_hidden_state는 워드 갯수가 22개이니 22차원이 나와요."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "2b173e84",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"BertConfig {\n",
" \"_name_or_path\": \"bert-base-multilingual-cased\",\n",
" \"architectures\": [\n",
" \"BertForMaskedLM\"\n",
" ],\n",
" \"attention_probs_dropout_prob\": 0.1,\n",
" \"classifier_dropout\": null,\n",
" \"directionality\": \"bidi\",\n",
" \"hidden_act\": \"gelu\",\n",
" \"hidden_dropout_prob\": 0.1,\n",
" \"hidden_size\": 768,\n",
" \"initializer_range\": 0.02,\n",
" \"intermediate_size\": 3072,\n",
" \"layer_norm_eps\": 1e-12,\n",
" \"max_position_embeddings\": 512,\n",
" \"model_type\": \"bert\",\n",
" \"num_attention_heads\": 12,\n",
" \"num_hidden_layers\": 12,\n",
" \"pad_token_id\": 0,\n",
" \"pooler_fc_size\": 768,\n",
" \"pooler_num_attention_heads\": 12,\n",
" \"pooler_num_fc_layers\": 3,\n",
" \"pooler_size_per_head\": 128,\n",
" \"pooler_type\": \"first_token_transform\",\n",
" \"position_embedding_type\": \"absolute\",\n",
" \"transformers_version\": \"4.16.2\",\n",
" \"type_vocab_size\": 2,\n",
" \"use_cache\": true,\n",
" \"vocab_size\": 119547\n",
"}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bert.config"
]
},
{
"cell_type": "markdown",
"id": "f7a9944a",
"metadata": {},
"source": [
"Config 한번 보고 가요."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d48edc53",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}