매트랩으로 딥러닝 학습용 텍스트 생성하기

딥러닝을 사용하여 텍스트 생성하기

개발을 하려면, matlab toolbox가 필요합니다.

필요한 toolbox는 Deep Learning Toolbox / Statistics and Machine Learning Toolbox 입니다.

이 예제는 텍스트를 생성하도록,

LSTM 신경망을 훈련시키는 방법을 위해 텍스트 전처리에 대해 알아보겠습니다.

텍스트 생성을 위해 딥러닝 신경망을 훈련시키려면

문자 시퀀스 내의 다음 문자를 예측하도록 sequence-to-sequence LSTM 을 사용해야 합니다.

다음 문자를 예측하도록 신경망을 훈련시키려면,

입력 시퀀스를 시간 스텝 하나만큼 이동한 값을 응답 변수로 지정하십시오.

훈련 데이터 불러오기

훈련할 수 있도록 제공되는 텍스트파일을 불러옵니다.

sonnets.txt를 read하고, 텍스트를 추출해봅시다.

filename = "sonnets.txt";
textData = fileread(filename);

sonnet 데이터는, 공백 문자 2개로 들여쓰기 되어 있고 새 줄 문자 2개로 구분되어 있습니다.

replace를 사용하여 들여쓰기를 제거하고 split을 사용하여 텍스트를 구분된 소네트로 분할합니다.

처음 세 개의 요소에서 주 제목을 제거하고 각 소네트 앞에 나타나는 소네트 제목을 제거합니다.

1. 공백 제거하기

textData = replace(textData,"  ","");

2. 라인기준으로 분할하기

textData = split(textData,[newline newline]);
textData = textData(5:2:end);

3. 2개 데이터 체크해봅시다.

textData(1:2)

ans = 2×1 cell array

{'From fairest creatures we desire increase,↵That thereby beauty's rose might never die,↵But as the riper should by time decease,↵His tender heir might bear his memory:↵But thou, contracted to thine own bright eyes,↵Feed'st thy light's flame with self-substantial fuel,↵Making a famine where abundance lies,↵Thy self thy foe, to thy sweet self too cruel:↵Thou that art now the world's fresh ornament,↵And only herald to the gaudy spring,↵Within thine own bud buriest thy content,↵And tender churl mak'st waste in niggarding:↵Pity the world, or else this glutton be,↵To eat the world's due, by the grave and thee.' }

{'When forty winters shall besiege thy brow,↵And dig deep trenches in thy beauty's field,↵Thy youth's proud livery so gazed on now,↵Will be a tatter'd weed of small worth held:↵Then being asked, where all thy beauty lies,↵Where all the treasure of thy lusty days;↵To say, within thine own deep sunken eyes,↵Were an all-eating shame, and thriftless praise.↵How much more praise deserv'd thy beauty's use,↵If thou couldst answer 'This fair child of mine↵Shall sum my count, and make my old excuse,'↵Proving his beauty by succession thine!↵This were to be new made when thou art old,↵And see thy blood warm when thou feel'st it cold.' }

텍스트 데이터를 시퀀스로 변환하기

텍스트 데이터를 예측 변수의 경우 벡터로 구성된 시퀀스로, 응답 변수의 경우 categorical형 시퀀스로 변환합니다.

"텍스트 시작", "공백", "텍스트 끝" 및 "새 줄"을 나타내는 특수 문자를 만듭니다. 각각에 대해 특수 문자 "\x0002"(텍스트 시작), "\x00B7"("·", 중간 점), "\x2403"("␃", 텍스트 끝), "\x00B6"("¶", 단락 기호)을 사용합니다.

startOfTextCharacter = compose("\x0002");
whitespaceCharacter = compose("\x00B7");
endOfTextCharacter = compose("\x2403");
newlineCharacter = compose("\x00B6");

각 관측값에 대해, 시작 부분에 텍스트 시작 문자를 삽입하고 공백과 새 줄을 이에 대응하는 특수 문자로 바꿉니다.

textData = startOfTextCharacter + textData;
textData = replace(textData,[" " newline],[whitespaceCharacter newlineCharacter]);

텍스트 내의 고유 문자들로 이루어진 단어집을 만듭니다.

uniqueCharacters = unique([textData{:}]);
numUniqueCharacters = numel(uniqueCharacters);

텍스트 데이터를 루프를 사용해 순회하여, 각 관측값의 문자를 나타내는 벡터로 구성된 시퀀스와 응답 변수에 대한 categorical형 문자 시퀀스를 만듭니다. 각 관측값의 끝을 나타내려면 텍스트 끝 문자를 삽입하십시오.

numDocuments = numel(textData);
XTrain = cell(1,numDocuments);
YTrain = cell(1,numDocuments);
for i = 1:numel(textData)
    characters = textData{i};
    sequenceLength = numel(characters);
    
    % Get indices of characters.
    [~,idx] = ismember(characters,uniqueCharacters);
    
    % Convert characters to vectors.
    X = zeros(numUniqueCharacters,sequenceLength);
    for j = 1:sequenceLength
        X(idx(j),j) = 1;
    end
    
    % Create vector of categorical responses with end of text character.
    charactersShifted = [cellstr(characters(2:end)')' endOfTextCharacter];
    Y = categorical(charactersShifted);
    
    XTrain{i} = X;
    YTrain{i} = Y;
end

첫 번째 관측값과 여기에 대응하는 시퀀스의 크기를 표시합니다. 이 시퀀스는 D×S 행렬입니다. 여기서 D는 특징의 개수(고유 문자의 개수)이고 S는 시퀀스 길이(텍스트에 있는 문자의 개수)입니다.

textData{1}

ans =

'From·fairest·creatures·we·desire·increase,¶That·thereby·beauty's·rose·might·never·die,¶But·as·the·riper·should·by·time·decease,¶His·tender·heir·might·bear·his·memory:¶But·thou,·contracted·to·thine·own·bright·eyes,¶Feed'st·thy·light's·flame·with·self-substantial·fuel,¶Making·a·famine·where·abundance·lies,¶Thy·self·thy·foe,·to·thy·sweet·self·too·cruel:¶Thou·that·art·now·the·world's·fresh·ornament,¶And·only·herald·to·the·gaudy·spring,¶Within·thine·own·bud·buriest·thy·content,¶And·tender·churl·mak'st·waste·in·niggarding:¶Pity·the·world,·or·else·this·glutton·be,¶To·eat·the·world's·due,·by·the·grave·and·thee.'

사이즈를 출력해서 보면, 62.611로 알 수 있습니다.

size(XTrain{1})

ans = 1×2

62 611

대응하는 응답 변수 시퀀스를 표시합니다. 이 시퀀스는 응답 변수로 구성된 1×S categorical형 벡터입니다.

YTrain{1}

ans = 1×611 categorical array

F r o m · f a i r e s t · c r e a t u r e s · w e · d e s i r e · i n c r e a s e , ¶ T h a t · t h e r e b y · b e a u t y ' s · r o s e · m i g h t · n e v e r · d i e , ¶ B u t · a s · t h e · r i p e r · s h o u l d · b y · t i m e · d e c e a s e , ¶ H i s · t e n d e r · h e i r · m i g h t · b e a r · h i s · m e m o r y : ¶ B u t · t h o u , · c o n t r a c t e d · t o · t h i n e · o w n · b r i g h t · e y e s , ¶ F e e d ' s t · t h y · l i g h t ' s · f l a m e · w i t h · s e l f - s u b s t a n t i a l · f u e l , ¶ M a k i n g · a · f a m i n e · w h e r e · a b u n d a n c e · l i e s , ¶ T h y · s e l f · t h y · f o e , · t o · t h y · s w e e t · s e l f · t o o · c r u e l : ¶ T h o u · t h a t · a r t · n o w · t h e · w o r l d ' s · f r e s h · o r n a m e n t , ¶ A n d · o n l y · h e r a l d · t o · t h e · g a u d y · s p r i n g , ¶ W i t h i n · t h i n e · o w n · b u d · b u r i e s t · t h y · c o n t e n t , ¶ A n d · t e n d e r · c h u r l · m a k ' s t · w a s t e · i n · n i g g a r d i n g : ¶ P i t y · t h e · w o r l d , · o r · e l s e · t h i s · g l u t t o n · b e , ¶ T o · e a t · t h e · w o r l d ' s · d u e , · b y · t h e · g r a v e · a n d · t h e e . ␃

저작자표시 비영리 변경금지 (새창열림)

'MATLAB' 카테고리의 다른 글

2차원 plot 그리기 (matlab) (0)	2022.02.19
MATLAB database SQL문 작성하기 (JDBC / ODBC) (0)	2022.02.15
매트랩으로 앱 디자이너 사용하기 (0)	2022.02.10
머신러닝이란? (0)	2022.02.07
MATLAB 딥러닝을 사용한 컴퓨터 비전 (0)	2022.02.04

deeplearning.school

매트랩으로 딥러닝 학습용 텍스트 생성하기

딥러닝을 사용하여 텍스트 생성하기

훈련 데이터 불러오기

텍스트 데이터를 시퀀스로 변환하기

'MATLAB' 카테고리의 다른 글

댓글

티스토리툴바

매트랩으로 딥러닝 학습용 텍스트 생성하기

딥러닝을 사용하여 텍스트 생성하기

훈련 데이터 불러오기

텍스트 데이터를 시퀀스로 변환하기

'MATLAB' 카테고리의 다른 글

관련글

댓글

티스토리툴바