Content-based File Format Detection (ํŒŒ์ผ ํ™•์žฅ์ž ์˜ˆ์ธก)

Joonas' Note

Content-based File Format Detection (ํŒŒ์ผ ํ™•์žฅ์ž ์˜ˆ์ธก) ๋ณธ๋ฌธ

AI/๋จธ์‹ ๋Ÿฌ๋‹

Content-based File Format Detection (ํŒŒ์ผ ํ™•์žฅ์ž ์˜ˆ์ธก)

2022. 5. 18. 23:29 joonas ์ฝ๋Š”๋ฐ 3๋ถ„
  • Dataset
  • Code
  • Context
  • Conclusion

Dataset

https://www.kaggle.com/datasets/joonasyoon/file-format-detection

 

Programming Laungages and File Format Detection

can you know what file format is? and written in which language?

www.kaggle.com

Code

https://www.kaggle.com/code/joonasyoon/ml-content-based-file-format-detection

 

[ML] ๐Ÿ’พ Content-based File Format Detection ๐Ÿ“ƒ

Explore and run machine learning code with Kaggle Notebooks | Using data from Programming Laungages and File Format Detection

www.kaggle.com


Context

๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“œ๋Š” ๊ฒƒ๋ถ€ํ„ฐ, ML ๋ชจ๋ธ๊นŒ์ง€ ์ „๋ถ€ ๋งŒ๋“ค์–ด์„œ ํ™•์ธํ•ด๋ดค๋‹ค.

์ฒ˜์Œ๋ถ€ํ„ฐ ML ๋ชจ๋ธ๊นŒ์ง€ ์ž‘์„ฑํ•  ์ƒ๊ฐ์€ ์•„๋‹ˆ์—ˆ๊ณ , ํŒŒ์ผ์˜ ํ™•์žฅ์ž๋Š” ๋‹จ์ˆœํžˆ ์ด๋ฆ„์˜ ์ผ๋ถ€์ผ ๋ฟ์ด๋‹ˆ๊นŒ,
ํ™•์žฅ์ž๊ฐ€ ์—†๋Š” ์ƒํƒœ์—์„œ ํŒŒ์ผ ๋‚ด์šฉ๋งŒ ๋ณด๊ณ  ์–ด๋–ค ์–ธ์–ด๋กœ ์ž‘์„ฑ๋˜์—ˆ๋Š” ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์„๊นŒ? ํ•˜๋Š” ์˜๋ฌธ์—์„œ ์ถœ๋ฐœํ–ˆ๋‹ค.

D/C/Go ์–ธ์–ด ์˜ˆ์‹œ

GitHub์—๋Š” ์ˆ˜๋งŽ์€ ์ฝ”๋“œ๋“ค์ด ์ˆ˜๋งŽ์€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋“ค๋กœ, ๊ทธ๋ฆฌ๊ณ  ์ˆ˜๋งŽ์€ ์ฝ”๋“œ ์Šคํƒ€์ผ๋กœ ์ž‘์„ฑ๋˜์–ด ์žˆ๊ณ  ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๊ฒƒ์„ ์ˆ˜์ง‘ํ•ด์„œ ๋ฐ์ดํ„ฐ ์…‹์„ ๋งŒ๋“ค์—ˆ๋‹ค.

30๊ฐœ๊ฐ€ ๋„˜๋Š” ๋ ˆํฌ์ง€ํ† ๋ฆฌ์—์„œ 8๋งŒ๊ฐœ๊ฐ€ ๋„˜๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ์„ ๋ชจ์•˜๋‹ค. ๋žœ๋ค์œผ๋กœ ๋ชจ์œผ๋‹ค๋ณด๋‹ˆ Dart, Rust, C#, Go๋ฅผ ๊ฐ€์žฅ ๋งŽ์ด ๋ชจ์•˜๋‹ค. ๋‚˜๋จธ์ง€ ์–ธ์–ด๋“ค๋„ ํŒŒ์ผ ์‚ฌ์ด์ฆˆ๋ฅผ ์ƒ๊ฐํ•˜๋ฉด ์ ์€ ์–‘์€ ์•„๋‹ˆ๋ผ์„œ ํ•™์Šต์ด ์–ด๋ ค์›Œ๋ณด์ด์ง€๋Š” ์•Š์•˜๋‹ค.

๊ทธ๋ž˜๋„ ํŒŒ์ผ์ด 500๊ฐœ๋Š” ๋„˜๋Š” ์–ธ์–ด๋“ค๋งŒ ํ•™์Šตํ•ด์„œ ์—์ธกํ•ด๋ณด๊ธฐ๋กœ ํ–ˆ๊ณ , ์–ธ์–ด๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • C
  • C#
  • C++
  • Dart
  • Diff
  • Elixir
  • GAS
  • GLSL
  • Go
  • JSON
  • Java
  • Javascript
  • Julia
  • Kotlin
  • Markdown
  • PHP
  • Ruby
  • Rust
  • SQL
  • Text
  • YAML

JSON, Text, YAML ๋„ ์žˆ์œผ๋‹ˆ ์–ด๋–ค ํŠน์ • ์–ธ์–ด๋ผ๊ธฐ๋ณด๋‹ค๋Š” ๋ฐ์ดํ„ฐ ํฌ๋งท์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š” ๊ฒƒ์ด ๋งž์•„๋ณด์ธ๋‹ค.

ํ•œ ๊ฐ€์ง€ ๊ฑฑ์ •๋˜๋Š” ๊ฒƒ์€ ์ •๋ง ์™„์ „ํ•œ ๋žœ๋ค์ธ Text๋ฅผ ์ž˜ ๊ฑธ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์„ ์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ฐ”๋ผ์ง€ ์•Š๊ณ  ๋‹จ์ˆœํ•˜๊ฒŒ CountVectorizer๋กœ ๋ฒกํ„ฐํ™”ํ•ด์„œ ์ „๋ถ€ ํ•™์Šต์‹œ์ผฐ๋Š”๋ฐ, ์˜์™ธ๋กœ ์ž˜ ๋‚˜์˜จ๋‹ค. ์•„์ฃผ ์กฐ๊ธˆ๋งŒ ํ•™์Šตํ•ด๋„ 80%๋Š” ์‰ฝ๊ฒŒ ๋„˜์–ด๊ฐ„๋‹ค.

8๋งŒ๊ฐœ ํŒŒ์ผ(์•ฝ 1GB)์„ ์ „๋ถ€ ์ฝ์œผ๋ฉด ๋ฒกํ„ฐ ํฌ๊ธฐ๊ฐ€ 400๋งŒ ์ •๋„๋Š” ๋œ๋‹ค.

Conclusion

LinearSVC:
  elapsed time: 0:06:04.338887
  accuracy: 94.61%
  roc_auc: 0.9877388985308935
LogisticRegression:
  elapsed time: 1:06:56.671633
  accuracy: 97.02%
  roc_auc: 0.9882273614011244
RidgeClassifier:
  elapsed time: 0:04:10.478925
  accuracy: 55.39%
  roc_auc: None
random_forest:
  elapsed time: 0:05:06.379483
  accuracy: 93.68%
  roc_auc: 0.9825555088479243
k_neighbors:
  elapsed time: 0:00:33.741252
  accuracy: 87.67%
  roc_auc: 0.9706357200070198
SGD:
  elapsed time: 0:01:24.359626
  accuracy: 88.76%
  roc_auc: None

์ „๋ฐ˜์ ์œผ๋กœ ์Šค์ฝ”์–ด๊ฐ€ ๋ฌด์ฒ™ ๋†’์€ ํŽธ์ด๋‹ค. ์‹ค์ œ๋กœ ๋ช‡ ๊ฐœ๋ฅผ ๋ฝ‘์•„์„œ ๋ชจ๋ธ์— ์ฝ”๋“œ๋ฅผ ๋„ฃ๊ณ  ์˜ˆ์ธก ํด๋ž˜์Šค๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ์ž˜ ๋‚˜์˜จ๋‹ค.

CountVectorizer์—์„œ ํŠน์ˆ˜๊ธฐํ˜ธ๋‚˜ ๊ณต๋ฐฑ, stop words๋“ค์ด ์ œ๊ฑฐ๋ ํ…๋ฐ ๋‚จ์€ ๋‹จ์–ด๋“ค๋งŒ์œผ๋กœ๋„ ์–ธ์–ด๋ฅผ ๊ตฌ๋ถ„ํ• ๋งŒํผ์˜ ์œ ์˜๋ฏธํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๋ชจ์–‘์ด๋‹ค. ์˜ˆ์ƒ์ปจ๋Œ€, ๊ฐ ์–ธ์–ด๋งˆ๋‹ค ์“ฐ์ด๋Š” ์˜ˆ์•ฝ์–ด๋“ค๋กœ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

๊ธ€ ์ƒ๋‹จ์— ์ฒจ๋ถ€๋œ ์บ๊ธ€ ๋…ธํŠธ๋ถ์—์„œ ์ „์ฒด ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๊ฑธ ์–ด๋””๋‹ค ์จ๋จน์„ ์ˆ˜ ์žˆ๋‚˜ ํ•˜๊ฒ ์ง€๋งŒ, VSCode์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•˜๊ณ , Slack์˜ snippet์—์„œ๋„ ์“ฐ์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

Comments