Enhancing Code Similarity with Augmented Data Filtering and Ensemble Strategies
DOI: http://dx.doi.org/10.30630/joiv.6.3.1259
Abstract
Keywords
Full Text:
PDFReferences
D. Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
A. Radford, et al. Language models are unsupervised multitask learners. OpenAI blog 1.8 (2019): 9.
Y. Zhilin, et al. Xlnet: Generalized autoregressive pre-training for language understanding. Advances in neural information processing systems 32 (2019).
D. Zihang, et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
A. Vaswani, et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
G. Kim, et al. AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test. Mathematics 10.9 (2022): 1486.
S. Lee, G. Kim, and H. Lim, Verification of educational goal of reading area in Korean SAT through natural language processing techniques, Journal of the Korea Convergence Society, vol. 13, no. 1, pp. 81–88, Jan. 2022.
G. Kim, et al. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. International journal of machine learning and cybernetics 11.10 (2020): 2341-2355.
K. Kim, et al. GREG: A global level relation extraction with knowledge graph embedding. Applied Sciences 10.3 (2020): 1181.
Z. Feng, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
G. Daya, et al. Graphcodebert: Pre-training code representations with data flow. International Conference on Learning Representations: ICLR 2021.
C. Kevin, et al. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
T. Breaux and J. Moritz. 2021. The 2021 software developer shortage is coming. Commun. ACM 64, 7 (July 2021), 39–41. https://doi.org/10.1145/3440753
P. Tambe, Xuan Ye, Peter Cappelli (2020) Paying to Program? Engineering Brand and High-Tech Wages. Management Science 66(7):3010-3028. https://doi.org/10.1287/mnsc.2019.3343
H. Hamel, et al. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
L. Yujia, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814 (2022).
Z. Feng, et al. Flowchart-based cross-language source code similarity detection. Scientific Programming (2020).
S. Ducasse, M. Rieger and S. Demeyer, A language independent approach for detecting duplicated code, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM 99). Software Maintenance for Business Change (Cat. No.99CB36360), 1999, pp. 109-118, doi: 10.1109/ICSM.1999.792593.
B. S. Baker, On finding duplication and near-duplication in large software systems, Proceedings of 2nd Working Conference on Reverse Engineering, 1995, pp. 86-95, doi: 10.1109/WCRE.1995.514697.
I. D. Baxter, A. Yahin, L. Moura, M. SantAnna and L. Bier, Clone detection using abstract syntax trees, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), 1998, pp. 368-377, doi: 10.1109/ICSM.1998.738528.
M. Leblanc and Merlo, Experiment on the automatic detection of function clones in a software system using metrics, 1996 Proceedings of International Conference on Software Maintenance, 1996, pp. 244-253, doi: 10.1109/ICSM.1996.565012.
P. Anupriya. Code Clone Detection Using Code2Vec. University of California, Irvine, 2020.
A. Uri, et al. code2vec Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3.POPL (2019), 1-29.
R. Stephen, et al. Okapi at TREC-3. Nist Special Publication Sp 109 (1995), 109.
Y. Lv and C. Zhai. 2011. When documents are very long, BM25 fails In Proceedings of the 34th international ACM SIGIR conference on research and development in Information Retrieval. Association for Computing Machinery, New York, USA.
K. Yang, et al. Cross-Validated Ensemble Methods in Natural Language Inference. Annual Conference on Human and Language Technology. Human and Language Technology, 2019.