Seongmin Lee

+1 (310) 806-7239

University of California, Los Angeles

Computer Science Department

Engineering VI, Room 474

Los Angeles, CA 90095

I am a postdoctoral researcher at the Software Evolution and Analysis Laboratory (SEAL) at UCLA, working with Prof. Miryung Kim.

My research interests lie in program analysis and software testing, with a focus on making program analysis scalable to address the challenges associated with the scale and complexity of software systems. I aim to achieve practical software testing in real-world scenarios by utilizing statistical methods to analyze dynamic information from program execution, facilitating the reasoning of a program’s semantic properties and addressing empirical challenges in software testing.

Prior to joining SEAL, I was a postdoctoral researcher at the Max Planck Institute for Security and Privacy (MPI-SP) in Bochum, Germany, where I worked with Dr. Marcel Böhme in the Software Security group. I received my Ph.D. in Computation Intelligence and Software Engineering Lab (COINSE) at KAIST, advised by Dr. Shin Yoo in 2022.

news

Jun, 2025	I have been invited to serve as a program committee member for FSE 2026.
Jun, 2025	I joined Software Evolution and Analysis Laboratory (SEAL) at the University of California, Los Angeles (UCLA) as a postdoctoral researcher, working with Prof. Miryung Kim!
Apr, 2025	I have been invited for a tutorial talk at the 18th Intl. Workshop on Search-Based and Fuzz Testing (SBFT 2025)! See you everyone in the beautiful city of Ottawa!
Jan, 2025	Our paper “How Much is Unseen Depends Chiefly on Information About the Seen” has been accepted to The Thirteenth International Conference on Learning Representations (ICLR 2025) as a spotlight paper! This is my first paper at the top-tier ML conference.
Dec, 2024	I have been invited to serve as a program committee member for the International Conference on Software Engineering 2026 (ICSE’26) and the 1st International Workshop on Large Language Model-Oriented Empirical Research 2025 (LLanMER’25).

selected publications

ICSE
Accounting for Missing Events in Statistical Information Leakage Analysis

Seongmin Lee, Shreyas Minocha, and Marcel Böhme

In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering (ICSE’25), Jun 2025

PDF Abs Bib Artifact Slides

The leakage of secret information via a public channel is a critical privacy flaw in software systems. The more information is leaked per observation, the less time an attacker needs to learn the secret. Due to the size and complexity of the modern software, and because some empirical facts are not available to a formal analysis of the source code, researchers started investigating statistical methods using program executions as samples. However, current statistical methods require a high sample coverage. Ideally, the sample is large enough to contain every possible combination of secret {}times observable value to accurately reflect the joint distribution of {}langle\secret, observable{}rangle\. Otherwise, the information leakage is severely underestimated, which is problematic as it can lead to overconfidence in the security of an otherwise vulnerable program. In this paper, we introduce an improved estimator for information leakage and propose to use methods from applied statistics to improve our estimate of the joint distribution when sample coverage is low. The key idea is to reconstruct the joint distribution by casting our problem as a multinomial estimation problem in the absence of samples for all classes. We suggest two approaches and demonstrate the effectiveness of each approach on a set of benchmark subjects. We also propose novel refinement heuristics, which help to adjust the joint distribution and gain better estimation accuracy. Compared to existing statistical methods for information leakage estimation, our method can safely overestimate the mutual information and provide a more accurate estimate from a limited number of program executions.
@inproceedings{lee2024accounting, title = {Accounting for {{Missing Events}} in {{Statistical Information Leakage Analysis}}}, booktitle = {Proceedings of the {{IEEE}}/{{ACM}} 47th {{International Conference}} on {{Software Engineering}} (ICSE'25)}, author = {Lee, Seongmin and Minocha, Shreyas and B{\"o}hme, Marcel}, year = {2025}, month = jun, publisher = {Association for Computing Machinery}, }
ICLR
How Much Is Unseen Depends Chiefly on Information About the Seen

Seongmin Lee and Marcel Böhme

In Proceedings of the 13th International Conference on Learning Representations (ICLR’25), May 2025

Spotlight PDF Abs Bib Artifact Poster

The paper was selected as ICLR’25 Spotlight (Top 5% of accepted papers).

The missing mass refers to the proportion of data points in an unknown population of classifier inputs that belong to classes not present in the classifier’s training data, which is assumed to be a random sample from that unknown population. We find that in expectation the missing mass is entirely determined by the number \f_k of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators’ MSE is roughly 80% of the Good-Turing estimator’s.
@inproceedings{leeHowMuchUnseen2025, title = {How {{Much}} Is {{Unseen Depends Chiefly}} on {{Information About}} the {{Seen}}}, author = {Lee, Seongmin and B{\"o}hme, Marcel}, year = {2025}, month = may, booktitle = {Proceedings of the 13th International Conference on Learning Representations (ICLR'25)}, }
SCP
Causal Program Dependence Analysis

Seongmin Lee, Dave Binkley, Robert Feldt, Nicolas Gold, and Shin Yoo

Science of Computer Programming, Feb 2025

PDF Abs Bib DOI Artifact

Discovering how program components affect one another plays a fundamental role in aiding engineers comprehend and maintain a software system. Despite the fact that the degree to which one program component depends upon another can vary in strength, traditional dependence analysis typically ignores such nuance. To account for this nuance in dependence-based analysis, we propose Causal Program Dependence Analysis (CPDA), a framework based on causal inference that captures the degree (or strength) of the dependence between program elements. For a given program, CPDA intervenes in the program execution to observe changes in value at selected points in the source code. It observes the association between program elements by constructing and executing modified versions of a program (requiring only light-weight parsing rather than sophisticated static analysis). CPDA applies causal inference to the observed changes to identify and estimate the strength of the dependence relations between program elements. We explore the advantages of CPDA’s quantified dependence by presenting results for several applications. Our further qualitative evaluation demonstrates 1) that observing different levels of dependence facilitates grouping various functional aspects found in a program and 2) how focusing on the relative strength of the dependences for a particular program element provides a detailed context for that element. Furthermore, a case study that applies CPDA to debugging illustrates how it can improve engineer productivity.
@article{leeCausalProgramDependence2025, title = {Causal Program Dependence Analysis}, author = {Lee, Seongmin and Binkley, Dave and Feldt, Robert and Gold, Nicolas and Yoo, Shin}, year = {2025}, month = feb, journal = {Science of Computer Programming}, volume = {240}, pages = {103208}, issn = {0167-6423}, doi = {10.1016/j.scico.2024.103208}, }
ICSE
Extrapolating Coverage Rate in Greybox Fuzzing

Danushka Liyanage^*, Seongmin Lee^*, Chakkrit Tantithamthavorn, and Marcel Böhme

In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE’24), Apr 2024

PDF Abs Bib DOI Artifact Poster Slides

A fuzzer can literally run forever. However, as more resources are spent, the coverage rate continuously drops, and the utility of the fuzzer declines. To tackle this coverage-resource tradeoff, we could introduce a policy to stop a campaign whenever the coverage rate drops below a certain threshold value, say 10 new branches covered per 15 minutes. During the campaign, can we predict the coverage rate at some point in the future? If so, how well can we predict the future coverage rate as the prediction horizon or the current campaign length increases? How can we tackle the statistical challenge of adaptive bias, which is inherent in greybox fuzzing (i.e., samples are not independent and identically distributed)?In this paper, we i) evaluate existing statistical techniques to predict the coverage rate U(t0 + k) at any time t0 in the campaign after a period of k units of time in the future and ii) develop a new extrapolation methodology that tackles the adaptive bias. We propose to efficiently simulate a large number of blackbox campaigns from the collected coverage data, estimate the coverage rate for each of these blackbox campaigns and conduct a simple regression to extrapolate the coverage rate for the greybox campaign.Our empirical evaluation using the Fuzztastic fuzzer benchmark demonstrates that our extrapolation methodology exhibits at least one order of magnitude lower error compared to the existing benchmark for 4 out of 5 experimental subjects we investigated. Notably, compared to the existing extrapolation methodology, our extrapolator excels in making long-term predictions, such as those extending up to three times the length of the current campaign.
@inproceedings{liyanageExtrapolatingCoverageRate2024, title = {Extrapolating {{Coverage Rate}} in {{Greybox Fuzzing}}}, booktitle = {Proceedings of the {{IEEE}}/{{ACM}} 46th {{International Conference}} on {{Software Engineering}} ({{ICSE}}'24)}, author = {Liyanage, Danushka and Lee, Seongmin and Tantithamthavorn, Chakkrit and B{\"o}hme, Marcel}, year = {2024}, month = apr, series = {{{ICSE}} '24}, pages = {1--12}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3597503.3639198}, isbn = {9798400702174}, }
ESEC/FSE
Statistical Reachability Analysis

Seongmin Lee and Marcel Böhme

In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’23), Nov 2023

PDF Abs Bib DOI Artifact Poster Slides

Given a target program state (or statement) s, what is the probability that an input reaches s? This is the quantitative reachability analysis problem. For instance, quantitative reachability analysis can be used to approximate the reliability of a program (where s is a bad state). Traditionally, quantitative reachability analysis is solved as a model counting problem for a formal constraint that represents the (approximate) reachability of s along paths in the program, i.e., probabilistic reachability analysis. However, in preliminary experiments, we failed to run state-of-the-art probabilistic reachability analysis on reasonably large programs. In this paper, we explore statistical methods to estimate reachability probability. An advantage of statistical reasoning is that the size and composition of the program are insubstantial as long as the program can be executed. We are particularly interested in the error compared to the state-of-the-art probabilistic reachability analysis. We realize that existing estimators do not exploit the inherent structure of the program and develop structure-aware estimators to further reduce the estimation error given the same number of samples. Our empirical evaluation on previous and new benchmark programs shows that (i) our statistical reachability analysis outperforms state-of-the-art probabilistic reachability analysis tools in terms of accuracy, efficiency, and scalability, and (ii) our structure-aware estimators further outperform (blackbox) estimators that do not exploit the inherent program structure. We also identify multiple program properties that limit the applicability of the existing probabilistic analysis techniques.
@inproceedings{leeStatisticalReachabilityAnalysis2023, title = {Statistical {{Reachability Analysis}}}, booktitle = {Proceedings of the 31st {{ACM Joint European Software Engineering Conference}} and {{Symposium}} on the {{Foundations}} of {{Software Engineering}} ({{ESEC}}/{{FSE}}'23)}, author = {Lee, Seongmin and B{\"o}hme, Marcel}, year = {2023}, month = nov, series = {{{ESEC}}/{{FSE}} 2023}, pages = {326--337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3611643.3616268}, isbn = {9798400703270}, }

latest posts

Sep 19, 2024	The sequel of the paper "Causal Program Dependence Analysis"
Sep 15, 2024	Until scabs form and calluses build up \| 딱지가 앉고 굳은살이 배길 때 까지"