NIKET KUMAR CHOUDHARY Phone: 845-337-6468 E-mail: nkchoudh@ece.ncsu.edu Web: www4.ncsu.edu/~nkchoudh EDUCATION Ph.D. in Computer Engineering North Carolina State University, Raleigh, NC, GPA: 3.84/4.0 (Aug 09–May 12, Expected) M.S. in Computer Engineering North Carolina State University, Raleigh, NC, GPA: 3.88/4.0 (Aug 07–Aug 09) Bachelor in Information & Communication Technology Dhirubhai Ambani Institute of Information and Communication Technology, India (Aug 01-May 05) RESEARCH INTERESTS Computer architecture, Processor microarchitecture, Low-power & low-effort design methodology, 3-D IC architecture design, Dynamic binary translation & optimization, Emerging technologies and their interaction with architecture. WORK EXPERIENCE Research Assistant, North Carolina State University, NC, USA (Aug 07– Present) Advisor: Eric Rotenberg • FabScalar: Developed a novel approach to generate synthesizable RTL of an arbitrary out-of-order superscalar processor based on canonical pipeline template. Generated processors’ RTL differ in the three major dimensions: superscalar width, pipeline depth, and sizes of structures for extracting instruction-level parallelism (ILP). Note: FabScalar Infrastructure has been released to the research community and it is currently being used in multiple universities. • Design of Heterogeneous Multi-Core: Exploring heterogeneous multi-core architectures comprised of many FabScalar-generated superscalar cores, each customized to different application characteristics. This new paradigm enables higher computational efficiency with lower design & verification cost. • Energy Efficient 3-D CPU: Exploring 3-D IC based heterogeneous multi-core architectures. Focusing on fast thread migration using through-silicon-vias and integration of diverse cores designed on different process technology nodes. • Design Space Exploration techniques: Application of classic search and machine-learning techniques for fast design space exploration to find an optimal processor design. • Performance and Power modeling: Developed detailed timing and power model of a processor in C++, and validated it against RTL implementations. The model is used to study performance and power bottlenecks. Research Intern, Microsoft Corporation, Redmond, USA (May 11- Aug 11) Group: XCG CAPS & Computer Architecture, Manager: Doug Burger • Worked on the micro-architecture for the E2 Dynamic Multicore Processor. E2 is an advanced Explicit Datagraph Execution (EDGE) instruction set architecture. • Developed components relating to composing cores together to achieve better single-thread performance. Software Engineering Intern, Intel, Santa Clara, USA (May 10- Aug 10) Group: Binary Translation (BiTS), Manager: Daniel M. Lavery • Binary translation (BT) system analysis including evaluating the impact of the dynamic BT system on processor microarchitecture. 1 Design Engineer, ARM Private Ltd, Bangalore, India (Aug 05– Jul 07) Group: Processor Division; Manager: Rahoul Varma • Implementation of Dynamic Voltage and Frequency Scaling solution for ARM1176JZF-S processor. • Benchmarking of multiple ARM processors for power, performance, and area targeted to different CMOS technologies (130, 90, & 65nm) using Cadence and Synopsys digital products. Intern, Cadence Design System, Bangalore, India (Jan 05– May 05) Group: RTL Compiler, Manager: Taher Abbasi • Developed solutions for efficient synthesis of high performance floating point datapath based on IEEE 754 standard. Intern, Reliance Infocomm, Mumbai, India • Undergraduate internship. (May 03– Jul 03) TECHNICAL SKILLS Simulators: Programming Language: Professional Tools: Operating Systems: GEMS, SimpleScalar C, C++, SystemC, CUDA, OpenMP, Verilog HDL, SPICE, Perl Cadence (NC-Verilog, SOC Encounter, Virtuoso), Synopsys (Design Compiler, PrimeTime), MATLAB Linux, Unix, Windows AWARDS AND ACHIEVEMENTS • • • • • IEEE Micro’s Top Picks (2012), FabScalar work selected as one of the 12 computer architecture papers of 2011 Top Picks recognizes the most significant research papers in computer architecture based on novelty and long-term impact every year. Received 1st place in graduate student category of ACM Student Research Competition (2010) Awarded student travel grant to attend ISCA-2011, PACT-2010, PACT-2009, and ISCA-2009 Member National Scholars Honor Society Chairman IEEE student branch, DAIICT (2004) PUBLICATIONS • Niket K. Choudhary et al. “FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template”, IEEE Micro, Special Issue: Micro's Top Picks from 2011 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 32, No. 3, May/June 2012. • B.H. Dwiel, Niket K. Choudhary, and Eric Rotenberg. “FPGA Modeling of Diverse Superscalar Processors”, Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2012 • George Patsilaras, Niket K. Choudhary, and James Tuck. "Efficiently Exploiting Memory Level Parallelism on Asymmetric Multicore Processors in the Dark Silicon Era", Proceedings of the ACM Transactions on Architecture and Code Optimization (TACO) special issue on High-Performance and Embedded Architectures and Compilers (HiPEAC), 2012. • Niket K. Choudhary et al. “FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template”, Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA), 2011. • Niket K. Choudhary, Sandeep Navada, R. Ginjupali, and G. Khanna. “An Exploration of OpenCL on Multiple Hardware Platforms for a Numerical Relativity Aplication”, Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), 2011. 2 • S. Navada, Niket K. Choudhary, and Eric Rotenberg. "Criticality-driven Superscalar Design Space Exploration", Proceedings of the IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), 2010. • G. Patsilaras, Niket K. Choudhary, and James Tuck. "Design Trade-offs for Memory Level Parallelism on a Asymmetric Multicore System", Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA-3), in conjunction with ISCA-37, 2010. • H. H. Najaf-abadi, Niket K. Choudhary, and Eric Rotenberg. "Core-Selectability in Chip Multiprocessors", Proceedings of the IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009. • Niket K. Choudhary et al. "FabScalar", Workshop on Architecture Research Prototyping (WARP-4), in conjunction with ISCA-36, 2009. • Niket K. Choudhary et al. "ARM's IEM Implementation with Cadence Digital IC Products", Proceedings of the Cadence Designer Network Live Conference (CDNLive), October 2006. PRESENTATIONS • • • FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, ISCA-38, San Jose, 2011. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template, PACT-19, Vienna, 2010. ARM's IEM Implementation with Cadence Digital IC Products, CDNLive, Bangalore, 2006. PROFESSIONAL SERVICE • • External reviewer for International Conference on High-Performance Computer Architecture 2011. Reviewer for ARM architecture material of Computer Organization and Architecture textbook by William Stallings (latest edition). COURSE PROJECTS • • • • • • • • • • • • • Implementation of multicluster microarchitecture simulator in C++, and integrated power and performance analysis of different design choices. Parallelization of gravitation wave modeling, Lax Wendroff algorithm, using OpenCL for ATI graphic processor. Parallelization of image rotation function using CUDA for nVidia graphic processor. Implementation of process scheduling algorithms, reader/write locks, and virtual memory support on XINU (Intel x86 based operating system). Implementation of Owner and Sharer predictors in Simultaneous Multiprocessing environment on MOESI coherence protocol using GEMS package in Simics. Implementation of RegionScout Mechanism on MOSI coherence protocol using GEMS package in Simics. Implementation of a superscalar out-of-order pipeline simulator based on Tomasulo’s algorithm in C. Implementation of L1 and L2 data cache simulator in C with Markov chain prefetcher for different write policies for characterizing memory behavior. Implementation of branch predictor simulator in C for characterizing branch behavior. Verilog RTL implementation of regular expression matching hardware. VLSI implementation of 48-bits 4-port SRAM on 45nm process node (Schematic Design to Final Layout), optimized for energy-delay product. Lexing, parsing, code generation, optimization and scheduling for MiniC (a reduced form of C) Implementation of a simple reliable transport protocol on top of UDP on a UNIX machine. 3 RELEVANT COURSES Advance Microarchitecture Advance Parallel Computer Arch Multi-core/Many-core Architecture and Programming Digital ASIC Design Parallel Computer Architecture Code Generation and Optimization Operating System VLSI Systems Design REFERENCES Dr. Eric Rotenberg ericro@ece.ncsu.edu Phone: 919-513-2822 Dr. Greg Byrd gbyrd@ece.ncsu.edu Phone: 919-513-2508 Other references available on request. PERSONAL Citizen of India holding F1 visa. 4