KRYS SUPPLEE # IRIX Kernel Internals Student Handbook Part Number: TR-IKI-0.7-6.5-S-SD-W SGI Proprietary July 1998 #### **RESTRICTION ON USE** This document is protected by copyright and contains information proprietary to Silicon Graphics, Inc. Any copying, adaptation, distribution, public performance, or public display of this document without the express written consent of Silicon Graphics, Inc., is strictly prohibited. The receipt or possession of this document does not convey the rights to reproduce or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part, without the specific written consent of Silicon Graphics, Inc. Copyright<sup>©</sup> 1998 Silicon Graphics, Inc. All rights reserved. #### U.S. GOVERNMENT RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure of the data and information contained in this document by the Government is subject to restrictions as set forth in FAR 52.227-19(c)(2) or subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and/or in similar or successor clauses in the FAR, or the DOD or NASA FAR Supplement. Unpublished rights reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 2011 N. Shoreline Blvd., P.O. Box 7311, Mountain View, CA 94039-7311. The contents of this publication are subject to change without notice. #### PART NUMBER TR-IKI-0.7-6.5-S-SD-W, July 1998 #### RECORD OF REVISION Revision 0.7, Version 6.5, March 16,1998 #### SGI TRADEMARKS IRIX, Silicon Graphics, and the SGI logo are registered trademarks of Silicon Graphics, Inc. ### OTHER TRADEMARKS Other brand or product names are the trademarks or registered trademarks of their respective holders The contents of this publication are subject to change without notice. ## IRIX 6.5 Kernel Internals (IKI65) TR-IKI rev 0.7b SGI Proprietary (22jul1998) #### **Table of Contents** | IKI: IRIX Kernel Internals Home Page | | | |-------------------------------------------|-----------|------| | IKI65: IRIX 6.5 Kernel Internals | | 1 | | 1 Training Materials | | 1 | | 2 Training Material Utilities | | 2 | | IRIX Software Training | | | | 1 IRIX Software Training | | 1-1 | | 1 Contents | | 1-1 | | 2 Class Materials (SGI Employee Use Only) | | 1-2 | | 1 I65RU | | 1-2 | | 2 IKI65 | | 1-2 | | 3 OPET | | 1-2 | | 4 PESTO | | 1-2 | | 5 IFO | | 1-2 | | 3 Reference Materials | | 1-3 | | 1 Tech Digest links to many useful items | | 1-8 | | 2 Internal Support Tools | | 1-11 | | 1 Other Internal Support Tools | | 1-12 | | 4 Cellular IRIX | | 1-14 | | 5 Mail & Newsgroups | | 1-15 | | 6 Performance | | 1-16 | | 7 Performance Co-Pilot (PCP) | | 1-17 | | 8 Application Programming | | 1-18 | | 9 Hardware Reference Materials | | 1-20 | | 10 Other Reference Materials | | 1-21 | | Cray Origin2000 Architecture | | | | 2 Cray Origin2000 Architecture | | 2-1 | | TD IKI ray 0.7h SGI Proprietary | 22in11908 | | | 1 | Cray Origin2000 Architecture Module | 2-2 | |-----|-------------------------------------------|------| | | 2 CRÁY Origin2000 Multirack System | 2-4 | | 3 | 3 Router and Hypercube Connection | 2-5 | | | 4 Hypercube | 2-6 | | | 5 Origin2000 redundant paths | 2-7 | | | 6 Module and Node Block Diagram | 2-8 | | 7 | 7 Node Board Components | 2-9 | | 8 | 8 Node Board, XBOW, and Router | 2-11 | | 9 | 9 MIPS ® R10000 Microprocessor (block | 2-12 | | 10 | 0 More About the ® R10000 Microprocessor | 2-13 | | 11 | 1 More About Memory | 2-14 | | | 1 Cache memory systems | 2-14 | | | 2 Origin2000 Distributed Shared-Memory | 2-15 | | | 3 Origin2000 Memory Hierarchy Diagram | 2-16 | | | 4 Origin2000 Memory Hierarchy Explanation | 2-17 | | 12 | 2 More About Cache | 2-18 | | | 1 Non-Blocking Cache | 2-18 | | | 2 Cache Types | 2-19 | | | 1 Primary Data Cache | 2-20 | | | 2 Primary Instruction Cache | 2-22 | | | 3 Secondary Cache (for Data and | 2-23 | | 13 | 3 Determining What Hardware the System is | 2-24 | | 14 | 4 Determining What Memory Looks Like | 2-25 | | Mer | mory and Addressing: Pages, TLB's, | | | 3 M | lemory and Addressing from a Hardware | 3-1 | | 1 | 1 HARDWARE MEMORY | 3-2 | | 2 | 2 Pages, and TLB's | 3-2 | | | 1 Introductory Concepts About Pages | 3-2 | | | 2 Introductory Concepts About the TLB | 3-3 | | 3 | 3 Memory Management Philosophies | 3-5 | | | 1 Real Memory Machines and Swapping | 3-5 | | | 2 Virtual Memory Machines and Paging | 3-6 | | | 1 Where Are the Addresses the Process | 3-7 | | 4 | 4 Memory pages | 3-8 | | | | | | | | | 22jul1998 TR-IKI rev 0.7b SGI Proprietary | | HARDWARE ADDRESSING | | 3-9 | |--------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|------------------------------------------------------------------------------------------------------| | | All Addresses = (Page Number + Byte | | 3-9 | | | Cray Origin2000 Memory Hierarchy and | | 3-10 | | | Address Request Sequence | | 3-10 | | | TLB Misses | | 3-11 | | | 1 Two Types of "TLB Miss" | | | | 10 | TLB Size | | 3-14 | | | Coprocessor 0 and the TLB | | 3-15 | | | | | 3-16 | | | Binary, Hexadecimal, and Decimal | | 3-18 | | | The 64-Bit Address Space and "Segments" | | 3-19 | | | Illustrations of Segment Types | | 3-20 | | | Segment Characteristics | | 3-23 | | 16 | Segment Types Overview | | 3-24 | | 17 | Table of Cray Origin2000 Segment Types | | 3-25 | | | 1 32-Bit Compatibility Areas | | 3-26 | | | 2 Addresses Accessed Based on CPU Mode | | 3-27 | | 18 | Cray Origin2000 Segment Types | | 3-28 | | 19 | Interpreting the Segment Type From the | | , 3-29 | | | 1 User Address Area Segment | | 3-29 | | | 1 xkuseg - Virtual User Memory - mapped, | | | | | 2 Kernel Address Area Segments | | 3-29 | | | 1 xkseg - Virtual Kernel Memory - mapped, | | 3-30 | | | 2 xkphys - Physical Kernel Memory | | 3-31 | | | | | 3-32 | | | 1 xkphys - unmapped, possibly CACHED | | 3-32 | | | 2 xkphys - unmapped, UNcached | | 3-32 | | | The 64-bit Word and the Virtual Address | | 3-35 | | | A Different View of Memory Segments | | 3-37 | | | "Unmapped" Virtual Address Segment Types | | 3-41 | | 23 | "Mapped" Virtual Address Segment Types | | 3-43 | | 24 | xkphys Memory Segments Diagram | | 3-45 | | 25 | xkphys Memory Segments - Detail | | 3-46 | | 26 | xkseg Memory Segment - Introductory | | 3-48 | | 27 | xkuseg Memory Segment - Introductory | | 3-50 | | | xkuseg Memory Segment - Introduction | | 3-51 | | | xkseg - Detail | | | | | | | 3-53 | | | | | | | ii | | 22jul1998 | TR-IKI rev 0.7b SGI Proprietary | | 30<br>31<br>32<br>33<br>34 | 1 xkseg Virtual-to-Physical Address 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b> | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 Ke | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 Ke | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree rnel Source Tree Related On-Line Materials | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 <b>Ke</b> | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 Kc<br>1<br>2 | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree Related On-Line Materials Operating System Release Project Web Source Code Location | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 <b>Ke</b><br>1<br>2<br>3<br>3<br>4 | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree mel Source Tree Related On-Line Materials Operating System Release Project Web Source Code Location Base Source Code Naming Convention | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 <b>Ke</b><br>1<br>2<br>3<br>3<br>4 | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree rnel Source Tree Related On-Line Materials Operating System Release Project Web Source Code Location Base Source Code Naming Convention Where Is the Most Recent Version of the | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 <b>Ke</b><br>1<br>2<br>3<br>3<br>4 | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree mel Source Tree Related On-Line Materials Operating System Release Project Web Source Code Location Base Source Code Naming Convention | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br><b>Ker</b><br>4 Ker<br>2<br>3<br>3<br>4<br>4<br>5<br>6 | 2 xkseg Virtual Addresses Mapped through 3 xkseg Wired Kernel TLB Entries - Diagram 4 xkseg Wired Kernel TLB Entries Contents of xkseg Kernel Wired Entries xkuseg - Detail 1 xkuseg "TLB Hit" - Diagram Introduction to User Structures Related to TLB Single Miss Overview of Resolving a TLB Single Miss Overview of Resolving a TLB Single Miss Detail of Resolving a TLB Single Miss Detail of Resolving a TLB Double Miss Detail of Resolving a TLB Double Miss mel Source Tree rnel Source Tree Related On-Line Materials Operating System Release Project Web Source Code Location Base Source Code Naming Convention Where Is the Most Recent Version of the | 22jul1998 | 3-53<br>3-54<br>3-56<br>3-57<br>3-59<br>3-61<br>3-62<br>3-64<br>3-66<br>3-68<br>3-69<br>3-71<br>3-72 | 8 Kernel Source Tree Location 9 Kernel Source Tree Contents 1 The Difference Between ".h" and ".c" ... 2 Where to Find ".h" Files 10 Operating System and Kernel Source Tree ... 11 Tools Available to Browse Source 12 Determining What Software the System Is ... 1 versions - show system software; list ... 2 uname - show system software 13 How Do I Know What Crested My System? 13 How Do I Know What Crashed My System? 14 Where Does the System Put Things When ... 15 What System Logs Exist? 4-10 4-11 4-19 4-20 4-21 4-22 | <ol> <li>System Logs in /var/adm</li> <li>Description of system logs,</li> </ol> | | 4-23<br>4-24 | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|------------------------------------------------------------------------------------------------| | Operating System Overview 5 IRIX Operating System Overview 1 UNIX (IRIX) philosophy 2 IRIX system major components (user 3 IRIX system major components (kernel 4 When Does the Kernel Take Control Away 5 Kernel block diagram 6 Primary Kernel Activities 7 Summary of IRIX Kernel Primary Functions | | 5-1<br>5-2<br>5-3<br>5-4<br>5-5<br>5-6<br>5-7<br>5-10 | | Interrupt and Exceptions (Preliminary) 6 Interrupts and Exceptions (Preliminary 1 Processor Operating Modes 2 Interrupt and Exception Types 3 How are Interrupts Different From 4 How are Interrupts Similar to 5 MIPS Processor Exception and Interrupt 6 General Exceptions 7 Hardware Interrupt Check 8 Software and Hardware Exception Check | | 6-1<br>6-2<br>6-3<br>6-4<br>6-5<br>6-6<br>6-7<br>6-8<br>6-9 | | Process Management Overview 7 Process Management Overview 1 Process Management Overview 2 Executable Files and Processes Diagram 1 Executable Files and Processes Diagram 3 Executable Files and elfdump(1) 4 Process Definition Diagram 1 Process Definition Diagram Explanation 5 User Stack Diagram 1 User Stack Diagram Explanation | | 7-1<br>7-2<br>7-3<br>7-4<br>7-5<br>7-8<br>7-9<br>7-10 | | TR-IKI rev 0.7b SGI Proprietary | 22jul1998 | iv | | TK THE TOV 0.70 BOTT TOP TOWN | • | | | TR TRE TOV 0.70 DOLLT TOP TOWN | | | | 6 Kernel Stack Diagram 1 Kernel Stack Diagram Explanation 7 Processes and Kernel Threads 8 Displaying process memory (gmemusage(1)) 1 Cray Origin2000 System Workload 2 IRIX Physical Memory gmemusage(1) 3 Process Physical Memory gmemusage(1) 9 Process Control Diagram 1 Process Control Diagram Explanation 10 Process Segments or Regions 11 Kernel's Region Tables Diagram 1 Kernel's Region Tables Diagram 1 Region Sharing Diagram 1 Region Sharing Diagram 1 Region Sparing Diagram Explanation 13 Multiprocessing 14 Process Execution Flow Diagram 1 Process Execution Flow Diagram 1 Process Execution Flow Diagram 1 System Call Interface Diagram | | 7-12 7-13 7-14 7-15 7-16 7-17 7-18 7-19 7-20 7-21 7-22 7-23 7-24 7-25 7-26 7-27 7-28 7-29 7-30 | 22jul1998 TR-IKI rev 0.7b SGI Proprietary | 1 System Call Argument Processing | | 8-16 | |--------------------------------------------|-----------|----------------------------------| | 2 icrash(1M) Samples | | 8-18 | | 1 Process uthread Display | | 8-18 | | 2 uthread Detail | | 8-19 | | 3 Trace of open(2) System Call | | 8-20 | | 4 Trace Detail (partial) | | 8-21 | | 5 Frame For open() | | 8-23 | | 6 Disassembly Code For open() | | 8-24 | | 7 Frame For copen() | | 8-25 | | 8 Disassembly Code For kernel copen() | | 8-26 | | 3 Register Aliases | | 8-27 | | Memory Management Overview | | | | 9 Memory Management Overview | | 9-1 | | 1 Module Overview | | 9-2 | | 2 Module Objectives | | 9-3 | | 3 Hardware Memory Review | | 9-4 | | 1 Origin2000 distributed-shared memory | | 9-5 | | 2 Origin2000 Memory Hierarchy (in order | | 9-6 | | 4 Hardware Address Sequence Review Diagram | | 9-7 | | 1 Hardware Address Sequence Review | | 9-8 | | 5 Memory Subsystem Introduction | | 9-10 | | 6 Historical Solutions to Memory | | 9-11 | | 7 Recent Solution to Memory Management | | 9-12 | | 8 User Process Components Review | | 9-13 | | 9 User Process Virtual Memory Image | | 9-14 | | 10 User Process Virtual Addresses | | 9-15 | | 11 Virtual to Physical Address Translation | | 9-16 | | 12 Translation Lookaside Buffer (TLB) | | 9-17 | | 13 Translation Lookaside Buffer (TLB) | | 9-18 | | 14 TLB "Hits" and "Misses" | | 9-19 | | 15 Virtual Addressing Summary | | 9-20 | | 16 Demand Paging Overview | | 9-21 | | 17 Demand Paging Page Load Procedure | | 9-22 | | 18 Demand Paging Advantages and | | 9-23 | | vi | 22jul1998 | TR-IKI rev 0.7b SGI Proprietary | | ** | | in-in icv 0.76 Bot i toplicial y | | 19 | Page Stealing | 9-24 | |------|------------------------------------------|------| | 20 | Page Stealing Page Selection | 9-25 | | 21 | Page Stealing Page Actions | 9-26 | | 22 | Page Stealing and Job Classes | 9-27 | | 23 | Page Cache in IRIX | 9-28 | | 24 | User Process Space and Swapping | 9-29 | | 25 | S Swap Space Management | 9-30 | | 26 | 5 The Swapper Process | 9-31 | | 27 | The Swapper Process in IRIX | 9-32 | | 28 | The Swapper Process Relationship to | 9-33 | | 29 | Reporting Paging Activity (sar -p) | 9-34 | | 30 | Reporting System Swapping and Switching | 9-37 | | 31 | Reporting TLB Activity (sar -t) | 9-39 | | 32 | Process Size (ps -l) | 9-41 | | 33 | Reporting Memory Statistics (sar -R) | 9-42 | | 34 | Reporting Unused Memory Pages and Disk | 9-45 | | 35 | Reporting Memory Activity (gr_osview(1)) | 9-48 | | UNI | IX Filesystem Overview | | | | INIX Filesystem Overview | 10-1 | | | Sample UNIX FileSystem | 10-2 | | 2 | Generic UNIX FileSystem | 10-3 | | 3 | UNIX System V filesystem | 10-4 | | | 1 Small UNIX file sample | 10-5 | | | 2 Small UNIX file | 10-6 | | | 3 Large UNIX file sample | 10-7 | | | 4 Large UNIX file | 10-8 | | XES | S Filesystem - Structure | | | | The Extent Filesystem (EFS) | 11-1 | | | Fis: the extension of EFS | 11-1 | | | A New XFS Filesystem | 11-3 | | | Allocation Group | 11-4 | | | Superblock | 11-5 | | | AGF: Free Space Block | 11-5 | | TI W | tor. Hee Space Block | 11-0 | | 11 AGFL - Allocation Group Free List 11 AGI: Inode Btree Control 11 AGI and Inode Btree 11 On-disk Inode 11 On-disk Inode (256 bytes) with local 11 1-block Directory 11 Btree Directory 11 Btree Directory - Index Block 11 Attribute Fork Inside Inode 11 Attributes Block 11 Data Fork - Binary Tree 11 Journaling Log 1 Sequence for replaying the log when the 11 I/O Performance 11 xfs_db printable block types 11 Mounted Filesystems | 11-7<br>11-8<br>11-9<br>11-10<br>11-11<br>11-12<br>11-13<br>11-14<br>11-15<br>11-16<br>11-17<br>11-18<br>11-19<br>11-20<br>11-21<br>11-22<br>11-23 | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------| | XFS File Management 12 File System Switch 12 XFS Code Architecture 12 Example IRIX read(2) Sequence 12 System Call Layer - Read 12 (XFS) Filesystem Layer - Read 12 System Buffers 1 detail on the vnode's page hash list 12 Example IRIX write(2) Sequence 12 (XFS) Filesystem Layer - Write | 12-1<br>12-2<br>12-3<br>12-4<br>12-5<br>12-6<br>12-7<br>12-8<br>12-9 | | XFS File Management 13 Reference 1 mmap(2) - Memory Mapping a File pfdats | 13-1<br>13-2 | | TR-IKI rev 0.7b SGI Proprietary 22jul1998 | viii | | | | | | | | 14 Reference 1 pfdat's 2 Address Translation 3 Table locations | 14-1<br>14-2<br>14-3<br>14-4 | | 1 pfdat's 2 Address Translation | 14-2<br>14-3 | 22jul1998 ix TR-IKI rev 0.7b SGI Proprietary | 4 dumpsys() Processing 5 dumpvmcore() Processing 5 Dump Level Configuration 1 Dump Level 2 Dump level meanings | | 16-20<br>16-21<br>16-23<br>16-23 | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-----------------------------------------------------------------------------------------------------------| | Bibliography 17 Bibliography 1 BOOKS BY SGI EMPLOYEES (former or 2 BOOKS 3 ON-LINE DOCUMENTS 4 TOOLS 5 TRAINING MATERIALS WEB PAGES 6 INSTRUCTOR WEB PAGES (links to their 7 ENGINEER WEB PAGES | | 17-1<br>17-2<br>17-3<br>17-4<br>17-5<br>17-6<br>17-7<br>17-8 | | Appendix A: Origin2000 Support A Origin2000 Support Processes For High 1 Purpose 2 Getting Help 3 Getting Cray domain accounts 4 Site Planning 5 Installation Planning 6 System Registration 1 System Serial Number 7 Installation Reporting 1 Initial Mainframe Hardware Install 2 Initial Mainframe Software Install 3 Hardware & Software Installation Defects 8 System Failure Reporting 9 Problem Escalation 1 GTS's Hotlist 2 U.S. Escalation Model 3 International Escalation Model 10 Problem Reporting | | Appendix A-1 A-2 A-3 A-4 A-5 A-6 A-7 A-8 A-9 A-10 A-11 A-12 A-13 A-14 A-15 A-16 A-17 A-18 | | x | 22jul1998 | TR-IKI rev 0.7b SGI Proprietary | | | 223411776 | | | 1 Software Problem Reporting 11 Customer Communication 1 Pipeline 2 Cray Inform (CRInform) 3 Field Notices and FYIs/FIBs/NPIs 12 Related Information Appendix B: CPU R10000 Overview B MIPS ® R10000 Microprocessor Overview 1 Instruction prefetch 2 Out-of-order execution 3 Queuing structures 4 Integer Queue 5 Floating Point Queue 6 Address Queue 7 Execution Units 8 Integer ALUs | 22jui 1770 | A-19 A-20 A-21 A-22 A-23 A-24 Appendix B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 | | 1 Software Problem Reporting 11 Customer Communication 1 Pipeline 2 Cray Inform (CRInform) 3 Field Notices and FYIs/FIBs/NPIs 12 Related Information Appendix B: CPU R10000 Overview B MIPS ® R10000 Microprocessor Overview 1 Instruction prefetch 2 Out-of-order execution 3 Queuing structures 4 Integer Queue 5 Floating Point Queue 6 Address Queue 7 Execution Units 8 Integer ALUs 9 Floating-Point units 10 Load/Store unit and the TLB 11 Secondary Cache Controller 12 System Interface 13 R10000 Branch Unit 1 Branch instruction problem 2 Branch prediction | 22jui 1770 | A-19<br>A-20<br>A-21<br>A-22<br>A-23<br>A-24<br>Appendix B-1<br>B-2<br>B-3<br>B-4<br>B-5<br>B-6<br>B-7 | | 1 Software Problem Reporting 11 Customer Communication 1 Pipeline 2 Cray Inform (CRInform) 3 Field Notices and FYIs/FIBs/NPIs 12 Related Information Appendix B: CPU R10000 Overview B MIPS ® R10000 Microprocessor Overview 1 Instruction prefetch 2 Out-of-order execution 3 Queuing structures 4 Integer Queue 5 Floating Point Queue 6 Address Queue 7 Execution Units 8 Integer ALUs 9 Floating-Point units 10 Load/Store unit and the TLB 11 Secondary Cache Controller 12 System Interface 13 R10000 Branch Unit 1 Branch instruction problem | 22jui 1770 | A-19 A-20 A-21 A-22 A-23 A-24 Appendix B-1 B-2 B-3 B-4 B-5 B-6 B-7 B-8 B-9 B-10 B-11 B-12 B-13 B-14 B-15 | | Appendix F: IRIX 6.5 Kernel Values | | |--------------------------------------------------|--------------| | F IRIX 6.5 Kernel Values | Appendix F-1 | | 1 Kernel Value Table | F-2 | | 2 Column Meanings | F-3 | | 3 Kernel Value Table | F-4 | | 4 Sample "kerninfo" output | F-5 | | 1 Live Indy Workstation (IRIX 6.5 beta) | F-6 | | 2 O2000 system dump (IRIX 6.5 beta) | F-7 | | 3 O2000 live system (flurry; IRIX 6.5 | F-8 | | Appendix G: How to get a core dump from | | | G How to get a core dump from your Indy | Appendix G-1 | | Figures | | | Figure A-0: Critical Problem Escalation | A-14 | | Figure A-1: Cray Origin 2000 U.S | A-16 | | Figure A-2: Cray Origin 2000 U.S. Field's | A-16 | | Figure A-3: Cray Origin 2000 International | A-17 | | Tables | | | Table 3-0: Segment Types and Characteristics for | 3-25 | | Table A-1: Site Planning Materials | A-5 | TR-IKI rev 0.7b SGI Proprietary 22jul1998 xii ## **IKI: IRIX Kernel Internals Home Page** ## IKI65: IRIX 6.5 Kernel Internals ## **Training Materials** CRAY PRIVATE - SPT course description - Day 1: Introductory Lessons (with separate or merged TOC window) - IRIX source browsing (cscope(1) and dwarfdump(1)) Introduction to Dump Analysis (showcase) (html) (Matt Robinson) icrash(1M) Tutorial Draft 1.0 - Dump Analysis Draft 1.0 - Day 2: Lessons (with separate or merged TOC window) - Process Memory Study (lab 0) - User Virtual Address Study (lab 1) - Day 3: File System Lessons (with separate or merged TOC window) - filesystem structure - file management - Day 4: Input/Output Lessons (with separate or merged TOC window) - Day 5: Lessons (with separate or merged TOC window) - Dump Analysis 1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Training Material Utilities** - Request CrayRealm and Training domain accounts - Search training materials: Search training glossary: ## **Module 1: IRIX Software Training** ## **IRIX Software Training** #### CRAY PRIVATE Request CrayRealm and Training domain accounts #### **Contents** - 1. Class Materials - Reference Materials Reference Ma Cellular IRIX - 4. Mail & Newsgroups - 5. Performance - 6. Performance Co-Pilot (PCP) - 7. Application Programming - 8. Hardware Reference Materials - 9. Other Reference Materials Search training glossary: 1-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Class Materials (SGI Employee Use Only) I65RU: IRIX 6.5 Release Update SPT course description & class materials (HTML) IKI65: IRIX Kernel Internals (IRIX 6.5/Kudzu) SPT course description & class materials (HTML & PostScript) **OPET:** O2000 Performance Evaluation And Tuning SPT course description & class materials PESTO: Performance Evaluation and System Tuning for Origin2000 and Onyx2 Customer Education course description & class materials (same as OPET) IFO: IRIX Functional Overview Customer Education Project Plan (Working Draft) - Hardware & IRIX Operating System Overviews (Working Draft) - File, I/O, Memory & Process Management (Working Draft) - Interprocess Communication (Working Draft) - Security Features (Working Draft) - Data Migration Facility (DMF) (Working Draft) - Domain Name Service (DNS) (Working Draft) - Network File System (NFS) (Working Draft) - Network Information Service (NIS) (Working Draft) - TCP/IP (Working Draft) - Unified Name Service (UNS) (Working Draft) #### **Reference Materials** Origin 2000 Support Processes and Tools (lesson) | (slides only: PostScript | Showcase) Origin 2000 picture, Cray Origin 2000 picture, news items, and hardware Options/Enhancements milestones CrayBealm IRIX Source & Object code location and descriptions (lesson) Advanced Systems Division's Home for High Performance Computing An internal SGI resource supporting the technical marketing and development of high-end compute products. HPCxchange newsletter at the new HPC Web Site HPCxchange newsletter at the new HPC web Site Silicon Sales: Hardware, Software, Services & Support ASD Marketing System Administration Team provides HW and SW support for ASD's Technical Compute Division and the Graphics Division. Origin 2000 training material and presentation slides ORIGIN-LINKS Origin Benchmark Resources (MV & Eagan) Origin Benchmark Resources (MV & Eagan) ● How to reconfigure an Origin 2000 to 180MHz 1MB Cache system TR-IKI rev 0.7b SGI Proprietary 22jul1998 1-3 | Projects & Products | Irix 6.4 - Ficus<br>SPR Query | Features by Number, Category Exceptioned Features, SPR Query | | |---------------------|-------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|--| | Morks | ProjectVision (PV) viewer and process BugWorks: Web PV database interface bwx: command line PV database interface | | | | | PatchWorks: patch database interface (for IRIX 6.4.1) Patch Process FAQ, patch types, tools, colors, PV+ Tool, browsers | | | | InfoWorks | Software Development & Release Information | | | | SGI University | Software Engineering slides and videotapes | | | - SGI's Top, Kudzu's, comp.sys.sgi's, and Matthias Fouquet-Lapar's Frequently Asked Questions (FAQs) Origin 128 detailed issues and problem information and individual responsibilities To obtain Uncle Art's Big Book of IRIX: - - 1. telnet dist.engr as user guest and no password - cd /sgi/doc/swdev/BigIrixBook ftp the postscript (\*.out) files from that location. Alternately, (may not work for you): - 1. Install the handbook utilities: inst -f dist.engr:/sgi/infotools - 2. Enter: handbook -s dist.engr:/sgi/doc/swdev/BigIrixBook - Automating IRIX 6.5 Miniroot Installs with RoboInst 1-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary Legend: $\bullet$ = Very useful, $\bullet$ = Somewhat useful, $\bullet$ = Not rated • IRIX Kernel Development Information Index (John Hesterberg & Tom Cox) Information Index of IRIX Kernel Development Process at Cray; How to set up accounts, scan source code using cscope(8) and other tools. Has details down to building kernels and testing your code. • INFOSEARCH HOME HELF FEEDBACK TR-IKI rev 0.7b SGI Proprietary 22jul1998 1-7 • Tech Digest links to many useful items for Field Analysts | TECH.Digest: | Overview | Collections | FAQ | Search | Feedback | Internal: Sil.Junc Sniff Int.S | | ff Int.Sites | |-------------------------------------------------|-----------|-------------|----------|----------|-------------------|------------------------------------|-----------|----------------| | Views: Home Comm Gfx HA Hdw Lang MI | | Lang MMe | d Unix | Externa | ıl: Sil.Surf Te | chCtr FTP | | | | Bugs | Bulletins | CMSInfo | CSE.Lab | DTbox | FTPInfo | HWDevBk | InstLoc | License | | Matrices | ManPages | Oasis | Parts | Patches | Pipeline | PriceBook | PubDomSW | QNADocs | | RelNotes | Security | STbox | TechMail | TechPubs | | Videos | FAQS: Top | l SWEngr | The IRIX section of the On-Line Technical Publications Library is an important resource for obtaining information about IRIX as well as all SGI software products. It has books for developers, system administrators, and end users, as well as a collection of books related to SGI hardware. - IS Origin2000 System Into Is Origin2000 System Info descriptions of available O2000s systems and other information Understanding and debugging Problems on the Origin2000 and Onyx2 Systems IRIX Device Driver Programmer's Guide (007-0911-060) Documents the execution environment for kernel-level and user-level device drivers in IRIX. Covers development tools and methods used to create device Dean Roehrich's SGI/IRIX Testing GrabBag includes how to install Ficus on Drive 2 of your Indy 1-9 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Internal Support Tools** (IST) Group products, distribution center, and information center | Tool | Description | User<br>Guide | Training<br>Slides | Reference Manual | |----------|---------------------------------|---------------|--------------------|------------------| | AvailMon | Availability Monitor | (PS PDF) | (ShowCase) | (PDF) | | FRU | Field Replaceable Unit Analyzer | (PS PDF) | (ShowCase) | (PDF) | | ICRASH | Irix Crash Analyzer | (PS PDF) | (ShowCase) | (PDF) | | IPM | Installation Planning Manager | | | | | MDK | Micro-diagnostic kernel | (PS I (PDF) | | | | POD | Origin2000 Power-on Diagnostics | | | (PDF) | | RAT | Remote Access Tool | (PS PDF) | (ShowCase) | (PDF) | | SVP | System Verification Program | (postscript) | (ShowCase) | (PDF) | NOTE: The PDF format is readable by the Adobe Acrobat Reader. If you do not have a copy, please download the latest version. TR-IKI rev 0.7b SGI Proprietary 22jul1998 1-11 ### Other Internal Support Tools: Diagnostic Roadmap for Origin2000 and Onyx2 (preliminary) What Tool to use, and when (general) Pre-installation / Upgrade Support Installation / Upgrade Support Repair Support Preventive Support 1-13 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Cellular IRIX • Cellular IRIX Plans; G. Broner (30jan97) and time estimates (27jan97) Discusses Cellular IRIX's overall long-term design direction. Describes intermediate deliverables needed to meet Enterprise and HPC computer market short term needs. - Cellular IRIX Documentation Navigator - Common Operating System Plan for SN1 (28Jun96) Discusses OS direction for SN1 and transition from SN0 and T3E in support of SN1 Common OS plan. - Cellular IRIX and Nexus OS Infrastructure - Presentations - Nexus Architecture and Infrastructure Cellular IRIX Subsystems - Cellular IRIX Related Lego Design Documents Cellular IRIX Project Planning and Product Specification page (03feb97) - Cellular IRIX 6.4 Technical Report ## Mail & Newsgroups MAILMAN: web interface to the Majordomo mailing list manager. | Newsgroup | Name | Actions | |-------------------------------------|----------------------------------------|-------------------------------------------------------------------------------------------| | SN0 Applications/Performance SN0 OS | snappl.engr.sgi.com<br>sn0.csd.sgi.com | Subscribe Unsubscribe | | TechMail Archives | see TechMail Overview | Subscribe Post Unsubscribe<br>Sort Archive by Group/Month<br>Search Archive by string | | CPS Majordomo News Groups | Quick Reference | List all List subscribed | TR-IKI rev 0.7b SGI Proprietary 22jul1998 1-15 ## **Performance** - Availability Monitor (AvailMon and IRS Audit/KPM) home page and training slides Miser: User level program that generates a non-conflicting schedule of jobs with known time and space requirements. OPET: O2000 Performance Evaluation And Tuning class Origin2000/Origin200/Onyx2 Quick Reference Single-Processor Tuning (PostScript | PDF) Origin 2000 Performance Report (20May97: postscript, html, frame) and slides Origin 200 Performance Report (17mar97: (postscript, html, frame) - Performance analysis tools - Process Activity Reporter (par) (par for dummies) kernel function profiling (prfpr) System monitoring: gr\_osview(1), sar(1), osview(1) - Performance Tuning Optimization for Origin2000 and Onyx2 (007-3430-001) (summary | manual | glossary) ## **Performance Co-Pilot (PCP)** PCP provides a range of services designed to help monitor and manage system performance. - Origin Topology and Monitoring - PCP engineering and marketing home pages - PCP is developed and maintained within the EBU Performance Tools Group (PTG) - PTG's projects and future PCP product releases. - Installation, Licenses and Test Drive Instructions 1-17 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Application Programming** - Compiler Group: - Mongoose Compiler 7.2 Project - Project Caribou - Compiler-related environment variables - Irix 6.5 pthreads - CPS: Support Planning Operations (SPO) 7.2 Compiler status Schedule and Dependencies, Features, Pre-Release Test Program, Customer Impact, Manufacturing Release (MR) Status SGI/CRAY Supercomputing Application Programming Interface (API) (004-2211-001: HTML | other) Describes supercomputing API. Defines programming environment elements including compilers, language libraries, system libraries, environment variables, system calls, and a few associated utilities. ● PIPELINE ARTICLE 19960405: Programming Tips for IRIX Contains information on useful IRIX system utilities, programming interfaces, and large memory allocation troubleshooting techniques that may be of value to developers writing IRIX applications. - Topics and IRIX Programming manual - Programming on Silicon Graphics Computer Systems: An Overview manual - Power Fortran Accelerator User's Guide manual - Origin and Onyx2 Programmer's Reference Manual (007-3410-001) Describes memory maps, and physical and virtual address spaces; including Hub Special, I/O, Memory Special, and Uncached spaces. NCSA & Boston University's Silicon Graphics Origin2000 Supercomputer Repository NCSA and Boston U jointly announced this web site at the CUG Origin 2000 in Minneapolis Oct '97. It's intended to be a public repository for Origin2000 information, and a catalyst for discussion. gobo: Sen**chm**arks Links ■ Links to National Computational Science Alliance and other Origin2000 Sites Partial List of Origin 2000 Scientific Applications (Scalability & Performance charts) LANL's Preliminary Performance Study of the SGI Origin 2000 • Performance of Fortran 90 Array Intrinsic Functions on the SGI Origin2000 TR-IKI rev 0.7b SGI Proprietary 22jul1998 1-19 #### **Hardware Reference Materials** - High End Engineering MFG Test Engineering pages contain very useful HW reference materials. - Info Tools for High End Production - Technical Overview of the Origin Family Introduction Origin2000 Components What Makes Origin2000 Different Scalability and Modularity Systems Interconnections Crossbar Distributed Shared Address Space(Memory and I/O) System Bandwidth - CRAY Origin 2000 64 Processor Beta Information - Origin 2000 system part numbers, descriptions, and quantities - Lego Design Document Index - USFO Sales Tools listed and documented - Origin & Onyx2 World-Wide Service Support Tools project page and tool descriptions - Remote Access Tool (RAT) User's Guide for curses tool that talks to the System Controllers. - Origin & Onyx2 Theory of Operations Manual: (007-3439-001) TOC, architecture overview, boards, ASICs, glossary ■ IP27prom Technical Reference Manual Covers usage of the IP27prom to boot or debug an Origin2000 system: - Module System Controller (MSC) including commands MSC was formerly known as ELSC (Entry-Level System Controller) Multi-Module System Controller (MMSC) including debug switches - MMSC was formerly known as FFSC (Full-Featured System Controller) - CrayLink Interconnect Topology Primer - IP27prom Operation including booting - IP27prom Command Set - IP27prom debugging including LED error codes and log messages Includes sufficient background Origin2000 information to minimize the number of documents required to use the IP27prom. Multi-Module System Controller (MMSC) commands, security, flashing MMSC Firmware Flashing: reloading a PROM's firmware image. - MIPS R10000 Superscalar Microprocessor - MIPS Programming Manuals: - MIPSpro Compiling and Performance Tuning Guide - MIPSpro 64-bit Porting and Transition Guide - MIPSpro Assembly Language Programmer's Guide MIPSpro N32 ABI Handbook 1-20.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Other Reference Materials - CUG Papers, San Jose, May 1997: - Examples of Various Approaches to High-Performance Computing through Scalable Systems Crayleans - A Comparison of Application Performance Across Cray Product Lines Cray Real - CPS: Support Planning Operations for O2, Octane, Onyx2, Origin 200, SGI & Cray Origin 2000, TPU, IRIX 6.5 Project status, Planning documentation & minutes, Service Readiness Review (SRR), Service requirements, and systems - IRIX 6.5 Support Readiness Information Chandler Lai - Support Planning - Presentations done by engineering groups which are planned to be available as videos on demand from servinfo - CPS IRIX 6.5 (Kudzu) Project - SSE System Administration class - SSE Network Administration class - IRIX 6.5 New Features and Differences class - Server Central: Origin 2000 & IRIX Product Information - Advanced Server & Workstation Environments Product MR Status - Trusted Irix, IRIX 6.2| 6.3| 6.4| 6.5, XFS, DMF DCE/DFS Origin 2000 Configuration Guide (PS | PDF), Data Sheet, & Product Guide - Cray Origin 2000 System Descriptions (PDF), - Cray Software Engineering Technical Forums: previous and planned - CRAY Scalable Node and Origin 2000 project home pages and their list of SN-related links. - IRIX 6.5 Public Technology Focus & Roadmap & Archives - Kudzu Early Access Delivery List by Linda Conroy - Kudzu 128P test plan by Bill Roske - Silicon Sales: Presentations on Demand (POD) Origin 2000 Presentation Overview & Features | <b></b> | <ul> <li>Technical resources referenced in an SGI Pipeline article on Cellular Irix</li> <li>The Magic Garden Explained by Bernard Goodheart and James Cox</li> </ul> | | |------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------| | •••• | Authoritative, in-depth description of internal working and programmatic interfaces to UNIX System V Release 4 OS. Explains various techniques, algorithms, and structures within UNIX SVR4 kernel. | | | | ● Wind River Systems VxWorks R/T OS | | | Amount | | | | gament . | | | | | | | | *** * | | | | | | | | Base to the second | | | | | | | | | | | | | | | | | | | | | TID IVI | | | | TR-IKI rev 0.7b SGI Proprietary 22jul1998 | 1-21.a | | | | | | | | | | | | | | | | | | Salarana - Barris Pili | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | · | | | | | | | | | | Į | | | ---- | | <del>-</del> | | |-----------------------------------------|----------------------------------------|--| | 1 | | | | | | | | | Module 2: Cray Origin2000 Architecture | | | | | | | | | | | | | | | | | | | | | | | | | | | .] | | | | | | | | | | | | | | | | | | | | | | | | .! | | | | | | | | .1 | | | | | | | | .1 | | | | | | | | .! | | | | | | | | 1 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i . | | | | | | | | ! | | | | | | | | I | | | | | | | | ; | | | | *************************************** | | | | | | | ## Cray Origin2000 Architecture 2-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Cray Origin2000 Architecture Module Overview This section provides a hardware overview of the Cray Origin2000 architecture. By the end of this section, the student should be able to describe each of the below: - Hypercube Structure - Module Architecture - Router Connections - Node Board, XBOX, and Router Relationship - R10000 Chip Architecture - Cache Memory Systems - Non-Blocking CacheCray Origin2000 Cache Types ## **CRAY Origin2000 Multirack System** The CRAY Origin2000 system is a multirack system that can interconnect up to a maximum of 128 CPUS in 9 racks (8 server racks and 1 CRAY router rack) that are arranged as 4 cubes of 32 processors each. The router rack interconnects the processors with CrayLink cables. TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-4 ## **Router and Hypercube Connection** TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-5 The above diagram presents a conceptual view of the nine rack, four cube, 128 CPU system pictured in the previous illustration. Each "hypercube" has a variety of pathways to connect router vertices (V1, V2, etc.). A "module" is made up of a pair of routers (R1 and R2). Each module is made up of two node boards. Each node board (usually) contains two CPU's (Central Processing Units), also known as PE's (Processing Elements). 2-5.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Hypercube The above diagram focuses on a single hypercube and its router connections, as well as the module configuration. Two router vertices and their four nodes are considered a module. Every module contains eight CPU's. Each of the eight router vertices connects to two nodes. Since each node contains two CPU's, a single hypercube contains: $(8 \times 2 \times 2) = 32 \text{ CPU's}.$ A four hypercube configuration contains 128 processors (4 x 32). ## Origin2000 redundant paths The interconnection fabric provides a minimum of two separate paths to every pair of Origin2000 nodes (and their total of four CPU's). The above diagram illustrates three different paths from node R1 to node R6. This redundancy allows the system to bypass failing routers or broken interconnection fabric links. Each fabric link is additionally protected by a CRC code and a link-level protocol, which retry any corrupted transmissions and provide fault tolerance for transient errors. TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-7 #### Module and Node Block Diagram There are four nodes in a module. Each node contains two CPU's, for a total of eight processors per module. Each Origin2000 node board is a "system on a board". The Origin2000 has a number of processing nodes linked together by an interconnection fabric. Each processing node contains: • 1-2 R10000 processors TR-IKI rev 0.7b SGI Proprietary 22jul1998 - A portion of shared memory (64 MB to 4 GB) - A directory for cache coherence - Two interfaces: - o a connection to I/O devices - o a connection of all the system nodes through the interconnection fabric. 2-8.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Node Board Components** The Origin2000 central node board can be viewed as a system controller from which all other system components radiate. Primary Origin2000 components are as follows: #### Processor Origin2000 system uses the MIPS ® R10000, a high-performance 64-bit superscalar processor which supports dynamic scheduling. Some of the important attributes of the R10000 are its large memory address space, together with a capacity for heavy overlapping of memory transactions (up to twelve per processor in Origin2000). #### Memory Each node board added to Origin2000 is another independent bank of memory, and each bank is capable of supporting up to 4 GB of memory. Up to 64 nodes can be configured in a system, which implies a maximum memory capacity of 256 GB. #### I/O Controllers Origin2000 supports a number of high-speed I/O interfaces, including Fast, Wide SCSI, Fibrechannel, 100BASE-Tx, ATM, and HIPPI-Serial. Internally, these controllers are added through XIO cards, which have an embedded PCI-32 or PCI-64 bus. Origin2000 I/O performance is added one bus at a time. #### CrayLink Interconnect This is a collection of very high speed links and routers that is responsible for tying together the set of hubs that make up the system. The important attributes of CrayLink Interconnect are its low latency, scalable bandwidth, modularity, and fault tolerance. #### • XIO and Crossbow (XBOW) These are the internal I/O interfaces originating in each Hub and terminating on the targeted I/O controller. XIO uses the same physical link technology as CrayLink Interconnect, but uses a protocol optimized for I/O traffic. The Crossbow ASIC is a crossbar routing chip responsible for connecting two nodes to up to six I/O controllers. #### Hub This ASIC is the distributed shared-memory controller. It is responsible for providing all of the processors and I/O devices a transparent access to all of distributed memory in a cache-coherent manner. #### Directory Memory This supplementary memory is controlled by the Hub. The directory on each node keeps information about the cache status of its assigned subset of physical memory. For every physical page in a node's local memory, there is a bit which indicates whether that page is in any processor's primary instruction cache, primary data cache, or secondary cache, and whether the data is "dirty" (needs to be written to disk) or "clean" (an unchanged copy of disk data). This status information is used to provide scalable cache coherence, and to migrate data to a node that accesses it more frequently than the present node. On architectures with the capacity for 32 or less CPU's, the directory memory uses part of the local memory assigned to the node. On architectures with the capacity for 33 or more CPU's, the directory memory is on a separate "Dual In-Line Memory Module", or "DIMM" plugged into the node board. The main memory DIMM holds 16 bits of directory memory. The extended directory DIMMs hold an additional 32 bits of directory memory. TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-10 Each block of memory has a directory table that indicates the cached state of the block: - O Unowned: (uncached) the memory block is not cached anywhere in the system - Exclusive: only one readable/writable copy exists in the system Shared: zero or more read-only copies of the memory block may exist in the system. Bit vectors point to any cached location(s) of the memory block. - Busy states: Busy Shared, Busy Exclusive, Wait. These three transient states handle situations in which multiple requests are pending for a given memory location. - Poisoned: page has been migrated to another node. Any access to the directory entry causes a bus error, indicating the virtual-to-physical address translation in the TLB must be updated. The Hub ASIC is responsible for determining the state of the memory page during any memory request. The protocol is implemented completely in hardware. ## Node Board, XBOW, and Router Relationship Node Board The Hub, XBOW, and Router are ASICs (Application-Specific Integrated Circuit) which act as switches to provide the interconnectivity of the Origin2000 components. The Hub is responsible for providing all of the processors and I/O devices a transparent access to all of the Origin2000 distributed memory. The XBOW (Crossbow) is responsible for connecting two nodes to up to six I/O controllers. The Router is responsible for connecting a pair of nodes to other node boards on the system. 2-11 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## MIPS <sup>®</sup> R10000 Microprocessor (block diagram) ® The R10000 Microprocessor implements the MIPS <sup>(8)</sup> IV instruction set architecture. The R10000 Microprocessor delivers performance of 800 MIPS at a frequency of 200 MHz, with a peak data transfer rate of 3.2 GBytes/second to secondary cache. TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-12.a ## More About the <sup>®</sup> R10000 Microprocessor Instruction prefetch - Out-of-order execution - Queuing structures - Integer Queue - Floating Point Queue - Address Queue - Execution Units - o Integer ALUs - o Floating-Point units - o Load/Store unit and the TLB - Secondary Cache Controller - System Interface - R10000 Branch Unit - o Branch Instruction Problem - o Branch Prediction The contents of the above links can be found in the appendix: "MIPS ® R10000 Microprocessor Overview". ## **More About Memory** #### Cache memory systems A cache memory system is comprised of a small amount of memory which contains a block of memory addresses comprising a small section of main memory. Cache memory has much faster access times and can deliver data to the processor at a much higher rate than main memory. On-chip cache memory systems can greatly improve processor performance because they allow accesses to be completed often times in one cycle. On-chip cache contains a range of addresses which comprise a subset of those addresses in the secondary cache. In turn, the secondary cache contains a range of addresses which comprise a subset of those addresses in main memory. The above picture should be correct to show a secondary cache size range of from 512 Kbytes to 16 Mbytes. 2-14 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Origin2000 Distributed Shared-Memory (DSM) and I/O Origin2000 memory is located in a single shared address space but is physically dispersed throughout the system for faster processor access over the interconnection fabric. This differs from former systems, in which memory is centrally located on and only accessible over a single shared bus. Page migration hardware moves data into memory closer to a processor that frequently uses it. This page migration scheme reduces *memory latency*—the time it takes to retrieve data from memory. Although main memory is distributed, it is universally accessible and shared between all the processors in the system. Similarly, I/O devices are distributed among the nodes, and each device is accessible to every processor in the system. The Origin2000 divides main memory into two classes: local and remote. Memory on, or assigned to the same node as the processor is labeled local, with all other memory in the system labeled remote. Despite this distribution, all memory remains globally addressable. To a processor, main memory appears as a single addressable space containing many blocks, or pages. Each node is allotted a static portion of the address space. This means there is a gap if a node is removed. The illustration below shows an address space in which each node is allocated 4 GB of address space, and Node 2 is removed, leaving a hole from address space 4G to 8G. NODE EANK PAGE OFFSET TR-IKI rev 0.7b SGI Proprietary 22jul1998 ## Origin2000 Memory Hierarchy Diagram ## Origin2000 Memory Hierarchy Explanation Memory in Origin2000 systems is organized into the following hierarchy: - Processor Registers - Local Caches - Memory - Remote Caches #### **Processor Registers** The registers are closest to the processor making the memory request, which is the processor labeled P0 in the diagram. Since registers are physically on the chip they have the lowest latency, that is, they have the fastest access times. #### **Local Caches** The primary and secondary caches located on P0 are shown above (processor P1 has identical architecture, which is not involved in this scenario). Caches have the next lowest latency after the registers, since they are also on the R10000 chip (primary cache) or tightly-coupled to its processor on a daughterboard (secondary cache). Each CPU has a primary instruction cache, a primary data cache, and a secondary cache which is used to hold both instructions and data. #### Memory Memory can be either local or remote. The access is **local** if the address of the memory reference is to an address in that piece of memory space assigned to the node the processor is on. The access is **remote** if the address of the memory reference is to anywhere else in memory, all of which has been assigned to other nodes. In the diagram, local memory is the section of main memory assigned to Node 1, which means this area of memory is local to Processor 0 (and Processor 2-17 22jul1998 TR-IKI rev 0.7b SGI Proprietary 1). #### Remote caches Remote caches may be holding copies of a given memory block. If the requesting processor is writing, all other cache copies must be invalidated. None of this is a memory latency issue for the processor doing the writing. If the requesting processor is reading, memory latency will be an issue only if some other processor has the most up-to-date copy of the requested location. If this is true, then that other processor's cached copy of the information must first be written to disk, before the requesting processor can access that information for reading. In the diagram, the blocks labeled "cache" on Nodes 2 and 3 are remote to Node 1 (as are all the rest of the caches on any module in the machine). Caches are used to reduce the amount of time it takes to access memory (also known as a memory's latency) by moving faster memory physically close to, or even onto, the processor. While data only exists in either local or remote memory, copies of the data can exist in various processor caches. Keeping these copies consistent is the responsibility of the logic of the various hubs. This logic is collectively referred to as a cache-coherence protocol. #### **More About Cache** #### **Non-Blocking Cache** In a typical implementation, the processor executes out of the cache until a cache miss is taken. A number of cycles elapse before data is returned to the processor and placed in the on-chip cache, allowing execution to resume. This type of implementation is referred to as a blocking cache because the cache cannot be accessed again until the cache miss is resolved. Non-blocking caches allow subsequent cache accesses to continue even though a cache miss has occurred. Locating cache misses as early as possible and performing the required steps to solve them is crucial in increasing overall cache system performance. The major advantage of a non-blocking cache is the ability to stack memory references by queuing up multiple cache misses and servicing them simultaneously. The sooner the hardware can begin servicing the cache miss, the sooner data can be returned. TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-18 #### **Cache Types** - Primary Data Cache - Primary Instruction Cache - Secondary Cache (For Both Data and Instructions) The Primary Caches for data and instructions are a subset of the larger Secondary Cache which can contain both. All three caches use a least-recently-used (LRU) replacement algorithm. ## **Primary Data Cache** The primary data cache of the R10000 microprocessor is 32K bytes in size and is arranged as two identical 16K-byte banks. The cache is two-way interleaved, allowing memory accesses to be overlapped. Each of the two banks is two-way set associative (that is, two cache blocks are assigned to each set). Cache line size is 32 bytes. The data cache uses a fixed block size of 8 words. The data cache uses a write back protocol, which means a cache store writes data into the cache instead of writing it directly to memory. Sometime later this data is independently written to memory. Write back from the primary data cache goes to the secondary cache, and write back from the secondary cache goes to main memory, through the system interface. The primary data cache is written back to the secondary cache before the secondary cache is written back to the system interface. The data cache is indexed with a virtual address and tagged with a physical address. Each primary cache block is in one of the following four states: - Invalid - CleanExclusive - DirtyExclusive - Shared A primary data cache block is said to be Inconsistent when the data in the primary cache has been modified from the corresponding data in the secondary cache. The primary data cache is maintained as a subset of the secondary cache where the state of a block in the primary data cache always matches the state of the corresponding block in the secondary cache. 2-20 22jul1998 TR-IKI rev 0.7b SGI Proprietary A data cache block can be changed from one state to another as a result of any one of the following events: - primary data cache read/write miss - primary data cache write hit - subset enforcement - a CACHE instruction - external intervention shared request - intervention exclusive request - invalidate request #### **Primary Instruction Cache** The instruction cache is 32K Bytes and is two-way set associative. Instructions are partially decoded before being placed in the instruction cache. Four extra bits are appended to each instruction to identify which execution unit the instruction will be dispatched to. The instruction cache line size is 64 bytes. The instruction cache has a fixed block size of 16 words and is two-way set associative. The instruction cache is indexed with a virtual address and tagged with a physical address. Each instruction cache block is in one of the following two states: - Invalid - Valid An instruction cache block can be changed from one state to the other as a result of any one of the following events: - a primary instruction cache read miss - subset property enforcement - any of various CACHE instructions - external intervention exclusive and invalidate requests TR-IKI rev 0.7b SGI Proprietary 22jul1998 2-22 #### **Secondary Cache (for Data and Instructions)** The R10000 processor must have an external secondary cache, ranging in size from 512 Kbytes to 16 Mbytes, in powers of 2, as set by the SCSize mode bit. The SCBlkSize mode bit selects a block size of either 16 or 32 words. Secondary cache line size is programmable at either 64 or 128 bytes. The secondary cache interface of the R10000 microprocessor provides a 128-bit data bus which can operate at a maximum of 200 MHz, yielding a peak data transfer rate of 3.2 GBytes/second. The secondary cache is two-way set associative (that is, two cache blocks are assigned to each set). Each secondary cache block is in one of the following four states: - Invalid - CleanExclusive - DirtyExclusive - Shared A secondary cache block can be changed from one state to another as a result of any of the following events: 22jul1998 - primary cache read/write miss - primary cache write hit to a Shared or CleanExclusive block - secondary cache read miss - secondary cache write hit to a Shared or CleanExclusive block - a CACHE instruction - external intervention shared request - intervention exclusive request - invalidate request ## **Determining What Hardware the System is Running** ## hinv - Hardware inventory command The "hinv" command displays the contents of the system hardware inventory table. This table is created each time the system is booted and contains entries describing various pieces of hardware in the system. The items in the table include main memory size, cache sizes, floating point unit, and disk drives. Without arguments, the hinv command displays a one line description of each entry in the table. 2-24 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Determining What Memory Looks Like** The following are useful for determining how memory is being used, how much is "userland", how much is "systemland", how much was the default allocation for cache, how much cache is there now, what's in it, etc. Do a "man" or see web pages for these: - ps - top - gr\_top - osview - /usr/sbin/osview - gr\_osview - gmemusage (originally called "bloatview") # Memory and Addressing from a Hardware Perspective Since IRIX systems are *virtual memory* systems, that is, an entire process need *not* be completely in memory to run, there are special hardware and software considerations involved in translating and calculating a process's actual address in physical memory, and determining if the requested address is something which needs to be brought in from disk before the process can continue. This section provides information about how the Cray Origin2000 hardware references memory and handles addressing. By the end of this section, the student should be able to: - Explain how virtual memory systems differ from physical memory systems - Explain the concept of memory divided into pages - Describe the function of the TLB - Describe the sequence of events involved in satisfying a memory request - Desribe the four important memory segment types for 64-bit architectures - Interpret the segment type from a virtual address 3-1 22jul 1998 TR-IKI rev 0.7b SGI Proprietary #### HARDWARE MEMORY #### Pages, and TLB's ## **Introductory Concepts About Pages** • Memory is managed in pages With IRIX, memory is managed in amounts called "pages". Pages are typically 16Kbytes in size, although the size can vary. To a processor, main memory appears as a single addressable space containing many pages. • IRIX is a virtual memory operating system While there are only so many actual physical pages of memory on a machine, the IRIX operating system uses a methodology of "virtual memory", which allows the memory requirements of all the processes on the machine to add up to more actual pages than the physical machine contains. • A process does not need all of its pages in physical memory Each process is assigned to a range of *virtual* addresses, some of which are mapped to physical memory pages with actual data when the process is first created. The system requires only those pages a process is actually referencing to be physically present in memory, while unreferenced pages can remain as "virtual" addresses, that is, no physical page of memory has been allocated for this page, or contains this page's data. • Process pages do not need to be contiguous in physical memory If the process never references a virtual page, then a physical page will never be assigned for it. If and when a process needs to reference a "virtual" page, then that page will be mapped to a physical page and the data will be brought into main memory. Although the physical pages of a process do not need to be contiguous in physical memory, the operating system will organize the virtual process addresses into a contiguous virtual process image. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-2.a ## **Introductory Concepts About the TLB** • "TLB" - Translation Lookaside Buffer The Cray Origin2000 R10000 MIPS processor chip hardware contains an array of 128 Translation Lookaside Buffer (TLB) entries. The R4000 and R5000 chips hold 64 TLB register entries. • The TLB translates virtual addresses to physical addresses The function of the TLB is to translate virtual addresses to physical addresses. The TLB is a virtual cache. The "data" cached by each TLB entry is the physical page number and page access permissions that matches a particular virtual page address. "TLB Hit" - the physical page reference is in the TLB already When a processor wants to reference one of a process's virtual addresses, it looks first in its TLB to calculate what physical page matches the virtual address reference. If the process has already referenced this virtual page, and the matching physical page reference is still loaded in the TLB, then the physical offset into this page address can be calculated immediately, and the data can be accessed quickly, and passed along to the CPU's secondary and primary caches. #### "TLB Miss" - virtual page reference does not match a TLB entry When a processor wants to reference a process's virtual address and does not find a matching TLB entry, the CPU must do a context switch to the operating system kernel code. The kernel will check to see if there is an existing physical page loaded with process data which matches the virtual address. #### o If the physical page is currently loaded in memory... The kernel will do the calculations to translate the virtual process address to a physical memory page. If there is a valid translation, then the kernel will load a TLB entry to describe that page, and restart the instruction. This time, when the CPU tries to match the process's virtual address request with a TLB entry, the process gets a "TLB hit", and is able to reference the requested address. $\circ$ If the physical page is NOT currently loaded in memory... (Kernel was PIE) In this case the kernel determines that there is no physical page loaded with process data, that matches the virtual address the process now wants to reference. This is called a page fault. At this point, the kernel must calculate the disk location of the page containing the address the process wants to reference, and then move that page of disk information into a physical memory page. Once this is accomplished, the kernel can calculate a valid physical-to-virtual address translation, and load the TLB with a description of that page. When the process instruction is restarted, the process (finally) gets a "TLB hit", and is able to reference the requested address. Provide Contractor of the Cont 3-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Memory Management Philosophies** #### Real Memory Machines and Swapping One of the major concerns for the operating system is how it manages the finite amount of physical memory installed in the system hardware. The aggregate amount of memory needed by all active processes on the system is constantly changing and generally is far greater than available physical memory. On "real" memory machines, a process must be entirely located in physical machine memory, in order to run. Earlier versions of UNIX used a method called *swapping* to manage main memory. With this method, whole processes were swapped from memory to disk to make room for other processes that needed to run. Swapping was done by a special process called the swapper or sched (short for scheduler), which always had a PID=0. "Swapper" is still the first process created on most UNIX-based systems. When the system first comes up, the first process, PID(0) does a lot of system initialization. When process 0 is finished, it renames itself "sched" and jumps into a loop of code which is the process swapping routine. This routine sleeps, and wakes up when there is work to do. In 6.5, swager is now a Service thread #### **Virtual Memory Machines and Paging** UNIX System V Release 4 adopted a concept referred to as *virtual memory*. A virtual machine allows programmers to ignore the physical layout and size of machine memory. A program is written to reference virtual addresses for both instructions and data, thus relieving the programmer from concern as to where things are physically located in memory. On IRIX, an entire process does *not* have to be completely in memory to run. Instead, only those pieces, or "pages", of the process needed to execute are required to be physically in machine memory. Virtual memory systems: - Give the illusion that there is more memory available than physically installed on machine. - Can run programs that are larger than physical memory. - Do not require the process to be entirely in physical memory to run. - Require a translation mechanism to convert virtual memory addresses to physical addresses at run time. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-6 ## Where Are the Addresses the Process Isn't Using? Virtual memory is implemented using a hierarchical-storage scheme, as shown below. The parts of a process which are not being used and which will not fit in memory are held on secondary storage devices such as disk or remote disks accessible over the network. The subsystems of the kernel and the hardware that cooperate to translate virtual to physical addresses comprise the memory management subsystem. This section focuses on the hardware aspects. ## Memory pages IRIX implements a memory management architecture based upon pages. The kernel divides all of physical memory into a set of equal-sized blocks called pages. The size of each page is defined by the hardware, For IRIX-based systems a page is a multiple of 4 KB, and typically is 16 KB in size. The sysconf(1) command, using an argument of either PAGESIZE or PAGE\_SIZE, can be used to display the definition of page size on an IRIX system. Below is an example of a 64 GB Origin2000 system and a calculation showing the number of memory pages defined on the system. 3-8 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### HARDWARE ADDRESSING ## All Addresses = (Page Number + Byte Offset) Every addressable location in physical memory is contained in a memory page. Therefore, every memory location (byte) can be addressed by a pair of values: (page number, byte offset on page) Physical Memory Page Frame number— Physical Ground Compositional Ground (nosid, Bank & page) ## Cray Origin2000 Memory Hierarchy and Latency Page migration hardware moves data into memory closer to a processor that frequently uses it, in order to reduce *memory latency*, that is, how long it takes for a processor to access memory contents. Remember from the Cray Origin2000 Architecture lesson which contained the memory latency hierarchy explanation and diagram, that memory contents are accessed, in order of increasing memory latency (increasingly slow access), as follows: - Processor registers for this CPU - Primary Cache for this CPU - Secondary Cache for this CPU - Local Memory (the memory assigned to this node) - Remote Memory (the rest of the system's memory) - Remote Caches of other CPU's (the condition of the data in another CPU's cache could require time-consuming operations, such as a write, before this CPU can access the data) Much of the rest of this section overviews how an address is found. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-10 #### **Address Request Sequence** A CPU examines an instruction, and isolates that part of it which represents the address of the page, and the offset into that page, of the data that the CPU needs. This address might be something like the address of an instruction to fetch, or the address of an operand of an instruction. Then the CPU goes through the following steps in order to find that address. - 1. The virtual address of the needed data is formed in the processor execution or instruction-fetch unit. Most addresses are then mapped from virtual to real through the Translation Lookaside Buffer (TLB). This process may have had a "TLB miss" if the virtual-to-physical mapping was not already in the TLB. At that point, the CPU had to exchange into kernel context in order to determine the physical address and then load it into the TLB. One way or another, at this point the TLB has a virtual-to-physical address mapping of the address the process wants, and the CPU 'knows' what physical page of memory it must access. - 2. Most addresses are presented to the primary instruction or primary data caches, depending on what is being addressed. These caches are in the processor chip. If a copy of the data with that address is found, it is returned immediately. - 3. When the primary cache does not contain the data, the address is presented to the secondary cache, which is used to hold both data and instructions. If the secondary cache contains a copy of the data, the data is returned immediately. - 4. When the secondary cache does not contain the data, the physical address reference is placed on the system bus and handed over to the HUB chip. The HUB knows which areas of memory have been assigned to which nodes, which area of memory has been assigned as "local" to this node, and which nodes are attached to which router connections. The HUB acts as a switch, and directs the request either to this node chip's local memory, or whatever remote memory address is appropriate. - 5. When the HUB chip recognizes that local memory does not contain the data, the address passes out through the "connection fabric", that is, through router connections to other nodes on this, or other hypercubes in the system, to a memory module in another node, from which the data is returned. 3-12 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **TLB Misses** Each TLB entry on an R10000 chip describes two pages whose virtual addresses are adjacent, and maps each to the actual physical page of real memory containing the data. Remember that, although the virtual addresses are contiguous, the two corresponding physical pages probably are not. On an R4000 or R5000 chip, a TLB entry describes only a single page of memory. When a CPU tries to execute something relating to a process's address, it is using a virtual address. If the address falls in a page described by a TLB entry, the TLB supplies the physical memory address for that page. The translated address, now physical instead of virtual, is passed on to the secondary cache. When the input address is not covered by any active TLB entry, the MIPS processor generates a "TLB miss" exception, which means that the CPU stops executing user code, and changes its context to execute IRIX kernel code, in order to handle this TLB-related situation. The kernel inspects the address. When the address has a valid translation to some page in the address space, the kernel loads a TLB entry to describe that page, and restarts the original instruction. ## Two Types of "TLB Miss" There are two kinds of "TLB miss" situations. In one case, the CPU examines the TLB, and does not find a physical page reference because the page has never been loaded into memory to begin with. In the second kind of TLB miss, the page is in physical memory, but isn't in the TLB for some reason, for example, that reference may have been loaded in the TLB earlier, but eventually stopped being referenced, aged, and was overwritten. The TLB is hardware. Handling a TLB miss is solved with software. There is more detail on how the kernel handles each of these two TLB miss situations in a later section. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-14 #### **TLB Size** The size of the TLB is important for performance. As long as the CPU finds virtual-to-physical address mappings readily convenient in the TLB, the process can continue to execute (until some other even forces a context switch). The TLB associated with the R10000 chip holds 128 entries. The TLB associated with the R4000 and R5000 chips hold 64 entries. TR-IKI rev 0.7b SGI Proprietary 22iu11998 Py. 3-15 · ## Coprocessor 0 and the TLB Coprocessors are alternate execution units, with register files separate from the CPU. The MIPS architecture provides an abstraction for up to 4 coprocessor units, numbered 0 to 3. Each architecture level defines some of these coprocessors. Coprocessor 1 is used for the floating-point unit. Coprocessor 0 is always used for system control. Other coprocessors are architecturally valid, but do not have a reserved use. Some coprocessors are not defined and their opcodes are either reserved or used for other purposes. Many of the coprocessor 0 registers are related to the TLB and exception processing, as shown below. 3-16 22jul1998 TR-IKI rev 0.7b SGI Proprietary | Register No. | Register Name | Description | | | | |--------------|---------------|----------------------------------------------------------------------------------|--|--|--| | 0 | index | Frogrammable register to select TLB entry for reading or writing | | | | | 1 | Random | Fseudo-random counter for TLB replacement | | | | | 2 | Er.tryLo0 | Low half of TLB entry for ever VFN (Fhysical page number) | | | | | 3 | EntryLo1 | Low half of TLB entry for odd VFN (Fhysical page number) | | | | | 4 | Context | Fointer to kernel virtual FTE table in 32-bit addressing mode | | | | | 5 | Fage Mask | Mask that sets the TLB page size | | | | | 6 | Wired | Number of wired TLB entries (lowest TLB entries not used for random replacement) | | | | | 7 | Undefined | Undefined | | | | | 8 | BadVAddr | Bad virtual address | | | | | 9 | Cour.t | Timer count | | | | | 10 | EntryHi | High half of TLB entry (Virtual page number and ASID) | | | | | 11 | Compare | Timer compare | | | | | 12 | Status | Frocessor Status Register | | | | | 13 | Cause | Cause of the last exception taken | | | | | 14 | EFC | Exception Frogram Counter | | | | | 15 | FRId | Processor Revision Identifier | | | | | 16 | Corfig | Configuration Register (secondary cache size, etc.) | | | | | 17 | LLAddr | Load Linked memory address | | | | | 18 | WatchLo | Memory reference trap address (low bits Adr[39:32]) | | | | | 19 | WatchHi | Memory reference trap address (high bits Adr [31:3]) | | | | | 20 | XCor.text | Fointer to kernel virtual FTE table in 64-bit addressing mode | | | | | 21 | FrameMask | Mask the physical addresses of entries which are written into the TLB | | | | | 22 | BrDiag | Branch Diagnostic register | | | | | 23 | Undefined | Undefined | | | | | 24 | Undefined | Undefined | | | | | 25 | FC | Ferformance Counters | | | | | 26 | ECC | Secondary cache ECC and primary cache parity | | | | | 27 | CacheErr | Cache Error and Status register | | | | | 28 | TagLo | Cache Tag register - low bits | | | | | 29 | Tagi-li | Cache Tag register - high bits | | | | | 30 | ErrorEFC | Error Exception Frogram Counter | | | | Page Mark 3 060000 100011 Page maske visitate distribute of the state of page to TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-17.a ## Binary, Hexadecimal, and Decimal Address Conversions Below is a brief reminder of the bit pattern significance for hexadecimal addressing. | Binary | Hex | Decimal | | | |--------|-----|---------|--|--| | 0000 | 0 | 0 | | | | 0001 | 1 | 1 | | | | 0010 | 2 | 2 | | | | 0011 | 3 | 3 | | | | 0100 | 4 | 4 | | | | 0101 | 5 | 5 | | | | 0110 | 6 | 6 | | | | 0111 | 7 | 7 | | | | 1000 | 8 | 8 | | | | 1001 | 9 | 9 | | | | 1010 | A | 10 | | | | 1011 | В | 11 | | | | 1100 | C | 12 | | | | 1101 | D | 13 | | | | 1110 | E | 14 | | | | 1111 | F | 15 | | | ## The 64-Bit Address Space and "Segments" The 64-bit mode is an upward extension of 32-bit mode. All MIPS processors from the R4000 on support 64-bit mode. There are four bits to one nibble, and two nibbles to one (eight-bit) byte. There are eight bytes to one (sixty-four bit) word. The MIPS hardware divides the address space of system memory into *segments* based on the most significant bits, and treats each segment differently. The ranges are shown graphically, below. These major segments define only a fraction of the 64-bit space. Most of the possible addresses are undefined and cause an addressing exception (segmentation fault) if used. When operating in 64-bit mode, the MIPS architecture uses addresses that are 64-bit unsigned integers from (hexadecimal): 0x0000 0000 0000 0000 to 0xFFFF FFFF FFFF. This is an immense span of numbers - if it were drawn to a scale of 1 millimeter per terabyte, the drawing would be 16.8 kilometers long (just over 10 miles). 3-19 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Illustrations of Segment Types** Both of the next two illustrations show memory divided into the various segment types. The illustration on the left shows a better representation of the segment types. The illustration on the right gives a better representation of which hexadecimal addresses map to those segment types. These illustrations are a good starting point for understanding the different segment types, but some of what is shown is somewhat confusing. See the explanation of segment types and characteristics, which follows. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-21 Below is a simplified version of the earlier pair of illustrations, which summarizes the possible memory segment types for a Cray Origin2000 architecture. 3-22 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Segment Characteristics** There are a number of different segment types shown in the previous illustrations. These segments differ, depending on two major characteristics: - whether or not the address must be translated, or "mapped", from a virtual reference to a physical memory reference by the translation lookaside buffer (TLB). - whether an address can be accessed when the CPU is operating in user mode or in kernel mode. And there is an additional difference which is not segment-specific, but address-specific: • whether this particular address will be cached or not. For all segment types, each address is potentially cacheable, so whether this is something to be considered about this particular address (or not) must be checked. This last difference does not distinguish segments types, it is a distinction about the handling of each specific address, regardless of segment type. Whether to cache an address or not, is determined by bit settings (explained below). TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-23 #### **Segment Types Overview** The simplified diagram of memory segment areas attempts to display these essential concepts: - 64-bit machines are compatible with 32-bit machines - o (probably) ignore these segments - Memory access is divided into kernel and user areas, based on CPU mode - O Type is determined by the high-order bit (bit 63) - 0 = user - 1 = kernel - Four possible areas (segment types) - Ignore supervisory mode related references [xksseg] - User area: - o xkuseg virtual user memory - Virtual to physical address translation done through TLB ("mapped") - High order bits 63:56 = "00" - Might be cached - Kernel areas: - o xkseg virtual kernel memory - Virtual to physical address translation done through TLB ("mapped") - High order bits 63:56 = "C0" - Might be cached - o xkphys physical kernel memory - Low-order 44 bits used as direct physical address ("unmapped") (no TLB) - Six subdivisions based on caching algorithms - only two used, ignore the rest - If high order byte (bits 63:56) = "A8", xkphys address might be cached - If high order byte (bits 63:56) = "96", xkphys address will never be cached # Table of Cray Origin2000 Segment Types and Characteristics The table below is a summation of the bit patterns, and mapping methodology, and caching characteristics, of the four memory segment areas of most interest on a 64-bit machine. The "6" in the uncached xkphys sgement high bits "96" is the only part of this segment addressing scheme that is Origin2000 specific. The interpretation of the bits, and the implementation of these characteristics, is heavily influenced by the CPU and also a little by the general SN0 architecture. Table 3-0: Segment Types and Characteristics for Cray Origin2000 Architecture | Space Type | Bit | | (BITS 63:56)<br>HEX : | Mapped<br>Address<br>(Through<br>TLB) | User<br>Accessible | Kernel<br>Accessible | Cache Algorithm<br>Determined By: | |-----------------------------------------------------|---------------|-----------------|----------------------------------|---------------------------------------|--------------------|----------------------|------------------------------------------------------------| | xkseg<br>(k2seg)<br>kernel, mapped,<br>poss. cached | I<br>(kernel) | I<br>(mapped) | C0<br>1100 0000 | YES | NO | YES | TLB | | xkphys<br>(k0seg)<br>kernel, unmapped,<br>CACHED | l<br>(kernel) | 0<br>(unmapped) | A8<br><i>10</i> 10 1000<br>61:59 | NO | NO | YES | BITS 61:59 (bits indicate this address will be cached) | | xkphys<br>(k1seg)<br>kernel, unmapped,<br>UNcached | 1<br>(kernel) | 0<br>(unmapped) | 96<br><i>10</i> 01 0110<br>61:59 | NO | NO | YES | BITS 61:59 (bits indicate this address will not be cached) | | xksseg | | U | N | U | S | E | D | | xkuseg<br>user, mapped,<br>prob. cached | 0<br>(user) | 0<br>(mapped) | 00<br>0000 0000 | YES | YES | YES | TLB | 3-25 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### 32-Bit Compatibility Areas You can probably ignore all the 32-bit compatibility addresses on your machine, unless there is a user code running in that context. On a 64-bit machine, the beginning and end of memory have areas to make the machine compatible with 32-bit architectures. These areas are listed in the illustrations above as [kseg, kseg0, kseg1, kseg2, and kseg3] in one drawing, and [cksseg, ckseg0, ckseg1, and ckseg3] in the other. Comparing the two illustrations may lead to some confusion. In the right-hand illustration, you may notice that while there is a cksseg, ckseg0, ckseg1, and ckseg3 area, there is no "ckseg2" area. This is because the ckseg2 area was split into the ckseg3 and cksseg areas. In the left-hand illustration, you may notice that in this case the areas are numbered sequentially, kseg, kseg0, kseg1, kseg2, and even kseg3, but there is no "ksseg" area, although you may occasionally find it used to refer to "kernel mapped space" (more on "mapped" versus "unmapped" below). TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-26 #### Addresses Accessed Based on CPU Mode The 64-bit compatible memory addresses are divided into kernel areas, which only CPU's running in kernel mode can access, and the user area, which CPU's running in either kernel or user mode can access. You will probably encounter references to a third mode, *supervisory mode*, as well. This is a privilege level somewhere between user mode and kernel mode. This mode is NOT implemented. You can ignore all references to it in both documentation and in the code (it was easier to leave the code in, than to remove it). You can ignore the area of memory addresses, **xksseg**, devoted to it. There are really only four different types of memory segments you will probably need to know about. These are the three kernel-only address areas composed of **xkseg**, and two subdivisions of the **xkphys** area, and the fourth segment type composed of the user-specific area **xkuseg**. ## Cray Origin2000 Segment Types When a processor references an address, it looks at the high bits of the address to determine whether the address falls into user-accessable memory addresses (the high-order bit, bit 63, is a "0"), or kernel-only memory addresses (the high-order bit, bit 63, is a "1"). Most addresses presented to the CPU are virtual addresses and must be "mapped", or translated, through the TLB, into physical memory references. Some addresses presented to the CPU are used as "direct", or "unmapped", references to a physical location, that is, the address is not translated through the TLB, but is interpreted instead as instead reference to a physical area of memory. 3-28 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Interpreting the Segment Type From the Virtual Address Each 64-bit address value is treated as shown below The two most significant bits select the major segment. The xkuseg, xkphys, and xkseg segment types are discussed below. The xksseg segment is not utilized, so it is ignored. The size of a page of virtual memory can vary from system to system and release to release, so always determine it dynamically. In a user-level program, call the <code>getpagesize()</code> function (see the <code>getpagesize(2)</code> reference page). In a kernel-level driver, use the <code>ptob()</code> kernel function (see the <code>ptob(D3)</code> reference page) or the constant NBPP (Number of Bytes Per virtual Page) declared in /usr/include/sys/immu.h.) When the page size is 16 KB, bits 13:0 of the address represent the offset within the page, and bits 39:14 select a Virtual Page Number (VPN) from the 226, or 64 M, pages in the virtual segment.. ## **User Address Area Segment** xkuseg - Virtual User Memory - mapped, probably cached #### Distinguishing Bit Pattern? If the high-order bit of an address, bit 63, is a 0, then the address refers to the user memory segment. In fact, the upper two "nibbles" (ie, the upper 8 bits, 4 bits per nibble, same bits as the upper byte) are always all zeros for user address references. (It may actually be the case that bits 62:56 could be something other than 0. It seems to be the case that this is such a large virtual address value, this has never been tested.) #### • Accessible to Which CPU Modes? These addresses are the only ones CPU's in user mode can access. CPU's in kernel mode can access these areas, as well as the kernel-only memory addresses. The xkuseg area, and the xkseg area (kernel mapped, possibly cached, see below), are treated identically, except that only the kernel can access the xkseg area. #### • What's it used for? The xkuseg area is the area devoted to user process space. User address space takes up roughly half of memory (about 16 terabytes). #### Mapped or Unmapped? All user area addresses are considered mapped addresses. This means that the 64-bit address the CPU is examining is a TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-29.a virtual address and cannot be used "as-is" to find an actual physical location. The CPU must go through the TLB in order to translate the address into a real physical memory reference. The kernel creates a unique address space for each user process. Of the 226 possible pages in a process's address space, most are typically unassigned, and many are shared pages of program text from dynamic shared objects (DSOs) that are mapped into the address space of every process that needs them. The Origin2000 architecture adds the complication that the location of a page, relative to the location where the process executes, has an effect on the performance of the process. The kernel uses a variety of strategies to locate pages of memory in the same node as the CPU that is running the process. #### • Cached or Uncached? User area addresses references are probably going to be cached. This means that CPUs must be concerned with cache coherency issues before loading or storing user area adresses. Any attempt to read that memory location must confirm that there is not a version which was read into cache and changed, somewhere else in the machine. Such an occurrence would make the memory version incorrect, since the memory version would not be the most recent version of the address's contents. Any attempt to write to that memory location will make other cached versions on the machine outdated and invalid. It's possible that a user could access the xkuseg area with uncached reference, but the user would have to do special syssgi calls that are only used by SGI diagnostics to get uncached access to memory. Uncached access to the xkuseg segment is Of allowed by the architecture and CPU, but the kernel chooses not to use the hardware in this way. ## **Kernel Address Area Segments** For all kernel-only address area segments, the high-order bit (bit 63) is a "1". There are two major areas of kernel memory, xkseg and xkphys. The kernel distinguishes which of the two segment types an address falls into, by examining additional bits in the address. The second-highest bit, bit 62, determines whether this address reference is to an **xkseg** area (bit 62 is a "1"), or an **xkphys** area (bit 62 is a "0"). The **xkphys** area is further subdivided, based on cache-related issues. Bits 59, 60, and 61 (usually written "61:59") are examined to determine which caching algorithm to apply to memory in an xkphys segment. There are only three kernel-specific segment types you probably need to know about, xkseg, and two of the subdivisions of xkphys. 3-30 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### xkseg - Virtual Kernel Memory - mapped, possibly cached #### • Distinguishing Bit Pattern? When bits 63:62 are "11", then the memory accessed is kernel virtual memory. Only code that is part of the kernel can access this space, which is a 2 Terabyte segment starting at 0xC000 0000 0000 0000. All addressing in the xkseg area starts with "C0" in the highest two nibbles (highest byte). #### Accessible to Which CPU Modes? Only CPU's running in kernel context can access memory in the xkseg segment addresses. #### What's it used for? This is the space in which the IRIX kernel allocates such objects as kernel stacks, per-process data that must be accessible on context switches, and user page tables. Certain important data structures may be replicated into each node for faster access. This segment area is also the space in which kernel-level device drivers allocate memory, including automatic variables declared by loadable device drivers. The stack and data areas used by device drivers are in xkseg. Since kernel space is mapped, addresses in the xkseg segment that are apparently contiguous need not be contiguous in physical memory. However, a device driver can allocate space that is both logically and physically contiguous, when that is required. A driver has the ability to request memory allocation in a particular node, in order to make sure that data about a device is stored in the same node where the device is attached and where device interrupts are taken. #### • Mapped or Unmapped? References to this space are mapped (that is, translated through the TLB) and cached. The kernel uses the TLB to map kernel pages in memory as required, possibly in noncontiguous physical locations. The kernel passes the address through the TLB, and the TLB examines the virtual address, which it translates to a physical address. This segment area is treated exactly the same way as the mapped user area (xkuseg) in terms of how the TLB translates virtual to physical address references. The difference is that, although pages in kernel space are mapped, they are always associated with real memory. Kernel pages are never paged to secondary storage. #### • Cached or Uncached? The TLB itself has some bits set which define how it will handle cache coherency issues, that is, the appropriate caching algorithm to use is determined by the TLB mapping. There are only two caching algorithm choices which are used. One is "don't cache it", and the other is "cacheable coherent update on write". TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-31.a ## xkphys - Physical Kernel Memory xkphys - unmapped, possibly CACHED xkphys - unmapped, UNcached • Distinguishing Bit Pattern? When bits 63:62 are "10", then the memory accessed is kernel physical memory allocated in the xkphys segment. For this particular segment type, three additional bits, bits 61, 60, and 59, are examined to determine whether cache residency is a relevent concern for memory addresses in this segment. There are six possible subdivisions of the xkphys memory area, based on what cache coherency algorithm to use for addresses in these sub-ranges, but only two of the subdivisions (and their caching algorithms) are actually used. See the "Cached or Uncached?" section, below. #### • Accessible to Which CPU Modes? Only CPU's running in kernel context can access memory in the xkseg segment addresses. #### • What's it used for? CPU's reference xkphys addresses in order to access kernel structures and data that will be needed "briefly", such as proc structures, vnode structures, buf structures, and all kernel dynamic data, all of which is managed by pfdats. TR-IKI rev 0.7b SGI Proprietary 22jul1998 One-quarter of the 64-bit address space, that is, all addresses with bits 63:62 containing a bit pattern of "10", are devoted to special access to one or more 1 TB of physical address spaces. #### • Mapped or Unmapped? Direct references to this space are "unmapped", that is, the TLB is not involved in calculating the appropriate physical memory address location. The entire 64 bits are a virtual address that is not really a physical memory address reference, but the processor knows how to decode it into a physical address. The physical address selected is taken directly from the lower 40 bits, bits 39:0, of that part of the word normally interpreted as a virtual page number and offset into the page. The three high order address bits discussed above, bits 61:59, determine if this memory reference is cached. Those 3 bits are part of the actual virtual address, which happens to map pretty straightforwardly to a physical address. Access to addresses whose bits 56:40 are not equal to 0 cause an Address Error exception. 3-32.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Cached or Uncached? • xkphys - unmapped, possibly CACHED Addressing in the xkphys area that starts with "A8" in the highest two nibbles (highest byte) has a bit pattern of "1010 1000". The first two bits, "10" (bits 63:62), indicate this is an xkphys segment. The next three bits, "10 1" (bits 61:59), indicate which of the six possible xkphys caching algorithms is to be used, which, in this case, is the "cacheable coherent update on write" algorithm. Again, only two of the six possible algorithms are used. This particular cached area of the xkphys segment starts at address 0xA800 0000 0000 0000, and goes through 0xAFFF FFFF FFFF, but it is highly probable you will only see addresses in this area that start with "A8". #### • xkphys - unmapped, UNcached Addressing in the xkphys area that starts with "96" in the highest two nibbles (highest byte) has a bit pattern of "1001 0110". The first two bits, "10" (bits 63:62), indicate this is an xkphys segment. The next three bits, "01 0" (bits 61:59), indicate which of the six possible xkphys caching algorithms is to be used, which, in this case, is the bit pattern to specify that addressing in this particular range of xkphys should be "uncached". Again, only two of the six possible algorithms are used. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-34 #### The 64-bit Word and the Virtual Address - 1. Of the 64 bits in a virtual address, only 40 bits are actually address values. - 2. The high order 24 bits can be considered "mode bits", which contain information about how to interpret the low order bits 39:0. - 3. The NASID (Node Address Space ID) is the "power of two" at which each node's memory begins ## A Different View of Memory Segments - Diagram 3-37 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Memory Segment Overview** - "Stacked tuna cans" view of virtual memory is misleading - o the consecutive numbering across memory segment address ranges is misleading (00...00 -> 96...00 -> A8...00 -> C0...00) - the uncached xkphys (96...), cached xkphys (A8..), and xkseg (C0...) memory segments all describe the same range of physical memory pages - o the xkuseg virtual memory addresses can refer to almost any pages of physical memory - o the high order "mode bits" of a virtual address word indicate how a physical memory page will be referenced - Physical memory has address gaps between banks and between nodes - o these physical addresses do not exist - o attempts to reference these non-existent addresses will cause errors - There are 4 virtual memory segment types of primary interest - o xkphy (cached) (A8..) - o xkphys (uncached) (96...) - o xkseg (C0...) - o xkuseg (00...) The following illustrations reference a 64-bit architecture. For all types of memory segment addresses, there are three ranges to be considered: # 1) Maximum Address the Chip Size Allows TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-39 ``` 64-bit architecture-imposed address limit size = all 40 address bits set to 1 = FFFFFFFFF = 10,000,000,000 decimal bytes ``` - 1) The maximum possible (physical or virtual) address size the chip architecture will allow (that is, if all of the bits the chip uses to reference an address are set to "1") - of the 64 possible bits, bits 39:0 are used to specify an address - 40 decimal bits constitute 10 hex characters of address - An address 10 hex characters long could range from 000000000 to FFFFFFFFFF # 2) Highest Address Permitted by Configuration TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-39.a largest hardware-legal virtual address = 7FFFFFFF ("K2SIZE") software configuration tuning parameter limit: largest software-legal segment size = 11E000 hex pages ("SYSSEGSZ") # xkuseg site-imposed (hardware or software) limit rlimit\_wmem\_max + rlimit\_data\_max + rlimit\_stack - 2) The highest (physical or virtual) address size that has been configured as a software or hardware limit - the actual number of banks of memory a site has purchased will limit the maximum legitimate physical address (xkphys), which is a number much smaller than the largest address the chip bit range could actually specify. - the limit the site configures for total kernel space (xkseg) will be much smaller than what the chip bit range would allow, and probably much smaller than the machine's total range of physical memory pages. - the user segment (xkuseg) space is limited to site-defined limits to the sum of rlimit\_vmem\_max + rlimit\_data\_max + rlimit\_stack\_max # 3) Actual Number of Pages In Use 3-39.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary actual number of pages in use (pfdat "in-use" pages) + (total pages used for the kernel static data - 3) The actual number of pages in use - "xkphys" addresses - o not all physical pages of memory will actually be in use at any given time - o the "pfdat" table is used to manage physical memory, and is composed of two linked lists - the linked list of "free" physical memory pages the linked list of "in-use" physical memory pages - o almost all of the physical pages available on the system are listed in the pfdat tables. The physical pages used for the single 16Mbyte virtual page of kernel static information are not referenced with pfdat structures. - o the total number of "xkphys" pages actually in use is the sum of: (pfdat "in-use" pages) + (total pages used for the kernel static data) - an "xkphys" address that is within the bounds of the chip bit range, and is within the bounds of the range of actually configured memory, is still invalid, if the address refers to a physical page not currently in use - "xkseg" and "xkuseg" addresses must be "mapped" to be valid - o this means that, at some point, a page of physical memory must have been allocated and matched with that xkseg or xkuseg virtual address, and an entry has been made in a table to reflect this (a "PDE" <Page Descriptor Entry> in a "PTE" <Page Table Entry> table) - o the only exceptions are the two kernel "wired" TLB (xkseg) entries which are not referenced in the kernel's PTE table - o an "xkseg" or "xkuseg" address that is within the bounds of the chip bit range, and is within the bounds of the software configuration limits for that kind of address, is still invalid if the address refers to a virtual page which does not have a valid PDE entry, that is, that virtual page has not yet been assigned a matching physical page of memory (again, with the exception of the special kernel "wired" entries). TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-39.d ## "Unmapped" Virtual Address Segment Types #### **Unmapped Addresses - xkphys** • xkphys virtual addresses are considered "unmapped", which means the last 40 bits of the word, bits 39:0, are treated as the reference to a specific physical page, or "PFN" (Physical Frame Number) ## "Mapped" Virtual Address Segment Types ## Mapped Addresses - xkseg and xkuseg - xkseg and xkuseg virtual addresses are "mapped" addresses, which means the last 40 bits of the word, bits 39:0, must be translated to determine the matching physical page being referenced - a page of physical memory must have been allocated and matched with that xkseg or xkuseg virtual address, and an entry has been made in a table to reflect this (a "PDE" <Page Descriptor Entry> in a "PTE" <Page Table Entry> table) - virtual-to-physical address translations that have already been calculated are stored in a CPU's TLB (Translation Lookaside Buffer) - older TLB entries are overwritten eventually with newer address translations - a CPU's TLB can be considered a cache of the most recently used virtual-to-physical address translations - the kernel maintains two special TLB entries that are considered "wired" - o they are always resident in the TLB - o they are not referenced in the kernel's PTE table - unlike other virtual addresses - these two kernel (xkseg) pages are allocated during system startup - the 32Mbytes of physical pages assigned to these two virtual kernel pages never change 3-43 22jul1998 TR-IKI rev 0.7b SGI Proprietary it is uneccessary to keep a table entry showing what physical pages are currently assigned to these virtual addresses ## xkphys Memory Segments Diagram TR-IKI rev 0.7b SGI Proprietary moda address - address ranges 22jul1998 3-45 ## xkphys Memory Segments - Detail - 1:1 correspondence between xkphys virtual addresses and address range of physical memory, including "bad" (gaps, nonexistant) physical addresses - "96" = value of first byte of uncached xkphys memory segment virtual address - "A8" = value of first byte of cached xkphys memory segment virtual address - bits 61:59 determine caching algorithm - xkphys virtual addresses are considered "unmapped", which means the last 40 bits of the word, bits 39:0, are treated as the reference to a specific physical page, or "PFN" (Physical Frame Number) ## xkseg Memory Segment - Introductory Diagram ## **xkseg Memory Segment - Introduction** - Only a small subset of physical pages are actually used for xkseg type memory references - xkseg virtual addresses are "mapped", that is, the last 40 bits of the word, bits 39:0, are *not* treated as a direct translation to a physical page of memory - "C0" = value of first byte of xkseg memory segment virtual address - xkseg used for CPU-specific and node-specific data (eg., kernel tables, each node's copy of IRIX, etc.) 3-48 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## xkuseg Memory Segment - Introductory Diagram ## **xkuseg Memory Segment - Introduction** - the xkuseg virtual addresses are "mapped" addresses, that is, the last 40 bits of the word, bits 39:0, are *not* treated as a direct translation to a physical page of memory - the xkuseg virtual address range covers much less than the total possible range of physical addresses - the xkuseg memory segment is the size of *one* user process (the maximum permissable process size) - each CPU uses the entire xkuseg memory segment to refer to a single user process address space - each CPU refers to the *same* range of xkuseg virtual addresses to describe that CPU's currently connected process (eg, "the" 10th page of "the" currently connected process) - the same virtual address maps to different physical pages for each CPU. - "sparsely populated" only a small subset of physical pages are actually used by a CPU referencing xkuseg virtual addresses TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-51 # xkseg - Detail # xkseg Virtual-to-Physical Address Translation - Diagram # xkseg Virtual Addresses Mapped through the TLB to Physical Addresses - 1. A CPU is presented with an xkseg virtual address - 2. The CPU examines its TLB to see if a virtual-to-physical address translation has already been calculated for the page containing the desired address (in this picture, this is the case. There are explanations later in this section for the more complicated cases where PTE tables must be examined to determine the translation.) - 3. The same virtual address presented to different CPU's can be translated to different physical addresses 3-54 22jul1998 TR-IKI rev 0.7b SGI Proprietary # xkseg Wired Kernel TLB Entries - Diagram # xkseg Wired Kernel TLB Entries - TLB has first few entries "wired" (preset) and used for kenel references - system default page size is 4Kbytes - page size must be a multiple of 4Kbytes - normal Cray Origin2000 page size is 16Kbytes - page size is configurable - special kernel page size set to 16Mbytes (= 400 hex pages of 'normal' pages of 16Kbytes each) - each R10000 chip TLB register entry contains two virtual-to-physical address translations - the first "wired" TLB entry contains references to two kernel-sized pages of 16Mbytes each - a CPU referencing any virtual address within the first 32 megabytes of kernel memory (xkseg mapped "C0" prefix addresses) will always find a TLB entry mapping that virtual address to a physical page - no delay to handle a "TLB miss" TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-57 ## **Contents of xkseg Kernel Wired Entries** - the first wired kernel TLB entry - o refers to virtual address (C0...) pointing to physical pages on that node - o structures and data repeated on all nodes at same virtual addresses (different physical addresses) - the second wired kernel TLB entry - o refers to virtual address (C0...) pointing to physical pages on Node 0 - o some structures and data repeated on other nodes, but not at same virtual addresses (or physical addresses) - o when these structures and data are referred to with xkphy addresses (A8... or 96...), the physical pages referenced are on the same node - the pfdat table - o one pfdat structure is used to manage every physical page (1:1) of memory except for - those physical pages used on each node to contain the kernel's 16MByte Read-Only data - those physical pages used on Node 0 to contain the kernel's 16MByte Read/Write data - o only those pfdat entries reflecting the physical pages for that node are kept on that node - o each node has multiple linked lists of collections of contiguous free pages and contiguous in-use pages on that node #### Detail: For CPU [4] (on Node 2) to locate an available, or "free", physical page of memory near CPU[1] (on Node 0), CPU[4] walks through the following structures and pointers: - 1. CPU[4] - o looks in CPU[4]'s plaindr table for the location of CPU[1]'s PDA (the second entry) - o CPU[1]'s PDA is located on Node 0 - On Node O, in CPU[1]'s PDA find the sub-structure "p\_nodepda" (a structure called "nodepda\_s") In the p\_nodepda structure is another sub-structure, "node\_pg\_data" ("pg\_data\_t") - 4. The node\_pg\_data contains a sub-field, "pg\_freelst" which points to a structure called "pg\_free\_s" 3-59.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary - 5. The pg\_free\_s structure contains a sub-structure - This sub-structure is made up of an array of "phead" structures "phead" (of type "phead\_t") - 6. Each phead structure itself contains an array of "ph\_list" structures of type "plist\_t") - 7. The first field of the ph\_list array is "pf\_next", which is a linked list that points to the first set of free pfdats on that xkuseg - Detail TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-61 # xkuseg "TLB Hit" - Diagram - offset into page TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-62 ## User Addresses Are (Also) Mapped Through the TLB: 1. A user process xkuseg virtual address (00...) is presented to CPU Depending on the architecture, steps 2-4 may be done sequentially, or simultaneously: - 1. CPU looks in primary cache for the byte offset into the page - 2. CPU looks in secondary cache for the byte offset into the page - 3. CPU looks in the TLB for a matching combination of both the virtual page number (VPN) and the Address Space ID (ASID) # Address Space Identification (ASID) explanation Each independent task, or process, has a separate address space, which is assigned a unique 8-bit Address Space Identifier (ASID). This identifier is stored with each TLB entry to distinguish between entries loaded for different processes. The ASID allows the processor to move from one process to another (that is, perform a "context switch") without having to invalidate TLB entries. The processor's current ASID is stored in the low 8 bits of the EntryHi register in coprocessor 0. These bits are also used to load the ASID field of an entry during TLB refill. The ASID field of each TLB entry is compared to the EntryHi register; if the ASIDs are equal (or if the entry is global, which means the entry is valid for all processes), this TLB entry may be used to translate virtual addresses. 3-62.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Introduction to User Structures Related to "TLB MISS" ## Introduction to User Structures Related to "TLB Miss" (TLB Exception) - PDA (Private Data Area) - Each CPU has its own PDA. - Used by each CPU for many things, including to track what context this CPU is in, what thread this CPU is connected to, inter-CPU communication, and much more. - See "pda.h" for structure contents. - Always appears at the same virtual address in each processor - O It is one page (4K), and the bottom 1024 bytes is used as about/idle stack. - uthread structure - Each process has at least one uthread structure associated with it. - The uthread structure is the "starting point" for a CPU to refer to a process. Each CPU's PDA points to its currently connected thread (uthread structure). - The uthread structure contains, or points to, many important things, directly or indirectly, including associated vnodes, valid user pages, associated buffers, this process's stack, etc. - The uthread at 6.5 is the focal point for referring to a process in the way that the "proc" structure was the focal point in earlier releases of IRIX. - Segment Table - The segment table is used to manage the user process image pages. - Each segment table entry is one word long. - One (16Kbyte) page of a segment table holds 2048 words - O The uthread structure points to the first entry in the segment table - For 32-bit binaries - each one-word entry in the segment table points to the beginning of a page of user PDE entries. - one page of (one word each) segment table entries refers to 2048 consecutive pages of user PTE pages (each of which contains 2048 words, referring to 2048 user pages) - For 64-bit binaries - each one-word entry in the first-level segment table points to the beginning of a page of secondary-level segment table one-word entries. - Each one-word entry of the secondary-level segment is used in the same way as the segment table for TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-64 32-bit binaries, that is, each secondary-segment table word points to a page of PDE entries, and each PDE entry contains a reference to a user's virtual page (VPN), and the physical page (PFN) (if any) that has been mapped to it. - PTE Table and PDE entries - A PDE is a one-word entry (in the PTE table) which refers to a page of a user process - A PDE contains both the reference to the user's virtual page number (VPN), and the mapping to the actual physical page of memory (PFN) the page is using, if any. - The contents of a PDE word are used to load a TLB entry. - The PDE words in a user's PTE table form a consecutive one-to-one correspondence with that user's virtual user (xkuseg) pages - One (16Kbyte) page of user PDE's holds 2048 words, and can therefore refer to 2048 virtual user pages - The PTE (Page Table Entry) and PDE (Page Descriptor Entry) structures are "unioned" in the code, that is, both names are used to refer to the same structure (most of the code refers to "pde"s, except for "irix/kern/ml/tlb.s", which refers to "pte"s) (CD) [paa] -- & Kthread; proe \_ [pregnon] VPN-> [PFN] TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-64.a 3-65 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **TLB Single Miss** As before: 1. A user process xkuseg virtual address (00...) is presented to CPU Depending on the architecture, steps 2-4 may be done sequentially, or simultaneously: - 1. CPU looks in primary cache for the byte offset into the page - 2. CPU looks in secondary cache for the byte offset into the page - 3. CPU looks in the TLB for a matching combination of both the virtual page number (VPN) and the Address Space ID (ASID) This time, however, this user's VPN is not found in the TLB. # Overview of Resolving a TLB Single Miss TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-68 ## Overview of Resolving a TLB Single Miss • While this CPU is executing in user context: • The currently connected user process takes up all of the xkuseg memory segment, or "00..." range of addresses. • The VPN was extracted from the user's xkuseg ("00...") virtual address. O This VPN was presented to this CPU and was not found in this CPU's TLB. O This causes a "TLB Miss", or "TLB Exception" in the hardware. This CPU will change context from user mode to kernel mode. utlbmiss() • While the CPU is executing in kernel context the CPU will begin to execute kernel code to do the following: • The kernel will find the base of this particular user's PTE table of PDE's. - The base of each user's PTE table is a "well-known kernel address" of e0000fc00000000, which is associated with a kernel variable named "KPTEBASE". - Each time a CPU is connected to a user proces, this same xkseg (c0...) virtual address is remapped to point to a different physical page in memory, where the currently connected user's PTE table begins. - The CPU will calculate the offset from the beginning of the PTE table, "KPTEBASE", that represents the PDE for the user virtual page we want. ■ Each PDE describes a virtual user page (VPN) which may, or may not, be valid. - A virtual user page (VPN) is "valid" if it is associated with an actual physical page of memory and that association has been written into the appropriate PDE for that VPN. - Each virtual user page (VPN) is represented by a one-word PDE entry. - We want the "VPNth word" from the beginning of the user's PTE table. ■ If we are looking for user virtual page 10: ■ then the tenth word of the PTE table will contain the information about whether VPN 10 is mapped to a physical memory page (PFN) or not. ■ If we are looking for user virtual page 3000: - then the 3000th word of the PTE table will contain the information about whether VPN 3000 is mapped to a physical memory page or not. - On a 64-bit architecture, the default page size is 16 Kbytes, or 2048 words. 3-69 22jul1998 TR-IKI rev 0.7b SGI Proprietary ■ Each page of the PTE table contains 2048 one-word-long PDE entries. ■ The 3000th entry is located in the second page of PDE's. ■ The second page of PDE's will have a contiguous virtual address ("c0...") after the first page of PDE's. The physical address probably is not contiguous. It is the contents of the physical page that we must examine to find the 3000th PDE. ■ The result of this calculation, like all virtual addresses, will be an offset from the beginning of some page (page # + offset). ■ The virtual address will be an xkseg ("c0...") address. These addresses are virtual "mapped" addresses, and must be looked up in the TLB, in order to determine what physical page actually holds the information. The CPU will extract the virtual page number (VPN) from the ("c0...") xkseg virtual address, and look in the TLB to find what physical page actually holds that set of user PDE entries. The CPU will examine that physical page of memory, find the offset from the beginning of that page that contains the PDE information that is needed, and will then load the contents of that PDE entry into the TLB. The CPU will perform a context switch back to user mode. • While this CPU is executing in user context: • The CPU will re-execute the original user instruction which caused the "TLB Miss". O This time, the virtual page number (VPN) and its associated physical page (PFN) are found in the TLB. This is a "TLB Hit". The secondary and primary caches are loaded, and the needed byte is presented to the CPU. The instruction is executed. ## **Detail of Resolving a TLB Single Miss** #### Calculate the VPN to look for in the TLB - 1. A user process is connected to a CPU. On a 64-bit architecture, among other things: - The TLB is loaded with a pair of PDE entries, which represent the first and second pages of the user's PTE table. Each of the two xkseg ("c0...") virtual addresses is mapped to its physical page in its respective TLB entry. - 2. The CPU is presented with a virtual user xkuseg ("00...") address of "000000003800123". - 3. The CPU looks in the primary and secondary caches and does not find this byte offset of this page. These are "cache misses". TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-71 - 4. The CPU needs to look in the TLB for the virtual page number (VPN) (and matching ASID). - The virtual page number must be calculated from the 64 bit word. - The 64-bit word contains both mode bits and address bits. - The address bits contain both the page number and the offset into the page. - The offset is represented by bits 13:0. - Bit 13 falls in the middle of one of the 16 hex digits it takes to represent a 64-bit word. - Bits 39:14 contain the virtual page number. Examining these bits shows the VPN is "e00". - 5. The CPU looks in the TLB for VPN "e00" (and the ASID that matches this user address space). In this case, the user's VPN "e00" is not found in the TLB. - 6. This causes a "TLB Miss" (slang), or (the more technically accurate name) "TLB Exception" in the hardware. - 7. The CPU does a context switch to kernel mode and enters kernel code to handle a TLB Exception. - 8. Kernel code is performed to: - o find the beginning of the kernel structure with this user's PDE's. - "KPTEBASE", xkseg address "c0000fc0000000", points to this. - o find the particular one-word PDE entry that contains information about user VPN "e00" and what physical page (PFN) has been assigned to it. - 1. Hex math is performed to find the byte offset from "KPTEBASE" (the beginning of the user's PTE table) which contains the PDE word for the requested user VPN. - A PDE entry is only one word (8 bytes of 8 bits each=64 bits) long. - A 16Kbyte page of PDE's contains 2048 PDE word entries. - We want that page of PDE entries that has the "e00"th word. - (VPN) \* 8 = the first byte of the 8-byte PDE word we want. - (0xe00) \* 8 = 0x7000 - The first byte of the PDE word with the information about user VPN e00 starts 7000 bytes from the beginning of the PTE table. See next illustration. Hex math shows which page of PDE's contains the 7000th byte. See next illustration. 3-71.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` Hex math formula to calculate which page of PDE's contains the first byte of the desired PDE. 0xN (= the hex number of bytes offset from the start of the user's PDE table) : Page number 0x4000 (hex) bytes per 16Kbyte page (any remainder is the offset into that page) Examples: (for user VPN = e00) (for user VPN = 0) (for user VPN = 1) for byte=7000 for byte=0 for byte=8 0x7000 0 = 4000 0x4000 0x4000 = page 0 = page 0 = page 1 remainder = 0 remainder = 8 remainder = 3000 The desired PDE The desired PDE The desired PDE is in PTE page 1, starting at byte 3000 is in PTE page 0, starting at byte 0 is in PTE page 0, starting at byte 8 ``` The PDE that describes user VPN e00 starts 7000 bytes from KPTEBASE, the beginning of this user's PTE table. This puts it offset 3000 bytes into the second page of user PDE's (page 1). The kernel calculates the virtual address of the 7000th byte. See next illustration. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-71.d (where in physical memory is the page containing that word?) is The virtual page number (VPN) must be looked up in the TLB. The address of the 7000th byte is an xkseg ("c0...") virtual address of "c0000fc000007000". This kind of address is a "mapped" address, just like the user's xkuseg virtual address. - The VPN for this kernel virtual address has to be calculated from bits 39:14 in exactly the same way the user's VPN was determined. - Then that VPN is looked for in the TLB. - The TLB entry will match the virtual page number (VPN) with the actual physical page of memory where this page of user PDE's starts. See next illustration. # Calculate the VPN to look for in the TLB The kernel xkseg ("c0...") virtual address of the PDE data word we want is "c0000£c000007000". That word is offset in a page full of user PDE words. When the offset bits are stripped out of the virtual address, the virtual page number where that page of PDE's starts is "300001". The full kernel xkseg ("c0...") virtual address of that page can be calculated, as shown below. ### Translation of VPN back into full virtual address 3-71.f 22jul1998 TR-IKI rev 0.7b SGI Proprietary 1. The CPU, still in kernel context, looks in the kernel TLB entries for virtual page number (VPN) "300001" of the xkseg memory segment pages. - The (kernel xkseg memory segment) virtual byte address "c0000fc000007000" exists within a (kernel xkseg memory segment) virtual page that starts at byte address "c0000fc000004000". - Of the 64 bits of virtual address, bits 39:14 constitute the virtual page number (VPN). The VPN for both "c0000fc000004000" (the start of the page) and "c0000fc0000000" (a word within the page) is the same, ie, "3000001". - The VPN is what is used to look in the TLB entries, in order to find where in physical memory the entire page of data has been written. - In the illustration below, VPN "3000001" has been written to physical memory page (PFN) "2222" (a number chosen arbitrarily for this example). TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-71.h - 1. The kernel finds VPN 300001 mapped to physical page (PFN) 2222 in the TLB. - 2. The kernel goes to physical page 2222 of memory, and then to the proper offset of 3000 more bytes into that page (a distance of e00 words from KPTEBASE), and is now positioned at the first byte of the PDE word that represents information about user VPN e00 and whether it has been assigned an actual physical page of memory or not. - In the illustration above, KPTEBASE[e00] shows that user virtual page number (VPN) e00 has been written to physical memory page (PFN) "9999" (a number chosen arbitrarily for this example). - 3. The kernel now loades this PDE entry (and on this architecture, the next contiguous PDE entry as well) into the TLB for this CPU. See next illustration. TR-IKI rev 0.7b SGI Proprietary 22jul1998 - 1. The CPU now does a context switch back to user mode. - 1. The CPU re-executes the original instruction where it was presented with the xkuseg ("00...") user virtual address "000000003800123". - 2. As before, the CPU looks in the primary and secondary caches for the desired byte, and as before, still does not find it. - 3. As before, the CPU does hex math to extract the virtual page number (VPN) from the 64-bit xkuseg address, and determines that the VPN is "e00". - 4. As before, the CPU looks in the TLB to see if VPN "e00" for this user's ASID exists as a valid TLB entry. - 5. This time, the CPU does find this user's VPN of "e00" in the TLB, we have a "TLB Hit", and the CPU presents the physical page "frame" number (PFN), and offset into the page, to the HUB, and requests that it find the page and 3-71.j 22jul1998 TR-IKI rev 0.7b SGI Proprietary return enough bytes to load both the primary and secondary caches, as well as providing the CPU with the particular byte desired. ## **Detail of Resolving a TLB Double Miss** The sequence of events which lead up to a "TLB Double Miss", are very similar to the events which cause a "Single TLB Miss", until the kernel tries to look in the TLB for the VPN containing the user's PDE entries. In a TLB "Double Miss", not only is the user's VPN not referenced in the TLB (the "e00" in the previous example), but, after the CPU does a context switch to kernel mode, the kernel can't find the TLB entry for the page of user PDE's that refers to the VPN the user wanted either (the "3000001" in the previous example). When the CPU in user context can't find a requested VPN, that's a single miss. After the context switch to handle the single miss, the double miss occurs when the kernel discovers the user is trying to reference a VPN contained in a page of PDE's that aren't referenced in the kernel's TLB entries. # A TLB "Double Miss" Starts out the Same as a TLB Single Miss: 1. A user process is connected to a CPU. Among other things, this process is assigned an Address Space ID (ASID). While the CPU is in user context: - 1. a user 64-bit hex virtual addresss is presented to the CPU. - Let's use the same example as before, and use the xkuseg ("00...") virtual address "000000003800123". - 2. The CPU looks in the primary and secondary caches for the byte address requested. - 3. The CPU does hex math to determine the Virtual Page Number ("VPN") contained in bits 39:14 of the 64-bit virtual address. - O In this example, the VPN is "e00". - 4. The CPU looks in its TLB for a user VPN "e00" that matches this user's Address Space ID (ASID). - A valid TLB entry will match, or "map", a virtual page number (VPN) with a physical page of memory (Physical Frame Number, or "PFN"). - o If the VPN and ASID pair are found in the TLB, this is a "TLB Hit", and the CPU continues to process in user TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-72 context. If the VPN and ASID pair are not found in the TLB, this is a "TLB Single Miss", and this causes a hardware exception. This results in the CPU performing a context switch to kernel context, and beginning to execute kernel code, although the CPU is still considered "connected" to the user. ### Additional Information About the Single Miss: At this point the "EXL" bit in the status register of coprocessor 0 is set to indicate we have had one TLB Miss already. For 32-bit binaries the kernel begins to execute at a "vector" or entry point called "UT\_VEC", for a "TLB Exception". For 64-bit binaries the kernel begins to execute at a "vector" or entry point called "XUT\_VEC", for an "Extended TLB Exception". While the CPU is in kernel context: - 1. Hex math is done to calculate the offset from the beginning of this user's PTE table, which holds the word of data that describes what physical page of memory (PFN), if any, is being used by the requested VPN of, in our example, "e00". - Each user process has a Page Table Entry (PTE) table made up of as many pages as are necessary to describe the range of xkuseg virtual addresses for a user process. - Each PTE table entry, or "Page Descriptor Entry" (PDE) is one word long. Each PDE describes one virtual page (VPN), and which physical page (PFN), if any, is being used to contain the contents of this virtual page. - On a system with a 16Kbyte page-size, this means that a single page of PDE's contains 2048 words and can, - therefore, contain the information to map 2048 user VPN's to physical pages (PFN's). - The PTE table starts at a "well-known kernel (virtual) address" of "c0000fc0000000", which is pointed to by "KPTEBASE". - Every CPU uses the same virtual address for the beginning of its currently connected user process's PTE table. These virtual addresses are, however, mapped to different physical pages in memory for each CPU's process or thread. - Since one PDE represents the information for one user page, "KPTEBASE[VPN words] contains the PDE information for any given VPN. The user's PTE table entries are referenced with xkseg kernel mapped ("c0...") virtual addresses. Mapped addresses must be looked up in the TLB to determine what physical page holds the referenceded data. The TLB is examined for the value of the VPN extracted from bits 39:14 of the 4-bit virtual address. In our example, the VPN for the appropriate page of user PDE entries was VPN 300001. The kernel enters a simple routine named "utlbmiss", which does simple math to calculate which page of user PDE words contains the PDE word that matches the VPN the user is interested in. Assuming a valid PDE entry exists, one of the user TLB registers is selected at random to load with the value of "KPTEBASE[VPN]". Because there is a 1:1 correspondence between PDE words and user VPN pages, this formula loads the appropriate PDE word for the use VPN. #### Where a TLB "Double Miss" is Different: 1. At this point, for a "TLB Single Miss", the CPU would find VPN 300001 in the kernel TLB entries. This TLB 3-72.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary entry would reference the PFN containing a set of 2048 PDE entries, each containing the VPN-->PFN mapping information for a single user page. The CPU would offset from the beginning of the physical page to the appropriate PDE entry, and load its contents. Also, due to the machine architecture in our example (ie, a single TLB register holds two PDE entries), the contents of the next physically contiguous PDE word, would also be loaded into the TLB. These two entries represent contiguous virtual pages, but the matching physical pages are probably not contiguous. However, for a "TLB Double Miss", when the CPU looks in the TLB for VPN 300001, the kernel does not find this VPN either. This results in a second "TLB Miss", this time due to actions of the kernel, not the user process. 1. In the event of a "TLB Double Miss", the CPU can follow the pointers starting with its own PDA, which eventually lead to the segment tables which describe the pages of the currently connected user's PDE words. Some of these structures were shown earlier, and some more detail is added in the diagram below. There is more detail on the structures in this chain later in a later section. TR-IKI rev 0.7b SGI Proprietary 22jul1998 3-72.d | s <b>4</b> 0° Nobbes | | |----------------------|------------------------------| | | Module 4: Kernel Source Tree | | ·*······ | | | 1 | | | | | | | | | •. • | | | | | | • | | | | | | | | | | | | | | | | | | !<br>! | | | | | | | | | | | | ا ا | | | | | | | | | ! | | | | | | | | | | | | | | | | | # **Kernel Source Tree** This section provides an overview of the organization and location of operating system source code, with an emphasis on kernel code, and tools for browsing it. By the end of this section, the student should be able to: - Locate operating system release project web pages - Locate operating system and kernel source code files - Explain the kernel source tree subdirectory contents - Explain the difference between ".h" files and ".c" files - Determine if a ".c" file includes a specific ".h" file - Describe the system logs, subdirectories, and files in /var/adm - Describe tools to examine system activity - Describe tools to examine a system dump - Use the **cscope** tool to find operating system and kernel source code files - Use the uname command to determine what software a system is running - Use the versions command to determine what software a system is running 4-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Related On-Line Materials** Additional information is available in the "Source and Object code maintenance" lesson, which provides detailed information about: - Operating system release project web pages - The location of operating system code source files - The location of operating system code binary files - How to track changes to the operating system source tree - How to track changes to the operating system object code - The location of kernel source code and tools to examine it - Determining hardware and software system status with various tools and commands That information is available as part of the selection of IRIX Class Materials located at: http://wwwtng.cray.com/~mix/irix.html#Class-Materials . The lesson itself, and its Table of Contents, are located at: http://wwwtng.cray.com/~mix/protect/irix/manual/current/object\_code.html . ## **Operating System Release Project Web Pages** IRIX 6.4.x ("ficus") Project Web Page: http://info.engr.sgi.com/projects/bonnie\_proj/ficus/isms/status/ IRIX 6.5 ("kudzu") Project Web Page: http://info.engr.sgi.com/projects/bonnie\_proj/kudzu/isms/status Project Web Page information includes: - Description of platforms supported - Release milestones status - Bug status ("showstoppers", summary reports, etc.) - ISM ("Independent Software Module") owner and status - Source and Build Information (source tree, "Build Meister", etc.) - Related news groups, news letters - Features list There is a tremendous amount of valuable information contained on each operating system release's web page, including the location of important source code directories. TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-4 # **Source Code Location** The official location of source code is on "bonnie" in the "/proj" directory. Once inside the SGI/CRAY firewall, there are two methods of getting to "bonnie" and various source trees: telnet bonnie.engr.sgi.com Trying 192.26.80.202... IRIX (bonnie) login: guest (no password necessary) cd /proj or: cd /hosts/bonnie.engr.sgi.com/proj # **Base Source Code Naming Convention Explanation** Source code is not assigned a release number until it is close to being released. Once released, no further changes are made to that code directory tree. IRIX 6.4 (which had the in-house name "ficus") has been released; "kudzu" has not, but will be released as IRIX 6.5. The "irix.65se" directory contains the most relevant source code for the Irix 6.5 base release for the Cray Origin2000 systems. ``` bonnie 6% cd /hosts/bonnie.engr.sgi.com/proj; ls -la drwxrwxr-x 3 root sys 57 Dec 29 10:10 kudzu drwxr-xr-x 4 root sys 4096 Dec 26 16:58 irix6.5 drwxr-xr-x 5 root sys 52 Jan 22 12:58 irix6.5-features drwxr-xr-x 3 root sys 51 Jan 5 14:41 irix6.5-unbundled 57 Dec 29 10:19 irix6.5f 22 Jan 16 17:07 irix6.5m drwxr-xr-x 3 root sys drwxr-xr-x 3 root sys drwxr-xr-x 3 root sys 51 Jan 19 03:10 irix6.5se lrwxr-xr-x 1 root sys 14 Feb 11 1997 ficus -> irix6.4-s2mp+o drwxr-xr-x 3 root sys 56 Dec 14 1996 irix6.4-s2mp+o drwxr-xr-x 3 root sys 36 Oct 29 1996 irix6.4-ssg drwmr-mr-m 3 root sys 22 Nov 18 1996 irix6.4-ssg-unbundled ``` Other directorie names are explained below. The original IRIX 6.4 release can be found in the "irix6.4-ssg" directory. The "ssg" suffix refers to the "Scalable Systems Group". This release ran only on the high end "s2mp" (Scalable Symmetric Multi-Processor) architectures. This release has been superceded by the "irix6.4-s2mp+o" release. The "irix6.4-ssg-unbundled" directory was never used, is empty, and can be ignored. 4-6 22jul1998 TR-IKI rev 0.7b SGI Proprietary The code to support the low end "Octane" architecture was added in the "irix6.4-s2mp+o" version of the source, which is actually the IRIX 6.4.1 release. The project leader for each release chooses the naming conventions, such as references to "ssg", "+o", or "unbundled". Do not rely on a consistent naming convention for released systems. All such conventions as naming and location are totally up to the manager of the group. The original in-house name for the IRIX 6.4 release was "ficus". Note that the "ficus" directory has been symbolically linked to "irix6.4-s2mp+o/". The pre-release in-house name will always be linked to the most relevant base release file name, once the operating system is released. The all platform release of IRIX 6.5 can be found in the "irix6.5" set of subdirectories. The "irix6.5-unbundled" subdirectories are for products, like the compilers, which do not ship with the kernel. These may have their own release cycles, and may be optional software. The subdirectory named "irix6.5m" is for maintenance on the 6.5 release. This is where fixes for low priority bugs are checked in (problems not significant enough to hold up the release). # Where Is the Most Recent Version of the Source Code for the Upcoming IRIX 6.5 (kudzu) Release? Source code for upcoming IRIX 6.5 release - Always buildable - Always viewable - Changing constantly - Being tested constantly - Kept on "bonnie" Source code is kept in a continuously releasable state until it is released. Changes and corrections are applied directly to the source code in the development tree for the release. The source code for "kudzu", which will probably be released as "IRIX 6.5", is kept in: /hosts/bonnie.engr.sgi.com/proj/irix6.5/isms (/hosts/bonnie.engr.sgi.com/proj/kudzu/isms is outdated) NOTE: "isms" directories should be: - Independent bodies of code - The most recent appropriate software module for the release... but they may not be ! WARNING: The acronym "isms" stands for "Independent Software ModuleS". These are supposed to be large bodies of independent code, which can be built independently, but there may be some interconnections between them anyway (e.g., all the graphics code, and all the man pages, and all the IRIX kernel do have interdependencies with each other). Any "isms" directory should contain "all" the source code for that release, however, compilers have their own separate release schedule and are not included in the same "isms" tree. Compiler versions can be found in subdirectories under: TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-7 /hosts/bonnie.engr.sgi.com/isms/cmplrs.src (note: there is an Eagan escope database of development IRIX 6.5 source, and there are IRIX 6.4 and 6.5 escope databases on bonnie.engr.sgi.com in /cscope .) # Where Is the Most Recent Version of the Source Code for the Current IRIX 6.4 (ficus) Release? There isn't one. There is no policy for maintaining a recent viewable version of the source code for the current release. The base source code for the IRIX 6.4 release is kept on "bonnie", in the directory: /hosts/bonnie.engr.sgi.com/proj/irix6.4-smp+o . The code in this set of subdirectories is readable, buildable source. There is no policy for maintaining a current viewable version of the source code for any released operating system. Operating systems are released to customers in binary form (object code). Once an operating system is released, it is no longer kept in a continuously releasable (buildable) state. (note: there is an Eagan escope database of released IRIX 6.4 source) 4-8 22jul1998 TR-IKI rev 0.7b SGI Proprietary # Summary: Location of Operating System Source Code Base Release, Current Release, and Upcoming Release - Source code for IRIX 6.4 base release (no patches) - Viewable - Buildable - O Kept on "bonnie" - hosts/bonnie.engr.sgi.com/proj/irix6.4-s2mp+o/isms (/hosts/bonnie.engr.sgi.com/proj/ficus/isms is linked to the above) - Source code for IRIX 6.4 base release with released patches applied - Viewable - O KERNEL source MIGHT be buildable - o ALMOST all released patches applied - O No patches in test mode - O Kept on "bonnie" - hosts/bonnie.engr.sgi.com/proj/irix6.4\_patched/isms - Source code for IRIX 6.5 development - O Viewable - o buildable - Kept on "bonnie" - /hosts/bonnie.engr.sgi.com/proj/irix6.5/isms (note: Eagan escope databases of all three trees) The source code for the *base* release will always be viewable and buildable. It will be located on "bonnie.engr.sgi.com" in the "/proj" directory, under a name which includes the release number (e.g., "irix6.4-s2mp+o"). There is no policy to maintain a viewable, buildable version of the current release with all or most currently released patches applied. NOTE: One of the developers has taken it upon himself to create a buildable version of the latest kernel, with resolved patches applied. Whenever a new patch is inserted into the "irix6.4\_patched/isms/irix/kern/\*" tree, this developer will, himself, apply a new patch which will result in the kernel source tree being a buildable and "vi-able" version of the kernel, with most of the released patches (remember the time lag before a released patch shows up in the source tree) compiled in (that is, patch dependencies and conflicts will be resolved). In short, what you find in the "irix6.4\_patched/isms/irix/kern" subdirectories will toggle between two states. It will always be something that "vi" can examine. However, the code will be a buildable version of the kernel source, with most of the recently released patches, only until the build team drops in a new patch or patches. Then the code will be an unbuildable version of the kernel source, with most of the released patches, and with possible patch conflicts and dependencies. When this condition is noticed, the developer will apply yet another patch to toggle the code back to a buildable, resolved state. TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-9.a # **Kernel Source Tree Location** All source files are on "bonnie.engr.sgi.com". The main kernel source directories are located in: /hosts/bonnie.engr.sgi.com/proj/release-name/isms/irix/kern IRIX kernel source code for the IRIX 6.4 release is located in: /hosts/bonnie.engr.sgi.com/proj/irix6.4-s2mp+o/isms/irix/kern IRIX kernel source code for the kudzu (IRIX 6.5) release is located in: /hosts/bonnie.engr.sgi.com/proj/irix6.5/isms/irix/kern All source code can be found on the "bonnie.engr.sgi.com" host, mostly in an "isms" directory of Independent Software Modules, that is, code (mostly) logically distinct from other code. And most of the "interesting" operating system code is in the subdirectories under "isms/irix/kern". The IRIX kernel source code for the kudzu (IRIX 6.5) release for the Cray Origin2000 architecture is located in: /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/kern Most of the kernel code is in a "/proj" subdirectory related to the appropriate release name, each of which has its own "isms" subdirectory. In each release's "isms" directory of independent code modules, there is an "irix" directory, which leads to a "kern" directory. Here is where the kernel code can be found. Note: Many of the files and directories have been symbolically linked. Do not be confused if the pathway you "cd" to does not resemble your "pwd" output. The shell you use makes a difference. If you are using csh, then you could encounter something like this: cd /hosts/bonnie.engr/proj/irix6.4-s2mp+o/isms/irix/kern ; pwd TR-IKI rev 0.7b SGI Proprietary 22jul1998 | /hosts/bonnie.engr/disks/xlv2/ficus/irix_ficus/kern | |------------------------------------------------------| | Using ksh rather than csh will avoid this confusion. | | | 4-10.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Kernel Source Tree Contents** Below is an explanation of the files and directories where various parts of the kernel source tree, or useful system information, can be found. The "cscope" source browsing tool is introduced, and examples are given. All source files are kept on the "bonnie.engr.sgi.com" system, in the "/proj" directory, which is divided into subdirectories by release name. Underneath the "...release\_name/isms" subdirectory, most of the major operating system release components can be found. File names that end in ".s" are written in assembly language. Most of these are in the ".../isms/irix/kern/ml" subdirectory. File names that end in ".c" are written in the C programming language. File names that end in ".h" are "header files", and are also written in "C". # The Difference Between ".h" and ".c" Files The difference between ".h" and ".c" files is as follows. Files that end in the characters ".h" are called "header files". Header files can contain either C language source code, or structures and constants used by "modules" (source code files). Files that end in the characters ".c" are called "modules" or "source files". These files contain "C" language code which, among other things, can include directives to the compiler's pre-processor. Such directives can instruct the compiler to define a name to have a certain value, or to include a certain "header file" as part of the source when the source is compiled. In a ".c" file, directives that begin with "#include" specify a header file to include in the source. Header file names that are surrounded by the characters "<>", such as: #### #include <sys/types.h> ...indicate that certain "standard directories" (this is system and implementation dependent) should be searched and used as a prefix to the file pathname specified. One of the most common of the standard directories is "/usr/include". # cd /usr/include/sys # ls types.h types.h Header file names that are enclosed in quotation marks, such as: #### #include "region.h" ...indicate that the directory the source code resides in should be searched first, and, if the file is not found, then the standard directories are searched. TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-12 #### Where to Find ".h" Files Since header files contain and define the format for kernel structures and pointers, they are very useful to examine, in order to understand kernel functions and solve system dumps. Not all of the ".h" files will be on your system if you are not running source. For example, "region.h" is not a standard UNIX header file, but is specific to the IRIX methodology of memory management (paging). If you are running a binary version of the operating system at your site, you do not have all the files necessary to compile the operating system on your machine. If you want to find a specific ".h" file, the best tool to find a source version of it (inside of the SGI firewall) is probably "cscope". Header (".h") files may also include other header (".h") files. This may make examination of a source file confusing. It may be difficult to understand what structure definitions the code is referencing when the files which define them are not explicitly "included" up at the top of the file. For example, in the directory: #### /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/kern/os/as ...one can find the source file "region.c", and the header file "region.h". The "region.c" file uses but does not include the "region.h" file. Instead, the "region.c" source file includes a different header file, "as\_private.h", which itself includes the "region.h" file. In this way, the region.h header file is included *indirectly* as part of a *different* header file referenced in "region.c". Once inside the SGI firewall, you can use the "cscope" tool to find such occurences. In this case, we know (or suspect) that the "region.c" file uses, but does not include, the "region.h" header file. What we need to do is: TR-IKI rev 0.7b SGI Proprietary 22jul1998 - Look at the list of all the different ".h" files that the "region.c" code includes. This can be done by examining the region.c source code directly. - Make a list of all the files that explicitly include the "region.h" header file. This is one of the standard operations that the escope tool will do. - Compare the two lists, that is, compare the "region.c" header files to the list of all the files which include the "region.h" file directly. #### and: • Hope we find a match. Unfortunately, the last two steps have to be done by hand. If we find such a file, then we will know that "region.c" does include "region.h", indirectly, by way of an intermediate header file which includes "region.h" directly in its directives. Below is an example of using escope to try and find which ".h" header files include the "region.h" header file. Then a comparison is done to see if any of these are files used in "region.c". 4-13.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary Once you've invoked escope, a window much like the above is generated. The last command choice option is to do exactly what we want, find all the files which have a directive to explicitly include the "region.h" file. Your keyboard arrow keys will allow you to select which of the nine possible searches you want to invoke. In Step (1), the header file we're interested in, is entered on the appropriate command line. This results in the generation of a list of all files (not just header files) which include the target file. The yellow horizontal bar shows that, in this case, 118 files were found. Pressing your space bar will cycle you through these choices. Many of the result files are source code file names ending in ".c", but we are trying to find a header file (a file ending in ".h"). The area where the file names are listed, however, is not always wide enough for the full file name to show, so you might have to select a file just to determine if its name ends in ".h" or not. Following step (3), above, will open up a separate window, as shown below, with the specified file opened, and the cursor positioned at the line specified in the escope window. In this case, a window will open up with the cursor at line 83 of the file "as\_private.h", which is located in the directory: /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/kern/os/as ``` (as defined by some poministrative detap) and the parter dules out blocks of resource verge rights to dients. As long as a client has enough rights, they can use a shared-yedate token to change the value, and the client brows appriori that the overall maximum hasn't been e geeded. Once it runs out it asks the server for more. To read the current value, a client sets the read token which effectively forces all clients to send bank their grants with a count of how many they have allocated. Hunchitecture — The main idea is that all major operations can occur completely on the client side — an the client data structures are prefit may be completely expected the representations of the state of the address apple. This means that the asolt structure is both the client side "cached" data as well as the data structure used when there is no distribution of address space (e.g. a non-shared process) There are I different interposition layers — a local and a distributed one. These provide the special data unique to the implementation. This includes things like to ensign business; and include "kays/types.h" #include 'kays/types.h" #include 'kays/types.h' #include 'kays/beha.ion.h" "#include 'says/nesource.h' ``` Now we know that the header file "as\_private.h" includes the header file "region.h". A little detective work shows that the TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-13.c source file "region.c" includes the header file "as\_private.h". We have now confirmed that the "region.c" file uses the structures and variables defined in the "region.h" header file. # Operating System and Kernel Source Tree File and Directory Structure 4-14 22jul1998 TR-IKI rev 0.7b SGI Proprietary Once you have gotten onto the "bonnie" system, and selected the operating system release your are interested in from the "/proj" directory, and further descended into the ".../isms/irix" subdirectory, you will find that the "/irix" directory contains the following subdirectories: %cd /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix;ls Makefile RCS build cmd include , lib man The "cmd" subdirectory contains the "savecore" subdirectory, where information about past system crashes or hangs is kept. In the "lboot" subdirectory there are a number of files important to the booting, configuration, and tuning of the system. % cd /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/cmd;ls Makefile hinv mmscd sn0log xbstat diskless RCS icrash netman sn0msc xfs bsd dlpi ip26ecc sump xfsm dprof flash btool 1boot onlinediag stress x1vlinkstat sysctlrd xperform cached perfex clshm flashio mkmachfile savecore flashmmsc mkpart tokenring On a live system, much of the configuration and tuning information can be found in the "/var/sysgen" directory. The "/var/sysgen/master.c" file is generated by the boot process and is full of SYSTUNE, system, and driver configuration information. The "bdevsw" and "cdevsw" structures (block special device and character special device switch tables) are also defined in the master.c file. Most of the kernel code and structure definitions live in the subdirectories under ".../isms/irix/kern". \$ cd /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/kern;ls Makefilefsmaster.dsgiRCSiomlstubsbsdkcommondefsmtunesysbtoolkcommonrulesosdpksysprotoklocaldefs Below is an explanation of the subdirectories found in the kernel source tree. bsd - "bsd" stands for "Berkely Systems Developement". Directory contains networking related code, e.g. sockets, protocols, network device drivers. btool - Contains code for "btool", a code coverage analysis tool. dp - Contains distributed processing support, ie., cellular IRIX support. fs - Contains all the code for each of the file systems types, in subdirectories like nfs, pipefs, procfs, xfs, cachefs. io - Contains source code for I/O device drivers and I/O support routines, e.g., q1.c (SCSI chip logic and qlogics controller code), scsi.c (generic SCSI code), and dksc.c (generic disk drivers). kcommondefs - Contains basic common flags/locations for kernel builds. The makefile includes these extensions for Make. kcommonrules - Ensures that kernel builds end up in proper directories, install proper header files, etc. The makefile includes these extensions for Make. ksys - Contains a directory of kernel private header files - never exported outside the kernel. TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-14.b master.d - Contains a directory of configuration files for every device driver, which a program called "lboot" reads and decides whether to configure that device drives in or not. If you're writing a new driver, you'd add a new file for it in master.d. ml - Contains low level machine level code for system startup, locks, interrupt management, and error handling. All the assembler files go here, e.g., MIPs assembly language files, the assembler language level locking code "llsclocks.s", etc.. The ml directory has several interesting subdirectories, including: LOCORE - A directory of the most common ".s", shared assembler, files for all platforms. mtune - Contains files with system tunable parameters. These are modified only indirectly using the "systume" tool. os - A directory containing the bulk of the operating system code, e.g., fork, exec, the main kernel files and directories, the vm directory for virtual memory, the as directory for address space management (part of memory management - NUMA [nonuniform memory access] support code files are in this subdirectory). There is an important subdirectory named "as", which contains most of the code important for page fault handling, the definitions of the region and pregion structures, etc. protoklocaldefs - The prototype kernel local definitions file. Before a kernel is built, this file is copied into klocaldefs and modified. Binary sites do not need to think about this file. sgi - This is a directory of somewhat odd codes like a random number generator for the kernel, and some with more obvious value, such as the code for kernel mallocs, kern\_heap.c, or chunkio.c, which deals with DMA (Direct Memory Access - I/O controllers get data out of memory by touching it directly, they don't go through the CPU. The "chunkio" code coalesces DMAs to do large efficient direct memory accesses instead of lots of small inefficient ones). stubs - If there's an optional subsystem, that some site doesn't want, a stub for it is put in here and when that code gets referenced the code will just return. sys - This is where *most* of the public header files, i.e., those that get installed in /usr/include/sys can be found. But some of the public header files are in /irix/kern directories and get exported. This is also true of some of the private header files. Many kernel structure definitions can be found in this directory NOTE: Graphics has commands and libraries and kernel pieces, but it's not in /irix/kern. However, many related pieces, such as header files and device drivers, can be found under other "isms", for example, isms/gfx/kern, and the digital media drivers in /isms/dmedia/kern. The communications group has its own kern header files, such as the IBM x25 communications support under /isms/comm/kern. NOTE: The IRIX 6.4 source code is scattered all over, but the "isms" directory should contain the complete list of independent software modules. There are exceptions to this, however. For example, the compilers group has its own release schedule, since compilers are unbundled. Compiler versions can be found in subdirectories under: /hosts/bonnie.engr.sgi.com/isms/cmplrs.src 4-16 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Tools Available to Browse Source** - cscope interactively examine a C program - o Tutorial available - O See man page on "tokyo" (login as guest) - O Not officially supported, just for internal use - Supported equivalent is "gid" - see man page on Indys - dwarfdump locate source patches - o Example available - ctags create a tags file - o For vi users - See man page - etags create a tags file - For emacs users - No man page (note: there is an Eagan escope database of released IRIX 6.4 source) (note: there is an Eagan escope database of patched IRIX 6.4 source) (note: there is an Eagan escope database of development IRIX 6.5 source, ## **Determining What Software the System Is Running** - versions - Show system software - List installed patches - "versions -Inv | grep patch" - "versions -bv | grep patch" - Remove system software (eg, patches) - O Has man page - uname - O Show system software - "uname -a" - "uname -R" - O Has man page TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-18 # versions - show system software; list installed patches See "man versions". Versions has many options and three main functions. The "showprods" option displays information about the software that is currently installed on a system. "Showfiles" displays lists of files on your system and information about those files ("inst" can be used to remove installed software from your system). Typing "versions -Inv | grep patch", or "versions -bv", will generate a list of patches installed on the system (versions can also be used to remove patches). ## uname - show system software The uname command has several options. Below are samples of uname output with explanation of the fields. A word about the CPU board type. The CPU *type* would be more useful. The CPU type can be derived from the board type: IP numbers always increase with products. Odd numbered IP's are always high end products. Even numbered IP's are always low ends. Recent numbers: • IP19 = R4000 (or R4400) Challenge (other than S model), Onyx 4-20 22jul1998 TR-IKI rev 0.7b SGI Proprietary - IP20 = R4x00 Indigo - IP21 = R8000 POWER Challenge, POWER Onyx - IP22 = R4x00 Indigo2, Indy, Challenge S - IP23 = is none - IP24 = Indigo2 - IP25 = R10000 POWER Challenge R10000 - IP26 = R8000 POWER Indigo2 - IP27 = R10000Origin O2000 and O200, both, even though they're different CPU boards but they're all R10000-based systems - IP28 = R10000 POWER Indigo2 R10000 - IP29 = O200 - IP30 = R10000 OCTANE (low end machine, in-house "Speed Racer") - IP32 = R10000 O2 ## How Do I Know What Crashed My System? What Was Going On Just Before the System Crashed? - icrash- IRIX system crash analysis utility - O Has man page - O Tutorials available - O Tips available - O Gives status of - networks - disks - **tapes** ■ OS state - PE states - register contents - SYSLOG writes message onto the system log - Has man page - utrace Basic kernel trace mechanism (NEW 12/97) - O Has web page - Circular buffer of time-stamped events for each CPU Requires kernel rebuild to enable TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-21 ## Where Does the System Put Things When It Crashes? System logs are in either: • /usr/adm • /var/adm ## What System Logs Exist? #### System Logs in /var/adm: \$ cd /var/adm ; ls | SYSLOG | dtmp | pacct1 | pacct4 | pcplog | utmp | |----------------------------------------|-------------------------------------|----------------------------------------|--------------------------------------|----------------------------------------|---------------| | acct | fee | pacct10 | pacct5 | sa | utmpx | | avail<br>bds.log<br>crash<br>dodiskerr | klogpp<br>lastlog<br>mkpts<br>pacct | pacct11<br>pacct12<br>pacct2<br>pacct3 | pacct6<br>pacct7<br>pacct8<br>pacct9 | sat<br>sulog<br>sysmon.msg<br>sysmonpp | wtmpx<br>wtmp | 4-23 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Description of system logs, subdirectories, and files: Most of the interesting system logs, subdirectories, and files are in /var/adm . SYSLOG - log of everything happening on the system. See man SYSLOG. acct - invoke accounting. See man acct. avail - directory of logs, see /var/adm/avail/availlog - the primary log of the availmon tool, which keeps track of when you bring the machine up and down and why, and, if so configured, will send mail to SGI headquarters. The log contains information that this machine was rebooted, and for what reason, how long was it down etc. That log is processed and there's a databasee that tries to summarize field info. bds.log - logs all the opens and closes of and performance data for BDS (Bulk Data Services files (which are used to transfer large quantities of data between machines). crash -where the dumps are put (default directory). dodiskerr - part of disk error accounting dtmp - output from the acctdusg program fee - output from the chargefee program, ASCII tacct records klogpp - symbolic link to /usr/sbin/klogpp, which is the command that filters kernel messages for the syslogd (to SYSLOG and the /dev/console) lastlog - record of who's logged in. pacct - raw data file of all accounting activity that the acct command reads into reports pcplog - performance copilot log sa - unix acct sat - directory of security audit trail related files sulog - record of who tried to su to root or to some other login ID sysmon.msg - sysmon allows a user to browse SYSLOG; this file contains a message from the SYSLOG file. sysmonpp - program filter for log messages utmp - see man page man pages for these last 4. These are logs of who logged in to do what commands on which terminal. These files hold user and accounting information for such commands as who, last, write, and login. utmpx - see "utmp", above. wtmp- see "utmp",above. wtmpx- see "utmp",above. A more involved exploration of code examination and system dump analysis is handled in other modules. TR-IKI rev 0.7b SGI Proprietary 22jul1998 4-24.a | | Module 5: Operating System Overview | |-----|-------------------------------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | *** | | # **IRIX Operating System Overview** This section provides an overview of the organization of the IRIX operating system, user memory components, memory management, process relationships, and the primary functions of the IRIX kernel code. By the end of this section, the student should be able to: - Explain the IRIX operating system philosophy - Explain the concept of an interrupt - Explain the concept of an exception - Describe the major system components of system memory - Describe the major system components of user memory - Explain how kernel and user components are related - Explain the primary memory management methodologies for moving pages in and out of memory - Describe the relationship of sched and init to all other processes in the system - Describe the functions of the fork and exec system calls - Describe the process relationships for a user connecting to a system through a network - Explain each of these primary kernel activities: - System Initialization - Process Management - User Program Interface - Memory Management - File Management - I/O Management - Communication Facilities 5-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### UNIX (IRIX) philosophy The philosophy of UNIX (IRIX) is to take advantage of work already done by others. As this onion-like diagram suggests, UNIX (IRIX) is built in layers, with each layer representing a building block that can be used to build other building blocks. Most commands, programs, and utilities supplied with the UNIX (IRIX) system can be used in combination with each other to build other tools. Complex mechanisms can be built from a set of simple commands to perform various functions. The UNIX (IRIX) operating system kernel, in most instances, insulates the user from needing to know intimate details of the machine hardware. The machine hardware would include processors, peripheral devices, memory, hard disks, etc. The layer immediately outside the IRIX kernel is referred to as the system call layer. System calls allow user-written applications at the outer layers to invoke various functions residing inside the IRIX kernel. For example, an application needing to read a file would issue the read() system call which would invoke a read handler inside the kernel that would communicate with the device where the desired data resides and return the data to the application. TR-IKI rev 0.7b SGI Proprietary 22jul1998 5-2.a ## IRIX system major components (user memory) The above diagram illustrates the major components that comprise user memory in an IRIX system. The init process (pid(1)) is created and started by the first kernel process, sched (pid(0)) during system start-up and becomes the parent of all other processes (except sched) in the system. The getty, login, and shell (sh, csh, ksh ...) processes provide the interfaces between the system and users. Several daemon processes exist to provide services to both users and the kernel. The network daemons (inetd, telnetd, etc.) provide the interface between the system and a user at a network terminal. Daemons such as nasd and eron execute user scripts or commands non-interactively. Other daemons provide functionality for the kernel. The kernel can "off-load" lengthy work to them, such as tape support, accounting, error logging, and so on. User binaries represent the execution of program binaries as initiated by the shells. 5-3.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary # IRIX system major components (kernel logic) This diagram illustrates the major components that comprise the IRIX operating system kernel. The kernel logic is the overall controlling component in the system. ## When Does the Kernel Take Control Away From a User Process? Kernel code will take control of a CPU away from a user process when: • The user process receives an interrupt An interrupt is something generated externally to the process, which requires kernel intervention, such as the completion of I/O. • The user process generates an exception An exception is something generated by the process, such as when a user process makes a request to the kernel to take control of the CPU to do a system call. TR-IKI rev 0.7b SGI Proprietary 22jul1998 5-5 ## Kernel block diagram TR-IKI rev 0.7b SGI Proprietary 22jul1998 5-6 (machine level) SCSI disk SCSI tape VME 5-6.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Primary Kernel Activities** Kernel code selects one of several handlers to service the interrupt or exception. After all interrupts and the user's exception (if any) have been processed, the kernel will return control of the CPU to a user - but it may not be the same user who had the CPU before the kernel call. Which user the kernel connects to is determined by which process has the highest priority to run at that time. A major part of the kernel code provides for the processing of system calls. A system call is a type of interruption to user processing, called an "exception". System calls can be organized into one of four areas: process management, file management, I/O management, and miscellaneous routines that are used by both the kernel and "outside" processes. Another major portion of kernel logic is devoted to handling I/O interrupts, which signal an I/O completion. On IRIX systems, the primary memory management methodology is based on moving pages in and out of memory, not processes, so invoking sched for that purpose is done as a last resort. The primary memory management routines for moving pages into memory are kernel routines triggered by a "TLB miss", when a CPU cannot find what it wants in its Translation Lookaside Buffer, and must ask the kernel to make page information more local. The primary memory management routines for moving pages out of memory start with whand. To facilitate performance, a copy of the IRIX kernel resides in that part of main memory assigned to each individual node of a multi-node system. Interrupts and exceptions generated by user processes, interprocess communication, system calls, etc. can be handled more efficiently (faster) when a copy of the IRIX kernel is located "nearby" in the node's local memory instead of a CPU having to access kernel code through the interconnect fabric from a part of memory that would not be considered local, and would therefore take more time to access. The kernel block diagram above shows various user and kernel modules and how they are related. This is a useful model, although interactions in the kernel are much more complex than this. The UNIX (IRIX) kernel is designed around two primary entities: files and processes. Therefore, two major components of the kernel are the file subsystem and the process control subsystem. The block diagram shows three levels: user, kernel, and hardware. The system call interface represents the boundary between user programs and the kernel. A system call is a request made by a user's program to execute a function residing in the operating system kernel. Library functions, which also invoke system calls, are actually linked together with the user's program. The diagram partitions the set of system calls into two groups; those that interact with the file subsystem and those that interact with the process control subsystem. The file subsystem manages the creation and removal of files, controls access to files, allocates file space, administers free space, and reads and writes data for users. A user's process interacts with the file subsystem using system calls like open(2), close(2), read(2), write(2), chmod(2), chown(2), and stat(2). The file subsystem provides user access to data using a buffering mechanism that controls the flow of data between the kernel and secondary storage. The kernel's buffering mechanism interacts with block I/O device drivers to initiate reads and writes of data to and from the kernel. Device drivers are kernel modules which control access to peripheral devices. A block I/O type of device is a device which is read and written in fixed units, referred to as a block, or multiples of a block. Data residing on a block I/O devices can be accessed in a random manner. An example of block devices would include disk drives. The file subsystem also interacts with character devices. Character devices include all devices which are not block devices, and can be read and written to by as little as one character at a time, such as terminals and tape devices. TR-IKI rev 0.7b SGI Proprietary 22jul1998 5-8 The process control subsystem has responsibility for the creation/termination of user processes, interprocess communication (IPC), process scheduling, process synchronization, and memory management. A user's process interacts with the process control subsystem using system calls like fork(2), exec(2), wait(2), exit(2), brk(2), kill(2), and signal(2). The memory management facility is responsible for making sure each process is allocated sufficient memory to perform its tasks. IRIX uses *demand paging* to control user memory space. "Demand paging" means that a page containing the requested information is made local to a CPU only when it is needed, or "demanded", by the executing process. The scheduler facility is responsible for fairly allocating the CPU's to individual user processes. This is handled through queueing mechanisms. Processes with the highest priority are given CPU attention first. A process either voluntarily gives up its CPU while waiting for a resource (for example, I/O data or system call handling) or the process is preempted by the kernel when its time slice is consumed. Interprocess communication is supported in several forms including signals, pipes, shared memory, and message queues. Hardware control is responsible for handling device interrupts. Devices like terminals or disks may interrupt the CPU while a process is executing. Interrupts are serviced by special interrupt handling functions in the kernel. #### **Summary of IRIX Kernel Primary Functions** #### • System Initialization A facility exists for the IRIX kernel to start up and initialize itself. The system provides a "bootstrap" facility to load a copy of the IRIX kernel into the system memory and start running. #### Process Management A facility to create, terminate, and control user processes. IRIX is a multiprocessing operating system, so the kernel ensures that each active user process is given its appropriate share of CPU attention and other resources. Therefore, all processes appear to execute in parallel. #### • User Program Interface The kernel provides a robust set of system calls allowing user programs to access the vast array of services provided by the operating system. System calls are invoked by library routine interfaces to the operating system. #### • Memory Management On IRIX systems, the total amount of memory needed to accommodate all currently active processes far exceeds the physical memory installed in the hardware. To simulate more memory than is physically available and help overcome this bottleneck, the IRIX kernel implements a virtual-memory system. The system maps virtual addresses to physical addresses at run time. Therefore, there are no memory restrictions on a user's process other than those imposed by the operating system or imposed by the system administrator. 5-10 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### • File Management The IRIX operating system maintains many types of files which reside in file systems. A file system is an organized hierarchy of directories containing these various file types. File systems typically reside on physical media such as hard drives and the operating system provides the services to access individual files within file systems. IRIX supports multiple file system types. #### • I/O Management The operating system provides several user-selected options to influence the path taken for input/output data, which affects I/O performance and the level of risk for data loss. The kernel supports familiar I/O methods such as sequential and random I/O, buffered and direct I/O, synchronous and asynchronous I/O, file locking mechanisms, etc. #### Communication Facilities The operating system provides for inter-process communication, inter-machine communication (networks), and communication between processes and devices. | | Module 6: Interrupt and Exceptions (Preliminary) | |--|--------------------------------------------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | # **Interrupts and Exceptions (Preliminary Notes)** This section provides an overview of Interrupts and Exceptions. By the end of this section, the student should be able to: - describe the difference between an interrupt and an exception - describe the five different initial entry points, or "vectors", into the kernel for interrupts and exceptions - describe the logic flow through various interrupt and exception handlers 6-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Processor Operating Modes** The MIPS processor under IRIX operates in one of two modes: kernel and user. The processor enters the more privileged kernel mode when an interrupt, a system instruction, or an exception occurs. It returns to user mode only with a "Return from Exception" instruction. Certain instructions cannot be executed in user mode. Certain segments of memory can be accessed only in kernel mode, and other segments only in user mode. ### **Interrupt and Exception Types** - Types and Entry Points (handler table) - o TLB exception - o Extended TLB exception - ECC Exception - Reset Interrupt - General Exception(s and Interrupts) - 32 subtypes The hardware defines the entry points. The handling is done by software. The processor supports five hardware, two software, one timer, and one nonmaskable interrupt. The hardware Interrupt is described in great detail in Chapter 17 (17.3) of the R10000 Microprocessor User's Manual, in the section titled "Interrupt Exception" (http://www.sgi.com/MIPS/products/r10k/UMan\_V2.0/HTML/t5.Ver.2.0.book\_365.html#0). Software exceptions and interrupts are described in great detail in Chapter 6 (6.14) of the R10000 chip manual, in the section titled "Interrupts" (http://www.sgi.com/MIPS/products/r10k/UMan\_V2.0/HTML/t5.Ver.2.0.book\_129.html#HEADING169). TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-3 #### **How are Interrupts Different From Exceptions?** - Interrupt - o asynchronous to the currently executing process or thread - o due to causes unrelated to the current user process - After an interrupt, control returns to the next instruction - Exception - o synchronous to the currently executing process or thread - o caused by, or requested by, the currently executing process or thread - o After an exception, control returns to the same instruction Exceptions are ocurrences which make a CPU stop operating in user mode and begin executing in kernel mode, as a direct result of something that user's process did, either accidentally (such as a floating point error), or on purpose (such as a system call request). Some examples of exceptions are floating point exceptions, system call exceptions, page fault exceptions. Interrupts are (probably) not due to actions of the currently connected process. If a CPU receives an I/O interrupt, the CPU will stop executing in user context, and start executing in kernel context, in order to handle the I/O interrupt. The I/O which has completed is probably the last steps of some other process's I/O system call request, although it could be an asynchronous I/O which was requested earlier by the currently connected process. An interrupt is externally, asynchronously, caused, and distinct from the currently executing process. Some examples of interrupts are disk interrupts, tty interrupts, hardware error interrupts, clock interrupts. Another difference between exceptions and interrupts is where the CPU resumes execution in the user process, once the TR-IKI rev 0.7b SGI Proprietary 22jul1998 | CPU returns to user context. In general, exceptions occur "in the middle of instructions", therefore when the CPU returns to user context, it has to try to restart that same original instruction. And interrupt occurs "between" instructions, that is, when the CPU returns to user context, it executes the *next* insttruction after the point the interrupt occurred. | | | | | | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|---------------------------------|--|--|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 6-4.a | 22jul1998 | TR-IKI rev 0.7b SGI Proprietary | | | | | How are Interrupts Similar to Exceptions? • Both cause the CPU to do an exchange to kernel mode. • Kernel then saves context of previous process | | | | | | | | | | | | | ### MIPS Processor Exception and Interrupt Kernel Entry Points On MIPS processors, there are five possible interrupts or exceptions, but only FOUR entry points to the kernel: - 1. TLB exception - 2. Extended TLB exception - 3. ECC exception - 4. Reset interrupt - 5. General exception The Reset Interrupt, like the NMI Interrupt, is caused by pushing a button, and is handled by the hardware, not by the kernel. Examine the "sbd.h" file (use escope, or look on look on bonnie for the source file. Here's the path on bonnie for the IRIX 6.5 version /hosts/bonnie.engr.sgi.com/proj/irix6.5se/isms/irix/kern/sys/sbd.h). (I think "sbd" stands for "system board" ?) This file defines the five entry points listed above. Although the comments say "Chip definitions for R3000 and R4000", these entry points apply to the R10000 chip as well. All of them are called "Exception vectors" (see at or about line 39). ### The TLB Exception The first one, the "utlbmiss vector", listed above as the "TLB exception", is defined at (or about) line 49. #define UT\_VEC COMPAT\_KOBASE /\* utlbmiss vector \*/ TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-6 The Translation Lookaside Buffer is a piece of hardware used to contain mappings of virtual addresses to physical addresses. It is of limited size. With an R10000 chip, there are 64 registers, each of which holds two virtual-to-physical page mappings, so there is a maximum of 128 possible translations from virtual to physical that can be found in the TLB at any given time. With an R4000 or R5000 chip, there are 64 registers, each holding only one virtual-to-physical mapping. If a user tries to reference a virtual address that isn't one of the current 64 or 128 in the hardware, we take a "TLB exception" or "TLB miss" (these terms are synonomous), and the MIPS processor then starts executing code at the entry point for TLB exceptions (at address 80000000). #### The Extended TLB Exception #define XUT\_VEC (COMPAT\_KOBASE+0x80) /\* extended address tlbmiss \*/ This is the entry point for the "extended TLB handler". This handler has the same function as the TLB miss exception vector, but handles TLB misses on 64 bit addresses. The utlbmiss vector handles TLB misses on 32 bit addresses, which is the default. TLB misses don't have time to do anything but that, so the code doesn't handle anything much more than the TLB miss situation. #### The ECC Exception #define ECC\_VEC (COMPAT\_KOBASE+0x100) /\* Ecc exception vector \*/ ECC exception code (Error Correcting Code exceptions) is used to handle single or multi-bit errors, or single or multi-bit cache errors. All MIPS processors jump to the ECC exception vector, at yet another "well-known place in memory". The R10000 puts this entry point at a fixed place in memory. It's unusual that an ECC exception happens, and it doesn't TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-6.a follow normal rules for handling. In general, the more memory your system has, the more frequently you will get ECC exceptions. You probably won't see these very often and it will be obvious when you do. The machien sofware and hardware will call out where the problem area is (if it's a double-bit error) and keep track of how often and if it is going bad (single-bit errors). Like the TLB exception handlers, ECC exceptions handle the situation and don't do much of anything else. Both the TLB handlers and the ECC handler are explicitly short exceptions. ## The Reset Interrupt #### #define R\_VEC (COMPAT\_K1BASE+0x1fc00000) /\* reset vector \*/ The non-maskable interrupt and the reset interrupt and the power on interrupt are all caused by people pushing buttons. When a processor receives an NMI (Non-maskable interrupt), or reset interrupt or power on interrupt, it doesn't actually get handled by the kernel. As a result of the button being pushed, the CPU goes off into PROM to handle it, so these are really not handled by the kernel. #### Nonmaskable Interrupt (NMI) Core Dumps It is possible to manually generate a system core dump without the benefit of a system panic. All high-end (CHALLENGE® L or CHALLENGE® XL servers, Onyx® workstations, Origin200 and Origin2000 systems) systems contain a special feature that enables system administrators to initiate system core dumps. They accomplish this by issuing a Nonmaskable Interrupt (NMI) request. Depending on the system, selecting a system controller menu option or pressing a special button on the system controller will initiate an NMI. A system administrator (often at the request of SGI support) normally induces an NMI core dump when users of a system complain that the system is partially or completely hung. The resulting system core dump may provide a clue about the cause of the problem. ### The General Exception(s and Interrupt vector) 6-6.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### #define E\_VEC (COMPAT\_K0BASE+0x180) /\* Gen. exception vector \*/ Calling this vector the "General exception vector" is a bit of a misnomer. A better name would be the "General Exception AND INTERRUPT vector", because it actually is for both interrupts and exceptions. This entry point handles all the kinds of interrupts and exceptions not handled by the other vectors (eg., system calls, clock interrupts, I/O, etc.). On the MIPS architecture, all the reasons you'd end up here are subdivided into 32 different ways of dealing with why-you-got-here, sort of like 32 sub-entry-points, 32 potential vectors within the general exception vector, and those 32 are found in ..../irix/kern/os/startup.c Look for "causevec". There's a vecint vector for interrupts, more TLB stuff, read misses, write misses, attempts to modify stuff when you don't have permission, read/write address errors, system calls, breakpnt instructions, etc. FP overflow see MIPS architecture manual for a definition of each of these all MIPS architecture manual: in SGI home page. See: http://www.sgi.com/MIPS/products/r10k/UMan\_V2.0/HTML/t5.Ver.2.0.book\_1.html In particular, go the table of contents and take a look at Chapter 17 on CPU Exceptions and 6.14 on Interrupts. In that same book, go to the index, find "Cause register". The MIPS architecture defines 8 interrupt levels defined in the description of the coprocessor 0 status register and the coprocessor 0 cause register Some background: MIPS defines things as part of the CPU. One of the concepts of "CPU" is the concept of 'coprocessor'. there used to actually be a co-processor, but now it's actually built into the CPU chip - but it's still called "the co-processor". In all 6-6.c 22jul1998 TR-IKI rev 0.7b SGI Proprietary MIPS cpus, coprocessor 0 is for the special CPUcontrol registers, manipulate TLB, status regs, cause regs, clock reg, count and compare regs, up to about 32 control regs - see the architecture manual. These can only be accessed in kernel mode. Coprocessor 1 is always the Floating point control registers processor, so anytime you have any FP status reg and FP control reg, for FP operations. The architecture also defines Coprocessors 2 and 3, but they are not yet in use (possibly will be for media extensions and vector extensions). Actually, these are register sets, pieces of the CPU. The word "processor" is misleading. Every CPU has these. The status register in coprocessor 0 has 8 hardware levels defined for interrupts and we don't use them. Because that's not enough on the big machines, we have so many - so on Origin and other platforms, we provide, extrenal to the CPU hardware, in the HUB chip, a register that says \*here\* are the interrupts that are really pending and they are prioritized into 128 levels, pretty much per-device, eg, for each SCSI controller, it's common to use 30-40 of them, as defined by MV developers. The hardware prioritizes them but the software just uses them as 128 levels, same as UNICOS concept of hardware prioritizes by bit position versus software doesn't care re/interrupts. The HUB register is called the interrupt pending register. To see all about how to read what interrupt is pending in the register, see: http://babylon.engr.sgi.com/systemsw/projects/lego/hardware.html (SN0 used to be called Lego) click on Lego Hub, Router, IOC3, LINC Chips ChipDoc - Chip Hub Programming Manual chapter 2 CHAPTER 2 Hub Internal Register Definitions 2.1.1.20 INT\_PEND0 2.1.1.21 INT\_PEND1 What's important is that there is system hardware external to the processor thatkeeps track of which interrupts are TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-6.d actually pending at a given time, but these aren't icrash-locatable, so there's no way to find out what the processor actually knows. So that's a quick walk-through of the entry points for the four/five exceptions/interrupts mentioned before. When it comes to handling interrupts and exceptions, there are special handlers for each of those 5. Four of them handle special cases, and the fifth one has the 32 subtypes in a table in **startup.c**. If you look at the table, you'll see there's a single subtype for all device interrupts (disk, tape, console, tty, etc.). We always comes through the general exception vector, and always check to see if it's a hardware interrupt. If it is, the CPU spawns a thread for that kind of hardware, and for the device handler, which has other subtypes, and looks at more and more sub-divisions of handlers until it gets to point where some piece of software says this is the driver and the interrupt handler for this particular device, and spawns a thread to handle it (thread is spawned higher up). ## **General Exceptions** The above illustration is a simple diagram of the kinds of interrupts and exceptions that the kernel checks for, and reflects 6-7 22jul1998 TR-IKI rev 0.7b SGI Proprietary the checks that the code goes through in the actual order that they are checked for. Details of how these checks are performed, and more detail about the different interrupt and exception types is below. Five interrupt and exception vectors were described earlier (see /hosts/bonnie.engr.sgi.com/proj/release/isms/irix/kern/sys/sbd.h): ``` #define UT_VEC COMPAT_KOBASE /* utlbmiss vector */ #define XUT_VEC (COMPAT_KOBASE+0x80) /* extended address tlbmiss */ #define ECC_VEC (COMPAT_KOBASE+0x100) /* Ecc exception vector */ #define E_VEC (COMPAT_KIBASE+0x1600000) /* reset vector */ #define E_VEC (COMPAT_KOBASE+0x180) /* Gen. exception vector */ ``` The "UT\_VEC" and "XUT\_VEC" exceptions handle 32-bit and 64-bit architecture TLB misses, respectively. The ECC\_VEC, or "Error Correcting Code" vector handles single or multi-bit errors, as well as single or multi-bit cache errors. The "R\_VEC", or "Reset" vector handles NMI (Non-Maskable Interrupt) and reset interrupts. The "Interrupt and Exception Roadmap" details the "E\_VEC", or "general exception" vector, which actually handles both hardware interrupts as well as software interrupts and exceptions. There are 32 possible types of "general exception", as defined in ..../isms/irix/kern/os/startup.c , shown in the escope screen extract below. 6-7.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary When a CPU acts on an "E\_VEC", or general exception, it executes the code found in .../release/isms/irix/kern/ml/LOCORE/gen\_exc.s. The primary function of the gen\_exc.s code is to determine which of the above 32 reasons caused the CPU to stop operating in user context, and what is the appropriate routine to handle what must be done while executing in kernel context. The gen\_exc.s code checks first for interrupts, then for system calls, by examining the bits in the k0 register after masking them against CAUSE\_EXCMASK and then EXC\_SYSCALL, as shown in the escope screen extract, below. TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-7.b After the above, checks are made for a TLB read miss (exception), watch exception, and breakpoint exception, after which the code falls through to a routine which handles "everything else". The CAUSE\_EXCMASK, EXC\_SYSCALL, and other exception comparison bit fields are explained in the following cscope screen extracts. ## Hardware Interrupt Check The possible hardware interrupts are defined in .../release/isms/irix/kern/sys/sbd.h: For an R10000 chip O2000 or O200 machine, the hardware interrupt bit mask is explained in .../release/isms/irix/kern/sys/SN/SN0/IP27.h: 6-8 22jul1998 TR-IKI rev 0.7b SGI Proprietary The priority and definition of the above hardware interrupts can be found in .../release/isms/irix/kern/ml/SN/intr.c: TR-IKI rev 0.7b SGI Proprietary 22jul1998 6-8.b ### Software and Hardware Exception Check The possible hardware and software exceptions, like the hardware interrupts, are also defined in .../release/isms/irix/kern/sys/sbd.h. Notice that if the result of the masking operation is a zero, we have a hardware interrupt of one of the kinds defined above. As state above, this is the first thing the gen\_exc.s code checks for. After the check for hardware interrupts, then the check for hardware and software exceptions begins. A check is made to see if this is a system call (EXC\_SYSCALL). Then a check is made to see if this is TLB read miss (EXC\_MOD or EXC\_RMISS). TLB write misses are handled later in the logic (see the logic path on the diagram leading to ecommon, then PDA, then VEC\_tlbmiss). If this is not a TLB read miss, then a check is made to see if we have a watch (EXC\_WATCH) or breakpoint (EXC\_BREAK) exception (these seem to be primarily pathways related primarily to debugging). Finally, all the remaining exceptions (software exceptions, all of which have a prefix of "SEX\_C....", below) fall through the longway code, then to ecommon, then, to the appropriate handler. TR-IKI rev 0.7b SGI Proprietary 22jul1998 ``` ##define EXCLMHITCH EXCLODE(31) // Matchpoint reference // ##define EXCLVCED EXCLODE(31) // Virt. Coherence in data read // ##define EXCLVCED EXCLODE(31) // Virt. Coherence in data read // ##define EXCLODE(32) // Software detected begins of the serior SEXCLODE(33) // Received request // ##define SEXCLODE(34) received request // ##define SEXCLODE(34) // Received request // ##define SEXCLODE(34) // Received received request // ##define SEXCLODE(34) // Received recei ``` | | Module 7: Process Management Overview | |-----|---------------------------------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4 - | | | | | | | | | | | | | | | | | | | | # **Process Management Overview** This section provides an overview of IRIX processes and process management. By the end of this section, the student will be able to: - Describe the difference between a process and an executable file - Use the elfdump tool to examine an executable file. Define "process" and describe a virtual process image. - Describe a stack format, and the differences between a user stack and a kernel stack - Use the gmemusage tool to display and examine system and process physical memory usage - Describe some of the key structures in process control, and their functions - Describe the flow of execution in a context switch - Describe the flow of execution in a system call 7-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Process Management Overview** The process management facilities within IRIX are at the heart of the operating system. They are responsible for coordination of all the tasks invoked by users as well as system tasks. The process management subsystem's responsibility includes the following: User process life cycle Creation, execution, interruption, and termination of user processes. CPU scheduling All runnable processes are placed on run queues. Run queues must be constantly maintained with the highest priority processes scheduled to run next at any point in time. Context switching Once a process gets connected to a CPU, process management must determine how long the process is allowed to use the CPU before the CPU switches to another process. Accounting The kernel must keep track of how much execution time a process has consumed in user mode as well as kernel mode. Memory usage The memory management subsystem must be consulted to allocate and deallocate main memory when a process is initiated, or when it expands or contracts in memory. Exception handling When programs executing within processes cause exceptions, the operating system must notify the affected process via the signal mechanism. I/O processing Processes need to associate themselves with files for purposes of reading and writing. The process management subsystem must coordinate with the file and I/O management subsystems. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-2.a ## **Executable Files and Processes Diagram** | Linked v | iew | |----------|-----| |----------|-----| | | DIIKOG VIOW | | | | | |---|----------------------|--|--|--|--| | | ELF header | | | | | | | Program header table | | | | | | | Section 1 | | | | | | | | | | | | | | Section n | | | | | | - | | | | | | | | | | | | | | I | Section header table | | | | | ### Executable view ELF header Program header table Segment 1 Segment 2 ... Section header table ## **Executable Files and Processes Diagram** A process is created from an executable file stored in the file system. Executable files generated by any compiling system are called a.out files, because the default name for the compiler and linkage editor output is a . out. At the beginning of each a . out file is a header. The header contains the information about the format and structure of the program within the file. The header tells the system how to build a process in memory from the executable file stored on disk. All a out format files in IRIX use the format called the Extensible Linking Format (ELF). The diagram above shows an overview of the ELF format. The stored format of an a . out file is called the linked view (a program). As the file is loaded into memory to execute, the format of the a . out file within memory is called the executable view (a process). ELF a out files are constructed from various sections that are described within the header. The sections describe the organization of the parts of an executable file as stored on disk. Sections are used to hold parts of a program like instructions (code text), data, symbol tables, etc. When an a out file is loaded into memory for execution, three kinds of logical segments are set up: the text segment, the data segment (initialized data followed by uninitialized, the latter actually being initialized to all 0's), and a stack. A segment holds the parts of the program for the execution view and may contain one or more sections from the executable file. A single-threaded program loaded into memory may have multiple text and data segments, but only one stack segment. The text segments are not writable by the program; if other processes are executing the same a out file, the processes will share the same text segments. 7-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Executable Files and elfdump(1) ELFDUMP ET.ETITMO NAME elfdump - dumps selected parts of a 32-bit or a 64-bit ELF object file/archive and displays them in ELF style ${}^{\circ}$ elfdump [ options ] file DESCRIPTION The elfdump command dumps selected parts of a given ELF object file. This command works for 32-bit or 64-bit ELF object files or ELF archives only. It accepts these options and many others (see man page): - Dumps the file (ELF) header. - Dumps all section headers in the file. The elfdump command will display selected portions of an a. out executable file. The options to elfdump(1) control what portions are displayed. The examples on the following pages will only illustrate the use of the -f and -h options. For additional options, see the elfdump(1) man page. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-6 TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-7 7-7.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Process Definition Diagram** ## **Process Definition Diagram Explanation** A process is the execution of a program or executable file stored in the file system. An IRIX process is partitioned into several regions as shown above. All processes will have text, data, and stack regions but may contain others such as shared memory and memory mapped regions. #### Text Contains a sequence of bytes that the CPU interprets as machine instructions. Has status "read only" and may be shared by multiple processes; that is, multiple processes may be executing concurrently all issuing instructions from the same shared text area. Because the text area can be shared, individual processes are not allowed to modify it. #### Data A memory region private to the individual process and can be read or written by the process' instructions (text). Consists of two parts; an initialized area and uninitialized area usually referred to as the bss or heap area. Heap area can grow dynamically as the process needs more space. Heap growth is in the direction of the stack, ie, towards higher virtual addresses. As shown above, a process cannot read or write to any other process' data or stack regions. #### Stack Used to hold locally allocated variables and parameters passed to functions. Is automatically expanded as needed when process invokes functions or subroutines. Stack growth is in the direction of the heap, ie, towards lower virtual addresses. A process has two stacks. The user stack is used while executing instructions in user mode. The kernel stack is used for executing instructions in kernel mode or while the kernel is executing instructions "on behalf of" or "in the context of" the user process, such as when the kernel executes system call code which the user process requested. As of IRIX 6.5, the user information traditionally stored in the "u area" on UNIX systems has been dispersed into other areas of the IRIX kernel. That information is now in various parts of the uthread, kthread, and proc structures. The process's page table, the "pte", is used to load the TLB for the purpose of virtual to physical address translations. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-9 #### **User Stack Diagram** ## **User Stack Diagram Explanation** Attributes of the user stack are: - Automatically created, and size dynamically adjusted, at run time. Logical stack frames that are pushed onto the stack when a function is called, and popped off the stack when returning. - Stack pointer indicates current position in the stack. Stack frame contains parameters passed to the function, function's local variables, location of previous frame, return address to calling function, - Kernel grows the stack as needed. 7-11 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Kernel Stack Diagram ## **Kernel Stack Diagram Explanation** Because a process can execute in two modes, user or kernel, a separate stack is used for each mode. As stated on the previous page, the user stack contains the arguments and local variables for functions executing in user mode. The kernel stack contains the stack frames for functions executing in the kernel in kernel mode. The function and data entries on the kernel stack refer to functions and data in the kernel, not the user program. The kernel stack's construction is the same as that of the user stack. The "kernelstack" contains information about what the kernel is doing on behalf of a particular process. This information can help to determine the cause of a system panic or hang. The kernelstack virtual address is the same for all processes running on the system. For example, on an IP27 system, the address of the kernelstack is 0xffffffffffff00. The kernelstack address is platform specific and is determined when the kernel is built. Information is contained in the proc struct about the mapping of the kernelstack address to a particular physical memory page. Hernel stacks are limited in size to Lpage (or 2 pages for a 32-bit arch.) TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-13 ### Processes and Kernel Threads In IRIX revision 6.1 and earlier versions of IRIX, the process was the central mechanism for distributing processor resources over a collection of independent and cooperating tasks in the operating system. IRIX 6.2 introduced the migration toward the use of kernel threads (kthreads) as a central mechanism. Kernel threads resemble the execution model of a UNIX process and consist of a code stream, private stack, and private register space. Unlike UNIX processes, kthreads are inexpensive to create and destroy, and can be quickly scheduled. A kthread has an associated user process context only if it is running on behalf of a system call or page fault, otherwise it has no logical connection to a user process. With IRIX 6.2 and 6.3, only a partial conversion was made. The construct of a kthread was introduced, however it was still closely associated with entries in the proc table. In fact, a proc table entry was allocated for each active kthread in the system (even those that had no process context). Not until IRIX 6.4 was the conversion more or less complete (the evolution will continue with future revisions of IRIX). The kthread has now become the fundamental execution entity in the system. There are three types of kthreads: - User process - Interrupt thread (ithread) - Service thread (sthread) | <b>Displaying</b> | process me | mory ( | cmemusace( | 1 | " | |-------------------|-------------|--------|---------------------------------|---|----| | - LUPIU, LAND | Process min | | , <del>a mome a ce a ce l</del> | | ,, | The command gmemusage(1) can be used to display the memory usage for individual processes as shown below. 7-15 22jul1998 TR-IKI rev 0.7b SGI Proprietary # Cray Origin2000 System Workload gmemusage(1) Display 7-16 22jul1998 TR-IKI rev 0.7b SGI Proprietary TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-16.a # IRIX Physical Memory gmemusage(1) Display TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-17 7-17.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary # Process Physical Memory gmemusage(1) Display 7-18 22jul1998 TR-IKI rev 0.7b SGI Proprietary TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-18.a # **Process Control Diagram** ### **Process Control Diagram Explanation** This diagram provides a brief overview of how the kernel controls individual user processes. 1. pda structure Each CPU has a private data area (pda) in main memory which points to the thread structure for the process currently connected to the CPU. 2. thread structure When a process is created, a thread structure is dynamically allocated in the kernel which is used in controlling the process. The thread structure holds information such as status of signals, CPU scheduling information, and current system call and arguments passed. 3. proc structure When a process is created, a proc structure is also dynamically allocated in the kernel which is used in controlling the process. The proc structure keeps track of who its parent, child, and sibling processes are as well as what process group it belongs to. 4. pte structure (page tables) Helps map virtual process pages to physical memory pages. Because the thread and proc structures are always resident in memory, the information maintained in these structures for a particular process is always available to the kernel, even if the process is paged out. Therefore, the thread and proc structures contain all of the data needed about a process even though the process might not be present in main memory. Conversely, if a process is paged out by the kernel to gain space for a different process, the information in the user area is not available to the kernel until the process is paged back into memory. Therefore, the user area contains process control information that is not needed by the kernel when the process is inactive or paged out of main memory. 7-20 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Process Segments or Regions** The IRIX kernel divides the virtual address space of a process into logical segments or regions. A segment or region is a contiguous area of virtual address space of a process which can be treated as a distinct object to be shared or protected. Several processes can share a segment or region. For example, several processes may execute the same program, such as multiple users of the same shell program. Therefore, it makes sense for them to share the same copy of the text region. In a similar manner, several processes may cooperate by sharing a common shared-memory region. The process region mechanism also allows the kernel to protect regions of a process's address space so that the process itself cannot alter the region. This is done with the text region of a process. 22jul1998 TR-IKI rev 0.7b SGI Proprietary TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-21.a # **Kernel's Region Tables Diagram** ### Kernel's Region Tables Diagram Explanation The kernel maintains a region table (not shown above) and allocates an entry in the table for each active region on the system. Region table entries keep track of where each region resides in physical memory. Process-independent attributes are kept in the region table entries. Each process also has a per process region table where each entry is usually referred to as a pregion (preg in the diagram). Each preg entry has a pointer to a region table entry which has pointers to where the region resides in physical memory (shown as a dashed arrow in diagram). The preg entry also contains a permission field that indicates the type of access allowed to the process: read-only, read-write, or read-execute. Process-specific attributes are kept in the pregion structure. A process' pregions (pregs) are maintained in two separate lists. One list controls the regions which are considered private and the other controls those which are shared. The process' thread structure locates the private and shared pregion lists. 7-23 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Region Sharing Diagram** ### **Region Sharing Diagram Explanation** Several processes can share parts of their address spaces via a common region. Each process sharing a region accesses the region via a private pregion (preg) entry. The illustration above shows two processes (A and B) which are executing the same program and sharing a shared-memory region. The obvious advantages of region sharing are: - Reduction in physical memory requirements when multiple processes are executing the same program. For example, there is significant reduction in physical memory requirements for a program like the shell program which has many concurrent users. - Much less kernel paging is required with one copy of a program's text and multiple processes executing within the same text area. - Process startup time is reduced if desired program text is already in memory. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-25 ### Multiprocessing The IRIX system is a multiprocessing environment. Each CPU can execute in only one of two locations at any time: user process or operating system kernel. However, the operating system gives users the impression that a single CPU can give attention to multiple processes simultaneously, but only one user process can execute at a time per CPU when the CPU is not "in" the kernel. The kernel provides this illusion by a mechanism called *time slicing*. Processes receive short bursts of CPU attention called time slices. In general, a single process will not receive 100% attention of a CPU. Therefore, to the user it looks like multiple processes are all executing simultaneously, but in reality it is just one process at a time executing in short bursts. The CPU(s) must switch attention to multiple processes (residing in priority order on the run queue) very rapidly to provide this illusion. Processes are switched in and out of a CPU typically every few milliseconds. On a multi-CPU system, multiple processes execute simultaneously in the various CPUs. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-26 7-26.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Process Execution Flow Diagram** ### **Process Execution Flow Diagram Explanation** Each process runs in its own address space and behaves as if all of the machine resources are available for its exclusive use. The execution of multiple processes in parallel is achieved by switching different processes in and out of the CPU every few milliseconds. The above diagram shows the typical flow of execution for a process. - 1. When a CPU is executing in a user process, it is said to be operating in user mode. The process can only access memory that is within its address space. - 2. When a user process makes a system call, generates an exception (fault), or when an interrupt occurs like a terminal interrupt (Ctrl-C) or a system clock interrup, the user process's context is saved before the CPU executes kernel code. - 3. When the CPU leaves the user process and enters the kernel, it is executing in the kernel on behalf of that user process and is said to be executing in kernel mode. In kernel mode, the CPU is authorized to execute privileged instructions and can access the code and data of any process. A user process cannot access kernel code, or the address space of any other process. - 4. When the kernel has completed execution on behalf of the user process, it restores the context of the user process and returns CPU control to the process at the location where the user process was previously interrupted. Execution resumes in user mode. From the viewpoint of a user program, a process' address space is a linear, flat, addressable area of memory starting at address zero and extending to a fixed address boundary set by both the hardware and operating system kernel. However, to the kernel, a process's address space is divided into discrete regions called text, data, heap (bss), and stack shown on previous pages. TR-IKI rev 0.7b SGI Proprietary 22jul1998 7-28 #### **System Call Interface Diagram** Physical memory ### **System Call Interface Diagram Explanation** A system call is the mechanism a user uses to invoke a function or perform a task in the kernel. There are hundreds of system calls a user can invoke; for example, open(2), read(2), write(2), close(2), chown(2), kill(2), etc. The method used to invoke a system call and have control passed to the kernel for execution is illustrated above with explanation below: - 1. The user's program issues a system call by specifying the name of the specific call with an attached list of arguments. For every possible system call supported in IRIX, there is a unique library routine which processes the system call request. Control is passed to this library routine. - 2. From the viewpoint of the kernel, every system call is known by a unique integer which must be passed to the kernel for identification. The library routine invokes an instruction that changes the process' execution mode to kernel mode and causes control to be passed to the kernel's system call handling code passing along the integer identifier for the desired system call. - 3. The kernel's system call handling code receives the integer to identify the system call the user is invoking, and looks up the system call number in a table (sysent) to find the address of the appropriate system call handler. - 4. Control passes to the identified system call handler and the system call executes. - 5. When finished with the user's request (for example, a file read request), the system call handler returns to the kernel code from which it was called. - 6. The kernel returns to the user process' library routine which originally passed control to the kernel switching back to user mode. - 7. The library routine returns to the user's program where the system call was made with a return value indicating success or failure. 7-30 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **IRIX System Call Processing** #### Unit covers: - Overview of IRIX kernel system call processing - List and role of key components in system call processing - System call walk-through - System call argument processing including return value conventions - System call icrash(1m) examples 8-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **System Call Review** - User processes access Operating System Services via a mechanism called System Calls. - System calls are performed with these general steps: - User program prepares calling argument values according to each one's prototype as described in the calls man(2) page. - 2. Program performs a hardware instruction that causes in interrupt. The mnemonic for this in IRIX is syscall. - 3. The interrupt is "trapped" by the OS; the OS performs the operation (if legal) and returns one or more return values as described in the call's man(2) page. See man page for intro(2) for list of standard error return values. - 4. OS returns control to the user process, passing any return values to the user program. - 5. The user program checks return values for errors and handles error or continues processing. - Programs have access to System Calls with several methods: - 1. User codes the above "calling sequence" in program using direct assembler code. - 2. User uses standard UNIX system call library routines (a.k.a. open(2), read(2), close(2)) to directly access the system call. - 3. User uses standard "higher level" library functions to access system services. These library routines take a lot of the clerical work out of accessing system services or provide services in themselves. For instance: - fopen(3), fclose(3), fread(3), and fwrite(3) provide file I/O with user library level (double) buffering and other services. - psignal(3) provides general access to kernel signal processing services. - malloc(3) allocates and manages user (heap / BSS) memory, making break(2) system calls to request that the kernel expand user memory. - 4. Compilers generate System Call sequences as part of their command support. - 5. Compiler commands such as FORTRAN's open and READ build upon library man(2) system call routines. - 6. Most compilers provide direct access to both man(2) and man(3) library routines. - 7. Many compilers allow imbedding of assembler commands within their "normal" code. These assembler routines may make system calls as described above. - In all cases, the same low level kernel system call is invoked and processed by the OS. This unit describes kernel system call processing at this level. TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-3 ### System Call Component Diagram ### **System Call Overview** The diagram shows the general flow of control for IRIX system call processing. The open(2) system call is used as a typical example. - 1. User process "A" executes (open(2)) system call. The system call library code invokes a syscall interrupt. - Control switches to the kernel. - 2. Low level kernel exception handler (trap) routines decode the exception type and dispatch control to assembler routine systrap for system call exceptions. - Routine systrap check for usage errors (e.g. interrupt must be from user, not kernel) and save user's CPU register context. - Systrap fabricates a kernel stack and calls C function syscall(). - 4. Function syscall() accesses user's calling arguments. - O A check is made to make sure that the system call number fits within the scope of the system [] table. - The system call number argument is used to load the corresponding interrupt function address from the kernel sysent [] table. - O The sysent [] table argument limit is used to check the caller's number of arguments. - O If errors, syscall() returns to user with a standard error code (errno.h). - O If no error, syscall() calls kernel function corresponding the system call number. 8-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary - 5. Processing continues at the kernel system call function. - During call processing, error conditions may cause the function to return to syscall() without completing the required work. Syscall() returns the error (syserr.h) code to the user process. - The thread of logic started in the system call may block (sleep), waiting for some condition (resource) in the kernel. - This is called a "context switch" and is covered in another unit. - Eventually the system call processing completes. - Functions pass one or two "return values" to the calling user process indicating the success or failure of the call. The values are documented as the call's RETURN VALUES in the man(2) pages. - Additional data may be passed between the user and the kernel via user calling argument addresses; for example the path address in an open(2) or buffer address in a read(). - 6. Returning to syscall() the kernel: - Does final error checking. - Restores the calling process's (A) CPU register context -OR- - Calls soft\_trap() to schedule another process (B), restoring it's context instead. - O Process A resumes either immediatly or when it is resumed by swtch(). - Process A should check it's return value(s) before proceeding with other program logic. #### System Call Walk Through ### **User Makes System Call** Calling arguments are loaded into CPU registers. By IRIX convention, these are the "a" registers starting with reg a0. The man page for open(2) shows C SYNOPSIS: ``` int open (const char *path, int oflag, ... /* mode_t mode */) ``` For this C open command: int fd = open("/tmp/opensam",O\_CREAT,0700); - The address of the path string literal "/tmp/opensam" is loaded into register a0. - The open flag o\_creat (value 0x100) is loaded into register a1. - #define O\_CREAT $0 \times 100$ /\* open with file create (uses third open arg) \*/ - The mode value 0700 is loaded into register a2. - Other operands may be used as defined in the open(2) man page. - The program calls the open system call library function open() in open.s which loads register vo with the system call number. The number for each call is defined in fille sys.s. This number represents a position (index) into a system call entry points table called sysent[]. For open(2), this is 1005 (the base of the table is 1000). - The open library code executes the syscal1 machine instruction causeing an interrupt into the kernel. - Register content summary: - o a0-a2 contain the calling arguments. - o vo contains the system call number. TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-7 ### Sample assembler code for open(2) ``` main() { from sample program opn.c 10: 10] 0x10000b38: a0,-32740(gp) local memory pages path string literal O_CREAT=0x100 8f 84 80 1c 24 84 10 00 24 05 01 00 24 06 01 c0 8f 99 80 30 03 20 f8 09 0x10000b3c: 0x10000b40: addiu li li a0,a0,4096 a1,256 a2,448 0x10000b44: mode=0700 0x10000b48: 0x10000b4c: &open() call open -> _open64 lw jalr t9,-32720(gp) ra,t9 pen64: from libc.so.1 18] 0xfa3e4a0: 24 02 03 ed 18] 0xfa3e4a4: 00 00 00 v0.1005 index into sysent table 0xfa3e4a4: 0xfa3e4a8: 00 00 00 0c 14 e0 00 03 00 00 00 00 syscall bne a3,zero,0xfac5078 -> _cerror (indirectly) nop jr Oxfa3e4ac: -> back to main() 00 12 6f 58 70 2d 93 f8 0xfa384f8: 3c 0e 65 ce 01 d9 lui t2,0x12 30] 0xfa384f8: 30] 0xfa384fc: daddiu daddu t2,t2,28504 t2,t2,t9 at,-27656(t2) 0xfa38500: 8d c1 31] 0xfa38504: lw at=&errno (gobal error) v0,0(at) t0,-27736(t2) t1,-27656(t2) 0xfa38508: 0xfa3850c: ac 22 8d cc 00 00 93 a8 save v0 in global errno 8d cd 93 f8 8d 8c 00 00 11 8d 00 02 lw lw 351 0xfa38510: 0xfa38514: 0xfa38518: t0,0(t0) t0,t1,0xfaf484c 361 bea 00 00 00 00 ad 82 00 00 03 e0 00 08 36] 37] 0xfa3851c: 0xfa38520: nop save v0 in per/ thread errno -> back to main() v0,0(t0) 41] 0xfa38524: continued 00 00 00 00 8f 85 80 40 ac a2 00 00 0x10000b50: 0x10000b54: nop lw a1,-32704(gp) v0,0(a1) &fd 0x10000b58: fdevo = open("/tmp/op ``` ### **Kernel Traps the Interrupt** - Low level interrupt handler routines test interrupt (exception) code, calling systrap (systrap.s) for the syscall hardware exception. - Assembler code in systrap - O Saves CPU registers in user process's uthread exception frame area. Calling argument values from the a registers are now in the process thread area. - Switches system times (accounting) over to OS. - O Use system call number (sysnumber) in vo as index into the sysent[] table to access kernel function address, call argument count, and call flags. - C function syscall() is called to dispatch the system call. 8-9 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Syscall() Dispatches the Call - System call counts are incremented; by call number and total. - The system call number and argument values are checked for sanity (Specific checking of argument values is done by each system call function.) Errors are returned as described below. - User system call arguments are copied (up to 8 of them) to the uthread ut\_scallargs[] array. - The specific system call function is called as defined in the kernel sysent[] table. ### Kernel Performs Specific System Call - Kernel executes functions as initiated by the "top level" system call function For example: open() calls copen() which may call \*\*sopen(), and so on. - Very often the kernel performing a system call must wait for some resource such as an I/O operation. In this case: - o The function calls sleep() (or a variant of sleep()) to give the CPU to another process (thread). - The CPU selects another process thread and resumes processing that one. - Eventually the event the process was waiting for (e.g. I/O) completes. The interrupt handler for that event awakens this sleeping process with a wakeup() call. - O The kernel will eventually select this process to resume where it left off. - A system call may result in many sleep-wakeup situations before it completes. - If successful, "top level" functions return specific RETURN values as defined in the system call man(2) page. - Kernel functions check for errors, returning specific error codes as described below. - In any case, control returns to the "top level" system call function, which returns to syscal1(). TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-11 ### Syscall() Resumes Processing - If system call return an error the error: - The error is posted to the user (set in erro). - O The process may be sent a signal. - If no error: - O Set user return values rv1 and rv2 into registers v0 and v1 (see argument processing below). - Clear flags indicating that the system call is finished. - Return to systrap(). ### Systrap() Resumes Processing - Kernel system call functions may set a resched flag in the kernel when they perform some operation that changes the potential scheduling of processes (thread) in the system. Examples: - o Fork creats a new process which may need to run. - Exit destroys a process leaving the CPU free to run another. - A signal is sent to a process awakening it from sleep. - I/O completes awakening a process. - When the resched flag is set, function qswtch() is called to check a process run queue. The most worthy process is selected to resume. (See process scheduling for more detail). - The kernel system call timer is stopped and the user timer is resumed. - The first user stack TLB is loaded. - The user's CPU context (register values) are restored. - ERET machine instruction is executed. Control resumes at the address in EPC, which was the user PC at the time of the syscall exception interrupt. 8-13 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **User Resumes Processing** - The library routine returns control to the calling user function. - The user SHOULD check the return value (v0) and errno for error before continuing. - In the open sample used in this walk-through, the integer fd recieves the return value. If there was no error, fd will be an index to the user's open file descriptor in their open file table, the product of the open(2) system call operation. ### **System Call Argument Processing** TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-15 ### **System Call Argument Processing** Typical system call argument processing is shown using the open(2) system call as an example. - Open library code loads the calling argument values into registers a0, a1, and a2. The library code calls the kernel with the syscall command. On entry to the kernel, all user registers are save in the uthread exception frame. Systrap() creates a stack frame for the call to syscall() with room for syscall's local variables such as error and - 2. The system call number is placed in the ut\_syscallno field and the user arguments are copied into the ut\_scallargs[8] array, both in the process'es uthread area. - 3. Before syscall() calls the specific system call function, the kernel sets a0 to point to the system call arguments, now in the uthread area. Register a1 is set to point to the return value pointer rvp (in the stack). Register a3 contains the system call number (not always used). - 4. The called function used **ao** to locate the calling arguments. C code "maps" these arguments to their function use as seen by the opena structure in the open(2) system call. - 5. Kernel function open() calls copen() passing the caller's argument (values and pointers) to it in the a registers as - 6. The system call function procees, performing each one's service until it finishes or returns with error. - 7. At some time before returning, the system call functions store results in rv1 and possibly rv2. These values are also in registers vo and v1. By convention, the system call functions return 0 (zero) if no error or non-zero if error to syscall(). - 8. Syscall() places the error, v0, and v1 values in the exception frame a3, v0, and v1 fields. Systrap() reloads the CPU registers (including the return vaalues) from the exception area just before returning to the interrupted user process. - 9. The system call library routine checks register a3 for non-zero, storing the value in user process global data area errno. Register vo is delivered to the user process as the result of the system call. Depending on the specific system call, the user can test this for success or failure, and access errno for a more specific reason for failure in the case of an error. 8-17 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### icrash(1M) Samples ### **Process uthread Display** 8-18 #### uthread Detail ``` >> px *(uthread_t *)0xa800000201508c00 struct uthread_s { ut_kthread = kthread_t { k_regs = { [0] 0xa800000201508c00 [1] 0x80 [2] 0xa800000201508c84 [3] 0x80 [4] 0x0 [5] 0x0 [6] 0x100197d8 [7] 0x10019798 [8] 0xfffffffffffb940 191 0x0 [10] 0xc000000000200490 [11] 0x68al [12] 0x0 k_i = 0x10000005a (edited) ut_syscallno = 0x5 adjusted down from 1005, indexes to open() ut_syscalino = 0x5 ut_scallargs = { [0] 0x10019138 [1] 0x0 [2] 0x0 [3] 0x0 user address of path open flags (is zero this case) open mode (is zero this case) [4] 0x0 [5] 0x200æ6c [6] 0x0 [7] 0x200e70 {edited} ut_rsa_runable = 0x0 ut_rsa_npgs = 0x0 ut_rsa_locore = 0x0 ut_rsa_pad = ** ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-19 ### Trace of open(2) System Call TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-20 #### Trace Detail (partial) ``` >> trace -f a8000002014fcc00 STACK TRACE FOR UTHREAD 0xa800000201508c00 (xlv_plexd, PID=27): (edited) 8 copen[../os/vncalls.c: 211, 0xc000000001cb728] RA=0xc000000001cb5c4, SP=0xfffffffffffffbe30, FRAME SIZE=128 ffffffffffbe30: a8000002006a0270 0000000301419c18 ffffffffffffbe40: a80000201508c00 ffffffffffffbe50: a80000201508c00 fffffffffffffbe60: 0000000000000000 a800000201508c84 0000000010019138 0000000000000000 ffffffffffbe70: 0000000000000001 ffffffffffbed8 fffffffffffbe80: c000000001cb5c4 fffffffffffbe90: 000000000020000 c000000001418d30 0000000000020000 ffffffffffffbea0: a800000201508ea8 c00000000018cbac 9 open[../os/vncalls.c: 145, 0xc000000001cb5bc] RA=0xc0000000018cbac, SP=0xffffffffffffbeb0, FRAME SIZE=32 ``` 8-21 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Frame For open() TR-IKI rev 0.7b SGI Proprietary 22jul1998 8-23 ### Disassembly Code For open() ### Frame For copen() RA=0xc00000000001cb5c4, SP=0xffffffffffffbe30, FRAME SIZE=128 8 copen[../os/vncalls.c: 211, 0xc000000001cb728] 8-25 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Disassembly Code For kernel copen() ### **Register Aliases** \$pc - current user pc \$sp - current value of stack pointer \$rn - register n \$fn - single precision floating point register \$dn - double precision floating point register \$mmhi - most significant multiply/divide result register \$mmlo - least significant multiply/device result register \$fcsr - floating point control and status register \$feir - floating point exception instruction register \$cause - exception cause register (The following is correct for 32bit abi programs) | | Alternate | | |------------|-----------|-------------------------------------------------------| | Alias | Alias | Description | | \$r0 | \$zero | always 0 | | \$r1 | \$at | reserved for assembler | | \$r2\$r3 | \$v0\$v1 | expression evaluations, static links, returned values | | \$r4\$r7 | \$a0\$a3 | arguments | | \$r8\$r15 | \$t0\$t7 | temporaries | | \$r16\$r23 | \$s0\$s7 | saved across procedure calls | | \$r24\$r25 | \$t8\$t9 | temporaries | | \$r26\$r27 | \$k0\$k1 | reserved for kernel | | Sr28 | \$gp | global pointer | | Sr29 | \$sp | stack pointer | | \$r30 | \$58 | saved across procedure calls | | \$r31 | \$ra | return address | | | | | = New: \$r4..\$r11 \$a\$...\$a7 | Module 9: Memory Management Overview | | | |--------------------------------------|--|--| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 9-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Module Overview** This module provides an overview of the hardware and software mechanisms used to manage the system memory. Emphasis is on virtual addressing and memory paging, which are used to give users the illusion that their processes can consume all available memory or even more memory than physically available. 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Module Objectives** After completing this module, you will be able to: - Explain the characteristics of a swapping type of UNIX system. - Describe the concepts of virtual address and memory page in IRIX systems. - Describe the role of the TLB in memory address translation. - Describe how virtual addresses are translated to physical memory addresses. - Explain the concepts of demand paging and page stealing in IRIX. - Use sar(1) to produce reports on memory, swapping, paging, and TLB activity. - Use ps(1) to determine total size and current memory consumption of user processes. - Use gr\_osview(1) to display dynamic memory, swapping, paging, and TLB activity. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-3 ### **Hardware Memory Review** The hardware aspects of memory management were presented in the Hardware Overview section of this course. Please review those pages for the details. Following is a summary and diagram of the hardware memory concepts presented there. ### Origin2000 distributed-shared memory - Located in a single shared address space but is physically dispersed across system nodes. - Former systems had memory centrally located and only accessible over a single shared bus. - The interconnection fabric is a mesh of multiple point-to-point links connected by the routing switches. These links and switches allow multiple memory accesses to occur simultaneously. - To a processor, main memory appears as a single addressable space containing many blocks or pages. - Page migration hardware moves data into memory closer to a processor that frequently uses it to reduce memory latency. 9-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Origin2000 Memory Hierarchy (in order of increasing memory latency) - Processor registers - Cache (primary and secondary) - Local memory - Remote memory - Remote caches ## Hardware Address Sequence Review Diagram TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-7 Hardware Address Sequence Review Diagram Explanation A CPU examines an instruction, and isolates that part of it which represents the address of the page, and the offset into that page, of the data that the CPU needs. This address might be something like the address of an instruction to fetch, or the address of an operand of an instruction. Then the CPU goes through the following steps in order to find that address. - 1. The virtual address of the needed data is formed in the processor execution or instruction-fetch unit. Most addresses are then mapped from virtual to real through the Translation Lookaside Buffer (TLB). This process may have had a "TLB miss" if the virtual-to-physical mapping was not already in the TLB. At that point, the CPU had to exchange into kernel context in order to determine the physical address and then load it into the TLB. One way or another, at this point the TLB has a virtual-to-physical address mapping of the address the process wants, and the CPU 'knows' what physical page of memory it must access. - 2. Most addresses are presented to the primary instruction or primary data caches, depending on what is being addressed. These caches are in the processor chip. If a copy of the data with that address is found, it is returned immediately. - 3. When the primary cache does not contain the data, the address is presented to the secondary cache, which is used to hold both data and instructions. If the secondary cache contains a copy of the data, the data is returned immediately. - 4. When the secondary cache does not contain the data, the physical address reference is placed on the system bus and handed over to the HUB chip. The HUB knows which areas of memory have been assigned to which nodes, which area of memory has been assigned as "local" to this node, and which nodes are attached to which router connections. The HUB acts as a switch, and directs the request either to this node chip's local memory, or whatever remote memory address is appropriate. - 5. When the HUB chip recognizes that local memory does not contain the data, the address passes out through the "connection fabric", that is, through router connections to other nodes on this, or other hypercubes in the system, to a memory module in another node, from which the data is returned. 9-9 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Memory Subsystem Introduction One of the major concerns for the operating system is how it manages the finite amount of physical memory installed in the system hardware. The total amount of memory needed by all active processes on the system is constantly changing and generally is far greater than the actually available physical memory. The operating system kernel must handle situations like: - Where will a process reside in main memory? - How will it prioritize which processes are the most eligible to occupy main memory? - What scheme will be used to move processes in and out of main memory? - How will it allocate more memory to a process as its needs grow? - How will it free unused memory when a process wants to shrink? ### Historical Solutions to Memory Management (Swapping) Earlier versions of UNIX used a method called swapping to manage main memory. With this method, whole processes were swapped from memory to disk to make room for other processes that needed to run, as shown below. Historically, UNIX was a swapping system and the swapping was done by a special process called the swapper or sched (short for scheduler) which always has a process ID (PID) of 0. More recent versions of UNIX still have a swapper process or sched. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-11 # **Recent Solution to Memory Management (Virtual Memory)** To overcome the constraints of having to have an entire process resident in physical memory in order to run, UNIX System V Release 4 adopted a concept referred to as *virtual memory*. A virtual machine allows programmers to ignore the physical layout and size of machine memory. A program is written to reference virtual addresses for both instructions and data, thus relieving the programmer from concern as to where things are physically located in memory. Some attributes of virtual memory systems are: - Gives illusion that there is more memory available than physically installed on machine. - Can run programs that are larger than physical memory. - Process does not have to be entirely in memory to execute. - Translation mechanism is needed to convert virtual memory addresses to physical addresses at run time. ### **User Process Components Review** A process is the execution of a program and consists of a pattern of bytes that the CPU interprets as machine instructions (text), data, and stack. A program executing in a process reads and writes its data and stack areas and possibly shared memory areas. Following is a more complete description of the components that comprise a process. - Text contains the executable code (machine instructions) for a process. It is usually marked read-only so that a process cannot alter its own code or be altered by other processes. Text areas can be shared by many user processes that are concurrently executing the same code; for example, multiple users using the same shell program (sh, csh, ksh). - Data holds the data used and modified by the process during execution. It is usually marked for reading and writing. It is never shared with other processes; otherwise, a process could alter the data area of another process. - Stack holds the data necessary for the program to call and return from code modules called subroutines and for allocation of local data 9-13 22jul1998 TR-IKI rev 0.7b SGI Proprietary values. It is marked for reading and writing and cannot be shared with other processes. Shared memory is an area of memory accessible to multiple processes. One process can write data into the shared memory area and another process can read the data. ### **User Process Virtual Memory Image** Assume that the physical memory of a system is addressable with the first byte located at byte offset 0, and that the last byte has a byte offset equal to the amount of memory on the system (in other words, the maximum physical byte memory location). Compilers (C, Fortran, C++, etc.) generate machine code that is 0-based, that is, the program is assumed to begin at byte offset 0 and consumes as many bytes as needed. If the system were to treat the compiler-generated addresses in a user's program as address locations in physical memory, it would be impossible to execute two processes concurrently because their addresses would overlap. This is why compilers generate program addresses for a virtual address space within a given address range. The compiler assumes that every program begins at address 0 and can consume as much space as needed within the given range. The machine's memory management unit (MMU) then translates the virtual address generated by the compiler (0-based) into address locations in physical memory. The compiler does not need to know (nor does it care) where in physical memory the kernel will later load the program for execution. Furthermore, several copies of the same program can coexist in memory. They all would execute using the same virtual memory addresses but would be referencing different physical addresses. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-14 #### **User Process Virtual Addresses** Compilers generate virtual addresses (0-based) for data and instruction references without regard to the physical page size defined on the system. This makes program codes more portable from system to system. However, for purposes of this explanation of virtual addressing, assume that the machine's physical page size is defined to be 4K (4096 bytes). A process' virtual page will map onto a physical page somewhere in the system's memory (location controlled by the kernel) when the process executes. Every byte within a user's process is addressable with a virtual address. A virtual address consists of two parts: a virtual page number and an offset within that page. 9-15 22jul1998 TR-IKI rev 0.7b SGI Proprietary Assume physical page size = 4K bytes (4096 dec or 1000 hex) Virtual address = (virtual page number (the VPN) + byte offset into page) The above illustration of a machine with physical page size defined as 4K bytes shows that the 0-based addresses generated by the compiler for instruction and data references actually serve as virtual addresses also. When the program executes, the CPU will interpret the 0-based addresses as virtual addresses and map the virtual addresses to physical memory locations. It is easy for the CPU to translate a compiler-generated (0-based) address into a virtual address. For a system with page size defined as 4K bytes (as above), the rightmost 12 bits in the address (remember 1 hex digit = 4 binary digits) are the byte offset into the page and the remaining leftmost bits comprise the virtual page number (VPN). Likewise, for a system with page sizes of 16K bytes, like the Cray Origin2000, the rightmost 14 bits in the address are interpreted as the byte offset. ### Virtual to Physical Address Translation As a user process executes, it references instruction code and data by using the virtual addresses generated by the compiler. These virtual addresses are transparently translated into physical addresses by a combination of hardware and software. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-16 Every process has its own page table which provides a mapping from the process' virtual address space (virtual page numbers) to physical memory locations (physical page numbers). Using a combination of hardware and software, a process's virtual page number is looked up in its pte (page table) to produce a physical page number. The physical page number is then combined with the page offset to yield a real address in physical memory. A process's memory space does not have to be entirely resident in physical memory at once. Only the pages currently being referenced need to be memory resident. Therefore, virtual addressing allows a process' virtual address space to be larger than the machine's physical address space. The kernel keeps track of which pages are currently in memory by maintaining a flag in each page table entry (called the valid bit). If a page is not currently in main memory then it is invalid and the memory management system must keep information about where the page is residing in secondary storage. When a process references a non-resident page (invalid), then the process must wait until the system brings the page into physical memory. ### Translation Lookaside Buffer (TLB) The pte (per-process page table) is a structure maintained in physical memory. Each time a process references memory, the virtual page number needs to be looked up in the process's page table to locate the physical page of memory where that virtual page is mapped. However, 9-17 22jul1998 TR-IKI rev 0.7b SGI Proprietary searching the page table every time a process references memory would be very damaging to the process' performance. Therefore, if a process's page table (or portion of it) could be stored in memory built into the CPU's chip, then virtual to physical address translations could be performed very quickly. This type of memory is referred to as associative memory. The purpose of associative memory is to "associate" a given virtual page number to a physical page number. However, associative memory on the CPU is limited to a small area due to the lack of space for this type of memory on the chip. On MIPS CPUs, this area is called the Translation Lookaside Buffer (TLB). The number of TLB entries varies by MIPS processor type. | Processor Type | Number of TLB Entries | |----------------|-----------------------| | R4x00 | 96 | | R5000 | 96 | | R8000 | 384 | | R10000 | 128 | ### Translation Lookaside Buffer (TLB) (continued) When a user's virtual address is presented to the CPU, the TLB is first checked for a match on the virtual page number. The virtual page number is presented to all TLB entries at the same time. Note that each TLB entry points to two adjacent pages within the process' address space. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-18 ### TLB "Hits" and "Misses" Ideally, the TLB would be large enough to hold an entry to translate every possible virtual address presented to the CPU by a process during its execution. However, large TLBs are not practical and they can only hold a subset of the page table entries for a given process. Each TLB entry can map to two adjacent pages in the user process' virtual address space. These pages do not need to be adjacent in ;hysical memory. TR-IKI rev 0.7b SGI Proprietary 22jul1998 When the virtual memory address requested in the CPU falls within a page described by a TLB entry, the TLB supplies the physical memory address for the desired page. The offset is then applied to locate any desired byte location in physical memory. This is referred to as a TLB bit When the virtual memory address requested in the CPU is not covered by any active TLB entry, the MIPS processor generates an interrupt to the kernel which is then handled by an IRIX kernel routine. The kernel inspects the requested address. If the address is found to be valid (in other words, resides within the process' virtual address space), the kernel loads a TLB entry from the appropriate entry in the process' page table. The kernel then restarts the instruction which now will find an appropriate TLB entry to perform the address translation. 9-19.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Virtual Addressing Summary This illustration summarizes how virtual addresses within a user's process are translated to locations within physical memory. At compile time, a user's program is compiled using 0-based addresses to locate instructions and data. These 0-based addresses are referred to as virtual addresses and consist of two parts: the virtual page number and a byte offset within the page. At execution time, when these virtual addresses are presented to the CPU for resolution, they must be translated to physical addresses in real memory. When a process is loaded into memory to execute, the kernel establishes a pte (page table) for the process. Each virtual page within the process' virtual address space will have an entry in the page table. If a particular page currently resides in physical memory, the page table will point to where it is located in real memory. Otherwise, the virtual page is marked "invalid" in the page table. The task of translating virtual addresses occurs in the TLB. The TLB is an on-chip associative memory limited in size. The TLB basically 22jul1998 TR-IKI rev 0.7b SGI Proprietary contains a subset of the entries in the process' page table. If the CPU finds a match on the desired virtual page in the TLB, this is considered a "TLB hit". It is quick and easy to determine the page's physical address in memory. If the CPU does not find a match on the desired virtual page in the TLB, this is a "TLB miss". The kernel must get involved to load a TLB entry with the appropriate entry from the process' page table and then re-issue the affected instruction. The physical pages that correspond to a user's process can be anywhere within the user portion of system memory. When the kernel assigns physical pages of memory to a process, it need not assign the pages contiguously or in any particular order. The purpose of paged memory is to allow greater flexibility in assigning physical memory. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-20.a ### **Demand Paging Overview** When a process is created on the system, only a small amount of physical memory is initially consumed. Some memory is needed by the kernel to control and manage each process. The code (text) and data areas associated with the new process remain in the file containing the program which is being executed. Therefore, most of the page table entries for a newly created process would be marked invalid. Pages are created and allocated for a process only when they are referenced by the currently running process. This mechanism is referred to as demand paging. The entire process does not need to reside in memory in order to execute. The kernel loads pages of a process on demand when the process references the pages. With demand paging, physical memory pages are created for only the parts of the program which actually execute. The parts of a program that never execute can remain in secondary storage. An example of a piece of program that might not execute would be error or signal handling code which will not execute unless there is an error or signal is delivered. Processes tend to execute instructions in smaller portions of their entire text (instruction) space, such as in looping constructs and frequently called subroutines. Also, a process tends to reference data in small subsets or clusters of the process' total data space. Each process has a set of pages that need to be in main memory to ensure it runs efficiently. This set of pages is referred to as its working set. As a process executes, its working set changes depending on its pattern of memory references. When a process tries to access an address that is not in the working set, a page fault occurs so that the kernel can read into memory the page containing the desired address and attach the page to the process's address space. The kernel suspends the execution of the user process until it has read the needed page into memory and attached it to the process's address space. After the page has been loaded into memory, the process re-issues the instruction it was executing when it incurred the fault. 9-21.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Demand Paging Page Load Procedure** The procedure for loading a page into memory is as follows: - 1. The process references a memory address. - 2. The CPU attempts to translate the user's virtual address in the TLB. Assume no entry in the TLB can translate the virtual address to physical address. - 3. The currently running process is suspended and a fault is generated. This is called a TLB miss and control passes to the IRIX kernel. - 4. The kernel's exception handler searches the process's page table for a valid entry corresponding to the virtual address that was not found in the TLB. If one is found, the entry is placed in the TLB and the instruction is re-issued. Now when the instruction is executed, the virtual address is found in the TLB (TLB hit), translated to the physical memory address, and accessed by the CPU. - 5. If the kernel's exception handler cannot find a matching entry in the page table, then the desired page is not residing in physical memory. The kernel then locates a free page in physical memory and associates it with the current executing process by adding an entry to the process' page table. - 6. The kernel then initiates an input operation to fetch the requested data from either the file system where the executable program resides or the swap device. The process then voluntarily goes to sleep allowing other processes to run while the input operation is in progress. After the data arrives in memory, the process is \awakened and again scheduled to run. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-22.a ### **Demand Paging Advantages and Disadvantages** The IRIX kernel supports a demand paging algorithm which means that pages of memory are swapped between main memory and a swap device. This kernel feature gives the illusion that a single user process has all of system memory available if needed. Some advantages of demand paging systems: - Frees processes from size limitations otherwise imposed by the amount of physical memory available on the system. - Transparent to user programs. - Allows more processes to fit simultaneously into main memory as compared to a swapping system. Some disadvantages of demand paging systems: - Processes must wait for a page while it is being loaded. - During initial stages of a process, a process will usually generate many page faults which leads to slower startup times and many disk operations. ### **Page Stealing** Kernel's page stealer process (vhand) Kernel needs to load page from secondary storage but no physical memory is available. Eventually the operating system will need to bring into memory a page from secondary storage (swap device or executable program file) but all physical memory pages are in use. The kernel handles this situation by a mechanism called *page stealing*. The kernel has a process called the *page stealer* (or *vhand*) that swaps out memory pages to the swap device. The kernel creates the page stealer during system initialization and invokes it throughout the lifetime of the system whenever the system is low on free pages; in other words, whenever the number of free pages falls below a configurable threshold (called the low-water mark). The page stealer examines pages that are already allocated to a process and steals some of them so that they can be used by other processes. It keeps stealing pages from processes until the number of free pages reaches a configurable threshold (called the high-water mark). Note that the low-water and high-water thresholds need to be set appropriately in order to reduce the frequency that the page stealer needs to execute. Otherwise, the page stealer process can get into a thrashing situation where it is being called very frequently with little work to do; thus, negatively impacting system performance. 9-24 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Page Stealing Page Selection** The page stealer has to decide which pages are the best candidates to steal. The best candidates are those pages whose next reference will be the farthest into the future. Since that is very difficult to predict, the most common method used in UNIX System V systems is called *Not Recently Used* (NRU). With the Not Recently Used method, every page has a modified and referenced bit in its page table entry. When a page is referenced, its referenced bit is set. Likewise, when a page is modified, its modified bit is set. When the page stealer needs to steal some pages, it does so in the following order: • First selects those pages which have not been referenced for "a long time". Pages that are not included within a process's working set TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-25 are ideal candidates. - If that does not yield enough pages, then it selects those which have been referenced but not yet modified. - If that still does not yield enough pages, then it selects those that have been referenced and modified. Eventually, most of the pages will have their referenced bit set, so the page stealer makes a sweep through memory clearing the referenced bit of every page in memory. ### **Page Stealing Page Actions** After selecting which pages will be stolen, the following actions are taken: • If page has not been modified: Page is simply placed back on the list of free pages. These pages are invalidated in their respective page tables and cleared from any TLB entries pointing to them. Future access to these pages will require reloading from the swap device or secondary file storage. However, stolen pages are added to the back of the kernel's free list of pages so that they can be quickly reclaimed by the original owning process. • If page has been modified (dirty page): 9-26 22jul1998 TR-IKI rev 0.7b SGI Proprietary Page is first written to the swap device before being placed on the list of free pages. Note that under heavy load conditions, it is possible for a process to have less than its working set of pages available in main memory. This condition can lead to excessive kernel paging because immediately after a page is stolen from a process, the process may need it to be paged back in. This thrashing situation may be going on in every active process on the system, thus causing excess system overhead because much of the the system resources are being devoted to paging in and paging out of processes instead of getting user work done. If a system is thrashing, entire processes may have their pages stolen and written out to the swap device. In some cases, large processes will be killed to free up memory (see "The Swapper Process in IRIX", below). If a system is constantly swapping processes in and out, this may be evidence that the system does not have enough installed physical memory. ### **Page Stealing and Job Classes** ### **Job Classes** Page stealing priority – est priority - Real time priority driven - Batch critical Miser - Time share earnings driven - Batch opportunistic Miser - · Weightless idle driven Work is managed in an IRIX system within job classifications. When the page stealer needs to release memory pages, it will first apply its page selection criteria (discussed above) to processes representing jobs in the lowest classifications first, and then work its way up through the above ordered list. The IRIX job classifications are listed above with briefs explanations below: Weightless A job that is about to go idle is placed in this class. • Batch opportunistic Batch requests submitted to Miser are placed in this class and are specified with CPU time requirements. Job will complete when there is opportunity. If job cannot complete in specified time interval, it is moved to the Batch critical class. • Time share Typical interactive IRIX processes placed in this class. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-27 Batch critical Batch requests are submitted to Miser with a specified CPU time limit and placed into the batch opportunistic class. If job cannot complete in specified interval of time, job moves to this class with higher priority. • Real time Highest priority jobs in the system. Guaranteed a specified amount of CPU attention in specified interval. ### Page Cache in IRIX When IRIX steals a page from a process and that page has been modified by the process, it cannot be released but must be written to the swap device. IRIX implements an intermediate staging area for those pages which are being moved to the swap device. This area is called page cache and resides within the kernel's buffer cache. The kernel normally uses the buffer cache for an intermediate staging area for I/O operations. Modified stolen pages are temporarily staged in the page cache before being written to the swap device. The kernel assumes the overhead of writing the stolen pages to the swap device later when they age out of the buffer cache. The page cache allows the page stealer greater performance because it does not have to wait for completion of physical I/O to the swap device. Also, if a process re-accesses the stolen page while it still resides in the page cache and before it has actually been written to the swap device, it can be reclaimed from the page cache simply and efficiently. 9-28 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **User Process Space and Swapping** A process's memory space is partitioned into several regions. The user is able to modify the memory in some of these regions but not others. The text region has read-only status and is shareable with other processes. Text regions are not allowed to be modified. This means that if the page stealer stole a text page from a process, if that page is needed again by the process, that page can simply be reloaded from the executable file image (because the text page has not been modified). This also means that the page stealer can simply release the text page and no swapping to a swap device is necessary. Conversely, the data and stack regions have read/write status and can be modified. The IRIX kernel allocates pages for the data and stack regions as needed by the process. The pages allocated for the data and stack areas are not associated with the executable file from which the user program was loaded. Therefore, these pages are referred to as anonymous memory pages associated with the process. If the page stealer TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-29 needs to steal an anonymous page, then it must find a location on secondary storage where it can temporarily store the page for later recall, if accessed. The swap device(s) serve this role for the page stealing operation. ### **Swap Space Management** As explained on the previous page, only anonymous memory pages associated with user processes need to be swapped with the kernel's page stealing operation. The IRIX kernel then needs to maintain a mapping between anonymous pages and the swap space. Each anonymous page is mapped to a page-sized block of swap space. IRIX must have at least one disk partition or file allocated for swap space. Additional swap areas can be added or removed while the system is running. When a swap area is added by the system administrator, the number of pages that can be stored in that swap area is calculated. The kernel then adds a swap info structure to its swap table to keep track of the new swap area. When new anonymous memory pages are generated by user processes, the kernel's swap management routines spread the anonymous pages across all swap areas to maintain performance. If some swap areas are full, all swap areas will be searched until free space is found. 9-30 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### The Swapper Process As stated at the beginning of this section, earlier versions of UNIX used a method called *swapping* to manage main memory. With this method, whole processes were swapped from memory to disk to make room for other processes that needed to run. Historically, UNIX was a swapping system and the swapping was done by a special process called the *swapper* or sched (short for scheduler) which always has a PID=0. More recent versions of UNIX still have a swapper process or sched. The reason that demand paging type systems still need a swapper process is because there can be times when demands for memory are so high that the page stealer cannot maintain a large enough list of free pages. When free memory falls below a specified level, the kernel's swapper or sched process is invoked. The swapper process calls a kernel function to select a process to swap out to the swap device. All of the memory pages associated with the selected process are then freed. A flag is cleared in the selected process' proc table entry indicating it 22jul1998 TR-IKI rev 0.7b SGI Proprietary is no longer eligible to run. At a later time, the kernel's swapper or sched process is invoked again. If the amount of free memory is above a specified level, a kernel function is called to select a process to swap back into memory (now residing on the swap device). A flag is set in the selected process' proc table entry indicating it is now eligible to run again. As this process receives CPU attention, the process's memory pages will be faulted back into main memory when they are accessed, by the demand paging mechanism described earlier in this section. TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-31.a ### The Swapper Process in IRIX The implementation of sched or the swapper process in IRIX systems is different than a typical UNIX system. The swapper process is implemented such that it never swaps whole processes to and from a swap device. Currently, if memory is oversubscribed to an extent where the page stealer cannot keep up with the demand, then IRIX will begin removing processes from the system. The largest processes are the first candidates. ### The Swapper Process Relationship to Other Processes The swapper or sched process is a special kernel process which serves as the origin (or great-great-... grandparent) of all processes on the system. For instance, the swapper generates the init process which is responsible for initiating all of the major daemons that run on the system. The reason sched always has a process ID (PID) value of zero is because it is always the first process created on the system. The diagram to the left shows how the swapper or sched is related to all other processes on the system. 9-33 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Reporting Paging Activity (sar -p) The sar(1) command (with option -p) will report paging activity for the host system. ``` $ sar -p IRIX64 flurry 6.5-ALPHA-1274427934 02121253 IP27 02/27/98 | vflt/s dfill/s cache/s pgswp/s pgfil/s | vflt/s dfill/s cache/s pgswp/s pgfil/s | pflt/s | 152.97 | 32.16 | 117.73 | 0.00 | 0.68 | 177.21 | 10.60 | 2.14 | 8.46 | 0.00 | 0.33 | 11.74 | 10.60 | 2.14 | 8.46 | 0.00 | 0.33 | 11.74 | 10.74 | 1.87 | 8.87 | 0.00 | 0.00 | 12.70 | 6.08 | 1.17 | 4.91 | 0.00 | 0.00 | 7.59 | 2.83 | 65 | 569.09 | 1503.61 | 0.00 | 0.83 | 1201.67 | 6.08 | 1.17 | 6.351.00 | 0.00 | 1.68 | 2760.90 | 5056.14 | 1312.76 | 3351.06 | 0.00 | 0.16 | 30799.43 | 4752.46 | 1327.13 | 3389.85 | 0.00 | 0.05 | 2312.26 | 39945.11 | 1260.85 | 2628.23 | 0.00 | 0.32 | 2265.80 | 3488.68 | 1444.82 | 2309.96 | 0.00 | 0.13 | 1983.08 | 3465.20 | 962.82 | 2472.63 | 0.00 | 0.26 | 2124.83 | 3465.20 | 962.82 | 2472.63 | 0.00 | 0.26 | 2124.83 | 3418.88 | 1034.86 | 3061.75 | 0.00 | 0.21 | 2607.35 | cpyw/s steal/s 14:00:07 346 31 199.52 14:10:06 290 03 153.21 14:22:30 14:30:06 2009 49 323.00 14:40:06 3142 42 182.61 14:50:06 385.55 233.25 15:00:07 194.89 29.92 15:10:06 56.39 31.62 15:20:06 101:87 73.30 Average 1637.73 456.06 141.92 130.94 0. 00 0. 00 13.94 15.02 79.14 92.87 30.21 212.16 47.74 176.72 0.00 1682.50 2952.53 150.89 157.64 24.77 28.41 1171.52 1691 78 2887 67 112 84 96 63 35 87 49 48 974 88 0. 00 0. 00 0. 00 0. 00 0. 00 0. 00 133.46 272.73 5.17 7.80 0.34 0.00 15.99 92.25 57.58 58.06 29.94 17.19 21.50 435.29 ``` The sax(1) output data columns have the following interpretation (/s means per second): | Column header | Interpretation | |---------------|-------------------------------------------------------------------------------------------------------| | vflt/s | Address translation page faults (valid page not in memory) | | dfill/s | Address translation fault on demand fill or demand zero page | | cache/s | Address translation fault page reclaimed from page cache | | pgswp/s | Address translation fault page reclaimed from swap space | | pgfil/s | Address translation fault page reclaimed from file system | | pflt/s | (Hardware) Protection faults including illegal access to page and writes to (software) writable pages | | cpyw/s | Protection fault on shared copy-on-write page | | steal/s | Protection fault on unshared writable page | | rclm/s | Pages reclaimed by paging daemon | TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-35 # Reporting System Swapping and Switching Activity (sar -w) The sar(1) command (with option -w) will report system swapping and switching activity for the host system. | \$ sar -w | | | | | | | | |----------------------|--------------|--------------------|--------------|------------|--------------|-------------|-------------| | IRIX64 f | lurry 6.5-2 | LPHA-12 | 74427934 | 02121253 | 3 IP27 | 03/01 | ./98 | | 01:37:05<br>01:37:05 | ed e/nique | win/s s<br>x resta | | swot/s ps | evot/s | e\doweq | kewch/e | | 01:40:06<br>01:50:06 | 0.00 | 0.0 | 0.00<br>0.00 | 0.0<br>0.0 | 0.00<br>0.00 | 181<br>20 | 632<br>557 | | 02:00:07<br>02:10:07 | 0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0 | 0.00 | 30<br>822 | 599<br>575 | | 02:20:07<br>02:30:06 | 0.00 | 0. 0<br>0. 0 | 0.00 | 0.0 | 0.00 | 24<br>23 | 566<br>563 | | 02:40:06<br>02:50:06 | 0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0<br>0.0 | 0.00<br>0.00 | 30<br>22 | 564<br>560 | | 03:00:07<br>03:10:06 | 0.00<br>0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0<br>0.8 | 0.00<br>0.00 | 21<br>20 | 558<br>558 | | 03:20:06<br>03:30:06 | 0.00<br>0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0 | 0.00<br>0.00 | 22<br>25 | 560<br>563 | | 03:40:06 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 21 | 559 | | | | (ab | breviate | d) | | | | | 07:20:06<br>07:30:06 | 0.00<br>0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0<br>0.0 | 0.00 | 456<br>694 | 601<br>699 | | 07:40:07<br>07:50:06 | 0.00<br>0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0<br>0.0 | 0.00 | 634<br>624 | 772<br>813 | | 08:00:07<br>08:10:06 | 0.00 | 0. 0<br>0. 0 | 0.00<br>0.00 | 0.0 | 0.00 | 607<br>824 | 747<br>850 | | 08:20:06<br>Average | 0.00 | 0.0 | 0.00<br>0.00 | 0.0 | 0.00 | 1199<br>187 | 1002<br>610 | | 2 | | | | | ,,, | | | The sar(1) output data columns have the following interpretation (/s means per second): | Column header | Interpretation | |---------------|--------------------------------------| | swpin/s | Pages swapped in | | bswin/s | Number of 512-byte units swapped in | | swpot/s | Pages swapped out | | bswot/s | Number of 512-byte units swapped out | | pswot/s | Processes swapped out | | pswch/s | Processes switched | | kswch/s | Kernel switches | 9-38 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Reporting TLB Activity (sar -t) The sar(1) command (with option -t) will report TLB activity for the host system. ``` $ sar -t IRIX64 flurry 6.5-ALPHA-1274427935 02241507 IP27 03/02/98 08:50:07 tfflt/s rfflt/s vawrp/s ync/s flush/s idwrp/s idget/s idprg/s vaprg/s 09:00:06 0.00 0.20 0.00 2.76 352.82 0.00 220.70 946.84 0.00 09:10:06 0.00 3.45 0.00 51.19 6552.07 0.00 464.64 1854.08 2.55 09:20:06 0.00 0.93 0.00 10.94 1400.16 0.00 474.8 2057.31 0.02 09:30:06 0.00 0.98 0.00 38.76 4960.34 0.00 51.125 2124.62 1.68 09:40:06 0.00 0.73 0.00 9.98 1277.49 0.00 363.0 1580.55 0.00 09:50:06 0.00 0.59 0.00 8.12 1039.60 0.00 351.25 1432.23 0.07 10:00:06 0.00 0.73 0.00 9.98 1271.49 0.00 363.0 1580.55 0.00 09:50:06 0.00 0.73 0.00 9.98 1271.49 0.00 363.0 1580.55 0.00 09:50:06 0.00 0.73 0.00 9.98 1271.49 0.00 363.0 1580.55 0.00 09:50:06 0.00 0.73 0.00 9.02 1155.18 0.00 367.4 1476.70 0.04 Average 0.00 1.37 0.00 18.69 2392.05 0.00 392.83 1639.04 0.62 ``` The sar(1) output data columns have the following interpretation (/s means per second): | Column header | Interpretation | |---------------|-----------------------------------------------------------------------------------------------------------------| | tflt/s | User page table or kernel virtual address translation faults: address translation not resident in TLB | | rflt/s | Page reference faults (valid page in memory, but hardware valid bit disabled to emulate hardware reference bit) | | sync/s | TLBs flushes on all processors | | vmwrp/s | Syncs caused by clean (with respect to TLB) kernel virtual memory depletion | | flush/s | Single processor TLB flushes | | idwrp/s | Flushes because TLB ids have been depleted | | idget/s | New TLB ids issued | | idprg/s | TLB ids purged from process | | vmprg/s | Iindividual TLB entries purged | TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-40 ### Process Size (ps -1) The ps(1) command (specifying the -1 option) will display the total size of individual processes as well as the amount of main memory currently being consumed by those processes. The example below shows a typical ps display. Sizes are listed in units of pages of memory. The SZ and RSS columns of the ps(1) display are explained on the ps(1) man page and reproduced below: - SZ Total size (in pages) of the process, including code, data, shared memory, mapped files, shared libraries and stack. Pages associated with mapped devices are not counted. (Refer to sysconf(1) or sysconf(3C) for information on determining the page size.) - Total resident size (in pages) of the process. This includes only those pages of the process that are physically resident in memory. Mapped devices (such as graphics) are not included. Shared memory (shmget(2)) and the shared parts of a forked child (code, shared objects, and files mapped MAP\_SHARED) have the number of pages prorated by the number of processes sharing the page. Two independent processes that use the same shared objects and/or the same code each count all valid resident pages as part of their own resident size. The page size can either be 4096 or 16384 bytes as determined by the return value of the getpagesize(2) system call. # Reporting Memory Statistics (sar -R) The sar(1) command (with option -R) will report memory statistics for the host system. | S sar -R | | | | | | | | | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | IRIX64 f | lurry 6.5 | -ALPHA-1 | 274427934 | 021212 | 53 IP27 | 02/27 | /98 | | | 05:55:50<br>05:55:50 | | kernel<br>nix rest | user | fsctl | fedelwr | fadata | freedat | empty | | 06:00:06<br>06:10:06<br>06:30:06<br>06:30:06<br>06:40:07<br>06:50:06<br>07:10:06<br>07:20:06<br>07:30:06<br>07:30:06<br>07:30:06<br>08:00:06<br>08:00:06<br>08:00:06<br>08:00:06 | 2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912<br>2342912 | 38172<br>38354<br>383579<br>38400<br>39282<br>40060<br>40404<br>40524<br>40724<br>40982<br>41110<br>41444<br>41802<br>419365<br>43025<br>43025<br>43223 | 2103<br>1800<br>1761<br>17761<br>2777<br>3220<br>8392<br>5245<br>18805<br>39508<br>439508<br>439508<br>439508<br>439508<br>439508<br>439508 | 398<br>431<br>483<br>1011<br>10892<br>12632<br>1364<br>14284<br>14284<br>1646<br>17346<br>2002 | 23<br>10<br>11<br>93<br>118<br>327<br>152<br>148<br>8002<br>163<br>134<br>746<br>235<br>113<br>1171 | 4505<br>4940<br>4957<br>72576<br>8661<br>8445<br>8667<br>1242072<br>242072<br>2421380<br>2521520<br>272597 | 340<br>340<br>349<br>1342<br>2134<br>2832<br>3566<br>3614<br>3708<br>21059<br>168198<br>23317<br>24120<br>24796 | 229 7365<br>229 7037<br>229 6984<br>229 6898<br>229 1182<br>228 5538<br>228 5746<br>228 5746<br>227 1223<br>227 1223<br>227 1223<br>227 1223<br>227 129 2237<br>199 3227<br>199 3237<br>195 1658<br>195 1658<br>195 199 99 | | 09:00:06 | 2342912<br>2342912 | 43602 | 46226 | 2166<br>2197 | 68<br>22 | 274590<br>271718 | 27683 | 1951464 | | | | | (abbrev | iated) | | | | | | 13:50:06<br>14:00:07<br>14:10:06<br>14:22:30 | 2342912<br>2342912<br>2342912 | 52736<br>52900<br>52994<br>nix rest | 186004<br>187836<br>185772 | 2337<br>2342<br>2406 | 941<br>550<br>1397 | 462051<br>465263<br>435000 | 262795 | 1394266<br>1371226<br>1353500 | | Average | 2342912 | 43246 | 70195 | 1904 | 925 | 233329 | 38919 | 1954394 | 9-42 22jul1998 TR-IKI rev 0.7b SGI Proprietary The sar(1) output data columns have the following interpretation: | Column header | Interpretation | |---------------|--------------------------------------------------------| | physmem | Physical pages of memory on system | | kernel | Pages in use by the kernel | | user | Pages in use by user programs | | fsctl | Pages in use by file system to control buffers | | fsdelwr | Pages in use by file system for delayed-write buffers | | fsdata | Pages in use by file system for read-only data buffers | | freedat | Pages of free memory that may be reclaimable | | empty | Pages of free memory that are empty | # Reporting Unused Memory Pages and Disk Blocks (sar -r) The sar(1) command (with option -r) will report unused memory pages and disk blocks for the host system. \$ sar -r IRIX64 flurry 6.5-ALPHA-1274427935 02241507 IP27 03/02/98 08:50:07 freemem freeswp vswap 09:00:06 2270596 27611520 3083933 09:10:06 222598 27611520 3085163 09:20:06 2234598 27611520 3085163 09:20:06 2234598 27611520 3085475 09:30:06 2218342 27611520 3081868 09:40:06 2167508 27611520 2890365 09:50:06 2071240 27611520 2890365 09:50:06 2071240 27611520 2895279 10:10:07 2059354 27611520 2905279 10:10:07 2059354 27611520 1335055 10:20:06 1878832 27611520 1335474 10:40:06 1832440 27611520 1335474 10:40:06 1832440 27611520 1335474 10:40:06 1832440 27611520 1395596 10:50:06 1805964 27611520 1301303 Average 2097120 27611520 2299571 TR-IKI rev 0.7b SGI Proprietary 22jul1998 9-45 The sar(1) output data columns have the following interpretation: | Column header | Interpretation | |---------------|--------------------------------------------| | freemem | Average pages available to user processes | | freeswap | Disk blocks available for process swapping | | vswap | Virtual pages available to user processes | # Reporting Memory Activity (gr\_osview(1)) The gr\_osview(1) command will produce a graphical display of memory management activity including memory usage, page faults, TLB activity, and page swapping. An example of a gr\_osview(1) display is shown below for a 128-CPU Origin2000 system with the user's .grosview file set to (see gr\_osview(1) man page for details): cpu(sum) strip creepscale rmem strip creepscale interval(2) fault strip creepscale colors(1,2,3,4,5,6,72,92) tlb strip creepscale pswap strip creepscale swp strip creepscale nettcp strip creepscale System: Origin2000 system (128 CPUs) 9-48 22jul1998 TR-IKI rev 0.7b SGI Proprietary | | Modulo 10: UNIV Filogratom Overvious | |--------------------|--------------------------------------| | | Module 10: UNIX Filesystem Overview | | | | | | | | | | | | | | | | | | | | e a a a come | | | envento. ~ a | | | | | | | | | | | | au tura re, refrid | | | | | | | | | | | | | | | | | | a to Considerate | | | | | | | | | | | # **UNIX Filesystem Overview** ### Unit covers: - Generic layout of a UNIX filesystem - Layout of a UNIX System V filesystem - Layout of an IRIX EFS filesystem (TBD) Layout of an IRIX XFS filesystem (TBD) 10-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Sample UNIX FileSystem ### Generic UNIX FileSystem - Hierarchy of regular and directory files - File composed of inode and data - Inode - File type - Access permissions - O Pointers to where file's data "lives" on disk, called here extents (exts) - Directory files - Inode type directory (d) - File data (contents) lists file names and corresponding inode numbers (locations on disk) File name "." (dot) references inode of "self" File name ".." (dot-dot) references inode of parent directory (except root see below) - Data files - Inode type regular file (f) - Data appears as sequential or random list of file characters (bytes) Root directory "/" - - O Topmost directory in filesystem - O Parent directory is itself - Mount point - O Several filesystems may be "joined" together to form (the perception of) a single filesystem - o Root directory of one filesystem is mounted to (associated with) an otherwise empty directory in another (previously mounted) filesystem TR-IKI rev 0.7b SGI Proprietary 22jul1998 10-3 **UNIX System V filesystem** ### Small UNIX file sample 10-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Small UNIX file - Filesystem made up of blocks 0 through n (filesystem relative block number) In sample, each block is 1024 bytes (256 four byte words) - Inode contains block number (blkno) array - o First 10 array items point directly to a file's data blocks - Array indices 11-13 are used for indirect descriptors (see next subject) - Sample file - O Composed of four filesystem blocks, in order 5, 16, 4, 21 - O Each f.s. block holds 1024 bytes user data - User sees data as a stream of (up to) 4096 characters; file size in inode - o For example, f.s. blkno 21 is user block 4 (or character positions 3072-4095) ### Large UNIX file sample TR-IKI rev 0.7b SGI Proprietary 22jul1998 10-7 ### Large UNIX file - Single indirect - When the file grows beyond what fits in ten direct block descriptors the file continues to grow using single (level) indirect - o Inode array index 11's block is a block (full of) data block descriptors each pointing to data blocks - Sample - Inode element 11 points to f.s. block 23 - F.S. block 23 holds 256 block descriptors each pointing to a 1024 byte block - File capacity is 10+256 = 266 blocks - Double indirect - When the file grows beyond what fits in the single in direct block capacity the file continues to grow using double indirect - Inode array index 12's block is a block (full of) data block descriptors each pointing to another block of block descriptors each pointing to data blocks - Sample - Inode element 12 points to f.s. block 32 - F.S. block 32 holds 256 block descriptors, each pointing to a 256 word block. - Each of those 256 (possible) descriptors points to blocks pointing to 256 data blocks - $\circ$ File capacity is 10+ 256+256\*256 = 65802 blocks - Triple indirect - In the extreme case when the file grows beyond what fits in the double indirect block capacity the file continues to grow using triple indirect - Inode array index 13's block is a block (full of) data block descriptors each pointing to another block of block block descriptors each pointing to another block of block descriptors each pointing to 1024 byte data blocks - Sample - Inode element 13 points to f.s. block 34 - F.S. block 34 holds 256 block descriptors. - Each of those points to 256 blocks of block descriptors. - Each of those 256\*256 blocks points to 256 blocks of block descriptors 22jul1998 | ■ Each of the 256*256*256 blocks points ○ File capacity is 10+ 256+256*256+256*256 | s to 1024 byte data blocks.<br>*256 = 16,843,018 blocks | | |------------------------------------------------------------------------------------|---------------------------------------------------------|---------------------------------| | | | | | | | | | | | | | | | | | | | | | <br>10-8.a | 22jul1998 | TR-IKI rev 0.7b SGI Proprietary | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Para No. of | | |-----------------------------------------|---------------------------------------| | *************************************** | Module 11: XFS Filesystem - Structure | | Markoto, signose | | | almolecies conserva- | | | ###################################### | | | | | | Water by - cond-o | | | | | | | | | Marga servician con | | | | | | | | | | | | Standard Countries | | | district and the second | | | | | | demonstration of the second | | | American | | | - | | | | | | COMMISSION CONTRACTOR | | | | | | | | # The Extent Filesystem (EFS) #### Limitations: - Filesystem max: 8GB - File max: 2GB - Number of files fixed at mkfs time - Less than full hardware bandwidth - Slow crash recovery (minutes) - No support of sparsely-allocated files - Slow performance for large files - o linear bitmap structures for tracking free space made finding contiguous space slow - o linear searching of large directories 11-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary # xFS: the extension of EFS #### The main ideas: - "x" for to-be-determined (but the name stuck) - large filesystems - large files - large number of inodes - large directories - large I/O - parallel access to inodes - binary tree algorithms for searching large lists - asynchronous metadata transaction logging for quick recover - delayed allocation to improve data contiguity - ACL's -- Access Control Lists (see chacl(1), acl(4), acl\_get\_file(3c), acl\_set\_file(3c)) # A New XFS Filesystem: system - 3 "allocation groups" - AG's are used to keep the size of freespace and inode management data structures manageable TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-3 - O These structures use AG-relative block and inode pointers - O Directories are round-robined thru the AG's - O Files cluster around their directory, but are not limited to a single AG - Free space and inode structures in memory are locked per-AG, so parallelism can be achieved filesystem wide built with a journaling log within the data fork (i.e. not an XLV logical volume with "data" "realtime" and "log" forks) - mkfs -b size=blocksize \ - -d name=special-file,agcount=3 \/\* "data" section: device to build on, count of allocation groups\*/ - -i size=256,maxpct=25 \ /\* "inode" section: inode size, percent of fs for inodes \*/ -l internal=1,size=512b /\* "log" section: log is internal with inodes/data, size in blocks \*/ - example on a regular file: ``` mkfs -b size=4096 \ -d name=/tmp/cpw.fs,file=1,size=145133568,agcount=5 \ -i size=256,maxpct=25,align=1 \ internal=1,size=512b isize=256 bsize=4096 agcount=5, agsize=7087 blks blocks=35433, imaxpct=25 swidth=0 blks meta-data= /tmp/cpw.fs sunit=0 bsize=4096 extsz=65536 blocks=512 blocks=0, rtextents=0 log = inter realtime = none internal log ``` | superb | lock (512 | bytes), | W. | | > sb [n]<br>> print | |-------------|---------------|-----------------|----------|---------------|------------------------------------------| | agre a | in General of | ingairte (cont. | olegia | (Siz Bec) | 2 (557 (5<br>2 (2)2)24 (5 | | agi: a | llocation | | le biree | (512 bytes)). | en e | | á Giáil San | ailioca, ko | | | | | | -aeys | | i in plia | | | | | tratales p | (-6/15/286) | e ya dise. | | | | allocation group sb2.sa • superblock: identical one in each allocation group; describes basic characteristics of the filesystem and location of some key components 11-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary - agf: allocation group free space; points to the structures for locating free space on the filesystem agi: allocation group inodes; points to the structures for locating inodes in the allocation group agfl: used by XFS internally # Superblock: ``` sb_magicnum magic_number_TXFSE** sb_blocksize block size (bytes) sb_dblocks number of datablocks sb_logstart starting block of internal log sb_rootino root inode number sb_agblocks size of allocation group sb_agcount number of allocation groups sb_logblocks number of log blocks sb_inodesize inode size (bytes); sb_inopblock inodes per blocks sb_ifree free inodes sb_fdblocks free data blocks ... ``` > sb [n] > print xfs\_sb\_t sb1.sc - HREF="xfsdb.htm" TARGET="\_blank">xfs\_db "man page" - xfs\_db example: xfs\_db: sb xfs\_db: print magicnum = 0x58465342 blocksize = 4096 dblocks = 35433 TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-5 ``` logstart = 16388 rootino = 128 ... agblocks = 7087 agcount = 5 ... logblocks = 512 ... inodesize = 256 inopblock = 16 ... imax_pct = 25 icount = 64 ifree = 61 fdblocks = 34897 ``` ### 11-7.b # **AGFL - Allocation Group Free List:** • The Allocation Group Free List is only used internally by XFS to control agf btree blocks Located in the 4th 512 byte block of each allocation group the agfl freelist for internal btree space allocation is maintained for each allocation group. This acts as a reserved pool of space separate from the general filesystem freespace (not used for user data) ``` • xfs_db "man page" ``` xfs\_db example: TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-8.a # **AGI: Inode Btree Control:** ``` agi magicnum magic number: XAGI' agi segno sequence number; starting from 0 sequence number; starting from 0 sequence number; starting from 0 sequence number of allocation group agi count number of allocated inodes agi root; block number of root of inode biree agi level levels in inode biree agi freecount number of free inodes agi newino new inode - just allocated agi dirino last directory inode chunk agi unlinked[64] hash table of unlinked inodes (but still being referenced) ``` xfs\_agi\_t > agi [n] > print agi1.so xfs\_db "man page" • xfs\_db example: [empty filesystem] TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-9 - XFS dynamically allocates inodes as needed - XFS has replaced free space bit maps with a binary trees 22jul1998 TR-IKI rev 0.7b SGI Proprietary 11-9.a • each entry in the inode btree block represents a chunk of 64 inodes (the above binary tree has only one level) • xfs\_db "man page" • xfs\_db example: TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-10.a ``` core.naextents = 0 core.forkoff = 0 core.aformat = 2 (extents) /* this inode will have extents -- currently are none */ ... ``` # **On-disk Inode:** ``` u.sfdir.list[3].inumber = 8388838 u.sfdir.list[3].namelen = 12 u.sfdir.list[3].name = "libmalloc.so" u.sfdir.list[4].inumber = 8609299 u.sfdir.list[4].namelen = 3 u.sfdir.list[4].name = "cpp" ``` 11-12.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary # 1-block Directory: TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-13.a # **Btree Directory:** • pictured above is a 1-level binary tree structure for a directory • the first extent is not directory data, but a table that tells what extent contains a given key - o any link with hash value up to the first entry would be found in extent[1] of the directory - o any link with hash value up to the second entry would be found in extent[2] of the directory - each "leaf" of the directory contains a list of hash values in association with an index to the full link name within the block - o there may be duplicate hash keys 11-14.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Btree Directory - Index Block:** - The above diagram provides a little more detail of the "xfs\_da\_intnode\_t" block - The "before" field is the index into the inode extents TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-15.a ### **Attribute Fork Inside Inode:** > inode [ino] > print xfs\_dinode\_t attr1.sq - A small set of "attributes" may be stored entirely within an inode - A larget set will be stored separately, as shown below #### **Attributes Block:** - Access Control Lists (ACLs) are stored as attributes - DMF bitmapped file id's are stored as attributes - o it is highly recommended that filesystems to be DMF-managed be made with 512-byte inodes so that the DMF bfid's can be 11-17 22jul1998 TR-IKI rev 0.7b SGI Proprietary stored with in the inodes - The attributes of a file may be anything that its owner wishes to store with it - example: ``` $ attr -s character_set -V kanji filex Attribute "character_set" set to a 5 byte value for filex: kangi $ attr -s revision -V 5.1 filex Attribute "revision" set to a 3 byte value for filex: 5.1 $ attr -l filex Attribute "character_set" has a 5 byte value for filex Attribute "revision" has a 3 byte value for filex Attribute "character_set filex Attribute "character_set filex Attribute "character_set" had a 5 byte value for filex: kanji ``` ### Data Fork - Binary Tree above is a file with a 2-level binary tree of extents the inode points to blocks of indirect pointers TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-18 - o the indirect blocks point to the leaf blocks containing the actual extent descriptions - reference: reading an inode ``` xfs_iget() xfs_iread() xfs_iformat xfs_iformat_btree gets only the root of the btree into memory (attached to if_broot) ``` • reference: reading the entire extent list ``` xfs_iread_extents() xfs_bmap_read_extents reads all the extents (and attaches them to if_extents) ``` ## **Journaling Log** - Examining all the filesystem metadata to reconstruct it after a crash would take too long - o large filesystems - o inodes not in a fixed location - XFS does write-ahead logging of all structural updates to filesystem metadata - inodes - O directory blocks - o free extent tree blocks - o inode allocation tree blocks - o file extent map blocks - AG header blocksthe superblock - log entries must be written to disk before the metadata itself reaches disk - the log is circular, with a tail chasing a head - each record has a Log Sequence Number (lsn) XFS Log 11-19 22jul1998 - The life cycle of a transaction: - 1. allocate a transaction structure in memory and assign it it a unique id (tid) - modify the metadata resource and remember what was modified call xfs\_trans\_log\_buf(xfs\_trans\_t \*tp, buf\_t \*bp, uint first, uint last) to remember specified bytes of a buffer 5. commit the transaction: call xfs\_trans\_commit(xfs\_trans\_t \*tp, uint flags, xfs\_lsn\_t \*commit\_lsn\_p) - record the transaction (and all its modified metadata) in the in-core log (log records collected TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-19.b in one of 2 or more circular buffers) - "pin" the resources in memory until the transaction reaches disk (see buf\_t field b\_pincount) - unlock the resources - when the buffer is full, it is written as large, sequential write - 6. write of the tran:saction to disk completes: handler was specified by call to xfs\_trans\_callback() - unpin the resources - free the transaction structure - modified resources associated with the transaction are placed in an Active Items List (AIL) (there is one AIL per filesystem) - a metadata item stays "active" until it reaches disk - 7. modified resource (metadata) is flushed to disk (because of reuse, log buffer becomes nearly full, or some cleaning daemon pushes it out) - remove the item from the AIL (the log images are no longer needed) - a log entry consists of a header describing the metadata image that follows, and a copy of that new image (see structure xfs\_buf\_log\_item) - recovery consists of replaying the log - xfs\_repair(1M) exists for correcting the results of errors that corrupt random blocks in the filesystem - xfs\_logprint(1M) exists for displaying the contents of a log - example: change an inode with chmod(2) ``` set up a vattr structure with user's argument as mode and AT_MODE as mask namesetattr(uap->fname, FOLLOW, &vattr, 0) vp=lookupname(fnamep,UIO_USERSPACE,followlink,NULLVPP,&vp,NULL)) VOP_SETTATTR(vp, vap, flags, get_current_cred(), error) xfs_setattr(bhv_desc_t *bdp, vattr_t *vap, int flags,cred_t *credp) ``` 11-19.d 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` each item: bcopy to m_log->log->buffers xfs_trans_unlock_items(tp) /* free the "chunks" of each item */ xfs_iunlock(ip, lock_flags) /* unlock the inode */ ``` #### structures for the above example 22jul1998 xfs\_log\_write->xlog\_state\_release\_iclog->xlog\_sync->bwrite /\* \* Transaction types. Used to distinguish types of buffers. \*/ \*/ \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \*\* \* \*\* \*\* \* \*\* \* \*\* \* \*\* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* \* TR-IKI rev 0.7b SGI Proprietary • code references for write to log: 22jul1998 11-19.f #### Sequence for replaying the log when the filesystem is mounted: • Each log record has a header of one sector (xlog\_rec\_header\_t) 11-20.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### I/O Performance The main ideas behind XFS increase in I/O performance are: - allocate large (contiguous) extents for files - o writes to buffer cache reserve "virtual extents"; real blocks are not assigned until buffers are flushed to disk - o short-lived files may never be allocated; their metadata updates are reduced - o randomly-written files (with no holes) will be contiguous - o a filesystem's block size may range from 512-bytes to 64KB; large blocks reduce fragmentation - perform I/O in parallel - o clustering large minimum read buffers; combining writes of dirty buffers into large chunks - o read ahead multiple (2-3) read ahead buffers - o write behind balanced buffering of dirty blocks and asynchronous flushing to disk - o request parallelism multiple processes are allowed read/write the same file (inode lock) at the same time - IRIX supports asynchronous I/O by using multiple threads - o direct I/O avoiding copy into/out of buffer cache; and allows program control of I/O requests ■ buffer cache is kept coherent - handle metadata efficiently - o write-ahead transaction log makes updates fast - gather multiple updates into single I/O - write to the log asynchronously (sync. if exported via NFS) modified data cannot be written until the log is on-disk - the log may be placed on a separate device (XLV "fork") - o do searches and updates faster than linearly - o allow parallelism - all resources (except the log) are independent across AG's or individual inodes - inodes and blocks can be allocated and freed in parallel ### xfs\_db printable block types: xfs\_db "man page" sb superblock agf ag freespace control agin ag inode control agfl ag freelist bnobt block of freespace btree sorted by block number cntbt block of freespace btree sorted by count inobt block of inode btree dir directory leaf block inode bmapbtd block of an inode's extent btree for the data fork bmapbta block of an inode's extent btree for the attribute fork attr attribute leaf block dyblk disk quota block symlink symbolic link data hex dump of the block TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-22 # **Mounted Filesystems** - the inode points to the mount structure for the filesystem on which it resides - the mount table points to a table of functions that do operations on the filesystem - the mount table is dynamically allocated and "virtualized" i.e. a filesystem-independent vfs\_t structure points to a filesystem dependent structure (in the diagram, the xfs\_mount\_t) 22jul1998 • view the root vfs with icrash: 11-23.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-23.c ``` sb_rsumino = 130 sb_rextsize = 16 sb_agblocks = 122436 sb_agcount = 8 sb_rbmblocks = 0 sb_logblocks = 1000 sb_versionnum = 388 sb_sectsize = 512 sb_inodesize = 256 sb_inopblock = 16 sb_fname = " sb_fpack = " sb_blocklog = 12 sb_sectlog = 9 sb_inodelog = 8 sb_inopblog = 4 sb_agblklog = 17 sb_rextslog = 0 sb_imax_pct = 25 sb_icount = 48128 sb_ifree = 914 sb_fdblocks = 495020 sb_frextents = 0 sb_uquotino = 0 sb_uquotino = 0 sb_pqtlags = 0 sb_inoalignmt = 2 sb_unit = 0 sb_width = 0 } m_sb_bc = 0xa80000000015be700 m_fsname = 0xa800000000daf720 = "/" m_fsname_len = 2 m_dev = 1457 m_rtdev = 0 m_sgirotor = 0 m_agirotor = 0 m_agirotor = 2 m_ipinlock = 2 m_ipinlock = 2 m_ihash = 0xa8000000016b4000 m_ihashmask = 1023 ``` ``` m_inodes = 0xa80000172384a600 m_ilock = mutex_t { m_bits = 0 m_queue = (nil) } m_ireclaims = 0 m_readio_blocks = 16 m_readio_blocks = 16 m_writeio_log = 16 m_writeio_blocks = 16 m_log = 0xa8000000015bea000 m_logbufs = -1 m_logbsize = -1 m_rsumlevels = 0 m_rsumsize = 0 m_rsumip = 0xa8000000015ed000 m_rootip = 0xa8000000015ed000 m_rootip = 0xa8000000015ece000 m_quotainfo = (nil) m_ddevp = 0xa800000100499200 m_rtdevp = (nil) m_dircook_elog = 8 m_blkbit_log = 15 m_blkbb_log = 3 m_agno_log = 21 m_nreadaheads = 4 m_inode_cluster_size = 8192 m_blockwask = 4095 m_blockwask = 4095 m_blockwask = 1023 m_alloc_mur = { [0] 510 [1] 340 } m_alloc_mur = { [0] 255 [1] 170 } m_bmap_dmxr = { [0] 254 [1] 254 } ``` 11-23.e 22jul1998 ``` m_bmap_dmnr = { [0] 127 [1] 127 } m_inobt_mxr = { [0] 255 [1] 510 } m_inobt_mnr = { [0] 127 [1] 255 } m_ag_maxlevels = 3 m_bm_maxlevels = { [0] 5 [1] 3 } min_maxlevels = 2 m_perag = 0xa8000000000d9cc00 m_peraglock = mrlock_t { mr_lbits = 4 mr_un = union { mr_st = struct { qcount = 0 qflags = 0 } qbits = 0 ) mgrowlock = sema_t { s_un = union { s_st = struct { count = 1 flags = 0 } s_lock = 65536 } s_queue = (nil) } m_rbmrotor = 0 m_fixedfsid = { [0] -2035985001 [1] -1485377611 ``` ``` m_dmevmask = 0 m_flags = 0 m_attroffset = 120 m_da_node_ents = 510 m_ialloc_inos = 64 m_ialloc_blks = 4 m_litino = 156 m_inoalign = 2 m_fflags = 0 m_reservations = xfs_trans_reservations_t { tr_write = 74424 tr_itruncate = 203704 tr_rename = 86328 tr_link = 43008 tr_remove = 43320 tr_symlink = 52280 tr_symlink = 52280 tr_ifree = 14520 tr_ichange = 1592 tr_ichange = 1592 tr_growdata = 27264 tr_swrite = 384 tr_addafork = 31416 tr_writeid = 384 tr_attrinval = 128000 tr_attrset = 128312 tr_attrre = 128312 tr_clearagi = 640 } m_maxicount = 3917920 m_inoadd = 0 m_swidth = 0 m_sinoalign = 0 ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 11-23.g | | Module 12: XFS File Management | |---|--------------------------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | ## File System Switch - at boot time, the type of the root filesystem is taken from rootfstype - for each filesystem type there are 2 central logic tables: - O XXX\_vfsops functions that do operations on the filesystem (mount, unmount, sync, ...) - O XXX\_vnodeops functions that do operations on files (open, close, read, write, ...) 12-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary - the pointer to the XXX\_vfsops is stored in the mount table whenever a filesystem is mounted - the vfsops functions store the XXX\_vnodeops pointer in the in-memory inode when it is allocated - displaying the vfssw with icrash: | >> fstype | | | | | | | | |-------------------------------|------------------------------|---------|------------------|------------------|------------------|------|--| | IND | | NAME | INIT | VFSOPS | VNODEOPS | FLAG | | | === | | | ********* | | ****** | | | | | 0 | BADVFS | 0 | c000000001426ab0 | c000000001426b80 | 0 | | | | 1 | x£s | c0000000002c5690 | c00000000143a450 | C00000000143a218 | 0 | | | | 2 | umfs | c0000000002e29fc | c00000000143af10 | c00000000143ad48 | 0 | | | | 3 | lofs | c0000000002eae0c | c00000000143b090 | c00000000143b108 | 0 | | | | 4 | cachefs | c000000000302020 | c00000000143bd60 | c00000000143be28 | 0 | | | | 5 | autofs | c00000000031c4dc | c00000000143cle0 | c00000000143c018 | 0 | | | | 6 | nfs3 | c00000000032d880 | c00000000143c510 | c00000000143c258 | 0 | | | | 7 | nfs | c0000000002ec55c | c00000000143b2d0 | c00000000143b348 | 0 | | | | 8 | efs | c000000000350994 | c00000000143d190 | c00000000143cfb0 | 0 | | | | 9 | namefs | c000000000353408 | c00000000143d3d0 | c00000000143d208 | 0 | | | | 10 | £d | c000000000354890 | c00000000143d688 | c00000000143d4c0 | 0 | | | | 11 | hwgfs | c000000000356610 | c00000000143d8c8 | c00000000143d700 | 0 | | | | 12 | fifofs | c000000000357388 | c00000000143d940 | c00000000143d9d8 | 0 | | | | 13 | proc | c0000000035bccc | c00000000143dba0 | c00000000143dc18 | 0 | | | | 14 | pipefs | c000000000365c74 | c00000000143dde0 | c00000000143de58 | 0 | | | | 15 | specfs | c0000000002ded74 | c00000000143a870 | c00000000143a6a8 | 0 | | | === | | | | *=============== | | | | | 16 | 16 vfs structs found | | | | | | | | | | | | | | | | | >> findsym c0000000143a450 | | | | | | | | | 0xc0000000143a450> xfs_vfsops | | | | | | | | | UXC | OXCOUDUOUITSGESO> XLS_VISOPS | | | | | | | | 1 - | 1 cumbol found | | | | | | | 1 symbol found >> findsym c0000000143a218 0xc000000143a218 --> xfs\_vnodeops 1 symbol found #### **XFS Code Architecture** - XFS is called by POSIX-compliant system calls uses buffer/page cache - TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-2 - o uses directory name lookup cache - o uses dynamic vnode cache - Space Manager - o manages filesystem free space - o manages allocation of inodes - o manages allocation of space within files - I/O Manager - o satisfies file I/O requests - Directory Manager - o implements "name space" (addressing of objects by name) - Transaction Manager - o updates all filesystem metadata (inodes, superblock, agf,...) - Buffer Cache - o frequently accessed blocks from the underlying volumes - o integrated memory pages and all-filesystem cache - XLV Volume Manager - o provides disk concatenation, striping and mirroring (plexing) - display defined logical volumes with xlv\_mgr: ``` /dev/dsk/dks14d3s7 (17777424 blks) /dev/dsk/dks14d4s7 (177779016 blks) /dev/dsk/dks20d3s7 (17779016 blks) /dev/dsk/dks20d3s7 (17779016 blks) /dev/dsk/dks22d3s7 (17779016 blks) /dev/dsk/dks22d3s7 (17777424 blks) /dev/dsk/dks22d4s7 (17777424 blks) /dev/dsk/dks32d4s7 (17779232 blks) /dev/dsk/dks36d4s7 (17769232 blks) /dev/dsk/dks36d4s7 (17769232 blks) /dev/dsk/dks46d4s7 (17769232 blks) /dev/dsk/dks46d4s7 (17769232 blks) /dev/dsk/dks54d4s7 (17769232 blks) /dev/dsk/dks54d4s7 (17769232 blks) /dev/dsk/dks54d4s7 (17769232 blks) /dev/dsk/dks54d4s7 (17769232 blks) VOL ptmp (complete) (node=flurry) VE ptmp.data.0.0 [active] start=0, end=17769231, (catjgrp_size=1 /dev/dsk/dks12d5s7 (17769232 blks) VE ptmp.data.0.1 [active] Show active logical volumes with xlv_mgr: xlv_mgr> show kernel VOL tmp flags=0x4 (Block_IO) PLEX 0 flags=0x4 (Block_IO) PLEX 0 flags=0x4 (Block_IO) PLEX 0 flags=0x6 (Block_IO) /dev/dsk/dks6d3s7 (17777424 blks) /dev/dsk/dks6d3s7 (17777424 blks) /dev/dsk/dks6d4s7 (17777424 blks) /dev/dsk/dks12d3s7 (17777424 blks) /dev/dsk/dks12d3s7 (17777424 blks) /dev/dsk/dks12d4s7 (17777424 blks) /dev/dsk/dks12d4s7 (17777424 blks) /dev/dsk/dks12d4s7 (17777424 blks) /dev/dsk/dks12d4s7 (17777424 blks) /dev/dsk/dks12d4s7 (17777424 blks) /dev/dsk/dks6d4s7 (17777424 blks) /dev/dsk/dks6d4s7 (17777424 blks) /dev/dsk/dks6d4s7 (17777424 blks) /dev/dsk/dks6d4s7 (17777424 blks) /dev/dsk/dks12d3s7 (17777424 blks) /dev/dsk/dks12d3s7 (17777424 blks) /dev/dsk/dks2d4s7 /dev/dsk/dks3d4s7 (17769232 blks) /dev/dsk/dks3d4s7 (17769232 blks) ``` 12-2.b 22jul1998 ``` /dev/dsk/dks44d4s7 (17769232 blks) /dev/dsk/dks46d4s7 (17769232 blks) /dev/dsk/dks52d4s7 (17769232 blks) /dev/dsk/dks52d4s7 (17769232 blks) VOL polar_ptmp flags=0x1, [complete] (node=NULL) DATA flags=0x0() open_flag=0x0() device=(192, 5) PLEX 0 flags=0x0 VE 0 [active] start=0, end=17769231, (cat)grp_size=1 /dev/dsk/dks30d3s7 (17769232 blks) ``` ### Example IRIX read(2) Sequence ``` 1 read(2) ----- user->kernel----- systrap syscall -sysent[]---ml->os----- read read VOP_READ --xfs_vnodeops[]----os->xfs----- xfs_read xfs_read_file chunkread VOP_STRATEGY -xfs_vnodeops[]----- xfs_strategy (bp->bp_target set to mount mp->m_ddev_targp, which has a pointer to the bdevsw entry) 10 xfs_strat_read sfsbdstrat (mp, bp) (use bp->b_target->bdevsw) bdrv [bdstrat macro] ----bdevsw[]-----xfs->xlv----- xlvstrategy (minor is index to subvolume) lvstrategy xlv_lower_strategy (major 0: use hwgraph) 15 bdrv [bdstrat macro] 16 ----bdevsw[]------xlv->disk driver--- 17 dkscstrategy dksccommand 18 19 dkscstart ---lun vertex->target vertex->controller vertex->scicommand ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-3 ``` 20 qlcommand 21 ql_entry 22 ql_start_scsi 23 Ql_PCI_OUTH moves to registers ``` ### System Call Layer - Read read(struct rwa \*uap, rval\_t \*rvp) 12-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary • user arguments: • \_read(uap, rvp, readwrite) \_read(struct rwa \*uap, rval\_t \*rvp, enum rwrtn wherefrom) - getf((int)uap->fdes, &fp) set fp by finding the pda's p\_curuthread->thread ut\_proc->proc p\_fdt->fd\_list->uf\_ofile[fd] - o proc points to the exandable table of the user's open files - o user's file descriptor (fd) indexes into the table of the user's open files - O the open file table points to the system table of open files (vfile anchors a list of "physical" files each with its own offset) - o the open file table points to the system table of open files (vfile anchors a list of "physical" files each with its own offset) - store user args in a local uio structure - o uio\_resid = user buffer size - o there is only one iovec for a read(2) -- there may be many for a readv(2) [similar to UNICOS listio(2)] - if vfile\_t is flagged as a socket, do a socket receive (vfile points to a socket if the file type is FSOCKET) - o for regular files, the vfile points to the vnode - set vnode pointer "vp" to the vnode\_t - the xfs\_inode desribes the location of the data set ioflag from vf\_flag if special handling (FDIRECT, FSYNC, ...) - set file offset in uio struct by calling pfile\_getoffset() via the pfile\_ops[] table ("physical" file) - VOP\_READ(vp, &uio, ioflag, fp->vf\_cred, &ut->ut\_flid, error); - locates the vnode's behavior description; this points to the xfs\_vnodeops[] for an XFS file - O this is a table of functions that do operations on the file; in the case of a file on an XFS filesystem it uses the xfs vnode operations - for an XFS file, the VOP\_READ macro calls xfs\_read() O the vnode points to the inode for this type of filesystem; for a file on an XFS filesystem the vnode points to an xfs\_inode >> proc | grep make a800000047b60c00 1 346986 342801 346986 8827 a800000047b61398 make >> proc -f- a800000047b60c00 PROC ST PID PPID PGID UID WCHAN NAME a800000047b60c00 1 346986 342801 346986 8827 a800000047b61398 make SELECTED FIELDS FROM THE KTHREAD STRUCT AT 0xa800000026af2400: K\_FLAGS(0x240020)=KT\_SLEEP|KT\_HOLD|KT\_WSV K\_W2CHAN=0x0,K\_STACK=0xfffffffffffff8000, K\_STACKSIZE=16384 K\_PRTN=1, K\_PRI=-2, K\_BASEPRI=-2, K\_SQSELF=0, K\_ONRQ=-1 K\_SONPROC=-1, K\_BINDIMG=-1, K\_MUSTRUN=-1 K\_LASTRUN=5, K\_CPUSET=1, K\_EFRAME=0x0, K\_LINK=0x0 K\_INHERIT=0x0, K\_INDIRECTWAIT=0x0 K\_RFLINK=0xa800000026af2400, K\_RBLINK=0xa800000026af2400 K\_FLINK=0xa800000026af2400, K\_BLINK=0xa800000026af2400 SELECTED FIELDS FROM THE PROC STRUCT: P\_CHILDPIDS=0xa800000266f722c0, P\_SLINK=0x0, P\_SHADDR=0x0 OPEN FILES FOR PROC 0xa800000047b60c00: | FD | FILE | RCNT | DATA | ВН | FLAGS | |---------|--------------------|------|------------------|-----------------------------------------|----------| | | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | 1 | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | 2 | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | 3 | a800000220a666a0 | 1 | a800003767312000 | a8000002241b38d8 | 1 | | 1 activ | ve processes found | | | ======================================= | :======= | • The process's open files may also be displayed with the "file" directive: TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-4.b ``` The address of the file may be used: >> file a800000220a666a0 FILE RCNT DATA BH a800000220a666a0 1 a800003767312000 a8000002241b38d8 FLAGS 1 file struct found ``` The address of a proc may be specified to show all open files: >> file -p a800000047b60c00 OPEN FILES FOR PROC 0xa800000047b60c00: | | FD | FILE | RCNT | DATA | вн | FLAGS | |---|------|------------------|------|------------------|------------------|-------| | | 0 | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | | 1 | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | | 2 | a8000003a38c7a80 | 14 | a800000340040600 | a800000366c4e358 | 3 | | | 3 | a800000220a666a0 | 1 | a800003767312000 | a8000002241b38d8 | 1 | | 4 | file | structs found | | | | | • FLAGS: ``` #define FREAD #define FWRITE 0×01 ``` • The FILE addresses point to vfile\_t structures: (the first 3 seem to point to stdin/stdout/stderr, so use 4th) ``` >> print *(vfile_t *)a800000220a666a0 struct vfile { vf_bh = bhv_head_t { bh_first = 0xa8000002241b38d8 ``` ``` cu_ckpt = -1 } ``` • The DATA addresses point to vnode\_t's: 12-4.d 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` m_queue = (nil) ) v_pc = vnode_pcache_t { v_pcacheref = 0 v_pcacheflag = 0 v_pagecache = pcache_t { pc_size = 1 pc_count = 0 pc_un_list = (nil) pc_un_hash = (nil) } } v_ckpt = (nil) ``` • the BH addresses are bhv\_desc\_t's in the pfile\_t's • the v\_bh structure in the vnode (see the px output above) points to the inode's bhv\_desc\_t: 12-4.e ``` bd_next = (nil) ``` • you can tell from its pointer to vnode operations that this is an XFS file: TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-4.f ``` di_version = '\001' di_format = '\002' di_onlink = 3 di_uid = 8827 di_gid = 1037 di_nlink = 3 di_projid = 0 di_pad = ( [0] 0 [1] 0 [2] 0 [3] 0 [4] 0 [5] 0 [6] 0 [7] 0 [8] 0 [9] 0 } di_atime = xfs_timestamp_t ( t_sec = 888010746 t_nsec = 978164536 ) di_mtime = xfs_timestamp_t ( t_sec = 888010676 t_nsec = 342110995 ) di_ctime = xfs_timestamp_t ( t_sec = 342110995 ) di_size = 4096 di_nblocks = 1 di_extsize = 0 di_nextents = 1 di_anextents = 1 di_anextents = 0 di_forkoff = 0 di_forkoff = 0 di_forkoff = 0 di_di_dmewmask = 0 di_di_dmeymask = 0 di_flags = 0 di_gen = 0 ``` } • The vnode may also be displayed with the "vnode" directive: 12-4.h 22jul1998 TR-IKI rev 0.7b SGI Proprietary DEV: v\_rdev -- device number for special inodes (VCHR, VBLK) BH: v\_bh -- address of the fileystem dependent inode's behavior structure ### (XFS) Filesystem Layer - Read xfs\_read(bhv\_desc\_t \*bdp, uio\_t \*uiop, int ioflag, cred\_t \*credp, flid\_t \*fl) • set vp to the vnode (via behavior descriptor argument) TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-5 - set ip to the inode - set mp to the mount structure - check preferred iosizes (in the uio; IO\_UIOSZ) - for a regular file: - o if (ioflag & IO\_DIRECT) call xfs\_diordwr() - o else (buffer cache I/O): xfs\_read\_file(bdp, uiop, ioflag, credp) xfs\_read\_file(bhv\_desc\_t \*bdp, uio\_t \*uiop, int ioflag, cred\_t \*credp) - while uio\_resid != 0: /\* uio\_resid is the length of the user's request \*/ - o nbmaps = m\_nreadaheads ``` the filesystem's mount table m_nreadaheads value: >> print ((xfs_mount_t *)0xa800000026970800)->m_nreadaheads 4 ``` - o call xfs\_iomap\_read() to map the request (and read-aheads) into 4 bmapval structures: - call xfs\_bmapi() to create a list of the inode's extents from current offset to end of file - set iosize to starting/ending blocks (rounded down to page boundaries) - call xfs\_next\_bmap() to convert the beginning of the request into a bmapval (using the list of extents) - bmapval's offset and length returned in disk sector units - for each additional bmapval: - call xfs\_next\_bmap() to convert the next part of the request into a bmapval (using the list of extents) - return the number of filled-in bmapval's (nbmap) to xfs\_read\_file's "nbmaps" - o while uio\_resid && nbmaps: - /\* see discussion of system buffers below \*/ - call chunkread(vp, bmapp, read\_bmaps, credp) : - call fetch\_chunk(vp, bmap, cred) to set up the buf\_t and its pages for a bmapval: - select page size from policy module - call gather\_chunk(vp, bmap, &gc\_flags, &start\_page\_size, &end\_page\_size): - call bp\_insert() to locate or insert a buf\_t in the vnode's v\_buf tree - while bmap length: - find and/or allocate pages to link to the buf\_t's b\_pages list - mark the buf B\_DONE or B\_PARTIAL (partly filled) - return bp - return bp - if the buf\_t is B\_DONE, call chunkreada(vp, ++bmap, cred) for each of the following bmapval's - bp = fetch\_chunk(vp, bmap, cred); - VOP\_STRATEGY(vp, bp) (for XFS filesystems, jump thru xfs\_vnodeops[] to xfs\_strategy()) return bp - if the buf\_t is B\_PARTIAL, call patch\_chunk() to fill the initialized pages of this buf\_t, then chunkreada(vp, ++bmap, cred) for each of the following bmapval's return bp - (buf is not B\_DONE or B\_PARTIAL) VOP\_STRATEGY(vp, bp); call chunkreada(vp, ++bmap, cred) for each of the following bmapval's sleep in biowait(bp) - (chunkread) return bp - if bp->b\_resid != 0 mark buf\_t B\_DONE and break from while loop - else (bp->b\_resid == 0) call biomove()/uiomove() to move any already-buffered data from the buf b\_pages list to the user buffer at iov\_base; these decrement uio\_resid for each move - bmapp++; nbmaps-- - continue until all bmapval's are done (nbmaps) or request done (uio\_resid == 0) continue until request done (uio\_resid == 0) 12-5.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` xfs_strategy(bhv_desc_t *bdp, buf_t *bp) ``` • call xfs\_strat\_read(bdp, bp); xfs\_strat\_read(bhv\_desc\_t \*bdp, buf\_t \*bp) - call xfs\_bmapi() to locate the file's extents - for each block of the request decide: - o if a hole in the file, zero out the buffer - if not a hole, call getrbuf() to allocate a buf\_t (rbp) for the driver initialize the rbp->buf\_t from the bp->buf\_t xfsbdstrat(mp, rbp); iowait(rbp) - iodone(bp); xfsbdstrat((struct xfs mount \*mp, struct buf \*bp) - my\_bdevsw = get\_bdevsw(bp->b\_edev); /\* use hwgraph for major 0, or bdevsw[] for XLV major 192 \*/ - bdstrat(my\_bdevsw, bp); /\* which is a macro for: \*/ bdrv(my\_bdevsw,DC\_STRAT,bp) bdrv(bdevsw \*my\_bdevsw, int routine, ...) - case DC\_STRAT: func = (bdevfunc\_t)my\_bdevsw->d\_strategy; - (\*func)(a1, a2, a3, a4, a5); /\* calls XLV driver xlvstrategy(bp) or disk driver dkscstrategy(bp) \*/ - display a file's extents: TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-5.d ## **System Buffers** buffers - page cache - o all pages of physical memory - o indexed by pfdat - o shared by virtual memory system and file cache - o if nothing but file I/O is going on almost all of the page cache will become dedicated to caching file pages - o a file's pages may be mapped into a user's address space with mmap(2) - chunk cache - o the variable-length buffers headed by buf\_t structures - o used for file I/O - $\,\circ\,$ groups of pages, indexed by file - monitor the system buffers with bufview(1) - system buffer cache is anchored in global\_buf\_hash - the device list is anchored in global\_dev\_hash - number of buf\_t's: ``` >> whatis v 0xc0000000014b8380 >> print *(struct var *)0xc000000014b8380 | grep buf v_buf = 125000 v_hbuf = 8192 ``` - ngeteblkdev(dev\_t dev, size\_t len) gets a free buf from the bfreelist[] - o buf is chained onto the device hash list - o buf is assigned a buffer (may re-use memory from the fraglist[]) Both the chunk cache and page cache for a file are anchored in the vnode: - the buf's are sorted into a btree - the tree structure is balanced using b\_balance (see bp\_balance()) - there is a delwri (delayed write) chain using b\_dforw (and b\_dback) 12-6.a 22jul1998 - the buffers consist of a list of pages, represented by pfdat's - fetch\_chunk()->gather\_chunk() locates a buf\_t and gathers pages to one buf's bp\_pages list to represent one file extent - o it allocates buffers where none exist already - all pages representing data from the file are linked to vnode via a hash table (or directly, if a small number) - o all such pfdat's point back to the vnode (pf\_vp) - o any pfdat not associated with a file is "anonymous" with pf\_tag pointing to an "anon" structure (pf\_vp/pf\_tag is a union); anon structures connect the page with space on the swap device - pfind() aka vnode\_pfind() -> vnode\_pfind\_nolock() -> pcache\_find() -> pcache\_search() to search for a vnode's page in a hash list - all dirty (written-to) pages are on the v\_dpages list - the "bdflush" service thread flushes dirty pages from the chunk cache - the "pdflush" service thread flushes dirty pages from the page cache - the "vhand" service thread frees up unused pages to a freelist - when pf\_use == 0 the page can go onto the free list - the "coalesced" service thread frees up physically contiguous pages to combine as large pages #### detail on the vnode's page hash list: ``` To view a hash table attached to a vnode: >> print (*(vnode_t *)0xa800002626770900)->v_pc <<< notice pc_un_hash: the hash table and pc_size: #elements in hash table >>> >> whatis -1 pfdat ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-7 • there are 32 disk blocks (sectors) in a 16KB page # is a improvementation of the matter 10384 by the page • Detail of the buf structure: ``` typedef struct buf { /* * These first 4 fields must match the hbuf definition * below. DO NOT CHANGE THEM OR REARRANGE THEM. */ sema_t b_lock; /* lock for buffer usage */ uint64_t b_flags; /* see defines below */ struct buf *b_forw; /* headed by d_tab of conf.c */ struct buf *bd_back; /* * */ struct buf *bd_back; /* * */ struct buf *bd_back; /* device based hash chain */ struct dhbuf *bd_hash; /* hash bucket head */ struct buf *av_forw; /* position on free list, */ struct buf *av_back; /* if not BUSY */ dev_t b_edev; /* major+minor device name */ int b_error; /* returned after I/O */ off_t b_offset; /* vnode offset (in basic blocks) */ buftarg_t *b_target; /* route to I/O device */ ``` ``` unsigned b_bcount; unsigned b_resid; /* transfer count */ /* words not transferred after error */ unsigned b_remain; unsigned b_bufsize; __psunsigned_t b_sort; /* virt b_bcount for PAGEIO use only */ /* size in bytes of allocated buffer */ /* size in bytes of allocated bull/ /* key with which to sort on queue union ( caddr_t b_addr; int *b_words; struct pfdat *b_pfdat; daddr_t *b_daddr; /* low order core address */ /* words for clearing */ /* pointer into b_pages list */ /* disk blocks */ } b_un; } b_un; struct vnode *b_vp; daddr_t b_blkno; clock_t b_start; struct pfdat *b_pages; void *b_alenlist; /* object associated with bp */ /* block # on device */ /* request start time */ /* page list for PAGEIO */ clock_t b_start; /* request start time */ struct pfdat *b_pages; /* page list for PAGEIO */ void *b_alenlist; /* address, length lists */ void (*b_relse)(struct buf *); /* function called by brelse */ sema_t b_iodonesema; /* lock for waiting on I/O done */ void (*b_iodone)(struct buf *); /* function called by bwrite */ void *b_private; /* function called by bwrite */ void *b_fsprivate; /* for driver's use */ void *b_fsprivate2; /* private ptr for file systems */ void *b_fsprivate3; /* private ptr for file systems */ short b_pin_owaiter; /* someone waiting for unpin? */ ushort b_pin_waiter; /* someone waiting for unpin? */ /* count of times buf is pinned */ /* someone waiting for unpin? */ /* # of free trips through freelist */ /* tree balance */ /* free list number */ /* bufview flags */ /* vnode delwri chain */ /* vnode delwri chain */ /* private data for grio */ /* list of bps used in grio */ ushort b_pin_waiter; b_ref; b_balance; char char char b_listid; b_bvtype; buf *b_dforw; buf *b_dback; buf *b_parent; *b gric pri---- char struct struct *b_grio_private; buf *b_grio_list; void struct #ifdef DEBUG_BUFTRACE struct ktrace *b_trace; /* per buffer trace buffer */ #endif } buf_t; ``` Detail of the pfdat structure: 12-7.b 22jul1998 ``` typedef struct pfdat { struct pfdat *pf_next; /* Next free pfdat. union { struct pfdat *prev; sm_swaphandle_t swphdl; /* Previous free pfdat. */ /* Swap hdl for anon pages */ ) p_swpun; #ifdef _VCE_AVOIDANCE int /* Virtual cache color. */ /* bit 0: PE_RAWWAIT rawwait */ /* bits 1..7: PE_COLOR */ /* Regular page flags */ pf_vcolor:8, pf_flags:24; #else uint pf_flags; /* Regular & NUMA page flags */ #endif /* Share use count. /* Count of processes /* doing raw I/O to page* /* Object page number unsigned long pf_pageno; union { struct vnode /* Page's incore vnode. */ /* Generic hash tag. */ *tag; void } p_un; struct pfdat *pf_hchain; /* Hash chain link union *pf_pdep1; /* Primary pde ptr union { /* Reverse map pointer /* Page tbl entry ptr struct rmap *pf_revmapp; *pf_pdeptr; union pde #if CELL pf_utimestamp; /* timestamp for zero use cnt */#endif } p_rmapun; #if MULTIKERNEL __uint64_t pf_exported_to; /* bitstring of cells page is currently export to */ } pfd_t, pfde_t; ``` - There is a "buffer cache" used for filesystem metadata - o these buf's are indexed by device rather than file - see calls to get\_buf() - Some control structures of the "buffer" cache TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-7.d # Example IRIX write(2) Sequence ``` 1 write(2) ---- user->kernel----- systrap syscall --sysent[]---ml->os----- 8 second_thread bdflush clusterwrite VOP_STRATEGY VOP_STRATEGY xfs_vnodeops[]--- xfs_strategy (bp_target set to mount mp->m_ddev_targp) queue to tail of xfsd_list or xfs_strat_write xfs_bmapi xfsbdstrat (mp, bp) (use bp->b_target->bdevsw) bdstrat [macro] bdrv 12 13 14 bdrv 15 ---bdevsw[]------xfs->xlv--- xlvstrategy (minor is index to subvolume) xlv_lower_strategy griostrategy (major 0: use hwgraph) bdrv [bdstrat macro] ----bdevsw[]------xlv->disk driver--- dkscstrategy dksccommand ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-8 ``` --lun vertex->target vertex->controller vertex->scicommand 23 24 25 26 qlcommand ql_entry ql_start_scsi Ql_PCI_OUTH moves to registers write(struct rwa *uap, rval_t *rvp) • user arguments: struct rwa { sysarg_t fdes; char *cbuf; cnar *cour; usysarg_t count; sysarg_t off64; sysarg_t off1; sysarg_t off2; /* off64 is the offset for 64 bit apps */ /* off1 and off2 is the 64bit offset */ /* for 32 bit apps */ _write(uap, rvp, readwrite) _write(struct rwa *uap, rval_t *rvp, enum rwrtn wherefrom) getf((int)uap->fdes, &fp) set fp by finding the pda's p_curuthread->thread ut_proc->proc p_fdt->fd_list->uf_ofile[fd] o proc points to the exandable table of the user's open files O user's file descriptor (fd) indexes into the table of the user's open files the open file table points to the system table of open files (vfile anchors a list of "physical" files - each with its own offset) the open file table points to the system table of open files (vfile anchors a list of "physical" files - each with its own offset) • store user args in a local uio structure o uio_resid = user buffer size o there is only one lovec for a read(2) -- there may be many for a readv(2) [similar to UNICOS listio(2)] • if vfile_t is flagged as a socket, do a socket send (vfile points to a socket if the file type is FSOCKET) o for regular files, the vfile points to the vnode set vnode pointer "vp" to the vnode_t ``` 12-8.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary O the xfs\_inode desribes the location of the data - set ioflag from vf\_flag if special handling (FAPPEND, FSYNC, ...) - set file offset in uio struct by using VFILE\_GETOFFSET\_LOCKED if IO\_DIRECT, call xfs\_diordwr() for buffered I/O: VOP\_WRITE(vp, &uio, ioflag, fp->vf\_cred, &ut->ut\_flid, error); locates the vnode's behavior description; this points to the xfs\_vnodeops[] for an XFS file o for an XFS file, the VOP\_READ macro calls xfs\_write() o the vnode points to the inode for this type of filesystem; for a file on an XFS filesystem the vnode points to an xfs\_inode ### (XFS) Filesystem Layer - Write xfs\_write(bhv\_desc\_t \*bdp, uio\_t \*uiop, int ioflag, cred\_t \*credp, flid\_t \*fl) • set vp to the vnode (via behavior descriptor argument) TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-9 - set ip to the inode - set mp to the mount structure - if append mode, uio\_offset = file size - check preferred iosizes (in the uio; IO\_UIOSZ) - for a regular file: - o if (ioflag & IO\_DIRECT) call xfs\_diordwr() - o else (buffer cache I/O): xfs\_write\_file(bdp, uiop, ioflag, credp, &commit\_lsn) xfs\_wite\_file(bhv\_desc\_t \*bdp, uio\_t \*uiop, int ioflag, cred\_t \*credp, xfs\_lsn\_t \*commit\_lsn\_p) - while uio\_resid != 0: /\* uio\_resid is the length of the user's request \*/ - o call xfs\_build\_gap\_list to build a list of gaps we are filling (in case a reader is about to read them) - o call xfs\_iomap\_write to map the request into bmapval structures representing each extent being written to - calls xfs\_write\_bmap to form the bmap - returns the number of bmapval's it filled in - o for each of the bmapval's formed above: - getchunk(vp, bmapp, credp) to get the buf - use chunkread for special cases hiomove(hp. hmapp->phoff. hmapp->phsize\_LHQ\_WRITE\_uion) to - biomove(bp, bmapp->pboff, bmapp->pbsize, UIO\_WRITE, uiop) to move user data into the buffer decrement uio\_resid - adjust the "gap list" with xfs\_delete\_gap\_list - mark the buf for delayed write (bdwrite(bp)) - use bwrite(bp) for sync. writes - continue while loop on user request (uio\_resid length) getchunk(vnode\_t \*vp, bmapval\_t \*bmap, cred\_t \*cred) - call delalloc\_reserve() to ensure that too many delayed allocations are not pending - delallocleft is a global counter of how may such allocations may still be made (out of the max. that may exist simultaneously) in the system buffers TR-IKI rev 0.7b SGI Proprietary 22jul 1998 - call clusterwrite to flush some out if delallocleft is zero - call fetch\_chunk(vp, bmap, cred) - call gather\_chunk (vnode\_t \*vp,bmapval\_t \*bmap, int \*outflagsp, size\_t \*start\_page\_sizep, size\_t \*end\_page\_sizep) - call bp\_insert (vnode\_t \*vp, struct bmapval \*bmap) - find the buffer matching the bmap; look in vp->v\_buf b\_forw/b\_back list - may call ngeteblkdev to get a free one for the list - bp->b\_offset = bmap->offset - collect all pages assoc. with this buf and link to b\_pages list - o start reads as necessary (call cread, which uses VOP\_STRATEGY) - o release(bhead, btail, bp) each buf - biowait for outstanding I/O - return (bp) #### bdwrite(bp) mark the buffer for delayed write: bp->b\_flags |= B\_DELWRI | B\_DONE Control returns to the user. His write to the system buffers is complete. (or -- buf could "float" to top of freelist) #### clock every second, wakeup service thread second\_thread by releasing second\_sema #### second\_thread every bdflushcnt seconds, wakeup bdflush by releasing semaphore bdwakeup (bdflushcnt is 1, unless changed with a syssgi(2)) 12-9.b 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### bdflush - walks thru nbuf/bdflushr entries of global\_buf\_table[] (bdflushr is a system tuneable) - write to disk by calling clusterwrite(bp, age) (if attached to a vnode withb\_vp) (call bwrite(bp) if not attached) - sleeps by psema(&bdwakeup, PZERO) clusterwrite (buf\_t \*bp, clock\_t start) - works on vp = bp->b\_vp; vnode's list - works on b\_forw/b\_back/b\_parent list - call getrbuf to get extra buf's for writes - write via VOP\_STRATEGY(clusterbp->b\_vp, clusterbp); - biowait(clusterbp); - brelse(clusterbp); #### VOP\_STRATEGY xfs\_strategy(bhv\_desc\_t \*bdp, buf\_t \*bp) call xfs\_strat\_write(bdp, bp); xfs\_strat\_write(bhv\_desc\_t \*bdp, buf\_t \*bp) - for each extent of the request decide: - o transaction to allocate storage - allocate (xfs\_bmapi()) - o write with xfsbdstrat(mp, rbp); - iowait(rbp) - biodone(bp) ### O brelse() xfsbdstrat((struct xfs\_mount \*mp, struct buf \*bp) - my\_bdevsw = get\_bdevsw(bp->b\_edev); /\* use hwgraph for major 0, or bdevsw[] for XLV major 192 \*/ - bdstrat(my\_bdevsw, bp); /\* which is a macro for: \*/ bdrv(my\_bdevsw,DC\_STRAT,bp) bdrv(bdevsw \*my\_bdevsw, int routine, ...) - case DC\_STRAT: func = (bdevfunc\_t)my\_bdevsw->d\_strategy; - (\*func)(a1, a2, a3, a4, a5); /\* calls XLV driver xlvstrategy(bp) or disk driver dkscstrategy(bp) \*/ TR-IKI rev 0.7b SGI Proprietary 22jul1998 12-9.d | - / | | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------| | | | | | | | | | | | Module 13: XFS File Management | | | Module 15. Alb The Management | | | | | ********** | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | | | | | | | | | | | İ | | | | | | meta- automi | | | ŀ | | | | | | i | | | | | | | | | 1 | | | | | | | | | | | | | | | - | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and a condensate | | | | | | Andrew of the Control | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | # Reference: 22jul1998 TR-IKI rev 0.7b SGI Proprietary # mmap(2) - Memory Mapping a File mmap64() or mmap(0, size, addr length size, PROT\_READ, MAP\_SHARED, fd, 0) ength prot flags fd off) call mmap\_common() mmap\_common() - follow the fd to the vnode (vp) set isspec if VCHR or VBLK (character/block special) 13-2 13-1 TR-IKI rev 0.7b SGI Proprietary 22jul1998 13-2.a # Reference 22jul1998 TR-IKI rev 0.7b SGI Proprietary # pfdat's 14-1 - Each node has a list of pfdat's - the pfdat's for a node are physically located in each node's own memory each pfdat's PFN is implied by its offset from a well known base of 0x1800000 (NODE\_PFDAT\_OFFSET); a pfdat structure is ``` 64 bytes (0x40) in length given the address of a pfdat: PFN = address bits 39-32 (i.e. node number) << 18 || ((address bits 31 thru 0 - 0x1800000)/0x40) ■ given a PFN: o each pfdat contains information about the state and use of that page o the list of in-use pfdat anchors can be found by locating the master node's nodepda_t and following it to the array of pg_free_t's. Then use any node number (nasid) as an index into the array. (This is how the icrash "pfdat -a" directive lists all pfdat's.) >> dump master nasid c00000001450008: 0000000000000000 >> dump Nodepdaindr 4 c0000000014bc180: a8000000006d9448 a8000001003f4000 c0000000014bc190: a8000002003f4000 a8000003003f4000 ....m.H....?8. >> px *(nodepda_t *)a800000006d9448 | grep pg_freelst pg_freelst = 0xa8000000070f500 pg_freelst_lock = 0x2 or use the ">> nodepda" directive or use the >> nodepda -f | pg NODE NODEPDA NODE 0 a800000006d9448 NODE PG DATA: PG_FREELIST=0xa8000000070f500 NODE_FREEMEM=11637, NODE_FUTURE_FREEMEM=43018 NODE_EMPTYMEM=11087, NODE_TOTAL_MEM=47932 Example: if you were looking for node 2's pfdat's: >> whatis -l pg_free ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 14-2.a ``` >> px 0xa8000000070f500+(2*136) 0xa8000000070f610 ... ) hiwat = { [0] 0x7825 pheadend = { [0] 0xa8000002004222b8 pfd_low = 0xa800000201804240 pfd_high = 0xa800000201e7ffc0 HASH PFN 1 0 a800000201e7ffc0 0 2001 630783 2 pfdat structs found ``` - The page numbers in a node are not consecutive because there are "gaps" between memory banks of a node - To display the populated memory banks in each node, dump the "slot\_psize\_cache" array - o each entry in the array is a 16-bit integer - O there are "numnode" rows (major groups); each row represents a node - O there are "slots\_per\_node" columns in a row; each column represents a memory "slot" - on an ORIGIN, every 4th "slot" contains the number of pages in the bank ``` >> dump numnodes c0000000145cca4: 0000040 0x40, or 64 nodes >> dump slots_per_node c00000001450040: 000002000001000 0x20, or 32 possible "slots"/node ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 Every fourth "slot" represents bank of memory, therefore "slot"/4 -> bank. This "slot" is distinct from the hardware DIMM slots. On the Origin there are 16 DIMM (Dual In-Line Memory Module) slots on a node board (slots MMXL0-7 and MMXH0-7). MMXL0/MMXH0 are bank 0, MMXL1/MMXH1 are bank 1, etc., for a total of 8 banks. So take the "slot" number from the PFN and divide by 4. The result is a bank number, which is a pair of hardware DIMM's. in memsupport.c: ``` #ifdef SNO size = (__int64_t)banks->membnk_bnksz[slot/4]; >> dump slot_psize_cache 20 slot 4 30 200000000000000000 | ...... node 0 slot c0000000145e280: 200000000000000 26 bank 2 bank 30: ac. bank 2 node 1 c0000000145e2c0: 20000000000000 c0000000145e2d0: 20000000000000 c0000000145e2c0: 10000000000000 c0000000145e2t0: 10000000000000 c0000000145e300: 20000000000000 c0000000145e310: 200000000000000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20000000000000000 2000000000000000 ``` The total page size for node 0 is c000 pages (if you add up all the slots). ``` page size: >> px *(struct icrashdef_s *)&icrashdef ! grep page i_pagesz = 0x4000 i_mapped_kern_page_size = 0x1000000 ``` 14-2.c 22jul1998 TR-IKI rev 0.7b SGI Proprietary c000\*4000=30000000, decimal 805,306,368 The icrash "mem" directive agrees that there is 768 MB of memory in node 0: >> mem NODE MEMORY: | MODULE | SLOT | NODEID | NASID | MEM_SIZE | ENABLED | |---------|------|---------|-------|----------|---------| | ======= | | **===== | **** | | ====== | | 1 | n1 | 0 | 0 | 768 | Y | | 1 | n2 | 1 | 1 | 768 | Y | actual vale of 768MB: >> print 768\*1024\*1024 805306368 Node 0 has all 16 physical slots populated (it takes 2 hardware slots to make a bank, and we have 8 banks): ``` $ hinv -c memory -v > mem ... Memory at Module 1/Slot 1: 768 MB (enabled) Bank 0 contains 128 MB (Premium) DIMMS (enabled) Bank 1 contains 128 MB (Premium) DIMMS (enabled) Bank 2 contains 128 MB (Premium) DIMMS (enabled) Bank 3 contains 128 MB (Premium) DIMMS (enabled) Bank 4 contains 64 MB (Premium) DIMMS (enabled) Bank 5 contains 64 MB (Premium) DIMMS (enabled) Bank 6 contains 64 MB (Premium) DIMMS (enabled) Bank 7 contains 64 MB (Premium) DIMMS (enabled) ``` - Position of the "slot" number in an address - O "dump slot\_shift" shows that the slot number is bit 27 and up (left) in a kseg0 address ``` >> dump slot_shift c000000001450030: 0000001b000000000 ``` O "dump slot\_bitmask" shows the size of the slot field in an address (i.e. 5 bits) >> dump slot\_bitmask c000000001450038: 00000000000001£ #### Relationship of node number and nasid: O "nasid\_shift" shows position of nasid (node number) in an address: >> dump nasid\_shift c00000001450020: 0000002000000020 0x20, or 32 bits from the right O "nasid\_bitmask" shows the size of the nasid >> dump nasid bitmask c00000001450028: 000000000000000 nasid is 8 bits • the "compact\_to\_nasid\_node" array shows that node numbers are the same as nasid's: >> dump compact\_to\_nasid\_node 16 ``` c00000000145d810: 0000000100020003 0004000500060007 000c000d000e000f C00000000145d820: 00080009000a000b C0000000145d830: 0010001100120013 0014001500160017 001c001d001e001f . .!.".#.$.%.&..' .(.).*.+.,.-../ .0.1.2.3.4.5.6.7 .8.9.:;.<=:>? c00000000145d840: 00180019001a001b C00000000145d850: 0020002100220023 C0000000145d860: 00280029002a002b C00000000145d870: 0030003100320033 0024002500260027 002c002d002e002f 0034003500360037 c0000000145d880: 00380039003a003b 003c003d003e003f ``` - Address to PFN: - O PFN is bits 39-14 of a physical address (i.e. a800000....) - O divide an address by 0x4000 to shift right 14 - Node and bank to PFN: Given a node and memory bank, form PFN as node (or nasid) << 18 | bank << 15. TR-IKI rev 0.7b SGI Proprietary 22jul1998 14-2.e ``` In node 0, the PFN's would be: ``` bank 0: --0 < 18 | 0 < 15--0x2000 pages --- 0-1fff ----- decimal 0-8191 bank 1: --0 << 18 | 1 << 15--0x2000 pages --- 8000-9fff---- decimal 32768-40959 bank 2: --0 << 18 | 2 << 15--0x2000 pages --- 10000-11fff--- decimal 65536-73727 bank 7:---0 << 18 | 7 << 15-0x1000 pages -- 38000-38fff--- decimal 229376-335871 In node 1, the PFN's would be: bank 0: ---1 << 18 | 0 << 15 -- 0x2000 pages -- 40000-41fff --- decimal 262144-270335 bank 1: ---1 << 18 | 1 << 15 -- 0x2000 pages -- 48000-49fff --- decimal 294912-303103 ## Compare this with >> pfdat -a output: | PFDAT | USE | FLAGS | PGNO | VP/TAG | HASH | PFN | |------------------|-----|--------------|-------|--------------------|--------------------------|--------| | | | | | | | 455 | | a8000000018071c0 | 1 | 2800<br>2800 | 0 | 0 | 0 | 456 | | a800000001807200 | _ | 2800 | U | U | U | 436 | | a80000000187ff80 | - 1 | 2100 | 74212 | a8000000270447e0 | -900000102633690 | 8190 | | a80000000187ffc0 | ī | 20002800 | 74212 | a8000000270447e0 | 2800000102633180 | 8191 | | | _ | | - | 0 | 0 | | | a800000001a00000 | 0 | | 23960 | ů. | 0 | 32768 | | a800000001a00040 | 0 | 1a000 | 4817 | 0 | 0 | 32769 | | • • • | | | | | | | | a800000001a7ff80 | 0 | 2009 | | a800000a80487b00 | 0 | 40958 | | a800000001a7ffc0 | 1 | 2800 | 0 | a800000027ffc000 | 0 | 40959 | | a800000001c00000 | 1 | 210c | 86758 | a8000000270447e0 | a800000102228540 | 65536 | | a800000001c00040 | 1 | 2800 | 0 | a800000040004000 | 0 | 65537 | | | | | | | | | | a80000000243ff80 | 0 | 2009 | 1143 | a8000013611ecc00 | 0 | 200702 | | a80000000243ffc0 | 1 | 210c | 94437 | a8000000270447e0 | a800000002617fc0 | 200703 | | a800000002600000 | 1 | 210c | 74982 | a8000000270447e0 | a800000101838000 | 229376 | | a800000002600040 | ī | 210c | 0 | a800000027c4ea60 | a800000002217f40 | 229377 | | | _ | | • | | | | | a80000000263ff80 | 1 | 2100 | 95716 | a8000000270447e0 | a800000001c67f80 | 233470 | | a80000000263ffc0 | - 7 | 2800 | 33,10 | 0 | 0 | 233471 | | a800000101804380 | ī | 210c | | a8000000270447e0 | -900000103606fa0 | 262414 | | a8000001018043c0 | 2 | 2100 | 01003 | a80000000270492100 | 0 | 262415 | | a6000001018043C0 | 2 | C | U | 2000000002492100 | 0 | 202413 | | a80000010187ff80 | 1 | 2100 | 52700 | a8000000270447e0 | ************************ | 270334 | | 29000001019/1190 | | 2100 | 52/08 | 200000027044760 | a600000102021160 | 2/0334 | ``` a80000010187ffc0 1 210c 57829 a8000000270447e0 a800000101e33fc0 270335 a800000101a00000 1 210c 37094 a800000270447e0 a800002101c44000 294912 a800000101a00040 1 210c 38119 a800000270447e0 a800002101c08040 294913 ``` Here's another icrash option (but it is extremely slow): ``` >> pfdat -n VALID PFNS : 0 -- 8191 32768 -- 40959 65536 -- 73727 98304 -- 106495 131072 -- 135167 163840 -- 167935 196608 -- 200703 229376 -- 233471 262144 -- 270335 294912 -- 303103 Node Node Node 0000000 (8192) (8192) (8192) (8192) Node Node (4096) Node (4096) (4096) Node (4096) (8192) Node 0111 Node Node Node 327680 -- 335871 (8192) ``` The memory at the base of each node is reserved for .... (?) 14-2.g 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Address Translation** kernel physical addresses • begin with "a8000000..." - are used by the kernel to address most dynamically allocated data - are not translated through the TLB ### kernel mapped addresses - begin with "c0000000..." - are used by the kernel to address kernel text (mapped to the copy on each node0) and data from the unix binary - a TLB miss causes the kernel to go to the kernel page table (kptbl->) for a pte to load into the TLB #### kernel stack addresses (mapped) - begin with "ffffff..." - are used by the kernel to address the kernel stack associated with the current thread - the pte\_t for the kernel stack is kept in the uthread\_t - no TLB miss should occur - the virtual base of each kernel stack is fffffffffff8000 (KSTACKPAGE) #### user virtual addresses - begin with "00000000..." - a user may only use a virtual address (no kernel addressing modes are permitted by the hardware) - a TLB miss causes the kernel to go to the current thread's preg\_t's to find a virtual region containing the missed virtual address - the kernel goes to this pregion's table of pte's for a pte to load into the TLB - the kernel may use a user virtual address -- the kernel executes with the same address space id (tlbpid, or asid) as the current user thread, so address translation is the same as in user mode - the kernel is given user addresses via system calls - structure shown above assumes PMAP\_SEGMENT which is used for a 32-bit mode binary - For a PMAP\_TRILEVEL (64-bit mode) process the segment table points to other segment tables which point to pages of pte's. The diagram does not illustrate this case. TR-IKI rev 0.7b SGI Proprietary 22jul1998 14-3.a ### Table locations The pfdat's are local to the nodes they describe. They are indexed by the low 18 bits of the PFN (i.e. excluding the nasid). They correspond positionally to the pages on the node. There is only 1 kernel page table. It is physically located on node 0. 4.pfdat.htm | ļ | | |--------------|---------------------| | Northy cal | Module 15: Disk I/O | | | Widdle 13. Disk 1/O | | | | | | | | ang de Maria | | | | | | | | | | | | | | | | | | | | | | | | | | | 1 | | | | | | | | | | | | | | | | | | | | # Example IRIX read(2) Sequence ``` 1 read(2) ----- user->kernel----- 2 systrap syscall --sysent[]---ml->os----- read VOP_READ ----xfs_vnodeops[]----os->xfs------ xfs_read xfs_read_file 8 chunkread VOP_STRATEGY ----xfs_vnodeops[]------ xfs_strategy (bp->bp_target set to mount mp->m_ddev_targp, which has a pointer to the bdevsw entry) 10 xfs_strat_read xfsbdstrat (mp, bp) (use bp->b_target->bdevsw) bdrv [bdstrat macro] ----bdevsw[]----- xlvstrategy (minor is ____ xlv_lower_strategy ______ostrategy (major 0: use hwgraph) ----xfs->xlv--- (minor is index to subvolume) 15 16 ----bdevsw[]------xlv->disk driver--- 17 dkscstrategy dksccommand 18 19 dkscstart ----lun vertex->target vertex->controller vertex->scicommand ``` 15-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` 20 qlcommand 21 ql_entry 22 ql_start_scsi 23 Ql_PCI_OUTH moves to registers ``` # **XLV Structure** - see man xlv - concatenated filesystem: make multiple VE's of 1 partition each - striped filesystem: make 1 VE of multiple partitions - the VE is the unit of recovery; i.e. on a disk error, the VE is taken offline - xlv\_make(1M) is used create logical volumes by writing their definitions to the labels of the disks that will constitute the logical volume objects - xlv\_assemble(1M) is run at boot time to collect the labels from all the disks, and pass them to the kernel TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-2 xlv\_assemble(1M) also creates all the nodes in /dev/xlv; it assigns major 192 (XLV\_MAJOR) and creates a minor number the kernel strings all the subvolumes together in one table (see below), with minor number being the index into that table example: ``` $ ls -la /dev/xlv brw----- 1 root root 192, 8 Mar 4 10:43 kud_src brw----- 1 root root 192, 4 Mar 4 10:43 opt brw----- 1 root root 192, 9 Mar 4 10:43 people brw----- 1 root root 192, 5 Mar 4 10:43 ptmp brw----- 1 root root 192, 7 Mar 4 10:43 tmp brw----- 1 root root 192, 6 Mar 4 10:43 utmp ``` • to see your configuration: • also see xlv\_shutdown, prtvtoc, fx, dvhtool # **XLV Driver Layer** xlvstrategy(bp, 0) /\* assume a single plex \*/ - minor device is taken from bp->b\_edev - xlv\_p = &xlv\_tab->subvolume[minor\_dev]; - if num\_plexes > 1, split the I/O among the volume's plexes (i.e. mirrors) 15-3 22jul1998 TR-IKI rev 0.7b SGI Proprietary • xlv\_lower\_strategy (bp, 0); xlv\_lower\_strategy(bp, plex\_num) - plex = xlv\_p->plex[plex\_num]; - calculate the number of buffers needed to do the transfers to the real disk partitions - allocate an array of buf\_t structures for the transfers ("lvm") - break the bp->buf up into separate requests in buf\_list->buf's b\_edev will be a partition device number (i.e. a hwgraph vertex handle) - for each buf (tlvb) in list griostrategy(tlvb) ## griostrategy(bp) - check for guaranteed rate I/O information on this (b\_edev) disk; if none: - my\_bdevsw = get\_bdevsw(bp->b\_edev); /\* this macro assumes b\_edev is a vertex if major is 0, else indexes into bdevsw[] \*/ - bdstrat(my\_bdevsw, bp); /\* this a macro that calls bdrv(my\_devsw,DC\_STRAT,bp) /\* bdrv(bdevsw \*my\_bdevsw, int routine, ...) - case DC\_STRAT: func = (bdevfunc\_t)my\_bdevsw->d\_strategy; - (\*func)(a1, a2, a3, a4, a5); /\* calls disk driver dkscstrategy(bp) \*/ # **ORIGIN** module overview • The diagram above illustrates how the 12 I/O slots are connected to the 8 processors in an ORIGIN module TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-4 # Disk Device Connections to be Pictured in the Hardware Graph • half of a module is pictured above - the OS probes the hardware at startup time and builds the hardware graph in memory - the hwgfs file system is mounted on /hw TR-IKI rev 0.7b SGI Proprietary 22jul 1998 15-5 - each controller is assigned a unique number by ioconfig(1M) at boot time - o before mounting filesystems (called by /etc/bcheckrc) - o controller numbers stored in /etc/ioconfig.conf - o compare hinv output and /etc/ioconfig.conf to /hw, /dev/dsk and /dev/rdsk - disk device naming conventions: dksCNTRLdDRIVELLUNSPARTITION - example /dev/dsk/dks14d5s7 is a SCSI disk in controller 14, drive (or "target") 5, [only logical unit 0], partition - o "QL" is for QLogic scsi controller - o "jag" for scsi via Jaguar VME; "fd" for floppy disk - partition conventions: - 0 root - 1 swap - o 6 usr - o 7 entire disk except header and log - 0 8 volume header - o 9 reserved (bad blocks) - 0 10 entire disk - 11 xfs log (external) - disk labels: - o controller parameters; e.g. device geometry - o partition map - o sgi info; e.g. serial number - o boot info; e.g. root and swap partitions; name of file to boot - o directory; e.g. sash 15-5.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Hardware Graph format** - group 0 index 1 (i.e. handle 1) is the root of /hw - built dynamically in kernel memory at boot time - accessible externally as the /hw filesystem - meaning of a hwgraph "minor device number": ``` $ cd /hw/disk $1s -1 a root swap brw------ 1 root 0,3842 Mar 2 17:45 root brw------ 1 root 0,3847 Mar 1 08:41 swap $ dtb 3842 11110 0000010 /* 1E 2 i.e. group 0x1E vertex 2 */ $ dtb 3847 11110 0000111 /* 1E 7 */ • using icrash to display the hwgraph control structure ○ each vertex has a name ○ below the names: 1=>2 means vertex handle 1 points to handle 2 >> hwpath -f /* be patient - this takes a while! */ /hw/module/1/elsc/nvram 1=>2=>3=>4=>5 /hw/module/1/elsc/tty 1=>2=>3=>4=>5 /hw/module/1/slot/n1/node/memory/.master 1=>2=>3=>27=>28=>29=>31=>29 /hw/module/1/slot/n1/node/cpu/b/.master 1=>2=>3=>27=>28=>29=>32=>33=>29 /hw/module/1/slot/n1/node/cpu/b/.master 1=>2=>3=>27=>28>>29=>32=>35=>29 /hw/module/1/slot/n1/node/hub/.master 1=>2=>3=>27=>28=>29=>36=>29 ... >> dump hwgraph c0000000013daba0: a800000000da8000 ``` TR-IKI rev 0.7b SGI Proprietary graph\_num\_group = 128 >> print \*(graph\_t \*)a800000000da8000 struct graph\_s { graph\_name = 0xa80000000dac008 = \*hwg\* 22jul1998 15-6.a • using icrash to display all the vertices in the hwgraph ``` INFO LIST 1 a800001000501400 \begin{array}{lll} \textbf{CONNECT\_POINT=0} \\ \textbf{INDEX[0]=0$\times$0, } & \textbf{INDEX[1]=0$\times$0, } & \textbf{INDEX[2]=0$\times$0} \end{array} EDGES: LABEL NAME VERTEX EDGE a800000001360968 module a8000000013613a0 nodenu nodenum cpunum xplink midi 74 713 714 767 833 a800000001361490 a800000001362f18 a800000001362f78 a800000001363200 a8000002004a4098 .id .invent a8000002004a4098 a8000002e004a4140 a8000001004a4e60 a8000001004a53a0 1596 2750 3023 machdep ttys unknown a8000001004a5898 ``` disk TR-IKI rev 0.7b SGI Proprietary a8000001004a43f8 a8000001004a5aa8 a8000001004a5748 22jul1998 3844 15-6.b ``` a80000000248d100 external_int 3851 a80000000248d130 sharena a80000000248d178 zero a80000000248d1c0 mem 3867 3868 3869 a80000000248d1d8 using icrash to display a disk vertex o 2305 is binary 10010 0000001 o group id is 0x12 o group index is 0x1 >> vertex -n -f 2305 HNDL VERTEX REFCNT NUM_EDGE NUM_INFO INFO_LIST CONNECT_POINT=2679 INDEX[0]=0x0, INDEX[1]=0xa77, INDEX[2]=0xa8000018004b0330 EDGE LABEL NAME VERTEX 0 a8000001004a42d8 volume 1 a8000002004a4320 volume_header 2690 2 a8000003004a4128 partition 3678 INFOS: INFO LABEL NAME INFO_DESC INFO 20 a8000018004ac460 fffffffffffffff a800001c0050a280 1 graph_vertex_s struct found ``` meaning of icrash "vertex -l -n" headings: HNDL: decimal handle of this vertex 15-6.c 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` ■ VERTEX: hex address of this vertex ■ REFCNT: reference count ■ NUM_EDGE: number of graph_edge_t's ■ NUM_INFO: number of graph_info_t's ■ INFO_LIST: hex address of the list of graph_edge_t's and graph_info_t's ■ CONNECT_POINT: parent handle (arbitrary_info_t[1] following the graph_vertex_t) ■ INDEX[0]: hex; arbitrary_info_t[0] following the graph_vertex_t; can be a pointer to driver table ■ INDEX[1]: hex; arbitrary_info_t[1] following the graph_vertex_t; same as CONNECT_POINT ■ INDEX[2]: hex; arbitrary_info_t[2] following the graph_vertex_t; "fastinfo" ■ the arbitrary_info_t's: /* Reserve room in every vertex for 3 pieces of fast access indexed information */ #define HWGRAPH_NUM_INDEX_INFO 3 #define HWGRAPH_DEVSW 0 /* (b,c)devsw for this device */ /* (b,c)devsw for this device */ /* connect point (parent) */ /* reserved for creator of vertex */ #define HWGRAPH_CONNECTPT #define HWGRAPH_FASTINFO o for disk-related devices HWGRAPH_FASTINFO (index 2) points to: scsi controller scsi_ctlr_info_t scsi_targ_info_t scsi_lun_info_t scsi_unit_info_t scsi_dev_info_t scsi_disk_info_t scsi_part_info_t scsi target scsi target scsi lun scsi unit scsi device scsi disk scsi partition EDGES: graph_edge_t's O EDGE: index into the graph_edge_t's LABEL: hex address of the graph_edge_t NAME: e_label->character string O VERTEX: decimal vertex handle of the subordinate (child) vertex INFOS: graph_info_t's INFO: index into graph_info_t's LABEL: hex address of the graph_info_t O NAME: i_label->character string (see HW Graph Information Labels below for full list) O INFO_DESC: ``` ■ -1: INFO\_DESC\_PRIVATE - info is not exported as an attribute on a /hwgfs file - 0: INFO\_DESC\_EXPORT info is in the arbitrary\_info\_t itself - NN: size of the structure that i\_info points to INFO: i\_info; the arbitrary\_info\_t within the graph\_info\_t; this is either information or a pointer to information - jump-off for further study: example code sequence that sets a bdevsw pointer in to the hwgraph ``` main initialize_io edt_init edt[] wd93edtinit wd93hinv wd93_inq scsi_device_update enci_disk_update scsi_disk_update scsi_part_vertex_add hwgraph_device_add hwgraph_block_device_add hwgraph_bdevsw_set ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-6.e # **Hwgraph Example** (from vmcore.320.comp) mple: /hw/module/3/slot/io1/baseio/pci/0/scsi\_ctlr/0/target/4/lun/0/disk/partition/7/block logical controller drive [target] hw/disk/dks4d4s7 [lun is 0] partition 1084 dkscstrategy() sci\_command sci\_info si ctlr info sti\_ctlr\_info sti\_targ (unit mtroller-t harinformation sesi\_targ\_info\_t entory\_t 9200000808200000 (see "I/O Space" (see "I/O Spac diagram to interpret) sli\_targ\_info sli\_lun (lun 0) sasi\_lun\_info\_: [2] inventory\_t partition" sdi\_lun\_info scsi\_disk\_info 3683 🖥 "block" spi\_disk\_info spi\_part (7) scsi\_part\_info\_t 22jul1998 · associating partition name and physical location with icrash 15-7.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` >> vertex -f -n 1394 HNDL VERTEX REFCNT NUM_EDGE NUM_INFO INFO_LIST CONNECT_POINT=1093 INDEX[0]=0x0, INDEX[1]=0x445, INDEX[2]=0xa8000008004f4400 LABEL NAME 0 a8000000013613e8 .master 1 a8000005004a41e8 usrpci 2 a8000001004a40e0 scsi_ctlr 3 a8000005004a4260 base 4 a8000002004a4050 io 1407 1941 5282 4 a8000002004a4050 10 mem 6 a8000005004a4218 config 7 a8000005004a4638 dma 8 a8000050004a46e0 intr 9 a800000041329940 rom 5284 5285 INFOS: LABEL NAME INFO INFO_DESC 1 graph_vertex_s struct found >> vertex -n 1941 HNDL VERTEX REFCNT NUM_EDGE NUM_INFO INFO_LIST HNDL VERTEX REFCN: NUM_EDGE NUM_INFO INFO_LIST 1941 a80000150050e3f0 4 1 1 a800001c254873f0 EDGES: EDGE LABEL NAME VERTEX ``` ``` 0 a800000001360980 0 1949 INFOS: INFO INFO_DESC LABEL NAME INFO 0 a80000000248d298 _hwgfs_list ffffffffffffff a800001c24e83480 1 graph_vertex_s struct found O Notice above that 1941 is a scsi_ctlr edge "0" leading to the controller vertex #1949: o display the controller number for vertex 1949: >> vertex -f -n 1949 HNDL VERTEX REFCNT NUM_EDGE NUM_INFO INFO_LIST 1949 a80000150050e570 7 2 2 a800001c24e829a0 CONNECT POINT=1941 INDEX[0]=0x0, INDEX[1]=0x795, INDEX[2]=0xa800000100504300 EDGES: EDGE LABEL NAME VERTEX 0 a8000001004a4110 bus 1 a8000001004a4128 target INFOS: LABEL NAME INFO_DESC 20 a8000018004ac1c0 ffffffffffffffff a800001c24e82a60 1 graph_vertex_s struct found /* the _invent label points to an inventory_s structure (of x20 bytes)*/ TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-7.c >> print *(inventory_t *)0xa8000018004ac1c0 struct inventory_s { inv_next = (nil) inv_next = (nil) inv_class = 2 inv_type = 1 inv_controller = 4 inv_unit = 0 inv_state = 9 /* INV_DISK 2 */ /* INV_SCSICONTROL 1 */ /* matches /hw/disk/dks4d4s7 and /dev/dsk/dks4d4s7 */ o the controller logical number is assigned in conjunction with ioconfig(1M) (see ioctl (fd, DS_MKALIAS)) • Alternate method: dump all paths to a file: >> hwpath -f | cat > hwpaths O Search the file for the name and vertex number: /hw/disk/dks4d4s7 1=>3844=>3685 ``` • walk forward in this list to the scsi\_ctlr: • then display that vertex as above (>> vertex -f -n 1949) /hw/module/3/slot/iol/baseio/pci/0/scsi\_ctlr/0/target/4/lun/0/disk/partition/7/block 1=>2=>11=>135=>1084=>1085=>1093=>1394=>1941=>1949=>2179=>2626=>2675=>2679=>2305=>3678=>3683=>3685 # **HWGraph Information Labels** see iograph.h: ``` Symbolic name ASCII label: Contains: INFO_LBL_CPU_INFO "_cpu" Controller_name info_LBL_CPU_INFO LBL_CPU_INFO "_cpu" &cpuinfo_t &invent_miscinfo_t &invent_meminfo_t &invent_routerinfo_t &device_desc_t INFO_LBL_DEVICE_DESC "_device_desc" &diag_inv_t &dksc_local_info_t &device_driver_t not used? not used? "_diag_reason" "_dkiotime" "_driver" INFO_LBL_DIAGVAL INFO_LBL_DIAGVAL INFO_LBL_DKIOTIME INFO_LBL_DRIVER INFO_LBL_ELSC INFO_LBL_GIOIO "_gioio" not used? "_gioio_ops" &gioio_provide "_grio_disk" &grio_disk_info "_hwgfs_list" &hwgstat_t %hwg_traverse" &traverse_fn_t INFO_LBL_GFUNCS INFO_LBL_GRIO_DSK &gioio_provider_t &grio_disk_info_t INFO_LBL_HUB_INFO INFO_LBL_HWGFSLIST INFO_LBL_TRAVERSE INFO_LBL_INVENT INFO_LBL_MIRESET INFO_LBL_MODULE_INFO INFO_LBL_MONDATA "_invent" "_mlreset" "_module" "_mon" &inventory_t not used? &router_info_t &hubstat_t INFO_LBL_MDPERF_DATA "_mdperf" &md_perf_monitor_t "_nic" "_node" "_pcibr_hints" INFO LBL NIC char * INFO_LBL_NICE_INFO INFO_LBL_PCIBR_HINTS INFO_LBL_PCIIO INFO_LBL_PFUNCS &hubbinfo_t pcibr_hints_t pciio_info_t &pciio_provider_t '_pciio" __pciio_ops" ``` 15-8 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Disk Driver Layer** dkscstrategy(bp) bp->b\_edev is the disk partition's vertex in the hwgraph TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-9 (this is stored in the /hw/disk/dks.... node: \$ ls -la /hw/disk/dks0d4s1brw----- 1 root root 0,4042 Apr 27 11:23 /hw/disk/dks0d4s1 \$ ls -la 'find /hw/module -name block -depth' | grep 4042 brw----- 1 root root 0,4042 Apr 27 11:26 /hw/module/6/slot/io1/baseio/pci/0/scsi\_ctlr/0/target/4/lun/0/disk/partition/1/block [also see /etc/ioconfig.conf for path and controller number] ) - set "part\_info" to partition vertex's information set "disk\_info" to disk vertex's information - set "dk" to disk's queue control structure - calculate disk-relative block number (into b\_sort) - buf may, or may not, be sorted into normal or priority queue - dksccommand(dk, &sc); /\* sc a pointer to lock dk->dk\_lock \*/ dksccommand(struct dksoftc \*dk, int \*spl) - set bp to top request on the dk-> retry, priority and normal queues - set sp to dk\_freereq (a scsi request structure) - dkscstart(dk, bp, sp); dkscstart(struct dksoftc \*dk, buf\_t \*bp, scsi\_request\_t \*sp) - set lun\_info (follow scsi request to lun vertex) - copy the scsi read or command from dk->structure to scsi request - set addresses and length in to scsi request sr\_notify set to dk\_intr() for interrup handler - set interrupt handler (sr\_notify) to dk\_intr() - find the controller vertex (follow lun to target to controller vertex) and call scsi driver (sci\_command)(sp) qlcommand(sp) /\*\*\*\* assuming a QLogic scsi controller \*\*\*\*\*/ TR-IKI rev 0.7b SGI Proprietary 22jul1998 # **SCSI Driver Layer** qlcommand(struct scsi\_request \*req) • ql\_entry(req); ql\_entry(scsi\_request\_t\* request) - ha = ha\_information (hardware address) (follow request's pointer to lun vertex, to target vertex, to controller vertex) - initialiaze the request return status fields - place the request on the ha->waitf/waitb wait queue - ql\_start\_scsi(ha) ql\_start\_scsi(pHA\_INFO ha) (preliminary description - key points may be missing) - isp = ha->ha\_base /\* base address for communicating with this controller \*/ (see I/O Address Space, below) - move any request from the ha->wait queue to the ha->req\_forw/back request queue - work on first request in ha->req\_forw request queue - o check controller's queue space - o remove request from ha->req\_forw queue, link to it from the ha->reqi\_block[][] - o build a request at q\_ptr->queue\_entry from request->scsi\_request\_t - o adjust request for scatter/gather - o inform controller of request: QL\_PCI\_OUTH(&isp->mailbox4, ha->request\_in); [\*((volatile short\*)&isp->mailbox4) = ha->request\_in ] - move any other ha->wait request to ha->req\_forw and repeat above TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-10.a # I/O Address Space Big Window 0 (of 8 512M windows per node) (window 0 controls the other 7) - Each cpu can address an I/O device at an "I/O" space address, as diagrammed above. - nasid's: use "hinv -mv | grep Slice" to determine nasid assignments - even numbered nodes are connected to high numbered I/O slots - odd numbered nodes are connected to low numbered I/O slots - see http://www-ssd.engr/doc/chips/hub2\_prog\_man/hub2.regbook.doc.html for I/O space addressing details - see http://www-ssd.engr/doc/ for the general table of contents see the IRIX Device Driver Programmer's Guide (6.4) - see the IRIX Device Driver Programmer's Guide (6.5) Each xtalk device is accessible in the hwgraph via its I/O slot number, and via its controlling hub. For example: \$ \ls -1 \text{/hw/module/1/slot/n1/node/xtalk/8} -- 1 root root 29 Mar 6 11:11 /hw/module/1/slot/n1/node/xtalk/8 -> /hw/module/1/slot/io1/baseio 15-11.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Interrupt Processing** interruptI/O interrupt • CPU enters at E\_VEC 0x80000180 ### exception - code was copied here by \_hook\_exceptions - is\_intr processes an "interrupt" intrfast intrnorm ecommon - pda's p\_causevec[CAUSE exception code] - this table was copied to pda from causevec[] VEC\_int(ep) intr(ep, ... if (cause & SRB\_DEV0) (devices bit) (cause\_intr\_tbl[SRB\_DEV0\_IDX])(ep) (calls intpend0 for device interrupt) intpend0(ep) - while Hub PI\_INT\_PEND0 - o level = leading bit - o clear the bit - o pda: p\_intmasks.dispatch0->vectors[level] TR-IKI rev 0.7b SGI Proprietary 22jul1998 15-12.a ``` |iv_func|---> qlintr() |iv_args|---> ha_information |iv_sync| ``` • if THD\_OK (i.e. threaded, which is normal) vsema (&iv\_sync) (wakeup intpend0\_intrd) intpend0\_intrd(intr\_vector\_t \* ivp) (interrupt thread) - (iv\_func)(iv\_arg) (call to qlintr(&ha\_information)) - ipsema (&iv\_sync) qlintr(pHA\_INFO ha) read ha\_base->bus\_isr to test for interrupt (ISP\_REGS: register definitions) ql\_service\_interrupt(ha) - process "mailbox" request completions first - pull all responses (ha->response\_out to ha\_response\_in) to a linked list - inform the adapter that they are received - completion processing: - ql\_notify\_responses(ha, cmp\_forw) - start any more requests on ha->waif or ha->reqforw (ql\_start\_scsi()) ql\_notify\_responses(ha, forwp) - for each request on sr\_ha list: - o (\*request->sr\_notify)(request) ## dk\_intr(sp) - $bp = sp->sr_bp$ - remove buf from dk->dk\_active\_list ## biodone(bp) - check for guaranteed rate I/Oif bp\_biodone present, call (\*bp\_biodone)() - set B\_DONE - vsema(&b\_iodone\_sema) ## vsema(sema\_t \*sp) • increment semaphores lock count ### semawake(sp) - find the thread (kt = semadq(sp)) - make the thread runnable (thread\_unblock(kt, ...)) 15-12.c 22jul1998 TR-IKI rev 0.7b SGI Proprietary | | <del>-</del> | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------| | | | | | | | | | | | MALL ACTION OF | | | Module 16: IRIX Dumps - 6.5 | | | | | | | | | | | | | | | | | | | | | | | | | | Barrier and American | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | <b>8</b> 1 17 27. | | | | | | | | | | | | | | | | | | | | | | | | s mar ve d | | | to a merce and | | | to a merce and | | | | | | | | | | | | de conservado | | | in the second se | | | But to be designed. | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | # IRIX Dumps - 6.5 ## This module describes: - Situations in IRIX that cause system dumps - Dump creation overview - Dump contents - Configuring dump levels 16-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Dump Scenario Diagram** ## **Dump Scenarios** ### Four causes ### • Hardware exceptions Interrupts caused by hardware conditions For more serious machine errors, low level interrupt handlers call panic, or cmn\_err() directly with CE\_PANIC argument Root cause may be: - o Hardware malfunction: eg. ECC (Error Correction Code) error; double bit error - O Software; BadVAddr (Bad Virtual Address); kernel address pointer error Resulting panic message provides abbreviated reason for interrupt ### NMI (Non-Maskable Interrupt) NMI interrupt triggered by operator pressing switch, or (remote) console request Usually result of degraded or hung system Way to "force" a dump for diagnostic purposes Resulting panic message indicates "nmi" as cause TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-3 ### Assertions ASSERT macros scattered throughout kernel code test system "sanity" Commonly conditionally compiled (#ifdef'd); used for testing / debugging Resulting panic message includes failed test (source code expression) ### Panics PANIC (assembler) macro calls panic() function Panics typically indicate a logic error within the kernel; a very few may be caused by hardware Resulting panic message provides abbreviated reason for interrupt ### Common code - All panic conditions call common error handling routine icmn\_err() with the CE\_PANIC argument - Function icmn\_err() invokes the system dump sequence for the CE\_PANIC condition - Function sequence syncreboot -> dumpsys() -> real\_dumpsys -> dumpvmcore: - Flushes memory cache - o Flushes buffer cache areas to preserve file integrity - O Saves CPU context data in memory - O Write selected portion of virtual memory to the dump device - Call mdboot() to attempt to restart the system ## Panic within panic Hardware exceptions, NMI interrupts, ASSERT macro calls, PANIC macro calls, and panic calls that occur during panic processing result in corresponding handling routines to be called. Function icmn\_err() detects this nested panic condition and: - Dump processing is terminated; then -> - A system reboot is attempted; if this fails -> - Persistent re-entries of icmn\_err() results in hanging all CPUs in a hard loop 16-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary # **Processing Activities** ## **Hardware Exception** ### **Processing** - 1. Low level kernel interrupt handler routines "trap" hardware errors and call functions such as vec\_int() and intr(). - 2. Function intr() checks the cause for the interrupt: - Errors affecting a user application usually only result in killing the application. The error is logged; a user core file may result. - o Errors that are deemed "correctable" are corrected. The error is logged; the system continues. - Errors that compromise the integrity of the system are logged; then the system panics. - 3. Depending on the situation, hardware routines man cause a panic by calling cmn\_err() with the CE\_PANIC argument, by executing ASSERT or PANIC macros, or by calling the panic() function. In each case a panic message suggesting the reason for the panic is passed in the call. 4. In any of the above cases, eventually function icmn\_err() is called process the panic (see below). ### Panic Example ``` <0>PANIC: CPU 28: Fatal Craylink error. /hw/module/4/slot/n4/node/cpu/a: 1: RR_ERROR_STATUS = 0x9829000100041bf0 NOTICE - cpu 30 didn't dump TLB, may be hung ``` ### **Stack Trace Example** TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-7 ### NMI (Non-Maskable Interrupt) #### **Processing** - 1. NMI is a hardware interrupt, and thus is trapped by low level kernel code the same as other hardware interrupts. However the kernel takes a unique logic path for NMI in an attempt to save as much CPU context as possible for the dump. - 2. Assembler routine mmi\_dump sets up a dump stack area, and jumps to cont\_nmi\_dump. - 3. C function cont\_nmi\_dump: - O Places the system in panic (dump processing) state. - O Indicates a NMI occurred: nmied=1. - O Saves CPU registers in low memory; IP27\_NMI\_EFRAME\_OFFSET=0x11800 - o Collects router error information; print and log any problems found. - Calls cmm\_err(CE\_PANIC, "User requested vmcore dump (NMI), cpu \_\_\_ handling". - 4. Function icmn\_err() prints panic message and processes dump (see below). ### Panic Example No usable example, yet. ### **Stack Trace Example** No usable example, yet 16-9 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Assertions** ### **Processing** There are many different ASSERT type macros coded in IRIX, however they all work around the same principle. ``` Consider two definition samples from irix/kern/sys/debug.h: #define ASSERT(EX) ((!doass||(EX))?((void)0):assfail(#EX, __FILE__, __LINE__)) #define ASSERT_ALWAYS(EX) ((EX)?((void)0):assfail(#EX, __FILE__, __LINE__)) Usage examples from irix/kern/sgi/fs_bio.c: ASSERT(!bp->b_vp); ASSERT(bp->b_blkno == blkno && bp->b_bcount == BBTOB(len)); Also from irix/kern/os/fdt.c: ASSERT_ALWAYS(fp->vf_count > 0); ``` Macro ASSERT is conditionally compiled into kernels with #define DEBUG set, hence it is used in development debugging. Variable doass is set to "1" (true) by default. - Macro assert\_always is unconditionally compiled into every kernel. - In any case, assertions perform the following: - 1. Evaluate the expression "Ex" for "true" or "false" (non-zero or zero). - 2. If the expression is true return void (0); effectively a NO-OP. - 3. If false call function assfail with arguments the ASCII expression, the source code file name and line number. - Assembler routine assfail saves the calling CPUs registers in an exception frame called \_assrags and calls c function \_assfail(). - 5. Function \_assfail: - 1. Sets spin lock 7; forces other CPU to spin on this lock. - 2. Prints assertion message: "assertion failed cpu CPUN: , file: FILE, line: LINE". - 3. Prints values in \_assregs. - 4. Sets system in panic mode. - 5. Calls cmm\_err(CE\_PANIC, "assertion failure!") - 6. cmn\_err() prints the panic message and processes the dump (see below) TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-11 ### Panic Example ### **Stack Trace Example** 16-13 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Panics** ### **Processing** Assembler macro PANIC is a machine language interface to the C language panic() function. It passes the address of the PANIC message in the call. Function panic() simply calls icmn\_error with arguments CE\_PANIC and the message address. ``` For example, from irix/kern/bsd/socket/uipc_socket.c:m = so->so_rcv.sb_mb; if (m == 0) panic(*receive 1*); ``` Function icm\_error() prints panic message and processes dump (see below). #### Panic Example ``` <0>PANIC: CPU 4: receive 1 ``` ### Stack Trace Example ``` >> ctrace 4 STACK TRACE FOR CPU 4 | Systrape r2/v0:000000000000044a r5/al:000000001003c000 r8/a4:000000001005d11c CAUSE=1000002c, SR=fffffffffa400ffb3, BADVADDR=fabaccc ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-15 #### **Common Routines** #### cmn\_err() Panic Processing Called with flag (eg. CE\_PANIC), "printf" format string, and operand values; for instance Cmm\_err(CE\_PANIC, "%s\nIRIX Killed due to fatal memory ECC error.\n",buf) Calls icmn\_err() with CE\_PANIC flag, passing format string and operand values as arguments ### icmn\_err() Panic Processing - 1. Gets CPU number of panicing CPU. - 2. Blocks all CPU interrupts. - 3. Tests if not first CPU here; if not, spin wait. - 4. Sets this CPU in panic state. - 5. Formats panic message. - 6. Increments panic level counter by 1. - 7. If first time in panic processing (panic level counter = 1): - 1. Stops interval timer. - 2. Sets each other CPU's panic\_spin flag and sends it an interrupt. - 3. Flushes console buffers. - 4. Prints panic error message to console. - 5. Formats and prints systog message for availmon. - 6. Flushed console buffers - 7. Calls dumpt1b() to save TLB at t1bdumptb1 - 8. If this CPU had ECC error - Calls panicspin() which selects another CPU to do dumping (This CPU spins here another CPU calls dumpsys() to finish dump) - 9. Tests if all TLBs are dumped; prints error message if not. - 10. Calls syncreboot() to dump and restart the system. (see below) - 1. Locks putbuf - 2. Calls dumpsys() to write memory to disk (see below) - 3. NO RETURN from dumpsys() 16-17 22jul1998 - 8. If second time in panic processing(panic level counter == 2): - 1. Stops interval timer. - 2. Prints "DOUBLE PANIC" message. - 3. Updates availmon log. - 4. Flushes console buffers. - Calls mdboot(AD\_HALT) which calls mprboot() to attempt to reboot. (NO RETURN from mprboot) - 9. If third time in panic processing (panic level counter == 3): - Calls mdboot(AD\_HALT) which calls mprboot() to attempt to reboot. (NO RETURN from mprboot) - 10. If fourth (or more) panic processing (panic level counter > 3): - O Hangs in infinite spin loop. #### syncreboot() Processing - 1. Locks putbuf - 2. Calls dumpsys() to write memory to disk (see below) - 3. NO RETURN from dumpsys() TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-19 ### dumpsys() Processing - 1. Uses setjmp to save registers in dumpregs. - 2. Save current process pointer in dumpproc. - 3. Creates stack frame in preparation to call real\_dumpsys(). - 4. Flag this process to run in this (and only this) CPU - 5. Uses longjmp to call real\_dumpsys(); execute a NULL ASSERT is it ever returns - 1. Sets spin-lock 7 (locks other CPUs in kernel). - 2. Panics or hangs system if improperly configured dump device. - 3. Sets dump in progress flag. - 4. Prints "Dumping to device at block number, space: 0xnumber pages". - 5. Sets physaddr to first valid PFN in the system. - 6. Calls dumpvmcore() to dump memory to disk (see below). - 7. Clears dump in progress flag. - 8. Calls mdboot() which calls mprboot() to attempt to restart the system; hang CPU IF it returns. ie disables interrupts ### dumpvmcore() Processing - 1. Disables ECC interrupts. - 2. Calls flush\_cache() to save cache memory to "primary" memory. - 3. Sets up dump device header; writes it to disk. - Note each section of the dump is separated by a descriptive dump header. - 4. Writes putbuf data to disk. - 5. Write errbuf data to disk. - 6. Prints "Dumping low memory". - 7. Calls dump\_page() to dump physaddr to physaddr+0x4000: Memory is compressed before writing to disk. Bad or inaccessible pages are skipped. Stops if "out of dump space". Prints a "." for each "dump block" written. 8. Checks dump\_level config variable - see "Dump level configuration" below. 16-21 22jul1998 TR-IKI rev 0.7b SGI Proprietary - 9. if **dump\_level >= 1**: - 1. Prints "Dumping static kernel pages...". (Kernel code and compiled (static) data) 2. Calls dump\_lowmem: On node zero - dumps from physaddr+0x4000 to last page before pfdat table. On non-zero nodes - dumps from 0 to last page before pfdat table. Does NOT dump copies of kernel on non-zero nodes. - 3. Prints "Dumping dynamic kernel pages...". - 4. Calls pdf\_scan(SCAN\_KERN\_NONBULK, DUMP\_SELECTED) (Kernel dynamic (malloc'd) memory, excluding system buffers and mbufs). - 10. if $dump_level >= 2$ : - 1. Prints "Dumping buffer pages..." - 2. Calls pdf\_scan(SCAN\_KERN\_BULK, DUMP\_SELECTED) (Kernel dynamic (malloc'd) system buffer and mbuf memory). - 11. if **dump\_level >= 3**: - 1. Prints "Dumping remaining in-use pages..." - 2. Calls pdf\_scan(SCAN\_INUSE, DUMP\_INUSE) (Allocated memory, not yet dumped).. - 12. if **dump\_level >= 4**: - 1. Prints "Dumping free pages..." - 2. Calls pdf\_scan(SCAN\_FREE, DUMP\_FREE) (The rest of memory). - 13. Flush un-written dump buffers to disk. - 14. Prints "Updating dump header..." Prints "Dump complete." 15. On SNO - prints "System dump completed" to PROMLOG. ### **Dump Level Configuration** #### **Dump Level** Configuration variable dump\_level is set and displayed by systume(1M) TR-IKI rev 0.7b SGI Proprietary 22jul1998 16-23 ### **Dump level meanings** Dump routines repeatedly scan the kernel memory management table pfdat. Each page is flagged as its use in the kernel. Once a page is dumped, it is marked as "dumped" so it does not get dumped again. Dumping proceeds IN THIS ORDER: - Level 1 or greater: kernel code, static data, and non-bulk dynamic data - First valid PFN to first valid PFN+0x4000 (always dumped) - Kernel static memory (code and static data) Node zero: first valid PFN+0x4000 to last page before pfdat table Non-zero nodes: first valid PFN to last page before pfdat table - O Kernel dynamic (malloc'd) memory excluding system buffers and mbufs - Level 2 or greater: kernel bulk (dynamic) data - System buffers - o mbufs (Message buffers) - Level 3 or greater: in-use memory - O Memory pages marked as allocated in the pfdat table - Level 4 or greater: free memory - O Memory pages marked as free in the pfdat table | Marie and No. Plant | Module 17: Bibliography | |-------------------------------------------|-------------------------| | ndro rozza men | | | • Ngarina | | | | | | | | | | | | | | | Negopd | | | g#gl/parghoshid | | | | | | | | | • | | | ada a sa | | | | | | | | | ann e sev | | | | | | _ | | | _ | | | · · · · · · · · · · · · · · · · · · · | | | | | | 1 | | # **Bibliography** This bibliography contains suggested references organized by the following categories: - BOOKS BY SGI EMPLOYEES (former or current) - BOOKS - ON-LINE DOCUMENTS - TOOLS - o Browsing Kernel Source Code - Browsing Executable Code (".c" files) - Browsing Memory - TRAINING MATERIALS WEB PAGES - INSTRUCTOR WEB PAGES (links to their class materials) - ENGINEER WEB PAGES 17-1 22jul 1998 TR-IKI rev 0.7b SGI Proprietary # **BOOKS BY SGI EMPLOYEES (former or current)** Scalable Shared-Memory Multiprocessing by Daniel E. Lenoski, Wolf-Dietrich Weber, Dan Lenoski (One of our Principle Engineers said: "...recommend the very scholarly well-written and complete introduction to NUMA-style architectures (including the concept of directories) written in part by our own Dan Lenoski when he was at Stanford. Dan then joined SGI and became the head of the SN0 project. (at least the first half of the book) should be \*required reading\* for anyone that teaches classes about our hardware... you might want to recommend it for the students.") Unix Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers (Addison-Wesley Professional Computing Series) by Curt Schimmel Order books on-line from: http://www.amazon.com/exec/obidos/ISBN%3D1558603158/1963-2407536-869114 #### **BOOKS** The Magic Garden Explained The Internals of Unix System V Release 4: An Open Systems Design Berny Goodheart, James Cox / Paperback / Published 1994 Order books on-line from: http://www.amazon.com/exec/obidos/ISBN%3D1558603158/1963-2407536-869114 TR-IKI rev 0.7b SGI Proprietary 22jul1998 17-3 #### **ON-LINE DOCUMENTS** Documents / Technical Information Menu http://www-devtoolbox.engr.sgi.com/toolbox/documents/ Hardware Developer Handbook, Release 2.0 http://www-devtoolbox.engr.sgi.com/toolbox/hardware/hwHandbook/ R10000 Microprocessor User's Manual -Version 2.0 -Copyright 1996, 1997, MIPS Technologies, Inc. -- 09 DEC 96 http://www.sgi.com/MIPS/products/r10k/UMan\_V2.0/HTML/t5.Ver.2.0.book\_1.html MIPS IV Instruction Set http://coral.mti.sgi.com/arch/MIPS4\_3.2/APP.book\_1.html R10000 Microprocessor User's Manual http://www.sgi.com/MIPS/products/r10k/UMan\_V2.0/HTML/t5.Ver.2.0.book\_1.html IRIX 6.3/6.4 Migration Course (Virtual Memory Overview) http://snt.engr.sgi.com/webverter/cameron/6.3\_6.4\_Migration/CDROM/ (Memory, Swap, Tuning Information) http://catlady.engr.sgi.com/~saragon/CoCreate/customization/chap3.htm#sgiopt-2 Pthreads Home Page http://www-devtoolbox.engr.sgi.com/toolbox/documents/pthreads/index.html IRIX 6.4 Device Driver Programming Guide (Operating System Overview Information, mostly still accurate for 6.5) http://www-devtoolbox.engr.sgi.com/toolbox/documents/DevDriver/irix6.4/DD6.4toc.html How to download it http://www-devtoolbox.engr.sgi.com/toolbox/documents/DevDriver/irix6.4/html/ Architecture Documentation Database TR-IKI rev 0.7b SGI Proprietary 22jul1998 #### (SN0 architecture specifics) http://b7.asd.sgi.com/doc/arch/ #### including: Lego System Specification http://b7.asd.sgi.com/doc/arch/lego/sys\_spec.book.doc.html Link Level Protocol Specification http://b7.asd.sgi.com/doc/arch/llp/llp\_spec.book.doc.html Lego Cache Coherence Protocol Specification (including DIMMs < Directory Memory> Explanation and Block Diagrams) http://b7.asd.sgi.com/doc/arch/coherence/coherence\_spec.book.doc.html ASD/NSD 1996 Next Generation Product Specification http://b7.asd.sgi.com/doc/arch/prod\_spec/ProdSpec.book.html PCI-to-PCI Bridge Issues in Origin Systems http://b7.asd.sgi.com/doc/arch/pci\_to\_pci/pci\_to\_pci.doc.html Origin I/O Memory Model http://b7.asd.sgi.com/doc/arch/io\_mem\_model/io\_mem\_model.doc.html 17-4.a 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **TOOLS** #### **Browsing Kernel Source Code** cscope - see tutorial "IRIX source browsing" http://wwwtng.cray.com/-mix/irix.html#Class-Materials #### **Browsing Executable Code (".c" files)** elfdump dwarfdump - see tutorial "IRIX source browsing" http://wwwtng.cray.com/~mix/irix.html#Class-Materials #### **Browsing Memory** IRIX Crash Command Set (icrash) - a web page listing icrash commands. Clicking on any of them gets you to giving a brief definition of that directive, and some examples. http://bits.csd.sgi.com/digest/saca/icrash/commands.html icrash - see tutorial "IRIX source browsing" http://wwwtng.cray.com/~mix/irix.html#Class-Materials ### TRAINING MATERIALS WEB PAGES IRIX Software Training http://www.tng.cray.com/~mix/irix.html TR-IKI rev 0.7b SGI Proprietary 22jul1998 17-6 ## **INSTRUCTOR WEB PAGES (links to their class materials)** Howard Mundy's web page: http://wwwtng.cray.com/~hlm Dave Wright's web page: http://wwwtng.cray.com/~daw Cliff Wickmans's web page: http://wwwtng.cray.com/~cpw Mike Conrad's web page: http://wwwtng.cray.com/~conrad TR-IKI rev 0.7b SGI Proprietary 22jul1998 17-7 ### **ENGINEER WEB PAGES** Chandler Lai's web page: http://chandler.csd.sgi.com/~clai2 which includes: IRIX 6.5 Support Readiness Information http://chandler.csd.sgi.com/~clai2/6-5info.html IRIX 6.5 Engineering Technical Information Overview Sessions http://chandler.csd.sgi.com/~clai2/ETIO-TACbeta.html (list of publications) http://chandler.csd.sgi.com/~clai2 17-8 22jul1998 | | Appendix A: Origin2000 Support Processes For High End Systems | |-------------|---------------------------------------------------------------| | | | | | | | ant offer # | | | r 17 | | | | | | are. | | | | | | n sin | | | sum i | | | | | | | | | | | | v | | | | | | | | | | | | | | | | | | | | | | | | 1.00 | | | - 1 | | # Origin2000 Support Processes For High **End Systems** CRAY PRIVATE Appendix A-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Purpose** - Communicate Cray Origin2000 system processes and procedures. Introduce the tools supporting these processes and procedures. These processes has evolved over time and has been driven from key areas of need: - 1. A strong desire to avoid any unexpected surprises after the system has shipped. Most of these systems have an acceptance period with specific criteria which must be met before customer acceptance is completed and revenue flows. - 2. Feedback and process improvement. The general belief is that the closer to the factory a problem is identified and fixed, the less costly it is to fix. Thus, it is important to provide input on how well installations go; then, we can go back and fix problems and improve processes. | A Cray Origin2000 system is defined as a high er which start out as configurations smaller than 32 follow the procedures described below. | nd Origin2000 or array of Origin2000 systems gre<br>CPUs and are upgraded to above 32 CPUs are the | eater than 32 CPUs. Origin2000 systems<br>en considered Cray Origin2000 systems and | |-------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------| | | | | | | | | | | | | | | | | | | | | | | | | | TR-IKI rev 0.7b SGI Proprietary | 22jul1998 | Appendix A-2.a | | | | | | | | | | Getting Help Please contact Dave Walls ( walls@cray.com, +1- | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the n | naterial in this document. | | | -612-683-5352) regarding any questions on the m | naterial in this document. | ### Getting Cray domain accounts Some of the tools and processes described below require Cray domain accounts. Request CrayRealm and Training domain accounts Appendix A-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### Site Planning Currently available site planning materials: (05/97) Table A-1: Site Planning Materials 007-3452-xxx Site Preparation for Origin Family and Onyx2 HR-04122 CRAY Origin2000 or Onyx2 InfiniteReality Ra HR-04122 CRAY Origin2000 or Onyx2 InfiniteReality Rack System 10658642 Origin2000 Rack Site Planning Packet 10658643 Challenge RAID Site Planning Packet 10658644 Operators System Console Site Planning Packet Planning packets can be requested by mailing site@cray.com or calling +1-715-726-2820. Additional information concerning power requirements and floor plan layouts is available at the Cray Site Engineering Group's home page. Power Requirements tool Appendix A-5 22jul1998 ### **Installation Planning** A new tool has been created to aid Branch Support Managers (BSMs) in planning for the delivery, installation, and acceptance of large Origin 2000 systems. This Installation Plan Manager (ipm) tool supports the writing of installation plans for all Origin 2000 systems. You are requested to use this ipm tool to write an install plan for all large Origin2000 systems (Origin2000 systems with more than 16 CPUs). Completing the installation plan helps to ensure accurate delivery and success of your system installation and acceptance by the customer. Besides identifying the field account team responsible for this work, some of the other valuable information included in the plan is: - customer's intended use of the system - production hardware and software configurations - installation timeline - upgrade timeline - acceptance timeline TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-6 ### **System Registration** System serial number for Origin systems greater than 32 CPUs must be registered in the CRUISE database. This registration provides these benefits: - 1. Customer calls within the U.S. for these systems are routed to the Eagan-TAC instead of the Mountain View TAC. - Availmon data from this list of registered systems is used to calculate MTTI for large Origin systems. Some traditional Cray customers still wish to use the CRInform tool to report software problems (SPRs). This CRUISE registration is necessary to validate their use of CRInform and report problems. To register the system under CRUISE, send this information to cruise\_reg@cray.com: - Customer Name - System Serial Number - Support Branch in which the system is located. Register Origin System Under CRUISE Currently, this is the only use of CRUISE required for Origin systems. #### System Serial Number The ORIGIN system serial number needed to register the system in CRUISE is from the lower module located at the left end of the system as you face the front of the system. - Serial number format is Knnnnnnn. - The serial number is located on a white sticker at the back of the module/rack behind the power cord connection to the fan tray (left side). It is important that this physical module's serial number be used as the system serial number for tracking purposes. - O This is not always the same as the Sales Order's serial number. - Some early systems require that the upper module's serial number be used instead. In these cases, the upper module had a lower valued serial number than the lower module. - Software commands to obtain serial numbers are: Command Prints /usr/etc/amsysinfo system serial number sysinfo -vv | cut -d - -f2 all serial numbers (1st is system) hinv -vm serial numbers of all modules Appendix A-8 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Installation Reporting** In the U.S., a Clarify install ticket should be opened to track the status of the installation. This can be done through the ESCALL tool in Supportfolio or by using Clarify directly. Information that should be included is: - the install time - any installation problems - any hardware fallout Resources are allocated to monitor this information and work with the appropriate groups to resolve the problems and ensure better future installations. ### **Initial Mainframe Hardware Install Reporting** Update your system's Clarify install ticket to report: - hardware installation statustime taken to install the hardware - how well the initial hardware install went A Initial Mainframe Install-HW form is not required for Origin systems. | Origin2000 and Onyx2 Deskside And Rackmount<br>Installation Instructions | 108-0155-xxx | |--------------------------------------------------------------------------|--------------| | Origin2000 Power-On Diagnostics | 108-0161-xxx | TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-10 ### **Initial Mainframe Software Install Reporting** Update your system's Clarify install ticket to report: - software installation status - time taken to install the software - how well the initial software install went A Initial Mainframe Install-SW form is not required for Origin systems. | System Verification Program (svp) | Reference Guide<br>108-0165-xxx | FAQ | |-----------------------------------|---------------------------------|--------------| | FRU Analyzer Reference | Guide | 108-0166-xxx | | ICRASH Reference Gu | uide | 108-0167-xxx | | Hardware | 0. | C - C | | T4- | 11 - 4 | • | n - c | 4- | |----------|----|-------|------|-------|--------|-----|-------|------| | Hardware | ~ | 2011 | ware | inera | HOL | ากท | I DOT | ecte | Update your system's Clarify install ticket to report problems encountered in hardware and software installation processes. Install Hardware Defect & Install Software Defect forms are not required for Origin systems. Appendix A-12 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **System Failure Reporting** Once the system has been accepted, all known hardware or system failures requiring a reboot of the system should be reported. This information is used to determine Mean Time To Interrupt (MTTI) reliability metrics. MTTI reliability metrics are used to: - provide input back to the engineering groups on system stability identify areas for improvement. All, both U.S. and International, Origin 2000 sites should run Availmon if the customer allows it. Availmon is being used to capture failure data which is used to calculate system Mean Time To Interrupt (MTTI) metrics. Previously sites were being asked to report failure information in the Cruise database. AvailMon More Information #### **Problem Escalation** One escalation process applies to all Origin systems, regardless of size. The next diagram, originated by Jack La Salle of RTS, summarizes this overall escalation model. Figure A-0: Critical Problem Escalation In North America, escalation of large Origin system problems to backbone support groups, GTS/CTS, follow standard GTS escalation procedures. Escalation of large Origin system problems originating with SSEs in North America begin when they escalate to the RTS organization (1-888-800-4RTS). If Regional Technical Analyst (RTAs) requires further escalation, they escalate to GTS either through the GTS Escalation Pager (1-415-588-4826) for critical system down/customer satisfaction issues or through standard GTS escalation procedures for normal escalations. TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-14 | | | lation | Proced | ure & | Inform | ation Guideline | | |-------------------|----------------------------------------------|--------|--------|-------|--------|--------------------------|----------------------| | More information: | RTS's Escalation Procedure for U.S. & Canada | | | | | RTA Location & Expertise | | | Home Pages | BSM/SSE | CSC | CTS | GTS | RTS | Eagan & MV TAC | SGI & Cray Logistics | #### GTS's Hotlist Employees may monitor escalation status by viewing GTS's Hotlist. This Hotlist is the forum for escalating issues into Engineering. Hot Site List Hot Accounts Top Issues Hot Bugs The Cray Weekly Site Review (WSR) process and report is no longer used for escalating issues and problems for any Origin systems. Work is in process to merge these two escalation forums. Appendix A-15 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **U.S. Escalation Model** High end Origin system customers have access to some traditional Cray processes and tools that have not yet been merged into a single set of processes and tools. This results in a need to deviate from the standard processes used for low end Origin systems. Processes and tools which necessitate deviation are: - customer access to the CRInform tool the ability to submit SPRs in lieu of PVs - high end U. S. region calls routed to the Eagan TAC. The next charts help explain problem flow through the organizations with these differences added. Figure A-1: Cray Origin 2000 U.S. Customer's Problem Flow As shown, the backline support team works on one set of queues to address all Origin 2000 escalations, regardless of system size. Appendix A-16 22jul1998 Figure A-2: Cray Origin 2000 U.S. Field's Problem Flow Field in this diagram, includes SSEs, RTAs, and ASEs. TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-16.a ### **International Escalation Model** Figure A-3: Cray Origin 2000 International Problem Flow International regions are expected to handle escalations locally, with existing processes. When it becomes necessary to escalate to GTS in the U.S., use the current escalation procedures defined for any other Irix based system. Both non-critical and critical escalations requires that a Clarify call be created. This can be done through the ESCALL tool in Supportfolio. The Clarify call that is created will be assigned to the appropriate support group in the U.S. ### **Problem Reporting** Clarify is the call routing tool to initiate requests for help for all Origin2000 systems in the U.S. and for escalating calls from International to the U.S. backline support groups. - Critical (system down) problems are escalated via direct contact. - Non-Critical problems are passed via Clarify. Clarify is used to document all problems, hardware and software, critical and non-critical. This supports a consistent process for getting help and escalating problems for all Origin2000 machines, regardless of size. Clarify directs escalations to correct backline support groups based on system registration information. Report problems by any of these means: - Use Clarify directly, - Use Clarify indirectly by contacting the call center assigned to the customer, - Use the Electronic Escalation CallLog (ESCALL) capability in Worldwide Customer Service Supportfolio OnLine. | | Clarify | | Launch | |--------------------------------|---------|---------------------|-------------------------------------------| | Supportfolio ( | OnLine | Data Sheet | Launch | | More information: Supportfolio | | Supportfolio OnLine | Electronic Escalation<br>CallLog (ESCALL) | Appendix A-18 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### **Software Problem Reporting** All Cray Origin2000 support organizations are using PVs internally. Software Problems Reports (SPRs) go into the SPR database. Cray Origin2000 system SPRs automatically convert into PVs to support the MV and CF engineering organizations need to use existing tools. The ProjectVision (PV) number will match the SPR number. At customer request, field personnel can file SPRs to report software problems against Cray Origin2000 systems as an alternate to PVs. Field personnel not familiar filing SPRs can initiate a Clarify call or ESCALL and request that a problem also be recorded as an SPR. Using the SPR database allows customers to check on the status of their problems from problem report through patch (fix) availability through CRInform. Long-term plans are to replace SPR and PV with a new SCOPUS-based tool (BugMaster) that can be used by all engineering organizations. BugWorks is planned to bridge to this future database from its current PV database. Display SPR: PV: Example, try 706050. ### **Customer Communication** | Customer Communication | | Pipeline Supportfolio On-line None None | |------------------------|--|-----------------------------------------| |------------------------|--|-----------------------------------------| Customer communication mechanisms: Pipeline CRInform Field Notices and FYIs/FIBs/NPIs TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-20 ### Pipeline The Pipeline publication: - is the communication mechanism for topics related to Cray Origin2000 systems. is made available on the Supportfolio CD-ROM. will be sent to all Origin2000 customers. Pipeline home page On-line Viewer #### Cray Inform (CRInform) CRInform is an online information and problem-reporting service for SGI/Cray customers and employees. Cray Origin2000 customers will be provided with an CRInform account. This CRInform account will be set up when the Origin system is registered under CRUISE. CRInform provides customers with the capability to: - track the status of software problems they have reported against their Cray Origin2000 system - view Field Notices that have been written against Cray Origin2000 systems. #### CRInform features: - Report software problems (customers only) - Request technical assistance (customers only); similar to ESCALL - Search information repositories, including SPRs, Field Notices, Cray Service Bulletins, and Software Release Documents - Order software, software updates, software fixes, and software publications - Access the Publications and Training Catalogs - Access customer bulletin boards - Customize a user profile for receiving email notification when information of interest to you is available - Read about new products Appendix A-22 22jul1998 TR-IKI rev 0.7b SGI Proprietary #### Field Notices and FYIs/FIBs/NPIs A Field Notice (FN) is a SGI/Cray document that communicates technical or procedural information about SGI/Cray products to customers, employees, and third-party service providers. FNs are similar in function to SGI's Field Information Bulletins (FIBs). Both Field Notices and FYIs/FIBs/NPIs will be used to communicate information concerning Cray Origin2000 systems. Plans are to combine these two communication processes into one process. In the interim, to ensure all areas are covered: - All Cray Origin2000 Field Notices will have an associated FYI, FIB, or NPI with similar information. - FYIs, FIBs and NPIs do not go to customers. FYIs, FIBs, and NPIs which contain information that is to be sent to Cray customers, will be re-written as a Field Notice (FN) and distributed to field personnel that are registered to receive Cray Origin2000 field notices and to customers via CRInform. - FYIs, FiBs, and NPIs that apply against Cray Origin2000 systems and contain information that will not be distributed to customers, will be mailed to field personnel registered to receive Cray Origin2000 Field Notices. Format for these FYIs, FIBs, and NPIs re-issued as field notices will not be changed to Field Notice format. Field personnel who need to receive all Origin2000 communications should register to receive: - Field Bulletin System (FBS) mailings - Field Notice (FN) mailings ### **Related Information** Cray Origin2000 Support Tools and Planning Origin/Onyx2 Firmware information on the Lego Software home page Supporting documentation: | Origin2000 Deskside Owner's Guide | 007-3453-xxx | |------------------------------------------|--------------| | Onyx2 Deskside Workstation Owner's Guide | 007-3454-xxx | | Origin Vault Owner's Guide | 007-3455-xxx | | Origin2000 Rackmount Owner's Guide | 007-3456-xxx | | Onyx2 Rackmount Owner's Guide | 007-3457-xxx | | Internal Support Tools CD | 814-0640-001 | TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix A-24 | | Appendix B: CPU R10000 Overview | |-------------|---------------------------------| | en consenti | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MIPS <sup>®</sup> R10000 Microprocessor Overview Appendix B-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Instruction prefetch** Prefetching of instructions is a technique whereby the processor can request a cache block prior to the time it is actually needed. For example, assume the compiler is progressing sequentially through a segment of code. The compiler can make the assumption that this sequence will continue beyond the range of addresses available in the on-chip cache and issue a prefetch instruction which fetches the next block of instructions in the sequence and places them in the secondary cache. Therefore, when the processor requires the next sequence, the block of instructions exist in the secondary cache or a special instruction buffer as opposed to main memory and can be fetched by the processor at a much faster rate. If for some reason the block of instructions is not needed, the area in the secondary cache or the buffer is simply overwritten with other instructions. Appendix B-2 22jul1998 #### **Out-of-order execution** In a typical pipelined processor which executes instructions *in-order*, each instruction depends on the previous instruction which produced its operands. Execution cannot begin until those operands become valid. If the operands required to execute a given instruction are not valid, the pipeline stalls until those operands become valid. Because instructions execute *in order*, stalls usually delay all subsequent instructions. In an *in-order* superscalar machine where multiple instructions are fetched each cycle, several consecutive instructions can begin execution simultaneously if all of their corresponding operands are valid. However, the processor stalls at any instruction whose operands are not valid. In an *out-of-order* superscalar machine each instruction is eligible to begin execution as soon as its operands become available regardless of the original instruction sequence. The hardware effectively re-arranges instructions in order to keep the various execution units busy. This process is called *dynamic issuing*. TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix B-3 ### **Queuing structures** The R10000 Microprocessor contains three instruction queues. These queues dynamically issue instructions to the various execution units. Each queue uses instruction tags to track instructions in each execution pipeline stage. Each queue performs dynamic scheduling and can determine when the operands that each instruction needs are available. In addition, the queues determine the execution order based on the availability of the corresponding execution units. When the resources become available the queue releases the instruction to the appropriate execution unit. The integer queue contains 16 entries and issues instructions to the two integer arithmetic logic units (ALU). Integer instructions are written into empty queue entries and up to four entries may be written each cycle. Integer Instructions remain in the queue until being issued to an ALU. Appendix B-5 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **Floating Point Queue** The floating point queue contains 16 entries and issues instructions to the floating-point adder and floating-point multiplier execution units. Floating point instructions are written into empty queue entries and up to four entries may be written each cycle. Instructions remain in the queue until being issued to an execution unit. The floating-point queue also contains multiple-pass sequencing logic for instructions such as the multiply-add. This instruction is dispatched first to the multiply unit, then passed directly to the adder unit. Appendix B-6 22jul 1998 | | Ad | dress | <b>Que</b> | ue | |--|----|-------|------------|----| |--|----|-------|------------|----| The address queue issues instructions to the Load-Store unit and contains 16 entries. The queue is organized as a circular FIFO (first-in first-out) buffer. Instructions can be issued in any order, but must be written to or removed from the queue in sequential order. Up to four instructions can be written every cycle. The FIFO maintains the programs original instruction sequence so that memory address dependencies may be computed easily. An issued instruction may fail to complete because of a memory dependency, a cache miss, or a resource conflict. In these cases the address queue must re-issue the instruction until it is completed. TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix B-7 ### **Execution Units** The R10000 Microprocessor contains five execution units which operate independently of one another. There are two integer arithmetic logic units (ALU), two primary floating-point units, (including two secondary FP units which handle long-latency instructions such as divide and square root), and a load/store unit for address calculation. | Integer | AL | Us | |---------|----|----| | 1110000 | 4 | - | There are two integer ALUs in the R10000 microprocessor defined as ALU1 and ALU2. Both ALUs perform standard add, subtract, and logical operations. ALU1 handles all branch and shift instructions, while ALU2 handles all multiply and divide operations using iterative algorithms. Appendix B-9 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## Floating-Point units The R10000 Microprocessor contains two primary floating point units. The adder unit handles add operations and the multiply unit handles multiply operations. In addition, two secondary floating point units exist (not shown in block diagram) which handle long-latency operations such as divide and square root. #### Load/Store unit and the TLB The Load/Store unit consists of the address queue, address calculation unit, translation lookaside buffer (TLB), address stack, store buffer, and primary data cache. The Load/Store unit performs load, store, prefetch, and cache instructions. All load or store instructions begin with a 3-cycle sequence which issues the instruction, calculates its virtual address, and translates the virtual address to physical. The address is translated only once during the operation. The data cache is accessed and the required data transfer is completed provided there was a primary data cache hit. If there is a cache miss, or if the necessary shared register ports are busy, the data cache and data cache tag access must be repeated after the data is obtained from either the secondary cache or main memory. The Cray Origin2000 TLB contains 128 entries and translates virtual addresses to physical addresses. The virtual address can originate from either the address calculation unit or the program counter (PC). TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix B-11 #### **Secondary Cache Controller** Secondary cache support for the R10000 microprocessor is provided by an internal secondary cache controller with a dedicated secondary cache port. A dedicated 128-bit bus transfers data at the 200 MHz internal operating frequency of the R10000 CPU, yielding a maximum secondary cache data transfer rate of 3.2 GBytes/second. The R10000 microprocessor also provides a 64-bit system interface data bus. The secondary cache is implemented as two-way set associative. Maximum cache size is 16 MBytes. Minimum cache size is 512 KBytes. Transfer width is 128 bits, or (4) 32-bit words. Consecutive cycles are used to transfer larger blocks of data. ## **System Interface** The system interface of the R10000 microprocessor provides a gateway between the R10000 and its associated secondary cache, and the rest of the computer system. The system interface operates at the frequency of SysClk being supplied to the processor. The programmability of the system interface allows for clock speeds of 200, 133, 100, 80, 67, 57, and 50 MHz. All system interface outputs, as well as all inputs, are clocked on the rising edge of SysClk, allowing the system interface to run at the highest possible clock frequency. In most microprocessor systems only one system transaction can occur at any given time. The R10000 microprocessor supports a split-level bus transaction protocol. Split-transaction allows additional processor and external requests to be issued while waiting for a previous response. A maximum of four outstanding transactions at any given time are supported. Appendix B-13 22jul1998 TR-IKI rev 0.7b SGI Proprietary ### **R10000 Branch Unit** The branch unit of the R10000 microprocessor can decode and execute one branch instruction per cycle. A branch bit is appended to each instruction during instruction decode. These bits are used to locate branch instructions in the instruction fetch pipeline. The path a branch will take is predicted using a branch history RAM. This RAM keeps track of how often each particular branch was taken in the past. The code is updated whenever a final branch decision is made. Any instruction fetched after a branch instruction is *speculative*, meaning that it is not known at the time these instructions are fetched whether or not they will be completed. The R10000 microprocessor allows up to 4 outstanding branch predictions which can be resolved in any order. Special on-chip branch stack circuitry contains an entry for each branch instruction being speculatively executed. Each entry contains the information needed to restore the processor's state if the speculative branch is predicted incorrectly. The branch stack allows the processor to restore the pipeline quickly and efficiently when a branch miss-prediction occurs. ### **Branch instruction problem** All computer programs contain branch instructions. Some branches are unconditional, meaning that the program flow is always interrupted as soon as the branch instruction is executed. Other branches are conditional, meaning that the branch is taken only if certain conditions are met. Program flow interruption is inherent to all computer software and the microprocessor hardware has little choice but to deal with branches in the most efficient way possible. When a branch is taken, the new address at which the program is to resume may or may not reside in the secondary cache. The latency is increased depending on where the new instruction block is located. Since the access times of the main memory and secondary cache are far greater than the on-chip cache, as shown in the below figure, branching can often degrade processor performance. The branching problem is further compounded in super-scalar machines where multiple instructions are fetched every cycle and progress through stages of a pipeline toward execution. At any given time, depending on the size of the pipeline, numerous instructions can be in various stages of execution. When a conditional branch instruction is executed it is not known until many cycles later when the instruction is actually executed whether or not the branch should have been taken. Implementation of branching is an important architectural problem. To improve performance many current architectures TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix B-15 incorporate branch prediction circuitry, which can be implemented in a number of ways. ## **Branch** prediction Since branch instructions interrupt the pipeline flow, branch prediction schemes are needed to minimize the number of interruptions. Branches occur frequently, averaging about one out of every six instructions. In super-scalar architectures where more than one instruction at a time is fetched, branch prediction becomes increasingly important. For example, in a four-way super-scalar architecture, where four instructions per cycle are fetched, a branch instruction can be encountered every other clock. Most branch prediction schemes use algorithms which keep track of how a conditional branch instruction behaved the last time it was executed. For example, if the branch history circuit shows that the branch was taken the last time the instruction was executed, the assumption could be made that it will be taken again. A hardware implementation of this assumption would mean that the program would vector to the new target address and that all subsequent instruction fetches would occur at the new address. The pipeline now contains a conditional branch instruction fetched from some address, and numerous instructions fetched afterward from some other address. Therefore, all instructions fetched between the time the branch instruction is fetched and the time it is executed are said to be *speculative*. That is, it is not known at the time they are fetched whether or not they will be completed. If the branch was predicted incorrectly, the instructions in the pipeline must be aborted. Appendix B-16 22jul1998 ## region.c source file (excerpt) Appendix C-1 22jul1998 ``` #include <sys/lock.h> #include <sys/lpage.h> #include <sys/mman.h> #include <sys/mman.h> #include <sys/nodemask.h> #include <sys/par.h> #include <sys/param.h> #include <sys/param.h> #include <sys/param.h> #include <sys/param.h> #include <sys/param.h> #include *sys/param.h> #include *sys/prctl.h> #include *fetchop.h* #include *sys/rmap.h> #include <sys/sysmacros.h> #include <sys/sysmacros.h> #include <sys/sysmacros.h> #include <sys/sysmacros.h> #include <sys/systm.h> #include <sys/uio.h> #include <sys/uic.h> #include <sys/viie.h> <sys/miser_public.h> #include <sys/miser_public.h> #include <os/viinser_public.h> #include <os/viinser_public.h> #include <os/viinser_public.h> ``` # Appendix D: sbd.h ``` #indef __SYS_SBD_H__ #copyright (C) 1990, Silicon Graphics, Inc. * These coded instructions, statements, and computer programs contain are protected by Federal copyright law. They may not be disclosed to third parties or copied or duplicated in any form, in whole or in part, without the prior written consent of Silicon Graphics, Inc. * Copyright (C) 1986, MIPS Computer Systems */ * Copyright (C) 1986 AT&T */ * THIS IS UNPUBLISHED PROPRIETARY SOURCE CODE OF AT&T */ * The copyright notice above does not evidence any */ * include <sys/mips_addrspace.h> * include *sys/beast.h* * else /* R4000 || R10000 */ /* Chip definitions for R3000 and R4000 ``` ``` * constants for coprocessor 0 */ /* * Exception vectors * UT_VEC points to compatibility space in 64 bit R4000 systems. This is * where the architecture spec defines it to be (and is needed so things * like stack backtrace work). However, there are places in the kernel * that we use UT_VEC for cache flush calls and the cache flush routine * then tries to do IS_KSEGO(addr) which fails when it shouldn't. In these * case, KO_UT_VEC should be used. *// #define #define KO_UT_VEC KOBAS#define /* Size (bytes) of an exc. vec */ /* utlbmiss vector */ COMPAT_KOBASE KOBASE #define R_VEC (COMPAT_K1BASE+0x1fc00000) /* reset vector */ #if R4000 || R10000 XUT_VEC ECC_VEC (COMPAT_K0BASE+0x80) /* extended address tlbmiss */ (COMPAT_K0BASE+0x100) /* Ecc exception vector */ (COMPAT_K0BASE+0x180) /* Gen. exception vector */ #define #define #define E_VEC #endif #if R4000 /* 128 k */ /* 4M */ /* max r4k scache line size */ /* max r4k primary cache size */ #define #define MINCACHE MAXCACHE 0x20000 #define R4K_MAXPCACHESIZE 0x8000 #if _PAGESZ > R4K_MAXPCACHESIZE 0x8000 #if _PAGESZ > CACHECOLOPETED #else #else #define CACHECOLORSIZE (R4K_MAXPCACHESIZE/NBPP) #define CACHECOLORMASK (CACHECOLORSIZE - 1) #if _PAGESZ >= R4K_MAXPCACHESIZE #define CACHECOLORSHIFT 0 #dese #if _PAGESZ == 16384 #define CACHECOLORSHIFT 1 #else #if _PAGESZ == 4096 ``` Appendix D-0.a 22jul1998 ``` #define CACHECOLORSHIFT 3 #define CACHECOLORSHIFT 3 #else #ifdef _KERNEL <<BOMB -- need define for unanticipated page size >> #endif /* _KERNEL */ #endif /* !4096 */ #endif /* !16384 */ #endif /* !>= R4K_MAXPCACHESIZE */ #endif /* R4000 */ #if R10000 #ifndef R4000 #define MINCACHE #endif /* R4000 */ 0x80000 /* 512 k */ #ifdef R4000 #undef MAXCACHE #endif /* R4000 */ MAXCACHE 0x1000000 /* 16M */ #define R10K_MAXCACHELINESIZE 128 #define R10K_MAXPCACHESIZE 0x8000 #ifndef R4000 #if PACEST /* max r10k scache line size */ /* max r10k primary cache size */ #if _PAGESZ > R10K_MAXPCACHESIZE #define CACHECOLORSIZE 1 #define CACHECOLORSIZE (R10K_MAXPCACHESIZE/NBPP) #define CACHECOLORMASK (CACHECOLORSIZE - 1) #if _PAGESZ >= R10K_MAXPCACHESIZE #define CACHECOLORSHIFT 0 #else #if _PAGESZ == 16384 #define CACHECOLORSHIFT 1 #else #if _PAGESZ == 4096 #define CACHECOLORSHIFT 3 #endif #endif /* !4096 */ #endif /* !16384 */ ``` ``` #endif /* ! >= R10K_MAXPCACHESIZE */ #endif /* R4000 */ #endif /* R10000 */ * TLB size constants #if R10000 && R4000 #if R10000 && R4000 #define R10K_NTLBENTRIES 64 #define R4K_NTLBENTRIES 48 #define MAX_NTLBENTRIES R10K_NTLBENTRIES #ifdef _KERNEL #ifdef _LANGUAGE_C extern int ntlbentries; #define NTLBENTRIES ntlbentries #endif /* _LANGUAGE_C */ #endif /* _KERNEL */ #else /* R10000 && R4000 */ #if R10000 #define #endif NTLBENTRIES #if R4000 NTLBENTRIES #endif #define MAX_NTLBENTRIES NTLBENTRIES #endif /* R10000 && R4000 */ #ifdef MAPPED_KERNEL NKMAPENTRIES #define #define KMAP_INX #ifdef MH_R10000_SPECULATION_WAR #define NKMAPENTRIES #else #define NKMAPENTRIES #endif /* MH_R10000_SPECULATION_WAR */ #endif #if _PAGESZ == 4096 #define NWIREDENTRIES #endif (8 + NKMAPENTRIES) /* WAG for now */ ``` 22jul1998 Appendix D-0.c ``` #if _PAGESZ == 16384 #define NWIREDENTRIES (6 + NKMAPENTRIES) /* WAG for now */ #endif TLBWIREDBASE (1 + NKMAPENTRIES) NWIREDENTRIES TLBRANDOMBASE #define #if R10000 && R4000 #define R10K_NRANDOMENTRIES (R10K_NTLBENTRIES - NWIREDENTRIES) #define R4K_NRANDOMENTRIES (R4K_NTLBENTRIES - NWIREDENTRIES) #define MAX_NRANDOMENTRIES R10K_NRANDOMENTRIES #ifdef _KERNEL #ifdef _LANGUAGE_C extern int nrandomentries; #define NRANDOMENTRIES nrandomentries #endif /* _LANGUAGE_C */ #endif /* _KERNEL */ #else /* R10000 && R4000 */ #endif /* R10000 && R4000 */ TLBFLUSH_NONPDA TLBWIREDBASE #define TLBFLUSH_RANDOM TLBRANDOMBASE /* flush all random tlbs TLBINX_PROBE 0x80000000 #if R4000 || R10000 #if _PAGESZ == 4096 #define TLBHI_VPNSHIFT #define TLBHI_VPNMASK _S_EXT_(0xffffe000) #endif #if _PAGESZ == 16384 #define TLBHI TLBHI_VPNSHIFT TLBHI_VPNMASK _S_EXT_(0xffff8000) #define #endif #if _M #if _MIPS_SIM != #define I != _ABI64 TLBHI_VPNZEROFILL 0 #else #define TLBHI_VPNZEROFILL 0x3fffff0000000000 ``` ``` #endif TLBHI_VPN2MASK TLBHI_VPN2SHIFT TLBHI_PIDMASK TLBHI_PIDSHIFT TLBHI_NPID #define /* As named in the arch. spec.*/ TLBHI VPNMASK (TLBHI_VPNSHIFT+1) #define #define /* 255 to fit in 8 bits */ #define 255 /* cache coherency algorithm */ #if _RUN_UNCACHED #define TLBLO_NONCOHRNT #define TLBLO_EXI TLBLO_UNCACHED TLBLO_UNCACHED TLBLO_UNCACHED TLBLO_EXL #define TLBLO_EXLWR #else /* Cacheable non-coherent */ /* Exclusive */ /* Exclusive write */ #define TLBLO_NONCOHRNT #define TLBLO_EXL 0 \times 20 #define TLBLO_EXLWR 0x28 #endif #ifdef R10000 #define TLBLO____ #endif /* R10000 */ TLBLO_D TLBLO_UNCACHED_ACC 0x38 /* Uncached Accelerated */ 0x4 0x2 0x1 /* writeable */ /* valid bit */ /* global access bit */ #define TLBLO_G /* * TLBLO Uncached attributes field. #ifdef R10000 #define TLBLO_UATTRMASK #define TLBLO_UATTRSHIFT 0xC00000000000000 #endif TLBRAND_RANDMASK 0x3f #define TLBRAND_RANDSHIFT #define TLBWIRED WIREDMASK 0x3 f ``` Appendix D-0.e 22jul1998 ``` #define TLBCTXT_BASEMASK TLBCTXT_BASESHIFT TLBCTXT_VPNMASK 0xff800000 #define 0x7ffff0 #define #define TLBCTXT_VPNNORMALIZE ! #define TLBCTXT_VPNSHIFT #define #ifdef R10000 TLBEXTCTXT_BASEMASK 0xffffffe000000000 TLBEXTCTXT_BASESHIFT TLBEXTCTXT_VPNMASK 37 0x7fffffff0 #define #else /* R10000 */ #define TLBEXTCTXT_BASEMASK 0xfffffffel #define TLBEXTCTXT_BASESHIFT 31 #define TLBEXTCTXT_VPNMASK 0x7ffffff0 #define TLBEXTCTXT_REGIONMASK 0x0000000180000000 #define TLBEXTCTXT_REGIONSHIFT 31 #endif /* R10000 */ 0xfffffffe00000000 #define TLBPGMASK_4K #define TLBPGMASK_16K #define TLBPGMASK_64K #define TLBPGMASK_16M #define TLBPGMASK_16M 0x0006000 0x001e000 0x07fe000 0x1ffe000 #if _PAGESZ == 4096 #define TLBPG #endif TLBPGMASK_MASK TLBPGMASK_4K #if _PAGESZ == 16384 #define TLBPGN TLBPGMASK_MASK TLBPGMASK_16K #endif #endif /* R4000 || R10000 */ * Status register #ifdef R10000 #define #ifdef R4000 SR_CUMASK 0x70000000 /* coproc usable bits */ sr_xx #define SR_CU3 ``` ``` #endif /* R4000 */ #else 0xf0000000 0x80000000 #define SR_CUMASK /* coproc usable bits */ /* Coprocessor 3 usable */ #define SR_CU3 #endif /* R10000 */ /* Enable Mips 4 inst. execution */ /* Coprocessor 2 usable */ /* Coprocessor 1 usable */ /* Coprocessor 0 usable */ SR_XX #define 0x80000000 #define SR_CU2 SR_CU1 0x40000000 0x20000000 #define SR_CU0 0×10000000 /* Diagnostic status bits */ #if R4000 || R10000 #define SR SR 0x00100000 /* soft reset occured */ /* Cache hit for last 'cache' op */ #define #ifdef R10000 0x00040000 SR_CH SR_NMI #define #endif /* R10000 */ #ifdef R4000 #define SR_CE #endif /* R4000 */ 0x00020000 /* Create ECC */ #define SR_DE #endif /* R4000 || R10000 */ 0x00010000 /* ECC of parity does not cause error */ /* TLB shutdown */ /* use boot exception vectors */ 0x00200000 *define #define SR_BEV 0x00400000 Interrupt enable bits (NOTE: bits set to 1 enable the corresponding level interrupt) */ #if IP32 /* * Moosehead has different status register bit assignments and * uses an external interrupt controller. Only softints and * count/compare go directly to SR, CRIME controls all other * external interrupt sources in the system. /* Interrupt mask */ /* mask level 8 */ /* mask level 7 */ /* mask level 6 */ SR_IMASK SR_IMASK8 SR_IMASK7 0x00000000 0x00008000 #define #define SR_IMASK6 0x00008400 ``` 22jul1998 Appendix D-0.g ``` /* mask level 5 */ /* mask level 4 */ /* mask level 3 */ /* mask level 2 */ /* mask level 1 */ /* mask level 0 */ SR_IMASK5 #define #define #define SR IMASK4 0 \times 000008400 SR_IMASK3 0x00008400 *define SR IMASK2 0x00008400 #define SR IMASK1 0x00008600 #define SR_IMASKO 0x00008700 #else /* Interrupt mask */ /* mask level 8 */ /* mask level 7 */ /* mask level 6 */ /* mask level 5 */ SR_IMASK SR_IMASK8 SR_IMASK7 #define 0x0000ff00 0x0000000 #define #define 0x00008000 SR_IMASK6 SR_IMASK5 0x0000c000 0x0000e000 #define #define /* mask level 4 */ /* mask level 3 */ /* mask level 2 */ SR_IMASK4 SR_IMASK3 SR_IMASK2 #define #define 0x0000f000 0x0000f800 *define 0x0000fc00 #define SR_IMASK1 #define SR_IMASK0 #endif /* IP32 */ /* mask /* mask mask level 1 mask level 0 0x0000fe00 0x0000ff00 /* bit level 8 */ /* bit level 7 */ /* bit level 6 */ /* bit level 5 */ /* bit level 4 */ /* bit level 3 */ /* bit level 2 */ /* bit level 1 */ #define 0x00008000 SR_IBIT8 #define #define SR_IBIT7 SR_IBIT6 SR_IBIT5 0x00004000 0x00002000 *define 0x00001000 SR_IBIT4 SR_IBIT3 0x00000800 #define #define 0x00000400 #define SR TRITT2 0×00000200 SR_IBIT1 0x00000100 #define #if R4000 || R10000 /* SR_RP is undefined on R10000 - should be 0 */ #ifdef R4000 SR_RP #define 0x0800000 /* enable reduced-power operation */ #endif 0x04000000 * enable additional fp registers */ #define SR FR #define SR_RE 0x02000000 /* reverse endian in user mode */ /* extended-addr TLB vec in kernel */ /* xtended-addr TLB vec supervisor */ /* xtended-addr TLB vec in user mode */ /* 2 bit mode: 00b=>k, 10b=>u */ /* 2 bit mode: 00b=>k, 10b=>u */ /* 0-->kernel 1-->supervisor */ #define SR KX 0x00000080 0x00000040 0x00000020 #define #define SR_UX #define SR_KSU_MSK SR_KSU_USR 0x00000018 #define 0x00000010 /* 0-->kernel 1-->supervisor */ /* Error level, 1=>cache error */ #define SR KSU KS 0x00000008 0x00000004 ``` ``` 0x00000002 /* Exception level, l=>exception */ 0x00000001 /* interrupt enable, l=>enable */ SR_IE /* compat with R3000 source */ SR_KSU_MSK /* previous kernel/user mode */ /* No pagesize bits in SR */ /* No FP Debug Mode bits in SR */ /* Bits to preserve in SR */ #define SR_IE SR_IEC SR_PREVMODE #define #define #define #define SR_PAGESIZE #define SR_DM #define SR_DEFAULT #if R10000 #define SR_KERN_SET #else /* R10000 */ SR_KADDR|SR_UKADDR /*Bits to set in SR for kernel mode*/ #define SR_KERN_SET #endif /* R10000 */ SR_KADDR /* Bits to set in SR for kernel mode*/ #define SR_KERN_USRKEEP /* Bits to keep in SR from user mode*/ * SR_KADDR defines the desired state of the kernel address mode bit * in CO_SR, if such bits exist. We could actually enable SR_KX when * compiled under 32-bit compilers, though there is no real reason to do so. #if _MIPS_SIM == _ABI64 #define SR_KADDR #define SR_UXADDR SR_KX * kernel 64 bit addressing *, /* user ext. addressing and opcodes */ SR UX #else #define SR_KADDR #define SR_UXADDR /* kernel 32 bit addressing */ /* no user ext. address/opcodes */ #endif /* R4000 || R10000 */ #define SR_IMASKSHIFT #if IP32 #define SR_CRIME_INT_OFF 0xfffffbff #define SR_CRIME_INT_ON 0x00000400 #if IP20 || IP22 || IP28 || IP32 || IPMHSIM * The following value is used as a flag to indicate whether * a status register value is saved in the pda. This is for * assertions that check for nested spsemahi calls. This value * can be any bit that does not conflict with the mips processor * level nor (on the IP6 with the lio mask value). */ #define OSPL_SPDBG 0x00000040 ``` Appendix D-0.i 22jul1998 ``` #endif /* IP20 || IP22 || IP28 || IP32 || IPMHSIM */ * Cause Register #define CAUSE_BD CAUSE_CEMASK CAUSE_CESHIFT /* Branch delay slot */ /* coprocessor error */ 0x80000000 #define 0×3000000 #define /* External level 8 pending */ /* External level 7 pending */ /* External level 6 pending */ /* External level 5 pending */ /* External level 4 pending */ /* External level 3 pending */ /* Software level 2 pending */ /* Software level 1 pending */ 0x00008000 #define #define CAUSE_IP7 CAUSE_IP6 0x00004000 0x00002000 CAUSE_IP5 CAUSE_IP4 #define 0x00001000 #define 0x00000800 CAUSE_IP3 CAUSE_SW2 CAUSE_SW1 #define 0x00000400 #define 0x00000200 #define 0x0000100 #define CAUSE_IPMASK CAUSE_IPSHIFT 0x0000FF00 /* Pending interrupt mask */ #define #if R4000 || R10000 #define CAUS CAUSE_EXCMASK 0x0000007C /* Cause code bits */ #endif #define CAUSE_EXCSHIFT 2 #define CAUSE_FMT "\20\40BD\36CE1\35CE0\20IP8\17IP7\16IP6\15IP5\14IP4\13IP3\12SW2\11SW1\1INT" #define setsoftclock() siron(CAUSE_SW1) #define setsoftnet() siron(CAUSE_SW2) acksoftclock() siroff(CAUSE_SW1) siroff(CAUSE_SW2) #define #define acksoftnet() /* Cause register exception codes */ #define EXC_CODE(x) /* Hardware exception codes */ EXC_CODE(0) EXC_CODE(1) EXC_CODE(2) EXC_CODE(3) /* interrupt */ /* TLB mod */ /* Read TLB Miss */ /* Write TLB Miss */ #define #define EXC_MOD #define EXC RMISS EXC_WMISS ``` ``` EXC_CODE (4) EXC_CODE (5) EXC_CODE (6) EXC_CODE (7) EXC_CODE (8) EXC_CODE (9) /* Read Address Error */ /* Write Address Error */ /* Instruction Bus Error */ /* Data Bus Error */ /* SYSCALL */ /* BREAKpoint */ #define EXC_RADE #define EXC_WADE EXC_IBE EXC_DBE EXC_SYSCALL #define #define #define EXC_SYSCAL EXC_BREAK EXC_II EXC_CPU EXC_OV #define /* Illegal Instruction */ /* CoProcessor Unusable */ /* OVerflow */ EXC_CODE(10) #define #define EXC CODE (11) #if R4000 || R10000 /* Trap exception */ /* Virt. Coherency on Inst. fetch */ /* Floating Point Exception */ /* Watchpoint reference */ /* Virt. Coherency on data read */ EXC_TRAP EXC_VCEI EXC_CODE(13) EXC_CODE(14) EXC_CODE(15) #define #define EXC FPE EXC_WATCH EXC_CODE (23) #define Virt. Coherency on data read */ #define EXC VCED EXC_CODE (31) #endif #if R4000 #define SEXC_EOP EXC_CODE(39) /* end-of-page trouble */ #endif #endir #ifdef _MEM_PARITY_WAR #ifdef _MEM_PARITY_WAR #define SEXC_ECC_EXCEPTION EXC_CODE(40) /* ECC/Parity error recovery */ #endif /* _MEM_PARITY_WAR */ #define SEXC_UTINTR EXC_CODE(41) /* post-interrupt uthrea /* post-interrupt uthread processing */ /* * Coprocessor 0 operations /* read ITLB entry addressed by C0_INDEX */ /* write ITLB entry addressed by C0_INDEX */ /* write ITLB entry addressed by C0_RAND */ /* probe for ITLB entry addressed by TLBHI */ /* restore for exception */ /* wait for interrupt */ #define CO_READI 0x1 CO_WRITEI 0x2 #define CO_WRITER 0x6 CO_PROBE 0x8 #define #define #define CO_RFE 0x10 #define CO_WAIT 0x20 #if R4000 ``` 22jul1998 Appendix D-0.k ``` /* Target cache */ /* specifies primary inst. cache */ /* primary data cache */ /* secondary instruction cache */ /* secondary data cache */ #define CACH_PI 0x0 CACH_PD CACH_SI CACH_SD #define 0x1 0x2 0x3 #define Cache operations /* index invalidate (inst, 2nd inst) */ /* index writeback inval (d, sd) */ /* index load tag (all) */ /* index store tag (all) */ /* create dirty exclusive (d, sd) */ /* hit invalidate (all) */ /* hit writeback inv. (d, sd) */ /* fill (i) */ /* hit writeback (i, d, sd) */ /* hit set virt. (si, sd) */ C_IINV C_IWBINV C_ILT C_IST C_CDX C_HINV 0 \times 0 #define 0x0 0x4 0x8 #define #define #define 0xc 0x10 #define #define #define #define C_HWBINV C_FILL C_HWB 0x14 0x14 0x18 #define #define #ifdef TRITON 0x1c /* Triton invalidate all (s) */ /* Triton invalidate page (s) */ #define C_INVALL #define C_INVPAGE #endif /* TRITON */ 0 \times 0 0x14 /* * CO_CONFIG register definitions */ Ox. Ox. /* 1 == Master-Checker enabled */ /* System Clock ratio */ /* Transmit Data Pattern */ /* Secondary cache block size */ #define 0x80000000 CONFIG CM CONFIG_EC CONFIG_EP 0x70000000 0x0f000000 #define #define CONFIG SB 0x00c00000 /* Split scache: 0 == I&D combined */ /* scache port: 0==128, 1==64 */ /* System Fort width: 0==64, 1==32 */ /* 0 -> 2nd cache present */ /* 0 -> Dirty Shared Coherency enabled*/ /* Endian-ness: 1 --> BE */ /* 1 -> ECC mode, 0 -> parity */ /* Block order:1->sequent,0->subblock */ #define CONFIG_SS CONFIG_SW CONFIG_EW 0x00200000 0x00100000 0x000c0000 #define #define CONFIG_SC CONFIG_SM 0x00020000 0x00010000 #define #define CONFIG_BE 0x00008000 CONFIG_EM 0x00004000 CONFIG_EB 0x00002000 #define #define CONFIG_IC 0x00000e00 /* Primary Icache size */ ``` ``` #define /* Primary Dcache size */ /* Icache block size */ /* Dcache block size */ /* Update on Store-conditional */ CONFIG_DC CONFIG_IB CONFIG_CU #define 0x00000020 #define #define 0x00000010 0x00000008 #define CONFIG KO 0x00000007 /* KOSEG Coherency algorithm #ifdef TRITON #define CONFIG_TR_SS #define CONFIG_TR_SE #define CONFIG_TR_SE #endif /* TRITON */ /* Triton SS (2nd cache size) */ /* Triton SC (2nd cache present) */ /* Triton SE (2nd cache enabled) (R/W) */ 0x00300000 CONFIG_SC 0x00001000 #define CONFIG_UNCACHED 0x00000002 /* K0 is uncached */ #if _RUN_UNCACHED #define CON CONFIG_UNCACHED CONFIG_UNCACHED CONFIG_UNCACHED CONFIG_NONCOHRNT CONFIG_COHRNT_EXLWR #define #define CONFIG_COHRNT_EXL #else #define CONFIG_NONCOHRNT CONFIG_COHRNT_EXLWR CONFIG_COHRNT_EXL 0x00000003 #define 0x00000005 0x00000004 #define #endif #ifdef R10000 #define CON/ #endif /* R10000 */ CONF CONFIG_UNCACHED_ACC 0x00000007 CONFIG_SB_SHFT 22 CONFIG_IC_SHFT 9 CONFIG_DC_SHFT 6 CONFIG_BE_SHFT 15 /* shift SB to bit position 0 */ /* shift IC to bit position 0 */ /* shift DC to bit position 0 */ /* shift BE to bit position 0 */ - TR *o bit position 0 */ #define #define #define #define CONFIG_IB_SHFT 5 #define CONFIG_DB_SHFT 4 /* shift IB to bit position 0 */ /* shift DB to bit position 0 */ #ifdef TRITON #define CONFIG_TR_SS_SHFT 20 #endif /* TRITON */ /* shift TR_SS to bit position 0 */ /* * CO_TAGLO definitions for setting/getting cache states and physaddr bits /* 31..13 -> scache paddr bits 35..17 */ /* 9..7: prim virt index bits 14..12 */ /* bits 12..10 hold scache line state */ /* invalid --> 000 == state 0 */ /* clean exclusive --> 100 == state 4 */ #define SADDRMASK 0xFFFFE000 #define SVINDEXMASK #define SSTATEMASK #define SINVALID 0 \times 000000380 0x00001c00 0x00000000 #define SCLEANEXCL 0x00001000 ``` Appendix D-0.m 22jul1998 ``` #define SDIRTYEXCL 0x00001400 /* dirty exclusive --> 101 == state 5 */ /* low 7 bits are ecc for the tag */ /* shift STagLo (31..13) to 35..17 */ #define SECC MASK 0x0000007f #define SADDR_SHIFT /* PTagLo31..8->prim paddr bits35..12 */ /* roll bits 35..12 down to 31..8 */ /* bits 7..6 hold primary line state */ /* invalid --> 000 == state 0 */ /* clean exclusive --> 10 == state 2 */ /* dirty exclusive --> 11 == state 3 */ /* low bit is parity bit (even). */ #define PADDRMASK 0xFFFFFF00 #define PADDR_SHIFT #define PSTATEMASK 0×00C0 *define PINVALID *define PCLEANEXCL *define PDIRTYEXCL 0x0000 0x0080 0x00C0 #define PPARITY_MASK * CO_CACHE_ERR definitions. /* 0: inst ref, 1: data ref */ /* 0: primary, 1: secondary */ /* 1: data error */ /* 1: tag error */ /* 1: external ref, e.g. snoop*/ /* error on SysAD bus */ /* complicated, see spec. */ /* complicated, see spec. */ CACHERR_ER #define 0x80000000 0x4000000 0x2000000 0x10000000 #define CACHERR EC CACHERR_ED CACHERR_ET *define #define #define CACHERR_ES 0x08000000 #define 0x04000000 #define CACHERR EB 0x02000000 #define #if IP19 #define CACHERR EI 0x01000000 CACHERR_EW 0x00800000 /* complicated, see spec. */ #endif #define CACHERR_SIDX_MASK #define CACHERR_PIDX_MASK #define CACHERR_PIDX_SHIFT 12 #endif /* R4000 */ bits 31.3 are bits 31..3 of physaddr to watch bit 2: reserved; must be written as 0. bit 1: when set causes a watchpoint trap on load accesses to paddr. bit 0: when set traps on stores to paddr; CO WATCHHI DATCHNI bits 31..4 are reserved and must be written as zeros - R4000 bits 3..0 are bits 35..32 of the physaddr to watch - R4000 bits 31..8 are reserved and must be written as zeros - R10000 bits 3..0 are bits 39..32 of the physaddr to watch - R10000 ``` ``` #define WATCHLO_WTRAP 0x0000001 #define WATCHLO_WIRAP 0x0 #define WATCHLO_RIRAP 0x6 #define WATCHLO_ADDRMASK 0xf #define WATCHLO_VALIDMASK 0xf #if R4000 && (! defined(_NO_R4000)) #define WATCHHI_VALIDMASK 0x0 #elif R10000 0x00000002 0xfffffff8 0xfffffffb 0x0000000f #define WATCHHI_VALIDMASK 0x000000ff #endif #endif /* R4000 || R10000 */ * Coprocessor 0 registers * Some of these are r4000 Some of these are r4000 specific. #ifdef _LANGUAGE_ASSEMBLY #define CO_INX #define CO_RAND CO_TLBLO CO_CTXT CO_BADVADDR #define #define #define $8 CO_BADVAL CO_TLBHI CO_SR CO_CAUSE CO_EPC CO_PRID $10 $12 $13 #define #define *define #define $15 /* revision identifier */ #if R4000 || R10000 CO_TLBLO_0 CO_TLBLO_1 CO_PGMASK CO_TLBWIRED CO_COUNT $2 $3 $5 $6 $9 #define #define /* page mask */ /* # wired entries in tlb */ /* free-running counter */ /* counter comparison reg. */ /* hardware configuration */ /* load linked address */ /* watchpoint */ /* Extended context */ /* Scache FCC and primary re #define #define #define #define CO_COMPARE $11 CO_COMPARE CO_CONFIG CO_LLADDR CO_WATCHLO CO_WATCHHI CO_EXTCTXT $16 $17 $18 $19 $20 #define #define #define C0_ECC C0_CACHE_ERR C0_TAGLO /* S-cache ECC and primary parity */ /* cache error status */ /* cache operations */ #define #define $26 $27 #define $28 ``` 22jul1998 Appendix D-0.o ``` /* cache operations */ /* ECC error prg. counter */ $30 #ifdef R10000 /* Frame Mask */ /* Indices of tlb wired entries */ /* performance counter 0 */ /* performance counter 1 */ /* performance control reg 0 */ /* performance control reg 1 */ #define CO_PRFCNTO #define CO_PRFCNTO #define CO_PRFCNTI $25 #define CO_PRFCRTLO #define CO_PRFCRTLI $25 #endif /* R10000 */ $25 $25 else /* ! _LANGUAGE_ASSEMBLY CO_INX CO_RAND CO_TLBLO CO_CTXT CO_BADVADDR #define #define #define #define #define CO_TLBHI CO_SR CO_CAUSE 10 #define #define CO_EPC CO_PRID #define #define /* revision identifier */ /* page mask */ /* # wired entries in tlb */ /* free-running counter */ /* counter comparison reg. */ /* hardware configuration */ /* load linked address */ /* watchpoint */ /* Extended context */ /* cache ECC and primary parity */ /* cache operations */ /* cache operations */ /* ECC error prg. counter */ CO_CONFIG CO_LLADDR #define #define #define CO_WATCHLO 18 #define #define CO_WATCHHI CO_EXTCTXT 20 #define C0_ECC C0_CACHE_ERR #define #define CO_TACHO #define CO_TACHI #define CO_ERROR_EPC #endif /* R4000 | R10000 */ 28 ``` ``` #ifdef R10000 /* Frame Mask */ /* Indices of tlb wired entries */ /* performance counter 0 */ /* performance counter 1 */ /* performance control reg 0 */ /* performance control reg 1 */ #endif /* _LANGUAGE_ASSEMBLY */ #ifdef R10000 #include "sys/R10k.h" #endif /* R10000 */ #endif /* R4000 || R10000 */ #if _MIPS_SIM == _ABIO32 #define _S_EXT_(addr) (addr) #else #define _S_EXT_(addr) ((addr) | 0xffffffff0000000) /* CO_PRID Defines common to all cpus' */ //* * coprocessor revision identifiers #ifdef _KERNEL #ifdef _LANGUAGE_C typedef union rev_id { unsigned int ri_uint; #ifdef MIPSEB unsigned int Ri_fill:16, Ri_imp:8, Ri_majrev:4, Ri_minrev:4; /* implementation id */ /* major revision */ /* minor revision */ #endif /* MIPSEB */ #ifdef MIPSEL /* minor revision */ /* major revision */ /* implementation id */ unsigned int Ri minrev: 4. Ri_majrev:4, Ri_imp:8, Ri_fill:16; ``` Appendix D-0.q 22jul1998 TR-IKI rev 0.7b SGI Proprietary ``` #endif /* MIPSEL */ } Ri; rev_id_t; #define ri_imp #define ri_majrev #define ri_minrev #endif /* _LANGUAGE_C */ #endif /* _KERNEL */ Ri.Ri_imp Ri.Ri_majrev Ri.Ri_minrev #define CO_IMPMASK #define CO_IMPSHIFT #define CO_REVMASK 0xf0 #define CO_MAJREVMASK 0xf0 #define CO_MINREVMASK 0xf 0xff00 #define CO_MINREVSHIFT 0 #define CO_IMP_UNDEFINED 0x24 #define CO_IMP_R5000 0 #define CO_IMP_R5000 0 #define CO_IMP_R4550 0 #define CO_IMP_R4700 0 #define CO_IMP_R4700 0 #define CO_IMP_R4700 0 #define CO_IMP_R8000 0 #define CO_IMP_R10000 0 #define CO_IMP_R10000 0 #define CO_IMP_R6000 0 #define CO_IMP_R400 0x04 #define CO_IMP_R400 0x04 #define CO_IMP_R4000 0x04 #define CO_IMP_R3000 0 #define CO_IMP_R3000 0x03 #define CO_MAJREVMIN_R3000 0x02 #define CO_IMP_R3000 0x02 #define CO_IMP_R2000A 0x02 #define CO_IMP_R2000A 0x02 #define C0_IMP_UNDEFINED 0x24 0x23 CO_IMP_R5000 0x22 0x21 0x20 0x10 0x0e 0x06 0x04 0 \times 02 #define C0_IMP_R2000A 0: #define C0_MAJREVMIN_R2000A 0x01 #define C0_IMP_R2000 0: 0x02 0×01 * Defines for the CO_PGMASK register. ``` Appendix D-0.r 22jul1998 22jul1998 Appendix D-0.s | | Appendix E: kldir.h header file - has map of kernel in low memory | |-----------------------------------------|-------------------------------------------------------------------| | . ************************************* | | | | | | and the short | | | | | | | | | | | | | | | | | | | | | all of the second | | | PC 1001 - | | | | | | colorer one | | | | | | | | | or decreasing | | | e de la constante | | | | | | 4. Provides | | # kldir.h header file (excerpt) Appendix E-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary | * * | MEMORY MAP PER NODE | |----------------------------------------------------------|-------------------------------------------------------------------------| | * 0x2000000 (32M) | I IO6 BUFFERS FOR FLASH ENET IOC3 | | * 0x1F80000 (31.5M) | IO6 TEXT/DATA/BSS/stack | | * 0x1C00000 (30M) | IO6 PROM DEBUG TEXT/DATA/BSS/stack | | * 0x0800000 (28M) | IP27 PROM TEXT/DATA/BSS/stack | | * 0x1B00000 (27M) | IP27 CFG | | * 0x1A00000 (26M)<br>* | Graphics PROM | | * 0x1800000 (24M)<br>* | 3rd Party PROM drivers | | * 0x1600000 (22M)<br>*<br>* | Free | | * 0.400000 (0)4 | UNIX DEBUG Version | | * 0x190000 (2M)<br>* | SYMMON (For UNIX Debug only) | | * 0x34000 (208K)<br>* | SYMMON STACK [NUM_CPU_PER_NODE] (For UNIX Debug only) | | * 0x25000 (148K)<br>* | KLCONFIG - II (temp) | | * * * 0×19000 (100K) * | UNIX NON-DEBUG Version | | * The lower portion of the<br>* permanent and is used by | memory map contains information that is the IP27PROM, IO6PROM and IRIX. | | * 0x19000 (100K) | 1 | | * | PI Error Spools (32K) | Appendix E-1.a 22jul1998 | * | 1 | |------------------------|-----------------------------------------------------------------------| | * 0×12000 (72K) | Unused | | * 0x11c00 (71K)<br>* | CPU 1 NMI Eframe area | | * 0x11a00 (70.5K)<br>* | CPU 0 NMI Eframe area | | * 0x11800 (70K)<br>* | CPU 1 NMI Register save area | | * 0x11600 (69.5K) | CPU 0 NMI Register save area | | * 0x11400 (69K) | + | | * * 0×11000 (68K) | GDA (1k) | | * | Early cache Exception stack and/or | | *<br>* 0x10800 (66k) | kernel/io6prom nmi registers | | * | cache error eframe | | * 0×10400 (65K) | Exception Handlers (UALIAS copy) | | * 0x10000 (64K)<br>* | <del>+</del> | | * | KLCONFIG - I (permanent) (48K) | | * | | | * | | | * 0x4000 (16K) | NMI Handler (Protected Page) | | * 0x3000 (12K)<br>* | ARCS PVECTORS (master node only) | | * 0x2c00 (11K)<br>* | ARCS TVECTORS (master node only) | | * 0x2800 (10K)<br>* | LAUNCH [NUM_CPU] | | * 0x2400 (9K) | Low memory directory (KLDIR) | | * 0x2000 (8K) | + | | *<br>* 0×1000 (4K) | ARCS SPB (1K) | | * * * | Early cache Exception stack<br>and/or<br>kernel/io6prom nmi registers | 22jul1998 Appendix E-1.b | * 0x800 | (2k) | <b>+</b> | |---------|-------|--------------------| | * | (2,1, | cache error eframe | | * 0x400 | (1K) | + | | * | | Exception Handlers | | * 0×0 | (OK) | + | | */ | | | | | Appendix F: IRIX 6.5 Kernel Values | |---|------------------------------------| | | | | | | | | | | _ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | _ | | | | | | 1 | | | - | | | | | # **IRIX 6.5 Kernel Values** ## Unit covers: - List of kernel system values and constants extracted by icrash Examples of 32 and 64 bit (IRIX 6.5) systems Short definition of values Sample output of training icrash "kerninfo" command Appendix F-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary ## **Kernel Value Table** The following table displays key constant and define values for the IRIX 6.5 kernel (Beta). Please consider this table work-in-progress at the moment, other information will be added at a later date. ## **Column Meanings** icrash Name Name of value in icrash. NOTE: assume the prefix "K\_" before each name if looking for actual name in icrash. icrash Structure Name of structure in icrash where value defined/stored. Indy Live System Sample values of an Indy Workstation running IRIX 6.5 (Beta). O2000 dump Sample values from an O2000 dump of IRIX 6.5 beta (called IRIX64). O2000 live system Sample values from recent (Aril 1998) O2000 (flurry) running IRIX 6.5 beta (called IRIX64). Description Short description of field meaning/usage. Actual kernel name equivalents specified as "kernel: name". TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix F-3 ### **Kernel Value Table** | icrash Name | icrash<br>Structure | Indy Live System | O2000 Dump | O2000 Live System | Description | |---------------|---------------------|---------------------------|------------------------|------------------------|-------------------------------------------------------------| | K_xxxxxxxxxx | Name | (32 bit machine) | (64 bit machine) | (64 bit machine) | <u> </u> | | ACTIVEFILES | global_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Kernel: activefiles linked list of active file (DEBUG only) | | BLOCK_ALLOC | callback_s | 0x10014b40<br>(268520256) | 0x10014b70 (268520304) | 0x10014b70 (268520304) | | | BLOCK_FREE | callback_s | 0x10014ba8<br>(268520360) | 0x10014be0 (268520416) | 0x10014be0 (268520416) | | | CORE TYPE | coreinfo_s | /dev/kmem | corefile | /dev/kmem | Core type:<br>dump="corefile"<br>live sys="/dev/kmem" | | COREFILE | coreinfo_s | /dev/mem | vmcore.15.comp | /dev/mem | Core (file) name | | CORE_FD | coreinfo_s | 0x3 (3) | 0x3 (3) | 0x4 (4) | | | DEFKTHREAD | global_s | 0x0 (0) | 0xa800000103897000 | 0x0 (0) | "Current " default kernel thread | | DUMPCPU | global_s | 0x0 (0) | 0x4 (4) | 0x0 (0) | CPU that executed dumpsys() | | DUMPKTHREAD | global_s | 0x0 (0) | 0xa800000103897000 | 0x0 (0) | Kernel thread that executed dumpsys() | | DUMPPROC | global_s | 0x0 (0) | 0xa80000010261b000 | 0x0 (0) | If dump, pointer to proc that executed dumpsys() | | DUMPREGS | global_s | 0x0 (0) | 0x10e627d8 (283518936) | 0x0 (0) | Kernel:dumpregs:<br>Area where PANIC registers are saved | | DUMP_HDR | coreinfo_s | (null) | CrshDump | (null) | Header from dump corefile | | END | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | | ERROR_DUMPBUF | global_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Kernel:error_dumpbuf EVEREST only | | EXTSTKIDX | kerninfo_s | 0x1 (1) | 0x0 (0) | 0x0 (0) | Extend stack (yes/no) 32 bit kernels only | | EXTUSIZE | kerninfo_s | 0x1 (1) | 0x0 (0) | 0x0 (0) | Extend stack index<br>32 bit kernels only | | FLAGS | global_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | 22jul1998 | HWGRAPH | global_s | 0x10ab1000<br>(279646208) | 0x10e77000 (283602944) | 0x10dff000 (283111424) | | |---------------|------------|---------------------------|------------------------|------------------------|-----------------------------------------------------------------| | HWGRAPHP | global_s | 0xc0002000 | 0xa80000000055a000 | 0xa800000001370000 | | | ICRASHDEF | coreinfo_s | (null) | (nuli) | (null) | icrash definition file if sepcified on comand line | | IP | sysinfo_s | 0x16 (22) | 0x1b (27) | 0x1b (27) | Processor IPxx | | IRIX_REV | kerninfo_s | 0x5 (5) | 0x5 (5) | 0x5 (5) | IRIX revision:<br>IRIX6_x | | K0BASE | kerninfo_s | 0x80000000 | 0xa800000000000000 | 0xa8000000000000000 | Base of kernel K0 memory | | K0SIZE | kerninfo_s | 0x20000000<br>(536870912) | 0x1000000000 | 0x1000000000 | Size of kernel K0 memory | | KIBASE | kerninfo_s | 0xa0000000 | 0x9600000000000000 | 0x9600000000000000 | Base of kernel K1 memory | | KISIZE | kerninfo_s | 0x20000000<br>(536870912) | 0x1000000000 | 0x1000000000 | Size of kernel K1 memory | | K2BASE | kerninfo_s | 0xc0000000 | 0xc000000000000000 | 0xc000000000000000 | Base of kernel K2 memory | | K2SIZE | kerninfo_s | 0x20000000<br>(536870912) | 0xfff80000000 | 0xfff80000000 | Size of kernel K2 memory | | KERNELSTACK | kerninfo_s | 0xffffc000 | 0xfffffffffff8000 | 0×11111111111x0 | End of stack "page"<br>start here for stack traces | | KERNSTACK | kerninfo_s | 0xffffd000 | 0xffffffffffc000 | 0xffffffffffc000 | Base of kernel stack area (mapped on all CPUs) | | KEXTSTACK | kerninfo_s | 0xffffb000 | 0x0 (0) | 0x0 (0) | Extend stack amount<br>32 bit kernels only | | KPTBL | global_s | 0x8838c000 | 0xa800000000540000 | 0xa80000000071c000 | Kernel:kptbl<br>Base of kernel page (PDE) table | | KPTBLP | global_s | 0x0 (0) | 0x10e79008 (283611144) | 0x0 (0) | icrash (local) copy of kptbl (dump only) | | KPTEBASE | kerninfo_s | 00000811x0 | 0xc0000fc000000000 | 0xc0000fc000000000 | Base of segment table for regular and sproc processes (threads) | | KPTE_SHDUBASE | kerninfo_s | 0xfffffffeff800000 | 0xc00007c0000000000 | 0xc00007c000000000 | Base of user segment (pde) table | | KPTE_USIZE | kerninfo_s | 0x200000000 | 0x8000000000 | 0x80000000000 | Size of segment (pde) table for user processes | | KSTKIDX | kerninfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | | KUBASE | kerninfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Lower limit of user address space | | KUSIZE | kerninfo_s | 0x80000000 | 0x10000000000 | 0x1000000000 | Upper limit of user address space (size) | Appendix F-4.a 22jul1998 | LBOLT | global_s | 0x88337af4 | 0xc00000000144d284 | 0xc0000000145d80c | Kernel:lbolt<br>time in HZ since last boot | |------------------|------------|---------------------------|------------------------|------------------------|--------------------------------------------------------------------| | MAPPED_PAGE_SIZE | kerninfo_s | 0x0 (0) | 0x1000000 (16777216) | 0x1000000 (16777216) | Size of mapped kernel's mapped page size | | MAPPED_RO_BASE | kerninfo_s | 0x0 (0) | 0xc000000000000000 | 0xc00000000000000 | Kernel K2 memory base (K2BASE),<br>read-only portion | | MAPPED_RW_BASE | kerninfo_s | 0x0 (0) | 0xc000000001000000 | 0xc00000001000000 | Kernel K2 memory base of read-write portion | | MASTER_NASID | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Node ID of "master" node | | MAXCPUS | sysinfo_s | 0x1 (1) | 0x6 (6) | 0x80 (128) | Number of CPUs on the system | | MAXNODES | sysinfo_s | 0x1 (1) | 0x3 (3) | 0x40 (64) | Number of nodes on the system | | MAXPFN | kerninfo_s | 0x1ffff (131071) | 0x3ffffff (67108863) | 0x3ffffff (67108863) | Highest PFN (page frame number) on the system (physmem>>pnumshift) | | MAXPHYS | kerninfo_s | 0x1fffffff (536870911) | Oxfffffffff | 0xfffffffff | Maximum physical (portion of) address same as K_TO_PHYS_MASK | | MEMBER_BASEVAL | callback_s | 0x10014ab0<br>(268520112) | 0x10014ae0 (268520160) | 0x10014ae0 (268520160) | | | MEMBER_BITLEN | callback_s | 0x10014a88<br>(268520072) | 0x10014ab0 (268520112) | 0x10014ab0 (268520112) | | | MEMBER_OFFSET | callback_s | 0x10014980<br>(268519808) | 0x100149a0 (268519840) | 0x100149a0 (268519840) | | | MEMBER_SIZE | callback_s | 0x10014a00<br>(268519936) | 0x10014a20 (268519968) | 0x10014a20 (268519968) | | | MEM_PER_BANK | sysinfo_s | 0x0 (0) | 0x20000000 (536870912) | 0x20000000 (536870912) | Memory per node divided by banks per<br>node<br>(4 or 8) | | MEM_PER_NODE | sysinfo_s | 0x0 (0) | 0x100000000 | 0x100000000 | (1 << NASID_SHIFT (below)) | | MEM_PER_SLOT | sysinfo_s | 0x0 (0) | 0x8000000 (134217728) | 0x8000000 (134217728) | | | MLINFOLIST | global_s | 0x885dc2a0 | 0xa80000010084efa0 | 0xa8000008009240c0 | | | NAMELIST | coreinfo_s | /unix | /unix.15 | /unix | | | NASID_BITMASK | sysinfo_s | 0x0 (0) | 0xff (255) | 0xff (255) | | | NASID_SHIFT | sysinfo_s | 0x0 (0) | 0x20 (32) | 0x20 (32) | Node ID (NASID) shift amount position of NASID in addr | | NBPC | kerninfo_s | 0x1000 (4096) | 0x4000 (16384) | 0x4000 (16384) | Number bytes per "click"<br>(_PAGESZ) | | NBPS | kerninfo_s | 0x400000 (4194304) | 0x2000000 (33554432) | 0x2000000 (33554432) | Number bytes per segment<br>(Bytes.per.page * pages.per.segment) | |------------------|------------|---------------------------|----------------------|----------------------|-------------------------------------------------------------------| | NBPW | kerninfo_s | 0x4 (4) | 0x8 (8) | 0x8 (8) | Number bytes per word<br>(32/8) or (64/8) | | NCPS | kerninfo_s | 0x400 (1024) | 0x800 (2048) | 0x800 (2048) | Number clicks per segment<br>Bytes.per.click / PDE size(8)) | | NODEPDAINDR | global_s | 0x88338010 | 0xc0000000014cf9d0 | 0xc0000000014bd288 | Kernel:Nodepdaindr Array of addresses to node private data areass | | NPROCS | kerninfo_s | 0x1a8 (424) | 0x528 (1320) | 0x5688 (22152) | Kernel:v_proc Maximum number of user processes | | NTLBENTRIES | sysinfo_s | 0x0 (0) | 0x40 (64) | 0x0 (0) | CPU TLB size based on IP number dump only | | NUMCPUS | sysinfo_s | 0x1 (1) | 0x6 (6) | 0x80 (128) | Number CPUs in system | | NUMNODES | sysinfo_s | 0x1 (1) | 0x3 (3) | 0x40 (64) | Number nodes in system | | PAGESZ | kerninfo_s | 0x1000 (4096) | 0x4000 (16384) | 0x4000 (16384) | Working pages size in system (_PAGESZ) | | PANIC_TYPE | coreinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Panic type "1" means NMI | | PARCELS_PER_SLOT | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | | PARCEL_BITMASK | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | | PARCEL_SHIFT | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | | | PDAINDR | global_s | 0x88337488 | 0xc00000000144e5a0 | 0xc00000000145eaa0 | Kernel: pdaindr Array pointing to CPU PDA (private data area) | | PDE_PG_CC | pdeinfo_s | 0x38 (56) | 0x38 (56) | 0x38 (56) | PDE CC bit mask | | PDE_PG_D | pdeinfo_s | 0x80000000 | 0x200000000 | 0x200000000 | PDE D bit mask | | PDE_PG_EOP | pdeinfo_s | 0x4000000 (67108864) | 0x0 (0) | 0x0 (0) | PDE EndOfPage bit mask | | PDE_PG_G | pdeinfo_s | 0x1 (1) | 0x1 (1) | 0x1 (1) | PDE G bit mask | | PDE_PG_M | pdeinfo_s | 0x4 (4) | 0x4 (4) | 0x4 (4) | PDE M bit mask | | PDE_PG_N | pdeinfo_s | 0x10 (16) | 0x28 (40) | 0x28 (40) | PDE N bit mask | | PDE_PG_NR | pdeinfo_s | 0x18000000<br>(402653184) | 0x40000000 | 0x40000000 | PDE NR bit mask | 22jul1998 Appendix F-4.c | PDE_PG_SV | pdeinfo_s | 0x40000000<br>(1073741824) | 0x100000000 | 0x100000000 | PDE SV bit mask | |--------------|------------|----------------------------|------------------------|------------------------|-------------------------------------------------------------------------------------------------------------------------| | PDE_PG_VR | pdeinfo_s | 0x2 (2) | 0x2 (2) | 0x2 (2) | PDE VR bit mask | | PFDAT | global_s | 0x8827c128 | 0x0 (0) | 0x0 (0) | Kernel: pfdat<br>non-NUMA: pfdat table address<br>NUMA: see Kernel p_nodepda in PDA for<br>node resident pdat addresses | | PFN_MASK | pdeinfo_s | 0x3ffffc0 (67108800) | 0xffffff00 | 0xffffff00 | Mask to extract PFN from address | | PFN_SHIFT | pdeinfo_s | 0x6 (6) | 0x8 (8) | 0x8 (8) | PFN shift amount, rightmost bit in address | | PG_CC_SHIFT | pdeinfo_s | 0x3 (3) | 0x3 (3) | 0x3 (3) | PDE CC shift amount | | PG_D_SHIFT | pdeinfo_s | 0x1f (31) | 0x21 (33) | 0x21 (33) | PDE D shift amount | | PG_EOP_SHIFT | pdeinfo_s | 0x1a (26) | 0xffffffffffff | 0xfffffffffffff | PDE EndOfPage bit shift amount | | PG_G_SHIFT | pdeinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | PDE G bit shift amount | | PG_M_SHIFT | pdeinfo_s | 0x2 (2) | 0x2 (2) | 0x2 (2) | PDE M bit shift amount | | PG_NR_SHIFT | pdeinfo_s | 0x1b (27) | 0x22 (34) | 0x22 (34) | PDE NR bit shift amount | | PG_N_SHIFT | pdeinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | PDE N bit shift amount | | PG_SV_SHIFT | pdeinfo_s | 0x1e (30) | 0x20 (32) | 0x20 (32) | PDE SV bit shift amount | | PG_VR_SHIFT | pdeinfo_s | 0x1 (1) | 0x1 (1) | 0x1 (1) | PDE VR bit shift amount | | PHYSMEM | sysinfo_s | 0x6000 (24576) | 0x5000 (20480) | 0x23c000 (2342912) | Kernel: physmem<br>Physical memory size in pages | | PIDACTIVE | global_s | 0x88338090 | 0xc0000000014d0788 | 0xc0000000014be040 | Kernel: pidactive<br>list of active pids (processes) | | PIDTAB | global_s | 0xc0026000 | 0xa800000000f14000 | 0xc000000003ab4000 | Kernel: pidtab<br>pid table array address | | PIDTABSZ | global_s | 0x1a8 (424) | 0x528 (1320) | 0x5688 (22152) | Kernel: pidtabsz (v_proc)<br>size of pidtab | | PID_BASE | global_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Kernel: pid_base<br>CELL only | | PNUMSHIFT | kerninfo_s | 0xc (12) | 0xe (14) | 0xe (14) | PNUM shift amount,<br>right bit of PNUM in address | | PRINT_ERROR | callback_s | 0x1009e678<br>(269084280) | 0x1009f760 (269088608) | 0x1009f760 (269088608) | | | PROGRAM | global s | icrash | icrash/icrash | icrash | icrash called as (or fru) | | PTRSZ | kerninfo_s | 0x20 (32) | 0x40 (64) | 0x40 (64) | Size of pointer in bits<br>used to determine word size of kernel | |----------------|------------|---------------------------|------------------------|------------------------|-----------------------------------------------------------------------------------------------------------------------| | PUTBUF | global_s | 0x8839bc00 | 0xa80000000051b000 | 0xa8000000005c2c00 | Kernel: putbuf address of kernel put (console message) buffer | | RAM_OFFSET | kerninfo_s | 0x8000000<br>(134217728) | 0x0 (0) | 0x0 (0) | Kernel: _physmem_start | | REGSZ | kerninfo_s | 0x8 (8) | 0x8 (8) | 0x8 (8) | Register size in bytes (always 8) | | RW_FLAG | coreinfo_s | 0x2 (2) | 0x0 (0) | 0x2 (2) | icrash core file read-write flag<br>core dump always read-only | | SLOTS_PER_NODE | sysinfo_s | 0x0 (0) | 0x20 (32) | 0x20 (32) | Kernel: slots_per_node<br>lots per node (SN only) | | SLOT_BITMASK | sysinfo_s | 0x0 (0) | 0x1f (31) | 0x1f (31) | Kernel: slot_bitmask<br>extract slot number from address (SN only) | | SLOT_SHIFT | sysinfo_s | 0x0 (0) | 0x1b (27) | 0x1b (27) | Kernel: slot_shift<br>slot shift amount, right bit of slot number in<br>address(SN only) | | STHREADLIST | global_s | 0x8835e710 | 0xc0000000014cd370 | 0xc0000000014bab08 | Kernel: sthreadlist service thread linked list | | STRST | global_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | Kernel: strst<br>STREAMS statistics structure | | STRUCT_LEN | callback_s | 0x10014908<br>(268519688) | 0x10014920 (268519712) | 0x10014920 (268519712) | | | SYM_ADDR | callback_s | 0x10014870<br>(268519536) | 0x10014880 (268519552) | 0x10014880 (268519552) | | | SYSMEMSIZE | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | ? (unset) | | SYSSEGSZ | kerninfo_s | 0x3800 (14336) | 0x2800 (10240) | 0x11e000 (1171456) | Kernel: syssegsz<br>tuning variable "syssegsz" specifies the size<br>os the K2 (dynamically allocated) memory<br>pool | | SYSTEMSIZE | sysinfo_s | 0x0 (0) | 0x0 (0) | 0x0 (0) | ? (unset) | | ТІМЕ | global_s | 0x88337af0 | 0xc00000000144d280 | 0xc00000000145d808 | Kernel: time<br>time in seconds since 1970 | | TLBDUMPSIZE | sysinfo_s | 0x0 (0) | 0x608 (1544) | 0x0 (0) | Size of TLB table in dump (dump only) | Appendix F-4.e 22jul1998 | TLBENTRYSZ | sysinfo_s | 0x0 (0) | 0x18 (24) | 0x0 (0) | Size of a TLB entry in the dump (dump only) | |--------------|------------|------------------------|--------------------|-------------------|-----------------------------------------------------| | TO_PHYS_MASK | kerninfo_s | 0x1fffffff (536870911) | 0xfffffffff | 0xfffffffff | Mask to exttract physical part of address | | UPGIDX | kerninfo_s | Oxfffffffffffff | 0xffffffffffffff | 0xfffffffffffff | -1 (not used) | | USIZE | kerninfo_s | 0x1 (1) | 0x1 (1) | 0x1 (1) | Number of pages in the kernel stack | | UTSNAME | sysinfo_s | IRIX | IRIX64 | IRIX64 | Kernel: utsname<br>from name.c | | XTHREADLIST | global_s | 0x883575d8 | 0xc000000001482a70 | 0xc000000014b6060 | Kernel: xthreadlist<br>Kernel interrupt thread list | ## Sample "kerninfo" output Code for the "kerninfo" command was added to icrash as an instrument of study in order to produce this table. The code for this command is not (yet) available in supported icrash, but will be made available from training upon request. The following icrash output as used to create the table above. TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix F-5 ## Live Indy Workstation (IRIX 6.5 beta) >> kerninfo icrash coreinfo\_s data fields ``` CORE TYPE: /dev/kmem PANIC_TYPE: 0x0 (0) COREFILE: /dev/mem NAMELIST: /unix ICRASHDEF: (null) CORE_FD: 0x3 (3) DUMP_HDR: (null) RW_FLAG: 0x2 (2) icrash sysinfo_s data fields UTSNAME: IRIX IP: 0x16 (22) PHYSMEM: 0x6000 (24576) NUMCPUS: 0x1 (1) MAXCPUS: 0x1 (1) MAXCPUS: 0x1 (1) MILBENTRIES: 0x0 (0) TLBDUMPSIZE: 0x0 (0) TLBDUMPSIZE: 0x0 (0) NUMNODES: 0x1 (1) MAXNODES: 0x1 (1) MAXTEN_NASID: 0x0 (0) NASID_SHIFT: 0x0 (0) SLOT_SHIFT: 0x0 (0) PARCEL_BITMASK: 0x0 (0) PARCEL_BITMASK: 0x0 (0) PARCELS_PER_NODE: 0x0 (0) MEM_PER_SLOT: 0x0 (0) MEM_PER_BANK: 0x0 (0) NASID_BITMASK: 0x0 (0) MEM_PER_BANK: 0x0 (0) NASID_BITMASK: 0x0 (0) SLOTS_BITMASK: 0x0 (0) MEM_PER_BANK: 0x0 (0) SYSTEMSIZE: 0x0 (0) SYSTEMSIZE: 0x0 (0) END: 0x0 (0) ``` TR-IKI rev 0.7b SGI Proprietary icrash kerninfo\_s data fields 22jul1998 Appendix F-6 ``` IRIX_REV: 0x5 (5) SYSSEGSZ: 0x3800 (14336) NPROCS: 0x1a8 (424) PTRSZ: 0x20 (32) REGSZ: 0x8 (8) PAGESZ: 0x8 (8) PAGESZ: 0x40000 (4096) NDPW : 0x4 (4) NBPC: 0x1000 (4096) NCPS: 0x400000 (4194304) PNUMSHIFT: 0xc (12) TO_PHYS_MASK: 0x1fffffff (536870911) RAM_OFFSET: 0x8000000 (134217728) MAXPFN: 0x1fffff (131071) MAXPFNS: 0x1ffffff (536870911) USIZE: 0x1 (1) EXTUSIZE: 0x1 (1) EXTUSIZE: 0x1 (1) UPGIDX: 0xfffffffffffffff KSTKIDX: 0x0 (0) EXTSTKIDX: 0x1 (1) KUBASE: 0x80000000 KOBIZE: 0x80000000 KOSIZE: 0x20000000 (536870912) K1BASE: 0x80000000 K1SIZE: 0x20000000 (536870912) K2BASE: 0x60000000 K2SIZE: 0x20000000 (536870912) K2BASE: 0x60000000 K2SIZE: 0x20000000 (536870912) KERNSTACK: 0xffffd000 KERNSTACK: 0xffffb000 KERNSTACK: 0xffffb000 KPTE_USIZE: 0x20000000 KPTE_SHDUBASE: 0xC (0) MAPPED_RW_BASE: 0xC (0) MAPPED_RW_BASE: 0xC (0) MAPPED_RW_BASE: 0xC (0) MAPPED_RW_BASE: 0xC (0) Icrash pdeinfo_s data fields PFN_MASK: 0x3ffffc0 (67108800) PFN_SHIFT: 0x6 (6) PDE_PG_VR_SHIFT: 0x1 (1) ``` Appendix F-6.a 22jul1998 ``` PDE_FG_G: 0x1 (1) PG_G_SHIFT: 0x0 (0) PDE_PG_M: 0x4 (4) PG_M_SHIFT: 0x2 (2) PDE_PG_N: 0x10 (16) PG_N_SHIFT: 0x0 (0) PDE_PG_S: 0x100 (16) PG_N_SHIFT: 0x4 (30) PDE_PG_SV: 0x80000000 PG_D_SHIFT: 0x1e (30) PDE_PG_D: 0x8000000 PG_D_SHIFT: 0x1e (31) PDE_PG_D: 0x4000000 (67108864) PG_EOP_SHIFT: 0x1a (26) PDE_PG_EOP: 0x4000000 (402653184) PG_EOP_SHIFT: 0x1a (26) PDE_PG_NR: 0x18000000 (402653184) PG_NR_SHIFT: 0x3 (3) icrash callback_s data fields SYM_ADDR: 0x10014870 (268519536) STRUCT_LEN: 0x10014908 (268519688) MEMBER_OFFSET: 0x10014908 (268519808) MEMBER_SIZE: 0x10014a00 (268519936) MEMBER_BTLEN: 0x10014400 (268520112) BLOCK_ALLOC: 0x10014400 (268520112) BLOCK_FREE: 0x10014ba8 (268520360) PRINT_ERROR: 0x10014ba8 (268520360) PRINT_ERROR: 0x1009e678 (269084280) icrash global_s data fields PROGRAM: icrash FLAGS: 0x0 (0) DUMPCPU: 0x0 (0) DUMPCPU: 0x0 (0) DUMPCPH: 0x0 (0) DUMPTHREAD: 0x0 (0) DUMPTHREAD: 0x0 (0) DUMPPROC: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) DUMPPROC: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) KPTEL: 0x88337af4 MLINFOLIST: 0x88338010 ``` ``` PDAINDR: 0x88337488 PIDACTIVE: 0x88338090 PIDTAB: 0xc0026000 PFDAT: 0x8827c128 PUTBUF: 0x8839bc00 STHREADLIST: 0x8835e710 STRST: 0x0 (0) TIME: 0x88337af0 XTHREADLIST: 0x883575d8 PIDTABSZ: 0x1a8 (424) PID_BASE: 0x0 (0) DUMPREGS: 0x0 (0) DUMPREGS: 0x0 (0) KPTBLP: 0x0 (0) HWGRAPH: 0x10ab1000 (279646208) ``` 22jul1998 Appendix F-6.c #### O2000 system dump (IRIX 6.5 beta) >> kerninfo ``` icrash coreinfo_s data fields CORE TYPE: corefile PANIC_TYPE: 0x0 (0) COREFILE: /ptmp/mix/samples/vmcore.15.comp NAMELIST: /ptmp/mix/samples/vmcore.15.comp NAMELIST: /ptmp/mix/samples/unix.15 ICRASHDEF: (null) CORE_FD: 0x3 (3) DUMP_HDR: CrshDump RW_FLAG: 0x0 (0) icrash sysinfo_s data fields UTSNAME: IRIX64 IF: 0x1b (27) PHYSMEM: 0x5000 (20480) NUMCPUS: 0x6 (6) MAXCPUS: 0x6 (6) MAXCPUS: 0x6 (6) NTLBENTRYSZ: 0x18 (24) TLBENTRYSZ: 0x18 (24) TLBEDUMPSIZE: 0x608 (1544) NUMNODES: 0x3 (3) MAXNODES: 0x3 (3) MAXNODES: 0x3 (3) MASTER_NASID: 0x0 (0) NASID_SHIFT: 0x20 (32) SLOT_SHIFT: 0x20 (32) SLOT_SHIFT: 0x0 (0) PARCEL_STHMASK: 0x0 (0) PARCEL_BTTMASK: 0x0 (0) SLOTS_PER_NODE: 0x20 (32) MEM_PER_SLOT: 0x8000000 (134217728) MEM_PER_BANK: 0x20000000 (536870912) NASID_BITMASK: 0x1f (31) SYSTEMSIZE: 0x0 (0) END: 0x0 (0) icrash kerninfo_s data fields ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix F-7 Appendix F-7.a 22jul1998 PDAINDR: 0xc0000000144e5a0 PIDACTIVE: 0xc000000014d0788 PIDTAB: 0xa800000000f14000 PPDAT: 0x0 (0) PUTBUF: 0xa80000000051b000 STHREADLIST: 0xc000000014cd370 STRST: 0x0 (0) TIME: 0xc00000000144d280 XTHREADLIST: 0xc00000001482a70 PIDTABSZ: 0x528 (1320) PID\_BASE: 0x0 (0) PID\_BASE: 0x0 (0) PIDMPREGS: 0x10e627d8 (283518936) KPTBLP: 0x10e77000 (283601444) HWGRAPH: 0x10e77000 (283602944) TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix F-7.c #### O2000 live system (flurry; IRIX 6.5 beta) ``` >> kerninfo icrash coreinfo_s data fields ``` CORE TYPE: /dev/kmem PANIC\_TYPE: 0x0 (0) COREFILE: /dev/mem NAMELIST: /unix ICRASHDEF: (null) CORE\_FD: 0x4 (4) DUMP\_HDR: (null) RW\_FLAG: 0x2 (2) ### icrash sysinfo\_s data fields UTSNAME: IRIX64 IP: 0x1b (27) PHYSMEM: 0x23c000 (2342912) NUMCPUS: 0x80 (128) MAXCPUS: 0x80 (128) MAXCPUS: 0x80 (128) MAXCPUS: 0x80 (0) TLBENTRIES: 0x0 (0) TLBENTRYSZ: 0x0 (0) TLBDUMPSIZE: 0x0 (0) NUMNODES: 0x40 (64) MAXNODES: 0x40 (64) MASTER\_NASID: 0x0 (0) NASID\_SHIFT: 0x1b (27) PARCEL\_SHIFT: 0x1b (27) PARCEL\_SHIFT: 0x0 (0) PARCELS\_SHIFT: 0x0 (0) PARCELS\_PER\_SLOT: 0x0 (0) PARCELS\_PER\_SLOT: 0x0 (0) PARCELS\_PER\_SLOT: 0x0 (0) SLOTS\_PER\_NODE: 0x10000000 MEM\_PER\_SLOT: 0x8000000 (134217728) MEM\_PER\_SLOT: 0x8000000 (134217728) MEM\_PER\_SLOT: 0x80000000 (134217728) MEM\_PER\_SLOT: 0x80000000 (134217728) MEM\_PER\_SLOT: 0x80000000 (134217728) MEM\_PER\_SLOT: 0x80000000 (134217728) MEM\_PER\_SLOT: 0x80000000 (134217728) SUSTEMSIZE: 0x0 (0) SYSTEMSIZE: 0x0 (0) SYSTEMSIZE: 0x0 (0) SYSMEMSIZE: 0x0 (0) END: 0x0 (0) icrash kerninfo\_s data fields TR-IKI rev 0.7b SGI Proprietary 22jul 1998 Appendix F-8 Appendix F-8.a 22jul 1998 ``` PDE_PG_G: 0x1 (1) PG_G_SHIFT: 0x0 (0) PDE_PG_M: 0x4 (4) PG_M_SHIFT: 0x2 (2) PDE_PG_N: 0x28 (40) PG_N_SHIFT: 0x0 (0) PDE_PG_N: 0x28 (40) PG_N_SHIFT: 0x0 (0) PDE_PG_SV: 0x100000000 PG_SV_SHIFT: 0x20 (32) PDE_PG_D: 0x200000000 PG_D_SHIFT: 0x21 (33) PDE_PG_EOP: 0x0 (0) PG_EOP_SHIFT: 0x21 (33) PDE_PG_EOP: 0x400000000 PG_NR_SHIFT: 0x4400000000 PG_NR_SHIFT: 0x4400000000 PG_NR_SHIFT: 0x3 (3) icrash callback_s data fields SYM_ADDR: 0x10014880 (268519552) STRUCT_LEN: 0x10014920 (268519712) MEMBER_OFFSET: 0x10014920 (268519840) MEMBER_SITLEN: 0x10014920 (268519968) MEMBER_SITLEN: 0x1001400 (268520112) MEMBER_BITLEN: 0x1001400 (268520112) MEMBER_BITLEN: 0x10014b70 (268520304) BLOCK_FREE: 0x10014b70 (268520304) BLOCK_FREE: 0x10014b70 (268520304) BLOCK_FREE: 0x10014b70 (268520304) icrash global_s data fields PROGRAM: icrash FLAGS: 0x0 (0) DUMPCPU: 0x0 (0) DUMPCPU: 0x0 (0) DUMPCPU: 0x0 (0) DUMPRAPHP: 0xa800000001370000 ACTIVEPILES: 0x0 (0) DUMPPROC: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) ERROR_DUMPBUF: 0x0 (0) LBOLT: 0xc0000000015d80c MLINTOLIST: 0xa800000009240c0 NODEPDAINDR: 0xc0000000014bd288 ``` PDAINDR: 0xc0000000145eaa0 PIDACTIVE: 0xc000000014be040 PIDTAB: 0xc000000003ab4000 PFDAT: 0x0 (0) PUTBUF: 0xa800000005c2c00 STHREADLIST: 0xc000000014bab08 STRST: 0x0 (0) TIME: 0xc0000000145d808 XTHREADLIST: 0xc000000014b6060 PIDTABSZ: 0x5688 (22152) PID\_BASE: 0x0 (0) DUMPREGS: 0x0 (0) KPTBLP: 0x0 (0) HWGRAPH: 0x10dff000 (283111424) TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix F-8.c | And Annual Control of the | Appendix G: How to get a core dump from your Indy (2 methods) | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------| | | | | <b>***</b> | | | | | | | | | | | | | | | | | | water they was a market | | | 100 miles mi | | | ************************************** | | | Tagging constanting | | | | | | | | | emanda en 17 en | | | | | | Ma all to | | | | | | | | | | | | | | | | | | | | \_ # How to get a core dump from your Indy (2 methods) There are two ways to force a core dump file on your Indy. Both of these require you to have root privileges on the machine. **METHOD 1**: Change a systune parameter - only available as of 6.5 - means you always get a core file when you push the power button from then on, until you change the systune parameter again METHOD 2: Use the debugger to damage /dev/kmem. By default, the core dump file and related files will be placed in the /usr/adm/crash directory (which is the same as the /var/adm/crash directory). Appendix G-1 22jul1998 TR-IKI rev 0.7b SGI Proprietary **METHOD 1 - change systune parameter** The SYSTUNE parameter "power\_button\_changed\_to\_crash\_dump" controls whether pushing the restart button causes a core dump to be taken or not. You must have root privileges on the machine to change SYSTUNE settings. - 1. "su" to "root". - 2. Use "grep" to search through your current SYSTUNE settings for the current value of the parameter "power\_button\_changed\_to\_crash\_dump". A method is shown below: 3. Set the parameter value to anything but 0. An example is shown below: ``` myindy# systume power_button_changed_to_crash_dump 1 ``` 4. The system will respond with the following two lines. Type "y" as your answer to the question, as shown below: ``` power_button_changed_to_crash_dump = 0 (0x0) ``` TR-IKI rev 0.7b SGI Proprietary 22jul1998 Appendix G-3 Do you really want to change power\_button\_changed\_to\_crash\_dump to 1 ( ## METHOD 2 - damage /dev/kmem Appendix G-4 22jul1998 TR-IKI rev 0.7b SGI Proprietary You must have root privileges on the machine to change use the **dbx** debugger on kernel memory. - 1. "su" to "root". - 2. Type the following: dbx -k /unix /dev/kmem - 3. at the "dbx>" prompt, type: - p &fork - 4. That gave you the starting address of the fork routine, 0x\_something\_. An example, and the output generated, are shown below: (dbx) p &fork 0x88196c44 5. Now overwrite that memory location, damage the kernel's copy of the "fork" code, and confuse the system, by changing the starting address of the "fork" routine to "0". An example is shown below, using the address displayed in the previous step: ``` assign ((int *) 0x88196c44 ) = 0 ``` NOTE! As \*SOON\* as you hit "return" after typing the above, your Indy will panic. So have Appendix G-5 22jul1998 | everything else already set up first,<br>debugger, or do anything else. | because you will not get a cha | ance to exit from the | |-------------------------------------------------------------------------|--------------------------------|-----------------------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TR-IKI rev 0.7b SGI Proprietary | 22jul1998 | Appendix G-5.a | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... - | , <del></del> | | | | |---------------|--|--|--| | _ | | | | | ~ | | | | | | | | | | | | | | | _ | | | | | | | | | | | | | | | | | | | | _ | | | | | _ | | | | | | | | | | | | | | | _ | | | | | | | | | | | | | | | _ | | | | | <del></del> | | | | | | | | | | | | | · · | |---|---|---|-------| | | | | | | | | | | | | | | | | | | | , . | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | • | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ***** | | | | | | | · | | | • | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | • | | | 4 | | |