The Battle Field
Two Branded Intel Xeon Processor 7200 Series servers, one running Windows Server 2003 (SBS) with Exchange 2003 (will call it S1). This was a new server that was added to an existing Windows 2000 Domain, it held all FSMO (Flexible Single Master Operations) Roles. The second older server was running Windows 2000 Advanced Server with AD, Terminal Server (Application Mode), DNS and DHCP(will call it S2). File Replication configured and running.A third server running Windows 2003 Server (STD. ED.) with ISA 2004(will call it S3).
[this is a very small network with just 100 users distributed over 3 buildings in about a radius of 1 km. Connectivity is achieved with a combo of Fiberoptic, 100/1000 Mbps Lan]
The Problems Reported to ME...(be ready to... freak out)
1. Users if at times can be created on one of the DCs, not replicated to the other.
Two Branded Intel Xeon Processor 7200 Series servers, one running Windows Server 2003 (SBS) with Exchange 2003 (will call it S1). This was a new server that was added to an existing Windows 2000 Domain, it held all FSMO (Flexible Single Master Operations) Roles. The second older server was running Windows 2000 Advanced Server with AD, Terminal Server (Application Mode), DNS and DHCP(will call it S2). File Replication configured and running.A third server running Windows 2003 Server (STD. ED.) with ISA 2004(will call it S3).
[this is a very small network with just 100 users distributed over 3 buildings in about a radius of 1 km. Connectivity is achieved with a combo of Fiberoptic, 100/1000 Mbps Lan]
The Problems Reported to ME...(be ready to... freak out)
1. Users if at times can be created on one of the DCs, not replicated to the other.
2. New users created cannot logon to domain.
3. Directory service has exhausted the pool of relative. (Pops up when the DC that holds the RID Master FSMO role cannot be reached like if you want to create new objects as new RIDs cannot be allocated)
3. Directory service has exhausted the pool of relative. (Pops up when the DC that holds the RID Master FSMO role cannot be reached like if you want to create new objects as new RIDs cannot be allocated)
4. AD MMC reports, DC's are non - functional and cannot be contacted.
5. Internet Access is intermittently available (Reason: as ISA cannot be contact DCs) duh.. anyone can tell that
6. Exchange server also processes mail requests erratically. duh...duh
7. Thin Clients automatically disconnected. - In regard to this the IT admin on-site observed that those TS clients that were being disconnected automatically were on one specific network adapter. So he disabled NIC2 on S2. (MAJOR MISTAKE - MS admins Beware)
8. SYSVOL contains the "NtFrs_PreExisting___See_EventLog (This I checked on-site as I was aware replication was not working)9. Server Event Logs - A sight for sore eyes!
8. SYSVOL contains the "NtFrs_PreExisting___See_EventLog (This I checked on-site as I was aware replication was not working)9. Server Event Logs - A sight for sore eyes!
What’s working?
1. Existing Users (Thin Clients/ Desktop Users) can logon to server.
2. Network wide shares and printers available.
My first visit to the site.
I have handled larger setups in terms of no. of servers and users but for the first time in ten years I was nervous (of what - I have no idea).
Any ways I logon to S2. DC cannot be contacted so I started off checking DNS.
Start, Run, type cmd, and then OK.
c:\>nslookup (took almost 10 sec. for prompt) (thought I had the cat in the bag)
queried S2 returns results (this seemed OK)
queried S1 returns results (this seemed OK)
typed Exit and nslookup just refused to quit and would flash an error - "non existent i think - sorry cannot recollect"
So first I started by making sure DNS was optimally functioning (now i will not explain this as even a junior admin with a little experience should be able to handle DNS). Will create a separate tutorial for complex DNS server senarios.
Now to get on with life you need tools, a couple of them don't ship with the OS i.e. 2K or 2K3. These "Resource Kit Tools" can be downloaded form Microsoft website, some require validation. Use Legal Software PERIOD We owe our jobs to them (developers).
DCDiag, NetDiag, Ntdsutil, Ntfrsutl, Replmon, repadmin Ldf, Linkd, Netdom and how can I do anything without my soul mate "regedit".
Originally my plan was to walk you through the troubleshooting process as is, but some theory about how the Domain Controller works with the active directory database. So lets start with that.
In the %systemroot% of server acquaint yourself with the NTDS and SYSVOL folder. The NTDS stores the actual AD file i.e. "ntds.dit" and the SYSVOL has a structure shown below
\SYSVOL
\SYSVOL\domain
\SYSVOL\staging\domain
\SYSVOL\staging areas
\SYSVOL\domain\Policies
\SYSVOL\domain\scripts
\SYSVOL\SYSVOL
Within the \SYSVOL\SYSVOL\domain.com\policies folders there are two policies which are required for authentication
1. {6AC1786C-016F-11D2-945F-00C04fB984F9} - Default Domain Controllers Policy
2. {31B2F340-016D-11D2-945F-00C04FB984F9} - Default Domain Policy
The SYSVOL must contain the following reparse points
\SYSVOL\SYSVOL\domain.com
and
\SYSVOL\staging areas\domain.com
You can verify this by the following procedure:
Go to Start, click Run, type cmd, and then click OK.type ntfrsutl ds findstr /i "root stage", press ENTER.
For this to work the File Replication service should be running.
net start ntfrs at the command window will start it. (usually is running on DC by default)
The output below will be displayed.
Root: C:\WINNT\SYSVOL\domain
Stage: C:\WINNT\SYSVOL\staging\domain
Now for your domain objects to be available across the wire, there are two shares that exist i.e. the SYSVOL and NETLOGON share, the SYSVOL share is \SYSVOL\SYSVOL and the NETLOGON is \SYSVOL\domain\scripts
If either of theses share is missing for whatever reason clients cannot join or logon to the domain.
Using DCDIAG and NETDIAG.
These two are essential command line tools, DCDIAG tests the availability of the Domain Controller, the one on which you are running it and also if the replication partners are present and the list goes on. NETDIAG on the other hand will give you clues as to how your network connectivity and availability.
Both these tools report back errors to you if any. In setups containing 2 or more DCs you must run these at least once a week.
[There are a lot of third party GUI tools that do more or less the same job but I’m from the old school, seeing is believing and I want to see these tests run in front of me!]
Some common command line tests and their variations
DCDIAG /a /v /c
DCDIAG /e /test:frssysvol
DCDIAG /test:fsmocheck
DCDIAG /test:Knowsofroleholders /v
NETDIAG /v /debug
NETDIAG /test:DNS /v
NETDOM query fsmo
[Command outputs are sometimes very lengthy, so save to a file for later viewing and analysis.]
Active Directory defines five FSMO roles:
1. Schema master = Forest wide and one per forest.
2. Domain naming master = Forest wide and one per forest
3. RID master = Domain specific and one for each domain.
4. PDC master = PDC master is domain specific and one for each domain. (also know as PDC Emulator)
5. Infrastructure master = Domain specific and one for each domain.
Understand these five roles and your KING of THE CASTLE!
As in my setup two servers running in sync and suddenly one of them is unavailable, what next for Active Directory. If the server gone down does not hold any of the above roles nothing to worry, solving this problem is a breeze. What if the server gone done holds some or all of the above roles, don’t panic, put on your thinking hat and plan you attack. Like we all know "Failing to Plan is like Planning to Fail" and if your in one of those loosing battles, you must really plan as many steps ahead as you can.
Oh yes the big question, what really does happen if the server gone down holds all the roles?
Well if you have only one server and you have got no ASR, or any sort of backup strategy in place, now would be a god time to take a walk on the beach. Watch a sunset, feel the breeze in your face and kick your self in the ass! Run as far as you can from your boss.
Ok seriously, this is a serious problem, and if there are two or more DCs in your active directory, usually all five roles are held your first DC or can be moved around your DCs as you see fit. Now I will tell you why you have to constantly monitor the replication on your DCs. TSL - Tomb Stone Limit, 60 days in windows 2000 and 180 days in Windows 2003 This is the time during which objects not replicated in this specific TSL period will be lost and DCs start to become inconsistent.
Why would DCs just stop replication, I’ll give you the silly reasons first.
1. Insufficient Disk Space on Active Directory Partition.
2. Disabling or modifying Network settings on Servers in you spare time, instead of planning out a date!
3. Time (what?) (What Time?) - That time you see in your systray of course, I have never heard of a BIOS or a BATTERY failure on a server motherboard. Be it the old entry lever servers 440 - if i remember it right from intel or even the newer server boards.
\SYSVOL
\SYSVOL\domain
\SYSVOL\staging\domain
\SYSVOL\staging areas
\SYSVOL\domain\Policies
\SYSVOL\domain\scripts
\SYSVOL\SYSVOL
Within the \SYSVOL\SYSVOL\domain.com\policies folders there are two policies which are required for authentication
1. {6AC1786C-016F-11D2-945F-00C04fB984F9} - Default Domain Controllers Policy
2. {31B2F340-016D-11D2-945F-00C04FB984F9} - Default Domain Policy
The SYSVOL must contain the following reparse points
\SYSVOL\SYSVOL\domain.com
and
\SYSVOL\staging areas\domain.com
You can verify this by the following procedure:
Go to Start, click Run, type cmd, and then click OK.type ntfrsutl ds findstr /i "root stage", press ENTER.
For this to work the File Replication service should be running.
net start ntfrs at the command window will start it. (usually is running on DC by default)
The output below will be displayed.
Root: C:\WINNT\SYSVOL\domain
Stage: C:\WINNT\SYSVOL\staging\domain
Now for your domain objects to be available across the wire, there are two shares that exist i.e. the SYSVOL and NETLOGON share, the SYSVOL share is \SYSVOL\SYSVOL and the NETLOGON is \SYSVOL\domain\scripts
If either of theses share is missing for whatever reason clients cannot join or logon to the domain.
Using DCDIAG and NETDIAG.
These two are essential command line tools, DCDIAG tests the availability of the Domain Controller, the one on which you are running it and also if the replication partners are present and the list goes on. NETDIAG on the other hand will give you clues as to how your network connectivity and availability.
Both these tools report back errors to you if any. In setups containing 2 or more DCs you must run these at least once a week.
[There are a lot of third party GUI tools that do more or less the same job but I’m from the old school, seeing is believing and I want to see these tests run in front of me!]
Some common command line tests and their variations
DCDIAG /a /v /c
DCDIAG /e /test:frssysvol
DCDIAG /test:fsmocheck
DCDIAG /test:Knowsofroleholders /v
NETDIAG /v /debug
NETDIAG /test:DNS /v
NETDOM query fsmo
[Command outputs are sometimes very lengthy, so save to a file for later viewing and analysis.]
Active Directory defines five FSMO roles:
1. Schema master = Forest wide and one per forest.
2. Domain naming master = Forest wide and one per forest
3. RID master = Domain specific and one for each domain.
4. PDC master = PDC master is domain specific and one for each domain. (also know as PDC Emulator)
5. Infrastructure master = Domain specific and one for each domain.
Understand these five roles and your KING of THE CASTLE!
As in my setup two servers running in sync and suddenly one of them is unavailable, what next for Active Directory. If the server gone down does not hold any of the above roles nothing to worry, solving this problem is a breeze. What if the server gone done holds some or all of the above roles, don’t panic, put on your thinking hat and plan you attack. Like we all know "Failing to Plan is like Planning to Fail" and if your in one of those loosing battles, you must really plan as many steps ahead as you can.
Oh yes the big question, what really does happen if the server gone down holds all the roles?
Well if you have only one server and you have got no ASR, or any sort of backup strategy in place, now would be a god time to take a walk on the beach. Watch a sunset, feel the breeze in your face and kick your self in the ass! Run as far as you can from your boss.
Ok seriously, this is a serious problem, and if there are two or more DCs in your active directory, usually all five roles are held your first DC or can be moved around your DCs as you see fit. Now I will tell you why you have to constantly monitor the replication on your DCs. TSL - Tomb Stone Limit, 60 days in windows 2000 and 180 days in Windows 2003 This is the time during which objects not replicated in this specific TSL period will be lost and DCs start to become inconsistent.
Why would DCs just stop replication, I’ll give you the silly reasons first.
1. Insufficient Disk Space on Active Directory Partition.
2. Disabling or modifying Network settings on Servers in you spare time, instead of planning out a date!
3. Time (what?) (What Time?) - That time you see in your systray of course, I have never heard of a BIOS or a BATTERY failure on a server motherboard. Be it the old entry lever servers 440 - if i remember it right from intel or even the newer server boards.
Rule No. 1 : FOR WHATEVER REASON DO NOT CHANGE THE TIME AND DATE ON YOUR DOMAIN CONTROLLERS. (I have heard some of the lamest excuses why an admin had to change the time on the DC.)
Rule No. 2 : DO NOT FORGET RULE NO. 1
Rule No. 2 : DO NOT FORGET RULE NO. 1
Now for the Serious reasons,
Hardware failure (usually NIC, faulty power supply’s ), Virus issues, out of date restores i.e. restores from old backups.
Now I will discuss the troubleshooting steps and solutions to our FRS and Domain Not available problem.
FRS has not been functioning way beyond its TSL, nobody monitored and noticed until critical stage, and the site was left to die slowly. (Sorry to say: Human error)
Now I will discuss the troubleshooting steps and solutions to our FRS and Domain Not available problem.
FRS has not been functioning way beyond its TSL, nobody monitored and noticed until critical stage, and the site was left to die slowly. (Sorry to say: Human error)
No proper backups in place, media was as old as 30 days (with stale system state), and disorganized. Some backups we stored back on to server hard-drives.
My original plan was to at least get some proper copy of the AD's SYSVOL and then try a D4/D2, but wait i see more problems, both servers S1 and S2 hold bad copies and surprise, the shares i.e. SYSVOL and NETLOGON no longer exist.
I say to my self "Now I'm in real trouble" and then "Wait, why am I in trouble, I am only an external consultant, I really do not have anything on the line". But now you wait and listen, if you want to make it big in IT, its "ALL FOR ONE and ONE FOR ALL". If Admin's had not shared their experiences then, we'd never be called techies, geeks or whatever pleases you. We would not have communities, news groups and yes you must not forget the days of the BBS. So I am into this as much as is the poor IT Admin who is in this mess!
I start the diagnostics with the commands outlined above and export them to txt files and study them inside - out.
I need to know what’s ticking and what’s not and if not WHY?. By this time I know that FRS is working but nothing to replicate.
I recheck the DNS Servers, I have made S1 the secondary DNS servers and S2 is my primary.
I enable the second network adapter (the one previously disabled by the admin), and very my DNS again. Results: Very Good, My DNS is functioning like Clock Work. (Perfect)
I enable the second network adapter (the one previously disabled by the admin), and very my DNS again. Results: Very Good, My DNS is functioning like Clock Work. (Perfect)
I ran DCDIAG and NETDIAG...Results: I'd say Good for the situation and my AD still will not replicate ( I now it wont - TSL remember.)
Now a the Active Directory MMC consoles are working well, no RID run out errors, am able to create users, wow this looks good..
I test a thin-client, logon and all network apps working (they have always been working). I move over to a desktop, but no success, "CANNOT LOGON TO DOMAIN". I was a bit trouble by this but it hit me, my SYSVOL and NETLOGON do not exist.
Now this is where everything took an ugly turn. The S2 server starts beeping, that’s a Hard-Drive failure, not worried as its a raid 5, so we call server tech hardware support, server support man comes promptly the next day with a new replacement, Me and the admin are away for an hour for a meeting with the Boss to explain the status. In this hour what happened is still a big mystery for me, the raid array collapses, how can this happen I ask the server tech support guy, and no explanation. I start the server with a 2K3 boot cd, all partitions are showing as unformatted or corrupt. I was really done for.
Now to gauge data loss and inventory the situation..... there is a backup of user and company accounting data....but that still does not solve my problem, I've lost the FSMO role holder, as all roles were on S2, there is only one thing left to do "Seize the roles". I used ntdsutil to do this, procedure given below.
c:\>ntdsutil ENTER
ntdsutil: roles
fsmo maintenance: connections
server connections: connect to server S1
server connections: q
fsmo maintenance: *
* Seize domain naming master
Seize infrastructure master
Seize PDC
Seize RID master
Seize schema master
Seize all roles "Click Yes at the warning",
To verify use netdom query fsmo or ntdsutil
ntdsutil
domain management
connections
connect to server S1
q
select operation target
list roles for connected server
That’s sorted out, what next, my plan of D4/D2 is still in the back of my head, as its the only way to get a working replica of the SYSVOL but on what to do a D2 on, i pondered on this for a while and said to my self the rest is all on me. Lets Risk it Big and do a D4 on S1 and recreate the SYSVOL manually. I can’t think of a Nonauthoritative Restore of this Domain Controller as the backups all have this stale data.
DON'T EVER TRY THIS IN A PRODUCTION ENVIRONMENT UNLESS YOU REALLY HAVE TO...!!!
I verified the SYSVOL paths in the registry first
To do this,
Goto Start, click Run, type regedit and then press ENTER.
Then Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters.
Right-click SysVol and verify the settings
NEXT STEP
Setting S1 to D4 to allow FRS to perform a non-authoritative restore of the SYSVOL, you open regedit and make the following changes…
Goto the Key
\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Cumulative Replica Sets\18d…
Right-click the BurFlags value, and set the value to D4 hexadecimal. If you have other DC’s you can set them to D2, but since in my case I am just left with one
I have set the BurFlags value to D4 hexadecimal.
You can alternatively set this in the key down below
\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup
“Make sure that the FRS service is disabled when you do this procedure”
Also I manually made a copy of my required policies from the \SYSVOL\SYSVOL\domain.com\policies to the \SYSVOL\SYSVOL\domain.com\scripts
As mentioned earlier in this document the following two policies are required for authentication:
1. {6AC1786C-016F-11D2-945F-00C04fB984F9} - Default Domain Controllers Policy
2. {31B2F340-016D-11D2-945F-00C04FB984F9} - Default Domain Policy
Do a restart of the DC. Now another word of caution, do not expect instant results. This process can take some time. On restart my DC showed my SYSVOL as shared, but my NETLOGON still does not appear. As for this I manually shared my Scripts folder as NETLOGON.
VICTORY AT LAST.
ALL SYSTEMS GO….
EVERYTHING IS WORKING NORMAL. My next task is to make a proper backup of the stabilized AD and then reorganize this messed up AD. Then even a newbie could do.
Hope this document help you. Please comment and fire at will. I would like to improve this and add on more tutorials and help files. If you wish you can mail @ anil . colaco [at] gmail . com
c:\>ntdsutil ENTER
ntdsutil: roles
fsmo maintenance: connections
server connections: connect to server S1
server connections: q
fsmo maintenance: *
* Seize domain naming master
Seize infrastructure master
Seize PDC
Seize RID master
Seize schema master
Seize all roles "Click Yes at the warning",
To verify use netdom query fsmo or ntdsutil
ntdsutil
domain management
connections
connect to server S1
q
select operation target
list roles for connected server
That’s sorted out, what next, my plan of D4/D2 is still in the back of my head, as its the only way to get a working replica of the SYSVOL but on what to do a D2 on, i pondered on this for a while and said to my self the rest is all on me. Lets Risk it Big and do a D4 on S1 and recreate the SYSVOL manually. I can’t think of a Nonauthoritative Restore of this Domain Controller as the backups all have this stale data.
DON'T EVER TRY THIS IN A PRODUCTION ENVIRONMENT UNLESS YOU REALLY HAVE TO...!!!
I verified the SYSVOL paths in the registry first
To do this,
Goto Start, click Run, type regedit and then press ENTER.
Then Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters.
Right-click SysVol and verify the settings
NEXT STEP
Setting S1 to D4 to allow FRS to perform a non-authoritative restore of the SYSVOL, you open regedit and make the following changes…
Goto the Key
\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Cumulative Replica Sets\18d…
Right-click the BurFlags value, and set the value to D4 hexadecimal. If you have other DC’s you can set them to D2, but since in my case I am just left with one
I have set the BurFlags value to D4 hexadecimal.
You can alternatively set this in the key down below
\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\Backup/Restore\Process at Startup
“Make sure that the FRS service is disabled when you do this procedure”
Also I manually made a copy of my required policies from the \SYSVOL\SYSVOL\domain.com\policies to the \SYSVOL\SYSVOL\domain.com\scripts
As mentioned earlier in this document the following two policies are required for authentication:
1. {6AC1786C-016F-11D2-945F-00C04fB984F9} - Default Domain Controllers Policy
2. {31B2F340-016D-11D2-945F-00C04FB984F9} - Default Domain Policy
Do a restart of the DC. Now another word of caution, do not expect instant results. This process can take some time. On restart my DC showed my SYSVOL as shared, but my NETLOGON still does not appear. As for this I manually shared my Scripts folder as NETLOGON.
VICTORY AT LAST.
ALL SYSTEMS GO….
EVERYTHING IS WORKING NORMAL. My next task is to make a proper backup of the stabilized AD and then reorganize this messed up AD. Then even a newbie could do.
Hope this document help you. Please comment and fire at will. I would like to improve this and add on more tutorials and help files. If you wish you can mail @ anil . colaco [at] gmail . com